 # Tutorial 4 Social Network Analysis

So far we know about various online social
networks like Facebook and Twitter. In this tutorial, we will learn how to represent
social media data in a graph format consisting of nodes and edges. We will also learn the basics of social network
analysis. This is an example of my Twitter followees
network graph. Nodes are my followees and an edge signifies
that the node is following the other node. A graph is a data structure which consists
of a finite set of nodes and edges. Nodes represent the entities of social network
like users, pages or groups. Edges define the relationships between various
nodes, for instance, a directed edge from user a to user b can mean that a follows b
or an edge between a user a and the page can mean that user likes that particular page. However, how would a computer understand such
a node edge format? There are various ways to represent a node
edge graph; some of the most widely used methods are adjacency matrix, graph ML format and
CSV files. Let us look at what they are. An adjacency matrix is a 2 dimensional square
matrix whose size is equal to the number of nodes in the graph. In this particular example, since the graph
has 6 nodes, the size of the corresponding adjacency matrix is 6 x 6, this is at intersection
of ith row and jth column is 1, if an edge exist between nodes i and node j, otherwise
0. In this example, there is an edge from node
1 to node 2 and 3. Therefore, the cell at the intersection of
first row and second column gets a 1. Similarly, the cell at the intersection of
first row and third column also gets a 1. However, there is no edge between node 1 and
node 4. Therefore, the cell at the intersection of
first row and fourth column remain 0. The rest of the adjacency matrix is also filled
in similar manner. Adjacency matrix can be very easy to construct
using an array data structure in any programming language; however, if the input graph has
high number of nodes and less edges then the resulting adjacency matrix can be very sparse
and space consuming. Therefore, let us look at another way to represent
a graph, graph ML format. Graph ML is an xml file format for graphs. It consists of an xml file containing a graph
element within which is an unordered sequence of node and edge elements. Each node element should have a distinct id
attribute and each edge element has source and target attributes that identify the end
points of an edge between two nodes, in this example, we have a graph with 11 nodes that
node ids n 0 to n 10. The first edge element signifies that there
exists an edge between node n 0 and n 2. Now, we have learnt how to collect your own
twitter following network graph in graph ML format. We will be using twecoll, a command line tool
to get twitter data in graph ML format. Using twecoll, we will collect our followees
information which is also called friends and friends of friends information. Let us see how that works. Go to this particular URL and let us fetch
the code of twecoll. Download the code in a folder and unzip it
or clone this git repo to get the code. Now, go to the terminal and navigate to where
tutorial and I will go there. Now, we will start collecting data as first
step we will initialize the program to authorize a twitter app, run the command python twecoll
init, followed by the user name. In this case, I am using a dummy user name
for the demo purpose. The program will prompt you to go to apps
dot twitter dot com and enter the consumer key of a Twitter application. Let us go back to the browser and navigate
to apps dot twitter dot com. Now, I have already created an app, you can
either use a previously created app or use a new one, go to the keys and access token
and get hold of the consumer key. Then it will also ask for the consumer secret. Next, it will generate a link which will ask
you to enter a pin, copy paste that URL into a browser and you will be redirected to the
Twitter application authorization page. Authorize the app and get hold of the pin. Copy, paste it in the terminal and your initialization
will be complete. Next, it will start processing the ids of
the followees of your twitter account. Now, we will collect the friends of friends’
information of your twitter account. To do that, fire the command python tweecoll
fetch followed by a user name. This will start getting the list of followees
of your own followers. Next, we want to build the edges between the
followees and their followees, to do that we will run the command python tweecoll edgelist
followed by the user name. This command will generate the edges between
your followees and their followees. Do not worry, if you get a notification which
says no module named I graph, you can list the files in the folder and you should be
able to see a file named, your user name followed by the extension GML. This is the file which contains the information
of your followees and their followees and we will use Gephi tool to generate a network
graph of a twitter data. Before we move on to graph visualization,
let us first look at the basics of social network analysis. The most commonly use SNA metric is degree,
in a directed graph, in degree is equal to the number of edges entering a node. In this example, for node 2, edges are entering
it from node 1 and there is a self loop from itself therefore, it is in degree is two. Out degree equals to the number of edges leaving
a node in this example, edges are going away from node 2 to 4 and node 2 to 5, node two
also has a self loop. Therefore, the out degree of node 2 is 3. Total degree of a graph is calculated by summing
the in degree and out degree. One of the other useful SNA metrics is centrality
that is finding out which is the most central or important node. There can be various ways to define centrality. Let us look at them one by one; in degree
centrality finds the node with highest in degree. It can signify the most influential node or
in case of Twitter follower graph, the user with highest number of followers. Out degree centrality helps in locating the
node whose out degree is the highest, other ways to measure centrality are betweenness
and closeness. Betweenness centrality is equal to the number
of shorter paths from all vertices to all others that pass through that node, closeness
centrality helps to find the node with the lowest total distance from all other nodes. Let us also briefly look at community in a
graph. A community is a group of similar or strongly
connective nodes. The measure to define the strength of a community
is modularity, which means the fraction of edges that fall within the given group. Now, we will look at a tool called Gephi for
on my machine. Start Gephi and open the graph ML file which
you previously created. This will show some of the statistics about
the graph which is the number of nodes and the number of edges, click on OK and soon
your graph will be loaded. If you do not see anything, click on the overview. And a graph will initialize, in case your
graph is very dense you can change the layout of the graph. From the left panel click on the drop down
menu and choose the different layout. Let us choose the reingold algorithm and run
it, you can stop it after a while. Now, let us customize this graph to make it
look prettier. Before doing that, notice that on your right,
you will be able to see a lot of network statistics, for instance, the average degree, average
weighted degree, network diameter, etcetera, you can click on run to generate the reports. For instance, this is the degree distribution
of my Twitter followee graph. You can generate similar statistics; we will
also run the modularity to understand that what are the numbers of communities in our
twitter followee graph? Next, let us customize this graph, the nodes
are our followees, but we are not able to see their labels. Therefore, click on ‘t’ to generate the labels,
we can also customize the color of the nodes and edges. In the left panel choose the color pallet,
select nodes, click on attributes and then select modularity class. Click on apply; we can even customize the
size of the nodes. Click on the size pallet nodes and then attributes
and choose an attribute. Let us select in degree, so that we can get
a sense of which are the most important nodes in a graph, click on apply this will change
the color and size of the nodes. You can zoom in and see that, which are the
most important nodes in your graph and what are the different colors according to the
communities which they belong to, we can further customize this graph by changing the appearance
of the edges. Let us go to preview. In case, here you do not see any window tab
and then click on the preview settings, again click on window and click on the preview. Now, from the presets select default curved
and press refresh. This will make the edges curved instead of
straight. Now, you can save this graph as svg, pdf or
png format by exporting it from the left panel and clicking on export. You can also browse the data which is used
to generate this graph by going to data laboratory, click on the data laboratory tab. This will show you the list of the nodes,
edges and the relationships between them. There will be two tabs, nodes and edges. The nodes tab will list the information about
the nodes and the edges tab will list the source target information as we stored it
in the gml file. Now, we know how to generate a network graph
from social media data.

## One thought on “Tutorial 4 Social Network Analysis”

1. Monika Singh says:

how i could get this for windows