Community Detection Algorithms

Latest posts

Community Detection Algorithms

Beyond BLUE Score

How AI Is Transforming Insurance Fraud Detection

Gentle introduction to State Space Models

Introduction

In our highly connected society, there has been an increase in relations and connectedness. Throughout history, the complexity of these relations has grown more and more due to our technological growth which has allowed us to connect in more ways than we could ever imagine. Traditionally these connections were communities, friends, businesses etc. In modern society, thanks to distant travel, digital interaction and rapid connectivity to anywhere in the world, all these connections and networks have gotten larger and more complex, specially with the birth of social media.

The information we have access to also has similar network structures. The analysis of the networks between academies, new organizations and publishers has created very interesting insights that gives us a new understanding of how things work.

In this article we will explore what networks and communities are and will see an example of how to detect communities in networks.

What are networks?

Let’s start from the base. Networks in their simplest form are points (nodes) that are related to each other by connection (edges or link). These connections often multiply and intersect each other.

A network is created when many nodes are connected to other nodes through many edges.

Types of networks:

There are multiple types of networks, the most common ones are:

Technological network: An example of a technological network is the internet. In this case, the nodes are computers, and the edges are the connections between them. These could be fiber cables or telephone lines.
Information networks: These are larger networks that represent the network structures of the internet, such as the world wide we

World Wide Web | Source

Biological networks: Examples of biological networks are connections between neurons in the brain.

Biological network | Source

Social networks: This is the main type of network we will be going through. In social networks, the nodes are people, and the edges are their relations. These can be a physical relations, or online social media connections such as Instagram or twitter.

Social networks | Source

Community detection

Communities are nodes within a network that have denser connections between each other than the rest of the network. For example, in your daily interactions in Instagram or twitter, you share a community with your friends, family, colleagues because you interactions.

Communities | Source

Practical applications

Due to the fact that networks have played an important role in many real world issues, community detection algorithms have become useful in many fields, from social network analysis to public health use cases. For example:

– Community detection algorithms are used to detect terrorist groups in social networks by analyzing their connections and interactions.

– Using the same techniques as above, groups of malicious bots can be detected to find their origin.

– Community detection can also be used to study the more vulnerable groups to a disease. This technique can be used to discover links among patients.

– In digital forensics, communities can be detected when analyzing emails received and sent between individuals.

Algorithms

Two popular researchers of community detection are M. Girvan and M. E. J. Newman. They have proposed one of the most popular community detection algorithms, and these algorithms have been a foundation of the newer and more modern algorithms. They believe that groups of nodes in a community are tightly connected within communities and loosely connected with other communities.

The two main types of community detection techniques are agglomerative and divisive.

Agglomerative methods start with a network that includes only the nodes, without any of the edges (connections). Then, the edges are added one by one, prioritizing the stronger ones over the weaker ones. There are various ways the strength of the edges are calculated, and it depends on the implemented algorithm.

Divisive methods have an opposite approach. They start with the full graph including the nodes and edges and removes the stronger edges before going to the weaker ones. Each time an edge is removed, the strength of all the edges are calculated since the strength of the remaining edges change after a stronger edge is removed. After a few iterations, we get communities of densely connected nodes.

Code

Since this is a beginners guide, we will use the Girvan-Newman algorithm, a simple Divisive method which has been a foundational algorithm in community detection and has influenced many other approaches and improvements in this field.

First, the algorithm calculates the betweenness centrality, which represents the strength of the nodes. Edges with higher betweenness are more likely to connect to different communities.

Then, the algorithm removes the edges with the highest betweenness centralities. As the edges are removed, the network slowly splits into smaller components (communities). This step is repeated until the network is separated into different communities.

To implement this algorithm, we will be using the NetworkX python library.

Data set

The data set we will use for this example is Zachary’s karate club data set, which is a social network data set collected by Wayne W. Zacahry in 1977. It represents the social interactions between members of of karate club at a university. It has become a very popular dataset because during the observation period, the instructor and the administrator of the karate club got into a conflict. One group followed the instructor and the other followed the administrator. These two groups represent the 2 communities in the dataset, and that is what the algorithm should identify.

Lets start

The following code has been inspired by the codes in this GIT:

GitHub — cullenwatson/community-detection: Implements Girvan-Newman algorithm to detect communities

To start, we first download the necessary python libraries and load the data.

				
					import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.cm as cmx
import networkx as nx
from networkx.algorithms import community
%matplotlib inline

The data set used is from Wayback Machine (archive.org)

In this data set, node x and y represent the individuals, and the pair of these numbers represent their interaction (eg: they have a friendship or social interaction). In other words, each pair of numbers [x,y] represent the edge between node x and y.

				
					data = [
    [2, 1], [3, 1], [3, 2], [4, 1], [4, 2], [4, 3], [5, 1], [6, 1],
    [7, 1], [7, 5], [7, 6], [8, 1], [8, 2], [8, 3], [8, 4], [9, 1],
    [9, 3], [10, 3], [11, 1], [11, 5], [11, 6], [12, 1], [13, 1],
    [13, 4], [14, 1], [14, 2], [14, 3], [14, 4], [17, 6], [17, 7],
    [18, 1], [18, 2], [20, 1], [20, 2], [22, 1], [22, 2], [26, 24],
    [26, 25], [28, 3], [28, 24], [28, 25], [29, 3], [30, 24], [30, 27],
    [31, 2], [31, 9], [32, 1], [32, 25], [32, 26], [32, 29], [33, 3],
    [33, 9], [33, 15], [33, 16], [33, 19], [33, 21], [33, 23], [33, 24],
    [33, 30], [33, 31], [33, 32], [34, 9], [34, 10], [34, 14], [34, 15],
    [34, 16], [34, 19], [34, 20], [34, 21], [34, 23], [34, 24], [34, 27],
    [34, 28], [34, 29], [34, 30], [34, 31], [34, 32], [34, 33]
]

Then, we separate the data variable into a pandas DataFrame. The columns FROM and TO represent the source and the target, which in this case is the connection between the x and y club members.

				
					df = pd.DataFrame(data, columns=['FROM', 'TO'])
df['SOURCE'] = df['FROM'].astype(str)
df['TARGET'] = df['TO'].astype(str)
print(df.head())

To start with the algorithm, we convert the pandas dataframe into a NetworkX graph

				
					G = nx.from_pandas_edgelist(df, source='SOURCE', target='TARGET')

Then, we detect the communities using the Girvan-Newman method. As mentioned before, this algorithm analyzes the graph to detect communities by removing the edges with highest centrality.

Right after, we retrieve the first level of detected communities in a list called community_list for easier manipulation.

				
					# Implement algorithm on graph G
community = community.girvan_newman(G)

# First level of detected communities
community_list = list(next(community))

Then, we create a dictionary called community_mapping that maps each node to its corresponding community index using a for loop.

				
					community_mapping = {}
for i, com in enumerate(community_list):
    for node in com:
        community_mapping[node] = i

After that, to make it easier to reference the community membership of each node in the visualization process, we add the community data to the nodes in the graph by adding the information of: graph G, community_mapping dictionary, and string name of the attribute (‘Community’).

				
					nx.set_node_attributes(G, community_mapping, 'Community')

Moving on, we calculate the position of the nodes using the nx.spring_layout() function, which uses an algorithm to position the nodes in a way that visually represents the connections between the nodes. Nodes that are more connected will be placed closer together.

				
					pos = nx.spring_layout(G)

Finally, we get the community colors, set the node size by its degree (importance) and draw the graph.

				
					#Community color
colors = [community_mapping[node] for node in G.nodes()]

# Node size based on degree (importance)
node_sizes = [50 * G.degree(node) for node in G.nodes()]  

# Draw the graph
plt.figure(figsize=(10, 8))
nx.draw(G, pos, with_labels=True, node_color=colors, cmap=plt.cm.Set1, node_size=500, edge_color='gray')
plt.title("Girvan-Newman Community Detection on Zachary's Karate Club")
plt.show()

As a result, we get a graph clearly separated by two communities. Node 34 (Grey) represents the Administrator of the club, and node 1 (Red) is the instructor.

Image created by author

Additional Community Analysis

For additional community analysis, we can compare the number of members in a community in a bar graph:

				
					community_sizes = pd.Series(list(community_mapping.values())).value_counts()

plt.figure(figsize=(8, 4))
community_sizes.plot(kind='bar', color='orange')
plt.ylabel('Number of nodes')
plt.xlabel('Community')
plt.title('Size of communities')
plt.xticks(rotation=0)
plt.show()

Image created by author

We can also identify the nodes in each community:

				
					community_0_members = [node for node in G.nodes() if community_mapping[node] == 0]
print("Members of community 0:", community_0_members)

Finally, we can analyze the node Centralities to find the most influencial nodes, in which we see that the Administrator (34) is the most influencial node, followed by the instructor (1).

				
					degree_centrality = nx.degree_centrality(G)

centrality_df = pd.DataFrame.from_dict(degree_centrality, orient='index', columns=['Centrality'])
centrality_df = centrality_df.sort_values(by='Centrality')

print(centrality_df.head(10))

Additional Community detection algorithms in the NetworkX library

If you would like to test more algorithms, the NetworkX libraryalso includes some other community detection algorithms such as:

– Fluid communities algorithm: This algorithm uses the simple idea of the interaction so fluids in an environment which expand and push each other.

NetworkX Reference Guide

– Laber propagation algorithm: A semi-supervised machine learning algorithm that assigns labels to unlabeled data points.

NetworkX Reference Guide

– Clique percolation algorithm: It finds the k-clique communities in a network using the percolation method.

NetworkX Reference Guide

– Kernighan-Lin algorithm: This algorithm splits a network into two sets by exchanging the position of a pair of nodes to reduce the edge cut between the two set.

NetworkX Reference Guide

Conclusion

In summary, we went over the principle ideas and definitions of Networks and Community detection, with an example of an algorithm that detects communities. In the future, we will go over more modern and complex algorithm that are used in different industries today.