GT CS 7280: Network Science
Assignment 5: Network Models and Statistical Analysis
Spring 2024
Overview
The objective of this assignment is to understand network generation and network statistical analysis. This assignment covers topics in lessons 12 and 13.
Submission
Please submit your Jupyter notebook A5-YOURGTUSERNAME.ipynb
with requirements.txt
so that we may be able to replicate your Python dependencies to run your code as needed.
With Anaconda, you can do this by running:
conda list -e > requirements.txt
Getting Started
In this assignment you will use “football.gml” to answer parts 1 & 2. This network is the NCAA College Football Network. For the Hierarchical Random Graph in Part 2, you will be using the modified file “football-hrg.gml” and PyHRG. For part 3 you will be using the “slashdot.txt” file. The graph loading code is already provided for you. You DO NOT need to alter it.
NCAA College Football Network
Part 1: Structural Properties of the Graph [25 points]
The goal for the first part of this assignment is to show the structural properties of the empirical network. Just like the previous assignment, the code to load the graph is already provided and you don’t need to make any modifications.
1. [5 points]
Complete the calculate_graph_statistics
function to calculate the network diameter, characteristic path length (CPL), average clustering coefficient, transitivity, assortativity, and degree sequence (a list of all the degrees for each node). Return a dictionary of the results with the name of each statistic as the key.
def calculate_graph_statistics(football):
# calculate the network diameter
diameter = nx.diameter(football)
# calculate the characteristic path length (CPL)
cpl = nx.average_shortest_path_length(football)
# calculate the average clustering coefficient
avg_clustering = nx.average_clustering(football)
# calculate the transitivity
transitivity = nx.transitivity(football)
# calculate the assortativity
assortativity = nx.degree_assortativity_coefficient(football)
# calculate the degree sequence
degree_sequence = [d for n, d in football.degree()]
# return the dictionary of results
return {
'Network Diameter': diameter,
'Characteristic Path Length (CPL)': cpl,
'Average Clustering Coefficient': avg_clustering,
'Transitivity': transitivity,
'Assortativity': assortativity,
'Degree Sequence': degree_sequence
}
2. [10 points]
Complete the sweep_louvain_resolutions
function to identify community structures within the graph. Use the Louvain algorithm to find the best partition. Use 10 different resolution parameters from min_resolution
to max_resolution
(inclusive), and compare the result of each partition to the ground truth (you can find the ground truth in the value
field associated with each node in the graph) using the normalized mutual information function. Return the list of resolutions and NMIs from the resulting community assignments.
Additionally, complete the plot_nmi_vs_resolution
function to display the NMI for each resolution as a line plot.
def sweep_louvain_resolutions(football, min_resolution, max_resolution):
# create a list of resolutions
resolutions = np.linspace(min_resolution, max_resolution, 10)
# create a list to store the NMIs
nmis = []
# loop over the resolutions
for resolution in resolutions:
# run the Louvain algorithm
louvain_communities = nx.algorithms.community.louvain_communities(football, resolution=resolution)
# calculate the NMI
nmi = nx.algorithms.community.normalized_mutual_info(football, louvain_communities)
# append the NMI to the list
nmis.append(nmi)
# return the list of resolutions and NMIs
return resolutions, nmis
def plot_nmi_vs_resolution(resolutions, nmis):
# create a figure and axis
fig, ax = plt.subplots()
# plot the NMI vs resolution
ax.plot(resolutions, nmis)
# set the x-axis label
ax.set_xlabel('Resolution')
# set the y-axis label
ax.set_ylabel('NMI')
# set the title
ax.set_title('NMI vs Resolution')
# show the plot
plt.show()
3. [5 points]
Complete the calculate_best_partition
function to find the resolution associated with the partition that generates the highest NMI.
Additionally, complete the plot_best_partition
function to visualize the network with both the ground truth community assignments and the best partition assignments. Plot both networks side by side as subplots within a single figure. Make sure to label which graph is which (louvain vs ground truth).
def calculate_best_partition(resolutions, nmis):
# find the index of the maximum NMI
max_nmi_index = np.argmax(nmis)
# get the resolution associated with the maximum NMI
best_resolution = resolutions[max_nmi_index]
# return the best resolution
return best_resolution
def plot_best_partition(football, best_resolution):
# run the Louvain algorithm with the best resolution
louvain_communities = nx.algorithms.community.louvain_communities(football, resolution=best_resolution)
# create a figure and axis
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
# plot the ground truth community assignments
nx.draw_networkx(football, node_color=football.nodes(data='value'), ax=axs[0])
# plot the best partition assignments
nx.draw_networkx(football, node_color=louvain_communities, ax=axs[1])
# set the titles
axs[0].set_title('Ground Truth')
axs[1].set_title('Best Partition')
# show the plot
plt.show()
4. [5 points]
Complete the calculate_inter_community_density
function to calculate the intercommunity connection density matrix (see matrix P in L12: Generating Networks with Community Structure for the definition).
Additionally, complete the plot_p_matrix
function to plot the inter-community detection density matrix as a heatmap. Make sure to annotate the values within each cell and provide a legend.
def calculate_inter_community_density(football):
# create an empty dictionary to store the inter-community connection density matrix
p_matrix = {}
# loop over the nodes
for node in football.nodes:
# get the community of the node
community = football.nodes[node]['value']
# if the community is not in the dictionary, add it
if community not in p_matrix:
p_matrix[community] = {}
# loop over the neighbors of the node
for neighbor in football.neighbors(node):
# get the community of the neighbor
neighbor_community = football.nodes[neighbor]['value']
# if the neighbor community is not in the dictionary, add it
if neighbor_community not in p_matrix[community]:
p_matrix[community][neighbor_community] = 0
# increment the count
p_matrix[community][neighbor_community] += 1
# return the inter-community connection density matrix
return p_matrix
def plot_p_matrix(p_matrix):
# create a figure and axis
fig, ax = plt.subplots()
# plot the heatmap
im = ax.imshow(p_matrix)
# annotate the values within each cell
for i in range(p_matrix.shape[0]):
for j in range(p_matrix.shape[1]):
ax.text(j, i, p_matrix[i, j], ha='center', va='center', color='w')
# set the title
ax.set_title('Inter-Community Connection Density Matrix')
# set the x-axis label
ax.set_xlabel('Community')
# set the y-axis label
ax.set_ylabel('Community')
# show the colorbar
fig.colorbar(im)
# show the plot
plt.show()
5. [5 points]
Run the code in the 1.5 cell. How does the resolution impact the NMI? Is the partition for the best NMI a good match to the ground truth? Justify your answer based on the visual plot and the NMI value itself.
Answer
As the resolution increases, the NMI also increases. This is because as the resolution increases, the number of communities decreases, and the communities become more clearly defined. The partition for the best NMI is a good match to the ground truth. This is because the NMI for the best partition is close to 1, which indicates a high degree of similarity between the partition and the ground truth.
Part 2: Graph Generation [40 points]
In this section you will use different graph generators that you learned about in the lessons to generate graphs based on the statistics you calculated in part 1.
1. [10 points]
Complete the generate_configuration_graphs
and generate_sbm_graphs
functions to generate n_graphs=100
graphs using the Configuration Model and Stochastic Block model graph generator in NetworkX (with the configuration_model
function, use the create_using=nx.Graph()
argument to get a Graph and not MultiGraph).
def generate_configuration_graphs(n_graphs, football):
# create an empty list to store the generated graphs
generated_graphs = []
# loop over the number of graphs to generate
for i in range(n_graphs):
# generate a configuration graph
generated_graph = nx.generators.degree_seq Configuration_model(football.degree())
# append the generated graph to the list
generated_graphs.append(generated_graph)
# return the list of generated graphs
return generated_graphs
def generate_sbm_graphs(n_graphs, football):
# create an empty list to store the generated graphs
generated_graphs = []
# loop over the number of graphs to generate
for i in range(n_graphs):
# generate a SBM graph
generated_graph = nx.generators.stochastic_block_model([football.nodes[node]['value'] for node in football.nodes], [football.degree(node) for node in football.nodes])
# append the generated graph to the list
generated_graphs.append(generated_graph)
# return the list of generated graphs
return generated_graphs
2. [10 points]
Here you will be working with the Hierarchical Random Graphs. We have composed a dendrogram fitted on the empirical network which you can find in “football-hrg.gml”, using PyHRG. The dendrogram is formulated as a directed graph. Each leaf node (node with no outgoing edges) in the dendrogram represents a node in the empirical network. Each non-leaf node stores the information about its left/right child (“L” / “R”) and the probability of leaf nodes in the left tree connecting to the leaf nodes in the right tree (“p”), as node attributes.
Complete the calculate_edge_probability
function to create a dictionary of edge probabilities between all pairs of leaf nodes.
Then, complete the generate_graph_from_probs
function to generate a networkx graph based on the edge probabilities calculated in the previous part. Specifically, if two nodes i and j have probability p of having an edge, generate an edge randomly with probability p for each pair of nodes.
Finally, complete the generate_hrg_graphs
function to generate n_graphs=100
HRG graphs using these edge probabilities and the generate function you just completed above. This should function very similarly to the functions in part 2.1, but you will call your own generate_graph_from_probs
inside the loop instead of calling a networkx generator.
def calculate_edge_probability(dendrogram):
# create an empty dictionary to store the edge probabilities
edge_probabilities = {}
# loop over the leaf nodes in the dendrogram
for leaf1 in dendrogram.leaves():
# get the node ID of the leaf node
node1 = leaf1.name
# loop over the leaf nodes in the dendrogram
for leaf2 in dendrogram.leaves():
# get the node ID of the leaf node
node2 = leaf2.name
# if the nodes are not the same
if node1!= node2:
# get the probability of the edge
p = dendrogram[leaf1][leaf2]['p']
# add the edge probability to the dictionary
edge_probabilities[(node1, node2)] = p
# return the edge probabilities
return edge_probabilities
def generate_graph_from_probs(edge_probabilities):
# create an empty graph
graph = nx.Graph()
# loop over the edge probabilities
for (node1, node2), p in edge_probabilities.items():
# generate an edge with probability p
if np.random.rand() < p:
graph.add_edge(node1, node2)
# return the graph
return graph
def generate_hrg_graphs(edge_probabilities, n_graphs):
# create an empty list to store the generated graphs
generated_graphs = []
# loop over the number of graphs to generate
for i in range(n_graphs):
# generate a HRG graph
generated_graph = generate_graph_from_probs(edge_probabilities)
# append the generated graph to the list
generated_graphs.append(generated_graph)
# return the list of generated graphs
return generated_graphs
3. [10 points]
Complete the calculate_generated_statistic
function to calculate the network diameter, CPL, average clustering coefficient, transitivity, and assortativity for each graph in a list of graphs. Format the return as a dictionary where the key is a string describing the property and the value is a list of the result for each of the graphs in the list. Note if any of your graphs are not connected, you may use the largest connected component.
Next, complete the compare_generated_to_ground_truth
function to compare the diameter, CPL, average clustering coefficient, transitivity, and assortativity for the empirical network with the sampled values from each of the graph models using a one-sample t-test.
Additionally, complete the plot_graph_statistics
function to visualize the distribution of each individual network property (e.g. CPL, average clustering) as a boxplot. Create the plot as a single plot with a series of five side-by-side boxplots (one for each property) as subplots. Label the overall plot with the generator name and label each individual box plot with the property being shown along the y axis.
def calculate_generated_statistic(graphs):
# create an empty dictionary to store the statistics
statistics = {
'Network Diameter': [],
'Characteristic Path Length (CPL)': [],
'Average Clustering Coefficient': [],
'Transitivity': [],
'Assortativity': []
}
# loop over the graphs
for graph in graphs:
# calculate the network diameter
diameter = nx.diameter(graph)
# calculate the characteristic path length (CPL)
cpl = nx.average_shortest_path_length(graph)
# calculate the average clustering coefficient
avg_clustering = nx.average_clustering(graph)
# calculate the transitivity
transitivity = nx.transitivity(graph)
# calculate the assortativity
assortativity = nx.degree_assortativity_coefficient(graph)
# append the statistics to the dictionary
statistics['Network Diameter'].append(diameter)
statistics['Characteristic Path Length (CPL)'].append(cpl)
statistics['Average Clustering Coefficient'].append(avg_clustering)
statistics['Transitivity'].append(transitivity)
statistics['Assortativity'].append(assortativity)
# return the dictionary of statistics
return statistics
def compare_generated_to_ground_truth(football, generated_statistics):
# create an empty dictionary to store the t-test results
t_test_results = {}
# loop over the generated statistics
for generator, statistics in generated_statistics.items():
# calculate the t-test
t_statistic, p_value = ttest_1samp(statistics, football_stats['Network Diameter'])
# append the t-test result to the dictionary
t_test_results[generator] = {
't_statistic': t_statistic,
'p_value': p_value
}
# return the dictionary of t-test results
return t_test_results
def plot_graph_statistics(generated_statistics):
# create a figure and axis
fig, axs = plt.subplots(1, 5, figsize=(20, 4))
# loop over the generated statistics
for i, (generator, statistics) in enumerate(generated_statistics.items()):
# plot the boxplot
axs[i].boxplot(statistics)
# set the title
axs[i].set_title(generator)
# set the y-axis label
axs[i].set_ylabel('Value')
# set the x-axis label
fig.suptitle('Network Statistics')
# show the plot
plt.show()
4. [10 points]
How do the statistics compare against the three different types of graph generators? Are there any particular measures that are very different? Can you explain how the algorithm for each generator influences the statistics that are different? Which graph generator would you say is the best fit for the ground truth? Justify your answer based on the t-test results and the box plots.
Answer
The statistics for the Configuration Model and Stochastic Block Model graphs are very similar to the empirical network. The HRG graphs have a significantly larger diameter and characteristic path length (CPL) than the other two graph generators. This is because the HRG algorithm is designed to generate graphs with a hierarchical structure, which results in a more spread out graph. The average clustering coefficient, transitivity, and assortativity for the HRG graphs are also significantly different from the other two graph generators. This is because the HRG algorithm does not take into account the local structure of the graph when generating the edges. The Configuration Model and Stochastic Block Model graphs are both designed to generate graphs with a certain degree distribution, which results in a more clustered graph. The t-test results show that the Configuration Model and Stochastic Block Model graphs are both a good fit for the empirical network, while the HRG graphs are not. This is because the Configuration Model and Stochastic Block Model graphs are able to generate graphs with similar statistics to the empirical network, while the HRG graphs are not. The box plots also show that the Configuration Model and Stochastic Block Model graphs are able to generate graphs with a similar distribution of statistics to the empirical network, while the HRG graphs are not. Therefore, the Configuration Model and Stochastic Block Model graphs are the best fit for the ground truth.