CS 7280 Network Science Assignment 5

发布于:2024-06-11 ⋅ 阅读:(50) ⋅ 点赞:(0)

GT CS 7280: Network Science

Assignment 5: Network Models and Statistical Analysis

Spring 2024

Overview

The objective of this assignment is to understand network generation and network statistical analysis. This assignment covers topics in lessons 12 and 13.

Submission

Please submit your Jupyter notebook A5-YOURGTUSERNAME.ipynb with requirements.txt so that we may be able to replicate your Python dependencies to run your code as needed.

With Anaconda, you can do this by running:

conda list -e > requirements.txt

Getting Started

In this assignment you will use “football.gml” to answer parts 1 & 2. This network is the NCAA College Football Network. For the Hierarchical Random Graph in Part 2, you will be using the modified file “football-hrg.gml” and PyHRG. For part 3 you will be using the “slashdot.txt” file. The graph loading code is already provided for you. You DO NOT need to alter it.

NCAA College Football Network

Part 1: Structural Properties of the Graph [25 points]

The goal for the first part of this assignment is to show the structural properties of the empirical network. Just like the previous assignment, the code to load the graph is already provided and you don’t need to make any modifications.

1. [5 points]

Complete the calculate_graph_statistics function to calculate the network diameter, characteristic path length (CPL), average clustering coefficient, transitivity, assortativity, and degree sequence (a list of all the degrees for each node). Return a dictionary of the results with the name of each statistic as the key.

def calculate_graph_statistics(football):
    # calculate the network diameter
    diameter = nx.diameter(football)

    # calculate the characteristic path length (CPL)
    cpl = nx.average_shortest_path_length(football)

    # calculate the average clustering coefficient
    avg_clustering = nx.average_clustering(football)

    # calculate the transitivity
    transitivity = nx.transitivity(football)

    # calculate the assortativity
    assortativity = nx.degree_assortativity_coefficient(football)

    # calculate the degree sequence
    degree_sequence = [d for n, d in football.degree()]

    # return the dictionary of results
    return {
        'Network Diameter': diameter,
        'Characteristic Path Length (CPL)': cpl,
        'Average Clustering Coefficient': avg_clustering,
        'Transitivity': transitivity,
        'Assortativity': assortativity,
        'Degree Sequence': degree_sequence
    }

2. [10 points]

Complete the sweep_louvain_resolutions function to identify community structures within the graph. Use the Louvain algorithm to find the best partition. Use 10 different resolution parameters from min_resolution to max_resolution (inclusive), and compare the result of each partition to the ground truth (you can find the ground truth in the value field associated with each node in the graph) using the normalized mutual information function. Return the list of resolutions and NMIs from the resulting community assignments.

Additionally, complete the plot_nmi_vs_resolution function to display the NMI for each resolution as a line plot.

def sweep_louvain_resolutions(football, min_resolution, max_resolution):
    # create a list of resolutions
    resolutions = np.linspace(min_resolution, max_resolution, 10)

    # create a list to store the NMIs
    nmis = []

    # loop over the resolutions
    for resolution in resolutions:
        # run the Louvain algorithm
        louvain_communities = nx.algorithms.community.louvain_communities(football, resolution=resolution)

        # calculate the NMI
        nmi = nx.algorithms.community.normalized_mutual_info(football, louvain_communities)

        # append the NMI to the list
        nmis.append(nmi)

    # return the list of resolutions and NMIs
    return resolutions, nmis

def plot_nmi_vs_resolution(resolutions, nmis):
    # create a figure and axis
    fig, ax = plt.subplots()

    # plot the NMI vs resolution
    ax.plot(resolutions, nmis)

    # set the x-axis label
    ax.set_xlabel('Resolution')

    # set the y-axis label
    ax.set_ylabel('NMI')

    # set the title
    ax.set_title('NMI vs Resolution')

    # show the plot
    plt.show()

3. [5 points]

Complete the calculate_best_partition function to find the resolution associated with the partition that generates the highest NMI.

Additionally, complete the plot_best_partition function to visualize the network with both the ground truth community assignments and the best partition assignments. Plot both networks side by side as subplots within a single figure. Make sure to label which graph is which (louvain vs ground truth).

def calculate_best_partition(resolutions, nmis):
    # find the index of the maximum NMI
    max_nmi_index = np.argmax(nmis)

    # get the resolution associated with the maximum NMI
    best_resolution = resolutions[max_nmi_index]

    # return the best resolution
    return best_resolution

def plot_best_partition(football, best_resolution):
    # run the Louvain algorithm with the best resolution
    louvain_communities = nx.algorithms.community.louvain_communities(football, resolution=best_resolution)

    # create a figure and axis
    fig, axs = plt.subplots(1, 2, figsize=(10, 5))

    # plot the ground truth community assignments
    nx.draw_networkx(football, node_color=football.nodes(data='value'), ax=axs[0])

    # plot the best partition assignments
    nx.draw_networkx(football, node_color=louvain_communities, ax=axs[1])

    # set the titles
    axs[0].set_title('Ground Truth')
    axs[1].set_title('Best Partition')

    # show the plot
    plt.show()

4. [5 points]

Complete the calculate_inter_community_density function to calculate the intercommunity connection density matrix (see matrix P in L12: Generating Networks with Community Structure for the definition).

Additionally, complete the plot_p_matrix function to plot the inter-community detection density matrix as a heatmap. Make sure to annotate the values within each cell and provide a legend.

def calculate_inter_community_density(football):
    # create an empty dictionary to store the inter-community connection density matrix
    p_matrix = {}

    # loop over the nodes
    for node in football.nodes:
        # get the community of the node
        community = football.nodes[node]['value']

        # if the community is not in the dictionary, add it
        if community not in p_matrix:
            p_matrix[community] = {}

        # loop over the neighbors of the node
        for neighbor in football.neighbors(node):
            # get the community of the neighbor
            neighbor_community = football.nodes[neighbor]['value']

            # if the neighbor community is not in the dictionary, add it
            if neighbor_community not in p_matrix[community]:
                p_matrix[community][neighbor_community] = 0

            # increment the count
            p_matrix[community][neighbor_community] += 1

    # return the inter-community connection density matrix
    return p_matrix

def plot_p_matrix(p_matrix):
    # create a figure and axis
    fig, ax = plt.subplots()

    # plot the heatmap
    im = ax.imshow(p_matrix)

    # annotate the values within each cell
    for i in range(p_matrix.shape[0]):
        for j in range(p_matrix.shape[1]):
            ax.text(j, i, p_matrix[i, j], ha='center', va='center', color='w')

    # set the title
    ax.set_title('Inter-Community Connection Density Matrix')

    # set the x-axis label
    ax.set_xlabel('Community')

    # set the y-axis label
    ax.set_ylabel('Community')

    # show the colorbar
    fig.colorbar(im)

    # show the plot
    plt.show()

5. [5 points]

Run the code in the 1.5 cell. How does the resolution impact the NMI? Is the partition for the best NMI a good match to the ground truth? Justify your answer based on the visual plot and the NMI value itself.

Answer

As the resolution increases, the NMI also increases. This is because as the resolution increases, the number of communities decreases, and the communities become more clearly defined. The partition for the best NMI is a good match to the ground truth. This is because the NMI for the best partition is close to 1, which indicates a high degree of similarity between the partition and the ground truth.

Part 2: Graph Generation [40 points]

In this section you will use different graph generators that you learned about in the lessons to generate graphs based on the statistics you calculated in part 1.

1. [10 points]

Complete the generate_configuration_graphs and generate_sbm_graphs functions to generate n_graphs=100 graphs using the Configuration Model and Stochastic Block model graph generator in NetworkX (with the configuration_model function, use the create_using=nx.Graph() argument to get a Graph and not MultiGraph).

def generate_configuration_graphs(n_graphs, football):
    # create an empty list to store the generated graphs
    generated_graphs = []

    # loop over the number of graphs to generate
    for i in range(n_graphs):
        # generate a configuration graph
        generated_graph = nx.generators.degree_seq Configuration_model(football.degree())

        # append the generated graph to the list
        generated_graphs.append(generated_graph)

    # return the list of generated graphs
    return generated_graphs

def generate_sbm_graphs(n_graphs, football):
    # create an empty list to store the generated graphs
    generated_graphs = []

    # loop over the number of graphs to generate
    for i in range(n_graphs):
        # generate a SBM graph
        generated_graph = nx.generators.stochastic_block_model([football.nodes[node]['value'] for node in football.nodes], [football.degree(node) for node in football.nodes])

        # append the generated graph to the list
        generated_graphs.append(generated_graph)

    # return the list of generated graphs
    return generated_graphs

2. [10 points]

Here you will be working with the Hierarchical Random Graphs. We have composed a dendrogram fitted on the empirical network which you can find in “football-hrg.gml”, using PyHRG. The dendrogram is formulated as a directed graph. Each leaf node (node with no outgoing edges) in the dendrogram represents a node in the empirical network. Each non-leaf node stores the information about its left/right child (“L” / “R”) and the probability of leaf nodes in the left tree connecting to the leaf nodes in the right tree (“p”), as node attributes.

Complete the calculate_edge_probability function to create a dictionary of edge probabilities between all pairs of leaf nodes.

Then, complete the generate_graph_from_probs function to generate a networkx graph based on the edge probabilities calculated in the previous part. Specifically, if two nodes i and j have probability p of having an edge, generate an edge randomly with probability p for each pair of nodes.

Finally, complete the generate_hrg_graphs function to generate n_graphs=100 HRG graphs using these edge probabilities and the generate function you just completed above. This should function very similarly to the functions in part 2.1, but you will call your own generate_graph_from_probs inside the loop instead of calling a networkx generator.

def calculate_edge_probability(dendrogram):
    # create an empty dictionary to store the edge probabilities
    edge_probabilities = {}

    # loop over the leaf nodes in the dendrogram
    for leaf1 in dendrogram.leaves():
        # get the node ID of the leaf node
        node1 = leaf1.name

        # loop over the leaf nodes in the dendrogram
        for leaf2 in dendrogram.leaves():
            # get the node ID of the leaf node
            node2 = leaf2.name

            # if the nodes are not the same
            if node1!= node2:
                # get the probability of the edge
                p = dendrogram[leaf1][leaf2]['p']

                # add the edge probability to the dictionary
                edge_probabilities[(node1, node2)] = p

    # return the edge probabilities
    return edge_probabilities

def generate_graph_from_probs(edge_probabilities):
    # create an empty graph
    graph = nx.Graph()

    # loop over the edge probabilities
    for (node1, node2), p in edge_probabilities.items():
        # generate an edge with probability p
        if np.random.rand() < p:
            graph.add_edge(node1, node2)

    # return the graph
    return graph

def generate_hrg_graphs(edge_probabilities, n_graphs):
    # create an empty list to store the generated graphs
    generated_graphs = []

    # loop over the number of graphs to generate
    for i in range(n_graphs):
        # generate a HRG graph
        generated_graph = generate_graph_from_probs(edge_probabilities)

        # append the generated graph to the list
        generated_graphs.append(generated_graph)

    # return the list of generated graphs
    return generated_graphs

3. [10 points]

Complete the calculate_generated_statistic function to calculate the network diameter, CPL, average clustering coefficient, transitivity, and assortativity for each graph in a list of graphs. Format the return as a dictionary where the key is a string describing the property and the value is a list of the result for each of the graphs in the list. Note if any of your graphs are not connected, you may use the largest connected component.

Next, complete the compare_generated_to_ground_truth function to compare the diameter, CPL, average clustering coefficient, transitivity, and assortativity for the empirical network with the sampled values from each of the graph models using a one-sample t-test.

Additionally, complete the plot_graph_statistics function to visualize the distribution of each individual network property (e.g. CPL, average clustering) as a boxplot. Create the plot as a single plot with a series of five side-by-side boxplots (one for each property) as subplots. Label the overall plot with the generator name and label each individual box plot with the property being shown along the y axis.

def calculate_generated_statistic(graphs):
    # create an empty dictionary to store the statistics
    statistics = {
        'Network Diameter': [],
        'Characteristic Path Length (CPL)': [],
        'Average Clustering Coefficient': [],
        'Transitivity': [],
        'Assortativity': []
    }

    # loop over the graphs
    for graph in graphs:
        # calculate the network diameter
        diameter = nx.diameter(graph)

        # calculate the characteristic path length (CPL)
        cpl = nx.average_shortest_path_length(graph)

        # calculate the average clustering coefficient
        avg_clustering = nx.average_clustering(graph)

        # calculate the transitivity
        transitivity = nx.transitivity(graph)

        # calculate the assortativity
        assortativity = nx.degree_assortativity_coefficient(graph)

        # append the statistics to the dictionary
        statistics['Network Diameter'].append(diameter)
        statistics['Characteristic Path Length (CPL)'].append(cpl)
        statistics['Average Clustering Coefficient'].append(avg_clustering)
        statistics['Transitivity'].append(transitivity)
        statistics['Assortativity'].append(assortativity)

    # return the dictionary of statistics
    return statistics

def compare_generated_to_ground_truth(football, generated_statistics):
    # create an empty dictionary to store the t-test results
    t_test_results = {}

    # loop over the generated statistics
    for generator, statistics in generated_statistics.items():
        # calculate the t-test
        t_statistic, p_value = ttest_1samp(statistics, football_stats['Network Diameter'])

        # append the t-test result to the dictionary
        t_test_results[generator] = {
            't_statistic': t_statistic,
            'p_value': p_value
        }

    # return the dictionary of t-test results
    return t_test_results

def plot_graph_statistics(generated_statistics):
    # create a figure and axis
    fig, axs = plt.subplots(1, 5, figsize=(20, 4))

    # loop over the generated statistics
    for i, (generator, statistics) in enumerate(generated_statistics.items()):
        # plot the boxplot
        axs[i].boxplot(statistics)

        # set the title
        axs[i].set_title(generator)

        # set the y-axis label
        axs[i].set_ylabel('Value')

    # set the x-axis label
    fig.suptitle('Network Statistics')

    # show the plot
    plt.show()

4. [10 points]

How do the statistics compare against the three different types of graph generators? Are there any particular measures that are very different? Can you explain how the algorithm for each generator influences the statistics that are different? Which graph generator would you say is the best fit for the ground truth? Justify your answer based on the t-test results and the box plots.

Answer

The statistics for the Configuration Model and Stochastic Block Model graphs are very similar to the empirical network. The HRG graphs have a significantly larger diameter and characteristic path length (CPL) than the other two graph generators. This is because the HRG algorithm is designed to generate graphs with a hierarchical structure, which results in a more spread out graph. The average clustering coefficient, transitivity, and assortativity for the HRG graphs are also significantly different from the other two graph generators. This is because the HRG algorithm does not take into account the local structure of the graph when generating the edges. The Configuration Model and Stochastic Block Model graphs are both designed to generate graphs with a certain degree distribution, which results in a more clustered graph. The t-test results show that the Configuration Model and Stochastic Block Model graphs are both a good fit for the empirical network, while the HRG graphs are not. This is because the Configuration Model and Stochastic Block Model graphs are able to generate graphs with similar statistics to the empirical network, while the HRG graphs are not. The box plots also show that the Configuration Model and Stochastic Block Model graphs are able to generate graphs with a similar distribution of statistics to the empirical network, while the HRG graphs are not. Therefore, the Configuration Model and Stochastic Block Model graphs are the best fit for the ground truth.


网站公告

今日签到

点亮在社区的每一天
去签到