Learning Objectives

The Language and Basic Tools of Networks

On successfully completing the course, students should have a solid grasp of the vocabulary, central concepts, and some basic facts relating to networks. The study of networks draws on a variety of disciplines: mathematics, statistics, computer science, economics, and sociology. Thus, students should be fluent in combining and reasoning about terms and ideas coming from each of these. For instance, they will be able to understand and answer questions such as:

Is there an efficient algorithm for computing the average degree in a graph?
If so, describe one. If not, describe what is hard about the problem.
Based on the average degree, can you tell whether a graph is likely to be connected?
Networks look like “small worlds”: there are paths of short distance connecting most pairs of nodes. How do we know this? What is a leading model of small-world networks? Why are small worlds “easily navigable”?
What are some of the dimensions on which homophily occurs in real social networks? What statistical test might you run to detect whether there is significant homophily on a given dimension? Write a program in Python to run that test.
True or false: if a bipartite graph has two parts with equal numbers of nodes on each side, then the average (median) degree on each side is the same.
If this is false, correct the statement.
Comment on how this idea can be useful in detecting reporting bias in sociological surveys on sexual behavior.
Note that several of these questions involve a term from a social science (homophily, survey) as well as ideas from mathematics and computer science (algorithm, bipartite graph). Being able to connect ideas from formal/quantitative domains to questions that have traditionally had their home in the social sciences is one of the main skills we expect students to acquire.

Puzzles and Big Ideas

The study of networks is full of great ideas that can be summarized in a short story form. We aim to give students the power to tell these short stories well to their colleagues and friends when they’re relevant, and to apply them in research and business.

The friendship paradox: if you look around at your friends and take their social lives as a benchmark for what’s going on in society, you get a distorted view that makes you think you’re less popular (and less interesting, and less successful…) relative to others than you are.
Why the number 1 is so important in “viral” processes: does a typical victim (of the flu, or of a YouTube cat video) transmit it to more or less than 1 other?
Network centrality: who is especially central or influential in a network? It is often not the person with the most friends, but the person with the right friends, and you can tell by looking at the network who that is. Indeed, the cool kids are the ones that the other cool kids think are cool. We’ll teach the formal mathematical theory of this, and its role in how Google ranks web pages.
Adding a road to a transportation network can lead to everyone getting to their destinations more slowly, unless we can have smart tolls (which we usually don’t). Bonus: why having traffic jams is not a sad fact of life but a policy choice we are constantly making.
Will a protest attract more than a few people? Will a new social app you built gain critical mass? The curve that matters.
Power-law degree distributions and fat tails: “whales” on Facebook (people with 5000 friends); why there are more of these than suggested by the statistical models you probably learned in stats class, and how the fat tails radically change the life of networks.
Measuring and Modeling Networks in the Wild

The “highest” learning outcome will involve students combining multiple concepts they learn to address practical challenges that come up in measuring and modeling networks. We now give two extended exercises along these lines. They gives an idea of the sort of problem-solving that students will be able to contribute intelligently to, whether as members or an academic or industrial research team, a product design group, or a policymaking committee.

While we do not expect students to be able to analyze the following situations perfectly, even at the end of the course, we do expect intelligent suggestions and programs to be produced based on the ideas learned.

You are running a marketing campaign on Facebook that relies on people resharing your page. You run a pilot where you show the content to 100 users and monitor, for each pilot user, how many reshares occur as a consequence of that user seeing it. You find that the mean number of reshares per pilot user is 0.9 with a standard deviation of 0.3.
Assuming that the number of reshares follows the same binomial distribution for all individuals, give the parameters n and p of the binomial distribution.
Continuing with the assumption that individual reshares are binomially distributed, describe a branching process model of how the resharing process will go. Explicitly say what you are assuming about correlation between different individuals’ resharing decisions. Neglect clustering in the network.
Write a Python program to simulate the resharing process, starting with 7 initially seeded nodes. The output of a single run should be the total number of people that eventually reshare; we call this the reshare count.
Run the program you just wrote 1000 times, and record the reshare count of each simulation in an output file. State the mean and standard deviation of the distribution you obtain.
Extra credit: use the theory of Galton-Watson branching processes (as outlined here or here) to give a theoretical version of the simulation exercise you just did.
If the campaign costs $7 to run and the value of the campaign is half a cent per reshare, is the campaign worth it in terms of expectated value? What is the probability that it returns more than it costs to run?
If you did the theoretical version of this exercise in part (e), give an answer to part (f) that works for any numbers, not just the 0.9 mean and 0.3 variance from our pilot. Make a plot where the horizontal axis is the mean number of reshares (0.9 in the example above), and the vertical axis is the standard deviation (0.3 in the example above); then shade the area of pilot results for which the viral campaign is worth running, at a cost of $7.
We want to analyze the social network of a group of about 500 people but we initially only have access to four of its members. We construct a sample of the group like this: we ask each of these people to name 5 friends in the group, and then ask each of those people to name 5 friends in the group. This happens to give a sample of 67 people, after eliminating those who were named multiple times. Let’s assume that everyone selects friends to name uniformly at random from among their friends. We then survey those 67 people and ask them (i) how many friends they have in the group; (ii) how many times a month they have dinner at a restaurant.
Is the sample of 67 people likely to be representative of those in the network? Propose a way to correct for any biases encountered in the sampling.
In which direction is the answer to (ii) likely to be biased based on this sample? Explain your answer.