Skip to main content

The facebook graph

Putting "six degrees of separation" to the test with graph theory on a very large scale, Facebook analyses the connections between 10% of the worlds population.

Original paper:

As one would expect the network is nearly fully connected, with 99.91% of the nodes belonging to a single connected component. The largest subgraph not connected to the major graph consists of about 2000 users.

The degrees of separation for any two facebook users is closer to 4 than 6. In 92% of all pairs of users on facebook; a friend of your friend knows a friend of my friend.

The graph is inherently sparse, duein part to our social ineptness at connecting with any significant fraction of the global population and also by facebook limiting the maximum number of friends to 5000. The median number of friends (degree) globally was 99.

The graph is also highly clustered - "for a median user, 14% of all their friend pairs are themselves friends", and interestingly this "drops rapidly for users with close to 5000 friends, indicating that these users are likely using Facebook for less coherently social purposes and friending users more indiscriminately."

The clique problem is to find the largest subset of people who all know each other, the study calculates this sparsity of the neighborhood graphs by measuring the degeneracy for each degree. Two examples highlight the findings of just how dense the local neighborhood graphs are:
* A node with degree 100 (a user with 100 friends) has an average degeneracy of 15, meaning on average 16 of the 100 friends each know at least 15 other mutual friends.
* Users with 500 friends have an average degeneracy of 53, they have at least 54 friends who all know 53 of their other friends.

The fact that neighborhood graphs center around sizable dense cores is clearly how facebook can offer such good friend suggestions and smart lists - I'm sure could be used to improve sharing between users major cliques (or ahem *circles*).

Pictured is the cumulative degree distribution showing the number of nodes (users) who have degree (number of friends) greater than k. The number ofunique friends added per additional friend was to approximate a linear line of best fit: 355*k-15057. So in theory I have ~205398 unique friends-of-friends.

Some other very interesting revelations: "until you have nearly 700 friends, your (average) neighbor has more friends than you""a positive correlation does exist between degree and logins... So since your friends have more friends than you do, they also login to Facebook more than you do."

While there are roughly 30 million fewer active female users on Facebook, a random neighbor of a user is slightly more likely to be female. Why? Because the average female degree (198) is larger than the average male degree (172).

All the path length crunching was done on a 24-core machine with 72 GiB of memory and 1 TiB of disk space, the edge graph comprised ~69 billion friend connections between 721 million users and took up 345 GB at 20 bits per arc and was traversed by a new algorithm called HyperANF.

Popular posts from this blog

My setup for downloading & streaming movies and tv

I recently signed up for Netflix and am retiring my headless home media pc. This blog will have to serve as its obituary. The box spent about half of its life running FreeNAS, and half running Archlinux. I’ll briefly talk about my experience with FreeNAS, the migration, and then I’ll get to the robust setup I ended up with.

The machine itself cost around $1000 in 2014. Powered by an AMD A4-7300 3.8GHz cpu with 8GB of memory. A SilverStone DS380 case is both functional, quiet and looks great. The hard drives have been updated over the last two years until it had a full compliment of 6 WD Green 4TiB drives - all spinning bits of metal though.

Initially I had the BSD based FreeNAS operating system installed. I had a single hard drive in its own ZFS pool for TV and Movies, and a second ZFS pool comprised of 5 hard drives for documents and photos.

FreeNAS is straight forward to use and setup, provided you only want to do things supported out of the box or by plugins. Each plugin is install…

Driveby contribution to Python Cryptography

While at PyConAU 2016 I attended the Monday sprints and spent some time looking at a proposed feature I hoped would soon be part of cryptography. As most readers of this blog will know, cryptography is a very respected project within the Python ecosystem and it was an interesting experience to see how such a prominent open source project handles contributions and reviews.

The feature in question is the Diffie-Hellman Key Exchange algorithm used in many cryptography applications. Diffie-Helman Key Exchange is a way of generating a shared secret between two parties where the secret can't be determined by an eavesdropper observing the communication. DHE is extremely common - it is one of the primary methods used to provide "perfect forward secrecy" every time you initiate a TLS connection to an HTTPS website. Mathematically it is extremely elegant and the inventors were the recipients of the 2015 Turing award.

I wanted to write about this particular contribution because man…

Python, Virtualenv and Docker

Unsurprisingly I use some very popular Scientific Python packages like Numpy, Scipy and Scikit Learn. These packages don't get on that well with virtualenv and pip as they take a lot of external dependencies to build. These dependencies can be optional libraries like libblas and libatlas which if present will make Numpy run faster, or required dependencies like a fortran compiler.

Back in the good old days you wouldn't pin all your dependency versions down and you'd end up with a precarious mix of apt-get installed and pip installed packages. Working with other developers, especially on different operating system update schedules could be a pain. It was time to update your project when it breaks because of a dependency upgraded by the operating system.

Does virtualenv fully solve this? No, not when you have hard requirements on the binaries that must be installed at a system level.

Docker being at a lower level gives you much more control without adding too much extra comp…