Skip to main content

Wikipedia Mining

So I was studying nearest neighbor for my machine learning exam tomorrow and I stumbled across "breadth-first search" and it got me thinking... Okay I've come across it before but never thought of a breadth-first search of wikipedia as a means of finding the nearest neighbour...

I thought about all the internal links that wikipedia keeps and how easy it would be to use each page as a graph node then do a bit of a breadth-first search by visiting all those child nodes. Then I thought maybe the links from the child nodes would be interesting so I created a link counter to keep track of how often a link shows up across all children of a page. The way I implemented it wasn't using the wikipedia api or anything - just scraping the data off the web. This means it kinda downloads a few thousand wikipedia pages for a single query... but still it was interesting!

If I query "Machine learning" the closest match is not "Artificial intelligence" as you might guess, but rather International Standard Book Number. It has 38 links pointing to it as opposed to 30. I looked at a few of the pages and it appears these are links I could ignore - info boxes for journals seem to like linking to the ISBN article.... DOI was up there as well.

Anyhow all these links are taken from all the children of the Machine Learning page on wikipedia. There were 11008 total links only one node away, and of these 6873 were unique.

I thought about adding another level to it, ie exploring the children linked to by the children - but as it already takes a minute or so, I imagine that will take a few hours to run and download a crapload of wikipedia. I minor improvement would be needed if any depth could be specified - we wouldn't want to revisit the same node again and again.

Running the program with input "cheese", I found a few more children - 417 of them. This took about 10 minutes to download all the webpages then it quickly used some sets and dictionaries to work out that cheese is equally related to "Rice" and "Carbohydrate" according to this quick measure,  from all of cheese's internal links 124 of them point to each. Bread is a close follower with 122 links.

Hmm well that was a fun diversion, back to study me thinks.

The code in case anyone wants to play:

#!/usr/bin/env python

from urllib2 import build_opener
import HTMLParser

from HTMLParser import HTMLParser
URL = ''
ignore_urls = [

class LinkingHTMLParser(HTMLParser):
    def __init__(self):
        #super(LinkingHTMLParser, self).__init__()
        self.links = set()
    def handle_starttag(self, tag, attrs):
        if tag == "a" and len(attrs) > 1:
            (_,url), (_, title) = attrs[0], attrs[1]

            if url.startswith('/wiki') and not any(addy in url for addy in ignore_urls):
                #print 'Linked to %s at address: %s' % (title, url)

opener = build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

class Page(object):
    A node on the internet... a webpage

        >>> Page('')
    def __init__(self, url):
        self.url = url
        self.handle =
        self.parser = LinkingHTMLParser()
    def __repr__(self):
        return 'Page(%s)' % self.url
    def __str__(self):
        return '\n'.join(line for line in self.handle)

    def parse(self):
        print 'Number of internal links: %d' % len(self.parser.links)
    def create_children(self):
        '''Create nodes for each link in self'''
        self.children = []
        for url in self.parser.links:
            child = Page(URL + url)
def find_n_most_related_pages(topic='Nearest neighbor search', n=10):
    topic = '/wiki/' + topic.replace(' ', '_')
    p = Page(URL + topic)
    total_links = len(p.parser.links)
    all_links = p.parser.links
    link_counts = {}
    for child in p.children:
            #print 'Child "%s" has %d links.' % (child.url, len(child.parser.links))
            total_links += len(child.parser.links)
            all_links = all_links.union(child.parser.links)
            for link in child.parser.links:
                if link in link_counts:
                    link_counts[link] += 1
                    link_counts[link] = 1
            print 'error'

    print 'Link counts:'
    for i, link in enumerate(sorted(link_counts, cmp=lambda x, y: link_counts[y] - link_counts[x])):
        if i <= n:
            print '%s has %d links.' % (link, link_counts[link])
    print '*' * 80
    print 'Total links only one node away: %d' % total_links
    print 'Unique pages only one node away: %d' % len(all_links)


Popular posts from this blog

My setup for downloading & streaming movies and tv

I recently signed up for Netflix and am retiring my headless home media pc. This blog will have to serve as its obituary. The box spent about half of its life running FreeNAS, and half running Archlinux. I’ll briefly talk about my experience with FreeNAS, the migration, and then I’ll get to the robust setup I ended up with.

The machine itself cost around $1000 in 2014. Powered by an AMD A4-7300 3.8GHz cpu with 8GB of memory. A SilverStone DS380 case is both functional, quiet and looks great. The hard drives have been updated over the last two years until it had a full compliment of 6 WD Green 4TiB drives - all spinning bits of metal though.

Initially I had the BSD based FreeNAS operating system installed. I had a single hard drive in its own ZFS pool for TV and Movies, and a second ZFS pool comprised of 5 hard drives for documents and photos.

FreeNAS is straight forward to use and setup, provided you only want to do things supported out of the box or by plugins. Each plugin is install…

Driveby contribution to Python Cryptography

While at PyConAU 2016 I attended the Monday sprints and spent some time looking at a proposed feature I hoped would soon be part of cryptography. As most readers of this blog will know, cryptography is a very respected project within the Python ecosystem and it was an interesting experience to see how such a prominent open source project handles contributions and reviews.

The feature in question is the Diffie-Hellman Key Exchange algorithm used in many cryptography applications. Diffie-Helman Key Exchange is a way of generating a shared secret between two parties where the secret can't be determined by an eavesdropper observing the communication. DHE is extremely common - it is one of the primary methods used to provide "perfect forward secrecy" every time you initiate a TLS connection to an HTTPS website. Mathematically it is extremely elegant and the inventors were the recipients of the 2015 Turing award.

I wanted to write about this particular contribution because man…

Python, Virtualenv and Docker

Unsurprisingly I use some very popular Scientific Python packages like Numpy, Scipy and Scikit Learn. These packages don't get on that well with virtualenv and pip as they take a lot of external dependencies to build. These dependencies can be optional libraries like libblas and libatlas which if present will make Numpy run faster, or required dependencies like a fortran compiler.

Back in the good old days you wouldn't pin all your dependency versions down and you'd end up with a precarious mix of apt-get installed and pip installed packages. Working with other developers, especially on different operating system update schedules could be a pain. It was time to update your project when it breaks because of a dependency upgraded by the operating system.

Does virtualenv fully solve this? No, not when you have hard requirements on the binaries that must be installed at a system level.

Docker being at a lower level gives you much more control without adding too much extra comp…