Skip to main content

Wikipedia Mining

So I was studying nearest neighbor for my machine learning exam tomorrow and I stumbled across "breadth-first search" and it got me thinking... Okay I've come across it before but never thought of a breadth-first search of wikipedia as a means of finding the nearest neighbour...

I thought about all the internal links that wikipedia keeps and how easy it would be to use each page as a graph node then do a bit of a breadth-first search by visiting all those child nodes. Then I thought maybe the links from the child nodes would be interesting so I created a link counter to keep track of how often a link shows up across all children of a page. The way I implemented it wasn't using the wikipedia api or anything - just scraping the data off the web. This means it kinda downloads a few thousand wikipedia pages for a single query... but still it was interesting!

If I query "Machine learning" the closest match is not "Artificial intelligence" as you might guess, but rather International Standard Book Number. It has 38 links pointing to it as opposed to 30. I looked at a few of the pages and it appears these are links I could ignore - info boxes for journals seem to like linking to the ISBN article.... DOI was up there as well.

Anyhow all these links are taken from all the children of the Machine Learning page on wikipedia. There were 11008 total links only one node away, and of these 6873 were unique.

I thought about adding another level to it, ie exploring the children linked to by the children - but as it already takes a minute or so, I imagine that will take a few hours to run and download a crapload of wikipedia. I minor improvement would be needed if any depth could be specified - we wouldn't want to revisit the same node again and again.

Running the program with input "cheese", I found a few more children - 417 of them. This took about 10 minutes to download all the webpages then it quickly used some sets and dictionaries to work out that cheese is equally related to "Rice" and "Carbohydrate" according to this quick measure,  from all of cheese's internal links 124 of them point to each. Bread is a close follower with 122 links.

Hmm well that was a fun diversion, back to study me thinks.

The code in case anyone wants to play:

#!/usr/bin/env python

from urllib2 import build_opener
import HTMLParser

from HTMLParser import HTMLParser
URL = 'http://en.wikipedia.org'
ignore_urls = [
    '/wiki/Portal',
    '/wiki/Special',
    '/w/index.php',
    '/wiki/Help',
    'wiki/Wikipedia',
    '/wiki/Main_Page',
    '/wiki/Category',
    '/wiki/Template',
    '/wiki/Talk:',
    '/wiki/File:',
    
]

class LinkingHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        #super(LinkingHTMLParser, self).__init__()
        self.links = set()
    def handle_starttag(self, tag, attrs):
        if tag == "a" and len(attrs) > 1:
            (_,url), (_, title) = attrs[0], attrs[1]

            if url.startswith('/wiki') and not any(addy in url for addy in ignore_urls):
                #print 'Linked to %s at address: %s' % (title, url)
                self.links.add(url)


opener = build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

class Page(object):
    '''
    A node on the internet... a webpage

        >>> Page('http://en.wikipedia.org/wiki/Nearest_neighbor_search')
    
    '''
    
    def __init__(self, url):
        self.url = url
        self.handle = opener.open(url)
        self.parser = LinkingHTMLParser()
    
    def __repr__(self):
        return 'Page(%s)' % self.url
    
    def __str__(self):
        return '\n'.join(line for line in self.handle)

    def parse(self):
        self.parser.feed(str(self))
        print 'Number of internal links: %d' % len(self.parser.links)
        
    def create_children(self):
        '''Create nodes for each link in self'''
        self.children = []
        
        for url in self.parser.links:
            child = Page(URL + url)
            self.children.append(
                child
                )
def find_n_most_related_pages(topic='Nearest neighbor search', n=10):
    topic = '/wiki/' + topic.replace(' ', '_')
    p = Page(URL + topic)
    p.parse()
    p.create_children()
    total_links = len(p.parser.links)
    all_links = p.parser.links
    link_counts = {}
    for child in p.children:
        try:
            child.parse()
            #print 'Child "%s" has %d links.' % (child.url, len(child.parser.links))
            total_links += len(child.parser.links)
            all_links = all_links.union(child.parser.links)
            for link in child.parser.links:
                if link in link_counts:
                    link_counts[link] += 1
                else:
                    link_counts[link] = 1
        except:
            print 'error'

    print 'Link counts:'
    for i, link in enumerate(sorted(link_counts, cmp=lambda x, y: link_counts[y] - link_counts[x])):
        if i <= n:
            print '%s has %d links.' % (link, link_counts[link])
    print '*' * 80
    print 'Total links only one node away: %d' % total_links
    print 'Unique pages only one node away: %d' % len(all_links)

find_n_most_related_pages('Cheese')

Popular posts from this blog

Bluetooth with Python 3.3

Since about version 3.3 Python supports Bluetooth sockets natively. To put this to the test I got hold of an iRacer from sparkfun . To send to New Zealand the cost was $60. The toy has an on-board Bluetooth radio that supports the RFCOMM transport protocol. The drive  protocol is dead easy, you send single byte instructions when a direction or speed change is required. The bytes are broken into two nibbles:  0xXY  where X is the direction and Y is the speed. For example the byte 0x16 means forwards at mid-speed. I was surprised to note the car continues carrying out the last given demand! I let pairing get dealt with by the operating system. The code to create a  Car object that is drivable over Bluetooth is very straight forward in pure Python: import socket import time class BluetoothCar : def __init__ ( self , mac_address = "00:12:05:09:98:36" ): self . socket = socket . socket ( socket . AF_BLUETO...

Matplotlib in Django

The official django tutorial is very good, it stops short of displaying data with matplotlib - which could be very handy for dsp or automated testing. This is an extension to the tutorial. So first you must do the official tutorial! Complete the tutorial (as of writing this up to part 4). Adding an image to a view To start with we will take a static image from the hard drive and display it on the polls index page. Usually if it really is a static image this would be managed by the webserver eg apache. For introduction purposes we will get django to serve the static image. To do this we first need to change the template. Change the template At the moment poll_list.html probably looks something like this: <h1>Django test app - Polls</h1> {% if object_list %} <ul> {% for object in object_list %} <li><a href="/polls/{{object.id}}">{{ object.question }}</a></li> {% endfor %} </ul> {% else %} <p>No polls...

Python and Gmail with IMAP

Today I had to automatically access my Gmail inbox from Python. I needed the ability to get an unread email count, the subjects of those unread emails and then download them. I found a Gmail.py library on sourceforge, but it actually opened the normal gmail webpage and site scraped the info. I wanted something much faster, luckily gmail can now be accessed with both pop and imap. After a tiny amount of research I decided imap was the better albiet slightly more difficult protocol. Enabling imap in gmail is straight forward, it was under labs. The address for gmail's imap server is: imap.gmail.com:993 Python has a library module called imaplib , we will make heavy use of that to access our emails. I'm going to assume that we have already defined two globals - username and password. To connect and login to the gmail server and select the inbox we can do: import imaplib imap_server = imaplib . IMAP4_SSL ( "imap.gmail.com" , 993 ) imap_server . login ( use...