Skip to main content


Showing posts from June, 2010

Wikipedia Mining

So I was studying nearest neighbor for my machine learning exam tomorrow and I stumbled across "breadth-first search" and it got me thinking... Okay I've come across it before but never thought of a breadth-first search of wikipedia as a means of finding the nearest neighbour...

I thought about all the internal links that wikipedia keeps and how easy it would be to use each page as a graph node then do a bit of a breadth-first search by visiting all those child nodes. Then I thought maybe the links from the child nodes would be interesting so I created a link counter to keep track of how often a link shows up across all children of a page. The way I implemented it wasn't using the wikipedia api or anything - just scraping the data off the web. This means it kinda downloads a few thousand wikipedia pages for a single query... but still it was interesting!

If I query "Machine learning" the closest match is not "Artificial intelligence" as you might …