Monday, February 13, 2006

Search your intranet with Nutch!

Our company maintains several wikis for each department. This is where the knowledge of how to set servers up, the milestones and deliverables of projects, new project ideas etc live. There is (actually was) one problem - the wiki was not searchable. The search engine that came with the wiki was pretty much broken. A search would most times hang the browser.

So I was looking around, and found this amazing open source project - Nutch - with which I built the infra-structure for a crawl of a few intranet sites. Within a couple of days, I had the system running and the results were really good.

Then another developer hooked up the search engine to the wiki's search button.

Later on, I increased the crawl to other web based information we have like - the bug database, the system that logs perforce changelists. The results were quite good, now from a single place, we can find a lot of information about an item scattered across many links.

I remember back at Microsoft, someone was always suggesting how all the systems that maintained different datasets should somehow be unified so that from one point, we can gather all the data for one item. This of course is a huge dev/test effort requiring major re-haul of a bunch of working systems. Enter search - a good one - and all this costly development work disappears.

I want to commend the people working in the Nutch project for giving us such a cool set of tools. Keep up the excellent work!