Ruby web spider Part 0: concept

(I should probably mention that I have never written a spider or worked on a search engine before, so this is a learning process… I don’t pretend to be an expert on this – I picked this partly because it is far enough from my “day” job that I’m not going to inadvertently end up in a conflict of interest. The closest I’ve come in the past was working on a natural language interface to search engine queries, way back in 2001 while I was in my final year at UTas.)

So how did I start?


As I mentioned in my first post on this topic, the web spider I’ve been developing is simply a framework to test out some ideas of mine, as well as being a fun little ruby project to fill what little spare time I have.

The ideas I’m tossing around at the moment require a broad collection of documents to analyse, so it seemed pretty clear that the way to go was to narrow the initial application to blog posts, and that I would therefore need a spider that could discover not only page links, but also RSS feeds.

My initial reflex was to sketch out the few basic components I thought I would need and allow a rough architecture to evolve from there.
Ruby web spider: concept sketch
The initial concept called for four components: a scheduler, a spider, a harvester, and an analyser. This is a cleaned up version of my initial sketch – the first one had a lot of crossing out, so much so that even I was getting confused about what my initial “design” was.

I’ll spare you my handwriting, I’ve copied the annotations numbered 1-4, they are:

  1. The scheduler provides a public interface to trigger the seeding of the URI queue, and control the spiders. Additional URIs can be submitted through the scheduler and added to the queue.
  2. The spiders retrieve a URI for processing, and pass the retrieved data to the harvester. Additional URIs detected by the spider are inserted back into the URI queue.
  3. The harvester collects the data retrieved by the spider and inserts this data into a “page” cache, an opaque storage service for HTML or RSS data. Currently this is filesystem based, however there is no reason that harvester subclasses could not store this data in a DB, virtual FS or other appropriate datastore (that’s why it’s called “opaque” ;-) ). This also triggers inesrtion into the analysis queue a record of retrieved data – along with a reference to the file in the page cache, the record contains additional metadata such as URI, referring URI, time of retrieval, etc.
  4. An analyser retrieves data to process, removing the record from the analysis queue and using this to retrieve the file from the page cache. And then some magic we’re not talking about happens here ;-)

As you can see, there is no real mention of where each of these components will run, or indeed exactly how they will be implemented. As this series continues, we’ll build up to those details…

Well, that’s it for this exciting installment, comments, criticism and offers of high paying contracts gladly accepted below ;-)

Next: Part 1: the scheduler

One Response to “Ruby web spider Part 0: concept”

  1. Phill Midwinter Says:

    Not a bad start, you may find it worth putting the cache into DB if you want to search it effectively, you should also try breaking down the page into what are known as ‘barrels’. Basically a fat list of all the keywords on the page and their associated attributes (density, colour, font size whatever floats your boat). This makes it a lot easier to search later on.

Leave a Reply