Ruby web spider - watch this space

I mentioned before that I’ve been busy this last week, one of the things I’ve been working on in my own time is a web spider (written in Ruby) that can trawl both HTML pages and RSS feeds. I won’t say much about what I’m using it for, other than to say I’m testing some ideas out right now :-)

Anyway, I’m almost at a point where I’m happy to share this code (probably under GPL) as it’s not exactly rocket science (and I’ve only invested a week or so of evenings into it), but it has a couple of neat tricks that made it a good exercise in Ruby. A short laundry list of features:

  • Multi-process and multi-threaded
  • Can parse RSS, Atom and HTML
  • Capable of distributing workload across multiple machines, via a queue based scheduler (using DRb)
  • Respects robots.txt and is generally well behaved. Can set process and thread limits per machine or across the pool
  • disk-based storage of spidered pages (for now)
  • Asynchronous analyser(s) can process pages independent of the spiders’ execution

I won’t be releasing the analysers I’m using right now (gotta keep a bit of mystery) but I will release a generic analyser that you will be able to extend (ie subclass) to implement your own analysis, classification, etc.

Given the scope of the system (even though the code itself is simple), I’ll break it down according to function, and show how I’ve fitted it all together. I hope to have the first part up by Sunday evening.

edit:

Part 0 is now up, you can read it here.

3 Responses to “Ruby web spider - watch this space”

  1. warren Says:

    Slight delay, good friend of mine got married on the weekend, and too much free red wine left me in no state to finish writing up the first part yesterday arvo. :-) It is coming along though…

  2. Sander Says:

    Hi Warren,
    It has been a while, any progress on this? :-)

  3. Pukimak Says:

    well shit, at least it was good in your head. :P

Leave a Reply