Ruby web spider Part 1: The scheduler
This is the second part of a series of posts covering the development of my web spider in Ruby. You can read about the initial idea here, and the architecture in Part 0: Concept.
You may also recognise some of the code in Scheduler#run from a short post I made to check that the syntax highlighting was working
First I want to recap the goal of the scheduler before getting into the code itself. Simply put, the scheduler exists to mangage the list of URIs (web pages, RSS feeds) that need to be spidered, and to manage the spiders themselves. In particular, we want to be able to limit the number of spiders working at any one time, out of politeness if nothing else.
I’m not going to make this a tutorial in Ruby syntax by explaining things line by line, if you haven’t used Ruby before and find something you don’t understand, the PragProg book, Programming in Ruby is the place to go look.
So let’s take a peek at some code!
Class declaration
-
module RWS
-
class Scheduler < RWS::Service
-
include Singleton
-
-
def initialize
-
super
-
@spider_queue = Queue.new
-
@spider_threads = ThreadGroup.new
-
end
Here we can see the basic structure of the RWS::Scheduler, which inherits from RWS::Service (we’ll get back to that in a moment.) The key things to note here are the spider_queue, which is the basis for our URI queue, and spider_threads, a thread group which will contain separate threads for each spider instance executing.
Queue management
-
# schedule a URI for spidering
-
def add(uri)
-
begin
-
@spider_queue << URI.parse(uri).normalize
-
@logger.info("Scheduled: #{uri}")
-
rescue
-
@logger.info("Invailid: #{uri}")
-
end
-
end
-
-
# get a URI to spider
-
def get_uri
-
@spider_queue.pop
-
end
Here we see two extremely simple (for now) methods to work with the URI queue. Because Ruby’s Queue class is inherently thread safe (it’s designed specifically to allow synchronisation between threads), we can safely operate without the need for explicit synchronisation. This is not a big deal at the moment, but will become important as we build out the spidering infrastructure.
Service control
-
def run
-
super {
-
if @spider_queue.empty?
-
sleep(@@settings["timeout"])
-
else
-
# if we haven’t reached the concurrency limit
-
# schedule a spider
-
if @spider_threads.list.length < @@settings["thread_limit"]
-
@spider_threads.add Thread.new { Spider.new.process(get_uri) }
-
end
-
end
-
}
-
end
-
-
def shutdown
-
super {
-
@spider_threads.list.each { |spider_thread| spider_thread.join }
-
}
-
end
-
end
-
end
This is probably the most interesting code we’ll see today. Scheduler#run and Scheduler#shutdown both override base class methods of RWS::Service only to call the base class method with a code block (the bit between the {}’s)!
Whilst it seems slightly counter-intuitive, this allows the subclass to provide its own “work block”, without having to deal with the additional overhead of service management. In the case of Scheduler#run, the code block is executed repeatedly, until the service receives a signal to stop. #run will sleep when the URI queue is exhausted, otherwise it will attempt to create a new work thread for a spider, which retrieves a URI to request via #get_uri, unless the concurrent threads limit has been reached.
The block passed into the #shutdown method in contrast is executed only once. This calls join on each thread, effectively blocking until each thread in turn finishes, thus initiating a graceful termination.
The service superclass
Here’s a look at the base class beneath the scheduler. As you can see it provides simplistic state management through the #run, #stop and #shutdown methods. I think it’s sufficiently straightforward to not need any further explanation.
-
class Service < RWS::Base
-
def initialize
-
super
-
@state = :Stop
-
end
-
-
def run
-
@state = :Running
-
while @state == :Running
-
yield if block_given?
-
end
-
end
-
-
def stop
-
if @state == :Running
-
@state = :Stop
-
end
-
end
-
-
def shutdown
-
stop
-
@state = :Shutdown
-
@logger.info("#{self} Shutting down…")
-
yield if block_given?
-
end
-
end
Issues so far:
The best thing I can say about the current code is that it works with no unintended side effects.
There are a number of areas which I need to go back and assess, after all this is supposed to be a learning experience.
- Too much object creation going on. For each URI, a new
Spiderinstance and a newThreadinstance are created. As we will see later, the Spider class is relatively stateless, it should be simple to re-use. - Likewise, thread pooling would be a better approach than the current thread group to which new threads are constantly being added.
Queue#popcan be used in a blocking fashion, it will block until there is something in the queue to process. This may be better than the fixed timeout approach currently in use.- Use of blocks instead of callbacks - maybe I just got a bit too excited with the ability to use a block for the
Service#runmethod. - I’m not particularly fond of using a class variable
@@settings, so I’m looking to extract this and inject a settings instance into the scheduler. This is somewhat difficult given that I am using theSingletonmixin so cannot simply pass settings into a constructor. - I’m still tainted by Java… I find myself doing things in an overly convoluted Java-esque fashion. This habit is hard to break, but I will
So there you have it, the first proper installment of code for the web spider. I hope to shortly begin uploading actual .rb files containing what I’ve covered here, but a couple of housekeeping jobs still to be done on that front. I’ll add links here when the code is up.
March 12th, 2006 at 12:28 pm
Ruby doesn’t have native threads, so I don’t see how thread pooling makes a difference. You’ll spend more CPU handling the pool than on actual object creation.
Java had us stop obsessing about memory allocation and deallocation. Then it reincarnated them as object pooling. Ruby goes a long way to let us stop obsession about object pooling. I hope they keep it that way.
Similarly, don’t worry too much about using blocks.
March 14th, 2006 at 12:08 am
hi assaf, thanks for the advice. in particular, what you say about native threads makes perfect sense in hindsight. again, that’s me still thinking in Java, ugh.
It’s true that object pooling is obsolete when you have relatively cheap object instantiation, but in this instance I was wondering whether repeatedly creating and deleting objects which are stateless is a waste of cycles that I can design out without any overhead…
March 14th, 2006 at 11:04 am
If the object is stateless, then does it really matter? You’re not losing any state when it gets garbage collected, you’re not building any state when you create it. The cost of creating a new object is lower than the cost of pooling it.
It matters in EJB because a “stateless” session bean is void of application state, but is packed with container state. So it’s not really stateless. What EJB does is ask you to help manage those “stateless” objects by writing code that pretends they are really stateless, but acting as if they’re really stateful.
How they managed to pull it off (and how I fell for it for so long) is still a mystery.
May 14th, 2006 at 10:56 pm
Warren,
I wanted to thank you for your piece on Ruby Spidering. I’ve been trying to find some sample code with little luck. I trying to learn Ruby and develop an application that scrapes horse racing pages and inserts the results into a MySql database.
Your work have been very helpful. Thanks again.
Can’t wait for the next installmnet.
Jim