<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Warren Seen &#187; web</title>
	<atom:link href="http://warrenseen.com/blog/category/web/feed/" rel="self" type="application/rss+xml" />
	<link>http://warrenseen.com/blog</link>
	<description>freelance software developer</description>
	<lastBuildDate>Wed, 03 Jun 2009 23:54:34 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Morfik @ San Francisco Web 2.0 Expo</title>
		<link>http://warrenseen.com/blog/2008/04/24/morfik-san-francisco-web-20-expo/</link>
		<comments>http://warrenseen.com/blog/2008/04/24/morfik-san-francisco-web-20-expo/#comments</comments>
		<pubDate>Wed, 23 Apr 2008 23:49:54 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[web]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2008/04/24/morfik-san-francisco-web-20-expo/</guid>
		<description><![CDATA[I mentioned Morfik a while back, the little Tassie company that was taking on Google with a number of patents.
Today they were featured on Qik, interviewed by Scoble himself. Check it out.

]]></description>
			<content:encoded><![CDATA[<p>I mentioned <a target="_blank" title="Morfik Home" href="http://www.morfik.com">Morfik</a> a <a href="http://warrenseen.com/blog/2007/04/03/local-company-taking-on-google-over-gwt/">while back</a>, the little Tassie company that was taking on Google with a number of patents.</p>
<p>Today they were featured on <a href="http://qik.com">Qik</a>, interviewed by <a href="http://www.fastcompany.com/scoble">Scoble</a> himself. Check it out.</p>
<p><object width="320" height="280"><param name="movie" value="http://qik.com/player.swf?streamname=fb2eacd6aa8544cfb01835a6eb7c6b76&#038;vid=63143&#038;playback=false&#038;polling=false&#038;user=scobleizer&#038;userlock=true&#038;islive=&#038;username=anonymous" ></param><param name="wmode" value="transparent" ></param><param name="allowScriptAccess" value="always" ><embed src="http://qik.com/player.swf?streamname=fb2eacd6aa8544cfb01835a6eb7c6b76&#038;vid=63143&#038;playback=false&#038;polling=false&#038;user=scobleizer&#038;userlock=true&#038;islive=&#038;username=anonymous" type="application/x-shockwave-flash" wmode="transparent" width="320" height="280" allowScriptAccess="always"></embed></object></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2008/04/24/morfik-san-francisco-web-20-expo/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New Biz Cards, New Template</title>
		<link>http://warrenseen.com/blog/2007/04/14/new-biz-cards-new-template/</link>
		<comments>http://warrenseen.com/blog/2007/04/14/new-biz-cards-new-template/#comments</comments>
		<pubDate>Sat, 14 Apr 2007 05:59:36 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[bizcards]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby on rails]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[wordpress]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2007/04/14/new-biz-cards-new-template/</guid>
		<description><![CDATA[
For some time now, I&#8217;ve needed some new business cards printed up, so I&#8217;ve been tooling around a bit in Adobe Illustrator in my spare time, until I finally came up with a style that suited me. 
Now I was happy with my design, on the recommendation of a friend, I shot a PDF off [...]]]></description>
			<content:encoded><![CDATA[<div>
For some time now, I&#8217;ve needed some new business cards printed up, so I&#8217;ve been tooling around a bit in Adobe Illustrator in my spare time, until I finally came up with a style that suited me. </p>
<p>Now I was happy with my design, on the recommendation of a friend, I shot a PDF off to Click Business Cards via <a href="http://clickbusinesscards.com.au/">their website</a>, and within 24 hours my cards had been printed and express posted to my door (they arrived yesterday). Their website may not be much chop, but their service sure is.</div>
<div style="float:right;margin: 5px;"><a href="http://www.flickr.com/photos/warren_seen/458332397/"><img src="http://farm1.static.flickr.com/245/458332397_799319bb0e_m.jpg" alt="New Biz Cards" /></a></div>
<p>With a little hesitation, I cracked open the box, hoping that I&#8217;d gotten the resolution, colours and bleeds right. I had, they look excellent, if I do say so myself! </p>
<p>I&#8217;m the first to admit I&#8217;m no designer, and I&#8217;ll cop to the fact that the design was largely inspired by <a href="http://davidseah.com/archives/2006/11/11/quickie-business-card-design-iv/">Dave Seah&#8217;s cards</a>, whilst the loco image and ruby red was chosen to tie into the fact that I&#8217;m moving towards specialising in Ruby on Rails.</p>
<p>So, with new cards, I realised that the Wordpress template I&#8217;ve been using was pretty tired. </p>
<p>Time for some branding re-alignment! </p>
<p>A quick CSS change and a new header image later, here we are. I&#8217;ve kept the layout the same for the time being, I don&#8217;t want to sink a lot of time into a new Wordpress template as I&#8217;m thinking of packing this whole site up and moving to a <a href="http://www.slicehost.com/">Slicehost</a> slice. If I do that, it will be goodbye Wordpress, hello to either <a href="http://trac.typosphere.org/">Typo</a> or <a href="http://mephistoblog.com/">Mephisto</a>.</p>
<p>For the time being however, it&#8217;s steady as she goes!
</p>
<p><!--c22930bd2a5a4cc1b67cca8f4db30f35-->
</p>
<p><!--6188499cf736f45c468802611819326f-->
</p>
<p><!--2c2b3994dd06989f82c93f33d7811646-->
</p>
<p><!--1ccd515d5feeffb5b0717e28030e3aeb--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2007/04/14/new-biz-cards-new-template/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Getting off my lazy butt and doing something&#8230;</title>
		<link>http://warrenseen.com/blog/2006/11/15/getting-off-my-lazy-butt-and-doing-something/</link>
		<comments>http://warrenseen.com/blog/2006/11/15/getting-off-my-lazy-butt-and-doing-something/#comments</comments>
		<pubDate>Wed, 15 Nov 2006 02:18:20 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[building]]></category>
		<category><![CDATA[house]]></category>
		<category><![CDATA[kids]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/11/15/getting-off-my-lazy-butt-and-doing-something/</guid>
		<description><![CDATA[Err, yeah. I remember, once upon a time, I used to have time to blog. That was before we built this:

Oh, and before he was born:

So despite the fact that I&#8217;m lying on the couch there, I&#8217;ve been far from idle.
My yard still looks like this:

so it&#8217;s not as though i don&#8217;t have enough to [...]]]></description>
			<content:encoded><![CDATA[<p>Err, yeah. I remember, once upon a time, I used to have time to blog. That was before we built this:<br />
<img src="http://static.flickr.com/80/232313924_bea9d7170b_m.jpg" alt="New House" /></p>
<p>Oh, and before he was born:</p>
<p><img src="http://static.flickr.com/111/290388847_1117a3b7e8.jpg?v=0" alt="Warren and Liam" /></p>
<p>So despite the fact that I&#8217;m lying on the couch there, I&#8217;ve been far from idle.</p>
<p>My yard still looks like this:</p>
<p><img src="http://static.flickr.com/83/243864033_b614b07b58.jpg?v=0" alt="New House - yard" /></p>
<p>so it&#8217;s not as though i don&#8217;t have enough to do! However, I *am* committing to finishing up, in the very near future, the Ruby Web Spider I began to write about some 5-6 months ago. </p>
<p>Consider this, a dusting off of the cobwebs kinda post&#8230;
</p>
<p><!--0f0ac147d18782d08685594507fe264b-->
</p>
<p><!--f1086dc97d1f9788e3562082f6bc8847-->
</p>
<p><!--d74940cb28d976d3238f2418f5287436-->
</p>
<p><!--2e23bab61baab9db66ee2658f452304a--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/11/15/getting-off-my-lazy-butt-and-doing-something/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Ruby web spider Part 1: The scheduler</title>
		<link>http://warrenseen.com/blog/2006/03/08/ruby-web-spider-part-1-the-scheduler/</link>
		<comments>http://warrenseen.com/blog/2006/03/08/ruby-web-spider-part-1-the-scheduler/#comments</comments>
		<pubDate>Tue, 07 Mar 2006 13:05:13 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/03/08/ruby-web-spider-part-1-the-scheduler/</guid>
		<description><![CDATA[This is the second part of a series of posts covering the development of my web spider in Ruby. You can read about the initial idea here, and the architecture in Part 0: Concept.
You may also recognise some of the code in Scheduler#run from a short post I made to check that the syntax highlighting [...]]]></description>
			<content:encoded><![CDATA[<p>This is the second part of a series of posts covering the development of my web spider in Ruby. You can read about the initial idea <a href="http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/">here</a>, and the architecture in <a href="http://warrenseen.com/blog/2006/03/03/ruby-web-spider-part-0-concept/">Part 0: Concept</a>.</p>
<p>You may also recognise some of the code in Scheduler#run from a <a href="http://warrenseen.com/blog/2006/02/28/testing/">short post</a> I made to check that the syntax highlighting was working <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>First I want to recap the goal of the scheduler before getting into the code itself. Simply put, the scheduler exists to mangage the list of URIs (web pages, RSS feeds) that need to be spidered, and to manage the spiders themselves. In particular, we want to be able to limit the number of spiders working at any one time, out of politeness if nothing else.</p>
<p>I&#8217;m not going to make this a tutorial in Ruby syntax by explaining things line by line, if you haven&#8217;t used Ruby before and find something you don&#8217;t understand, the PragProg book, <a href="http://www.rubycentral.com/book/index.html">Programming in Ruby</a> is the place to go look.</p>
<p>So let&#8217;s take a peek at some code!</p>
<p><span id="more-27"></span></p>
<h3>Class declaration</h3>
<div class="synthi_code" style="display:none;" id ="plain_synthi_4c8578fbc7aa7">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('styled_synthi_4c8578fbc7aa7').style.display='block';document.getElementById('plain_synthi_4c8578fbc7aa7').style.display='none';return false">Show Styled Code</a>]:</span></div>
<pre style="width:100%;overflow:auto;">
module RWS
  class Scheduler < RWS::Service
    include Singleton

    def initialize
      super
      @spider_queue = Queue.new
      @spider_threads = ThreadGroup.new
    end
</pre>
</div>
<div class="synthi_code" style="display:block;" id ="styled_synthi_4c8578fbc7aa7">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('plain_synthi_4c8578fbc7aa7').style.display='block';document.getElementById('styled_synthi_4c8578fbc7aa7').style.display='none';return false">Show Plain Code</a>]:</span></div>
<div class="ruby" style="font-family: monospace;">
<ol>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#9966CC; font-weight:bold;">module</span> RWS</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; <span style="color:#9966CC; font-weight:bold;">class</span> Scheduler &lt; RWS::Service</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">include</span> Singleton</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> initialize</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">super</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @spider_queue = Queue.<span style="color:#9900CC;">new</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @spider_threads = ThreadGroup.<span style="color:#9900CC;">new</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span> </div>
</li>
</ol>
</div>
</div>
<p>Here we can see the basic structure of the RWS::Scheduler, which inherits from RWS::Service (we'll get back to that in a moment.) The key things to note here are the <code>spider_queue</code>, which is the basis for our URI queue, and <code>spider_threads</code>, a thread group which will contain separate threads for each spider instance executing.</p>
<h3>Queue management</h3>
<div class="synthi_code" style="display:none;" id ="plain_synthi_4c8578fbca1ab">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('styled_synthi_4c8578fbca1ab').style.display='block';document.getElementById('plain_synthi_4c8578fbca1ab').style.display='none';return false">Show Styled Code</a>]:</span></div>
<pre style="width:100%;overflow:auto;">
    # schedule a URI for spidering
    def add(uri)
      begin
        @spider_queue << URI.parse(uri).normalize
        @logger.info(&#034;Scheduled: #{uri}&#034;)
      rescue
        @logger.info(&#034;Invailid:  #{uri}&#034;)
      end
    end

    # get a URI to spider
    def get_uri
      @spider_queue.pop
    end
</pre>
</div>
<div class="synthi_code" style="display:block;" id ="styled_synthi_4c8578fbca1ab">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('plain_synthi_4c8578fbca1ab').style.display='block';document.getElementById('styled_synthi_4c8578fbca1ab').style.display='none';return false">Show Plain Code</a>]:</span></div>
<div class="ruby" style="font-family: monospace;">
<ol>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#008000; font-style:italic;"># schedule a URI for spidering</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> add<span style="color:#006600; font-weight:bold;">&#40;</span>uri<span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">begin</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; @spider_queue &lt;&lt; URI.<span style="color:#9900CC;">parse</span><span style="color:#006600; font-weight:bold;">&#40;</span>uri<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">normalize</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; @logger.<span style="color:#9900CC;">info</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Scheduled: #{uri}&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">rescue</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; @logger.<span style="color:#9900CC;">info</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Invailid:&nbsp; #{uri}&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#008000; font-style:italic;"># get a URI to spider</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> get_uri</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @spider_queue.<span style="color:#9900CC;">pop</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span> </div>
</li>
</ol>
</div>
</div>
<p>Here we see two extremely simple (for now) methods to work with the URI queue. Because Ruby's <code>Queue</code> class is inherently thread safe (it's designed specifically to allow synchronisation between threads), we can safely operate without the need for explicit synchronisation. This is not a big deal at the moment, but will become important as we build out the spidering infrastructure.</p>
<h3>Service control</h3>
<div class="synthi_code" style="display:none;" id ="plain_synthi_4c8578fbcc8bb">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('styled_synthi_4c8578fbcc8bb').style.display='block';document.getElementById('plain_synthi_4c8578fbcc8bb').style.display='none';return false">Show Styled Code</a>]:</span></div>
<pre style="width:100%;overflow:auto;">
    def run
      super {
        if @spider_queue.empty?
          sleep(@@settings[&#034;timeout&#034;])
        else
          # if we haven't reached the concurrency limit
          # schedule a spider
          if @spider_threads.list.length < @@settings[&#034;thread_limit&#034;]
            @spider_threads.add Thread.new { Spider.new.process(get_uri) }
          end
        end
      }
    end

    def shutdown
      super {
        @spider_threads.list.each { |spider_thread| spider_thread.join }
      }
    end
  end
end
</pre>
</div>
<div class="synthi_code" style="display:block;" id ="styled_synthi_4c8578fbcc8bb">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('plain_synthi_4c8578fbcc8bb').style.display='block';document.getElementById('styled_synthi_4c8578fbcc8bb').style.display='none';return false">Show Plain Code</a>]:</span></div>
<div class="ruby" style="font-family: monospace;">
<ol>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#9966CC; font-weight:bold;">def</span> run</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">super</span> <span style="color:#006600; font-weight:bold;">&#123;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">if</span> @spider_queue.<span style="color:#9900CC;">empty</span>?</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#CC0066; font-weight:bold;">sleep</span><span style="color:#006600; font-weight:bold;">&#40;</span>@@settings<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;timeout&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">else</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#008000; font-style:italic;"># if we haven't reached the concurrency limit</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#008000; font-style:italic;"># schedule a spider</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">if</span> @spider_threads.<span style="color:#9900CC;">list</span>.<span style="color:#9900CC;">length</span> &lt; @@settings<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;thread_limit&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; @spider_threads.<span style="color:#9900CC;">add</span> Thread.<span style="color:#9900CC;">new</span> <span style="color:#006600; font-weight:bold;">&#123;</span> Spider.<span style="color:#9900CC;">new</span>.<span style="color:#9900CC;">process</span><span style="color:#006600; font-weight:bold;">&#40;</span>get_uri<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#125;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">&#125;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> shutdown</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">super</span> <span style="color:#006600; font-weight:bold;">&#123;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; @spider_threads.<span style="color:#9900CC;">list</span>.<span style="color:#9900CC;">each</span> <span style="color:#006600; font-weight:bold;">&#123;</span> |spider_thread| spider_thread.<span style="color:#9900CC;">join</span> <span style="color:#006600; font-weight:bold;">&#125;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">&#125;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#9966CC; font-weight:bold;">end</span> </div>
</li>
</ol>
</div>
</div>
<p>This is probably the most interesting code we'll see today. <code>Scheduler#run</code> and <code>Scheduler#shutdown</code> both override base class methods of <code>RWS::Service</code> only to call the base class method with a code block (the bit between the {}'s)! </p>
<p>Whilst it seems slightly counter-intuitive, this allows the subclass to provide its own "work block", without having to deal with the additional overhead of service management. In the case of <code>Scheduler#run</code>, the code block is executed repeatedly, until the service receives a signal to stop. <code>#run</code> will sleep when the URI queue is exhausted, otherwise it will attempt to create a new work thread for a spider, which retrieves a URI to request via <code>#get_uri</code>, unless the concurrent threads limit has been reached.</p>
<p>The block passed into the <code>#shutdown</code> method in contrast is executed only once. This calls join on each thread, effectively blocking until each thread in turn finishes, thus initiating a graceful termination.</p>
<h3>The service superclass</h3>
<p>Here's a look at the base class beneath the scheduler. As you can see it provides simplistic state management through the <code>#run</code>, <code>#stop</code> and <code>#shutdown</code> methods. I think it's sufficiently straightforward to not need any further explanation. <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<div class="synthi_code" style="display:none;" id ="plain_synthi_4c8578fbcefc3">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('styled_synthi_4c8578fbcefc3').style.display='block';document.getElementById('plain_synthi_4c8578fbcefc3').style.display='none';return false">Show Styled Code</a>]:</span></div>
<pre style="width:100%;overflow:auto;">
  class Service < RWS::Base
    def initialize
      super
      @state = :Stop
    end

    def run
      @state = :Running
      while @state == :Running
        yield if block_given?
      end
    end

    def stop
      if @state == :Running
        @state = :Stop
      end
    end

    def shutdown
      stop
      @state = :Shutdown
      @logger.info(&#034;#{self} Shutting down...&#034;)
      yield if block_given?
    end
  end
</pre>
</div>
<div class="synthi_code" style="display:block;" id ="styled_synthi_4c8578fbcefc3">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('plain_synthi_4c8578fbcefc3').style.display='block';document.getElementById('styled_synthi_4c8578fbcefc3').style.display='none';return false">Show Plain Code</a>]:</span></div>
<div class="ruby" style="font-family: monospace;">
<ol>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#9966CC; font-weight:bold;">class</span> Service &lt; RWS::Base</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> initialize</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">super</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @state = :Stop</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> run</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @state = :Running</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">while</span> @state == :Running</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">yield</span> <span style="color:#9966CC; font-weight:bold;">if</span> block_given?</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> stop</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">if</span> @state == :Running</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; @state = :Stop</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> shutdown</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; stop</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @state = :Shutdown</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @logger.<span style="color:#9900CC;">info</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;#{self} Shutting down...&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">yield</span> <span style="color:#9966CC; font-weight:bold;">if</span> block_given?</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; <span style="color:#9966CC; font-weight:bold;">end</span> </div>
</li>
</ol>
</div>
</div>
<h3>Issues so far:</h3>
<p>The best thing I can say about the current code is that it works with no unintended side effects. <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  There are a number of areas which I need to go back and assess, after all this is supposed to be a learning experience.</p>
<ol>
<li>Too much object creation going on. For each URI, a new <code>Spider</code> instance and a new <code>Thread</code> instance are created. As we will see later, the Spider class is relatively stateless, it should be simple to re-use.</li>
<li>Likewise, thread pooling would be a better approach than the current thread group to which new threads are constantly being added.</li>
<li><code>Queue#pop</code> can be used in a blocking fashion, it will block until there is something in the queue to process. This may be better than the fixed timeout approach currently in use.</li>
<li>Use of blocks instead of callbacks - maybe I just got a bit too excited with the ability to use a block for the <code>Service#run</code> method.</li>
<li>I'm not particularly fond of using a class variable <code>@@settings</code>, so I'm looking to extract this and inject a settings instance into the scheduler. This is somewhat difficult given that I am using the <code>Singleton</code> mixin so cannot simply pass settings into a constructor.</li>
<li>I'm still tainted by Java... I find myself doing things in an overly convoluted Java-esque fashion. This habit is hard to break, but I will <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </li>
</ol>
<p>So there you have it, the first proper installment of code for the web spider. I hope to shortly begin uploading actual .rb files containing what I've covered here, but a couple of housekeeping jobs still to be done on that front. I'll add links here when the code is up.</p>
<p><!--f13d79fe9b2263fe4d225931e955fdd2--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/03/08/ruby-web-spider-part-1-the-scheduler/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Ruby web spider Part 0: concept</title>
		<link>http://warrenseen.com/blog/2006/03/03/ruby-web-spider-part-0-concept/</link>
		<comments>http://warrenseen.com/blog/2006/03/03/ruby-web-spider-part-0-concept/#comments</comments>
		<pubDate>Fri, 03 Mar 2006 04:49:53 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/03/03/ruby-web-spider-part-0-concept/</guid>
		<description><![CDATA[(I should probably mention that I have never written a spider or worked on a search engine before, so this is a learning process&#8230; I don&#8217;t pretend to be an expert on this &#8211; I picked this partly because it is far enough from my &#8220;day&#8221; job that I&#8217;m not going to inadvertently end up [...]]]></description>
			<content:encoded><![CDATA[<p>(I should probably mention that I have never written a spider or worked on a search engine before, so this is a learning process&#8230; I don&#8217;t pretend to be an expert on this &#8211; I picked this partly because it is far enough from my &#8220;day&#8221; job that I&#8217;m not going to inadvertently end up in a conflict of interest. The closest I&#8217;ve come in the past was working on a natural language interface to search engine queries, way back in 2001 while I was in my final year at UTas.)</p>
<p>So how did I start?</p>
<p><span id="more-26"></span><br />
As I mentioned in my <a href="http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/">first post on this topic</a>, the web spider I&#8217;ve been developing is simply a framework to test out some ideas of mine, as well as being a fun little ruby project to fill what little spare time I have. </p>
<p>The ideas I&#8217;m tossing around at the moment require a broad collection of documents to analyse, so it seemed pretty clear that the way to go was to narrow the initial application to blog posts, and that I would therefore need a spider that could discover not only page links, but also RSS feeds.</p>
<p>My initial reflex was to sketch out the few basic components I thought I would need and allow a rough architecture to evolve from there.<br />
<img src="http://static.flickr.com/41/107051912_94bed6920b.jpg" alt="Ruby web spider: concept sketch" style="border: 1px solid black;"/><br />
The initial concept called for four components: a scheduler, a spider, a harvester, and an analyser. This is a cleaned up version of my initial sketch &#8211; the first one had a lot of crossing out, so much so that even I was getting confused about what my initial &#8220;design&#8221; was.</p>
<p>I&#8217;ll spare you my handwriting, I&#8217;ve copied the annotations numbered 1-4, they are:</p>
<ol>
<li>The scheduler provides a public interface to trigger the seeding of the URI queue, and control the spiders. Additional URIs can be submitted through the scheduler and added to the queue. </li>
<li>The spiders retrieve a URI for processing, and pass the retrieved data to the harvester. Additional URIs detected by the spider are inserted back into the URI queue.</li>
<li>The harvester collects the data retrieved by the spider and inserts this data into a &#8220;page&#8221; cache, an opaque storage service for HTML or RSS data. Currently this is filesystem based, however there is no reason that harvester subclasses could not store this data in a DB, virtual FS or other appropriate datastore (that&#8217;s why it&#8217;s called &#8220;opaque&#8221; <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> ). This also triggers inesrtion into the analysis queue a record of retrieved data &#8211; along with a reference to the file in the page cache, the record contains additional metadata such as URI, referring URI, time of retrieval, etc.</li>
<li>An analyser retrieves data to process, removing the record from the analysis queue and using this to retrieve the file from the page cache. And then some magic we&#8217;re not talking about happens here <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' />  </li>
</ol>
<p>As you can see, there is no real mention of where each of these components will run, or indeed exactly how they will be implemented. As this series continues, we&#8217;ll build up to those details&#8230;</p>
<p>Well, that&#8217;s it for this exciting installment, comments, criticism and offers of high paying contracts gladly accepted below <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>Next: <a href="http://warrenseen.com/blog/2006/03/08/ruby-web-spider-part-1-the-scheduler/">Part 1: the scheduler</a>
</p>
<p><!--56306cbea7726d8ec332355bc85deae8--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/03/03/ruby-web-spider-part-0-concept/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The expressiveness of ruby&#8230;</title>
		<link>http://warrenseen.com/blog/2006/02/28/testing/</link>
		<comments>http://warrenseen.com/blog/2006/02/28/testing/#comments</comments>
		<pubDate>Mon, 27 Feb 2006 13:03:21 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/02/28/testing/</guid>
		<description><![CDATA[The web spider writeup continues. I am no longer best friends with red wine after the weekend however.
Speaking of things that are red, I *love* the expressiveness of Ruby. The following is the run method of my Scheduler class, Ruby makes it a piece of cake to understand what is going on here, even without [...]]]></description>
			<content:encoded><![CDATA[<p>The web spider writeup continues. I am no longer <a href="http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/#comment-33">best friends with red wine</a> after the weekend however.</p>
<p>Speaking of things that are red, I *love* the expressiveness of Ruby. The following is the run method of my Scheduler class, Ruby makes it a piece of cake to understand what is going on here, even without the full context, I think the following is fairly intuitive&#8230;</p>
<p><span id="more-25"></span></p>
<div class="synthi_code" style="display:none;" id ="plain_synthi_4c8578fc02089">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('styled_synthi_4c8578fc02089').style.display='block';document.getElementById('plain_synthi_4c8578fc02089').style.display='none';return false">Show Styled Code</a>]:</span></div>
<pre style="width:100%;overflow:auto;">
def run
  while not stop?
    if @spider_queue.empty?
      sleep(Timeout)
    else
      # if we haven't reached the concurrency limit
      # schedule a spider
      if @spider_threads.list.length < ThreadLimit
        @spider_threads.add Thread.new {
          Spider.new.process(remove)
        }
      end
    end
  end
  # join threads
  @spider_threads.list.each {
    |spider_thread| spider_thread.join
  }
end</pre>
</div>
<div class="synthi_code" style="display:block;" id ="styled_synthi_4c8578fc02089">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('plain_synthi_4c8578fc02089').style.display='block';document.getElementById('styled_synthi_4c8578fc02089').style.display='none';return false">Show Plain Code</a>]:</span></div>
<div class="ruby" style="font-family: monospace;">
<ol>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#9966CC; font-weight:bold;">def</span> run</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; <span style="color:#9966CC; font-weight:bold;">while</span> <span style="color:#9966CC; font-weight:bold;">not</span> stop?</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">if</span> @spider_queue.<span style="color:#9900CC;">empty</span>?</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#CC0066; font-weight:bold;">sleep</span><span style="color:#006600; font-weight:bold;">&#40;</span>Timeout<span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">else</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#008000; font-style:italic;"># if we haven't reached the concurrency limit</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#008000; font-style:italic;"># schedule a spider</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">if</span> @spider_threads.<span style="color:#9900CC;">list</span>.<span style="color:#9900CC;">length</span> &lt; ThreadLimit</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; @spider_threads.<span style="color:#9900CC;">add</span> Thread.<span style="color:#9900CC;">new</span> <span style="color:#006600; font-weight:bold;">&#123;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Spider.<span style="color:#9900CC;">new</span>.<span style="color:#9900CC;">process</span><span style="color:#006600; font-weight:bold;">&#40;</span>remove<span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">&#125;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; <span style="color:#008000; font-style:italic;"># join threads</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; @spider_threads.<span style="color:#9900CC;">list</span>.<span style="color:#9900CC;">each</span> <span style="color:#006600; font-weight:bold;">&#123;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; |spider_thread| spider_thread.<span style="color:#9900CC;">join</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; <span style="color:#006600; font-weight:bold;">&#125;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#9966CC; font-weight:bold;">end</span> </div>
</li>
</ol>
</div>
</div>
<p>(For those wondering, yes, it runs in its own thread).</p>
<p>Anyway, this was really only a test of the <a href="http://www.indyjt.com/software/">SyntHihol plugin</a> for Wordpress. It took a bit of setting up as to get Ruby highlighting, I needed the latest version of <a href="http://qbnz.com/highlighter/">GeSHi</a>, and I had to turn off TinyMCE, because it kept stealing my indentation <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_sad.gif' alt=':-(' class='wp-smiley' />
</p>
<p><!--0bb8fe90fe48391a2e1b9e1f5b620b85--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/02/28/testing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ruby web spider &#8211; watch this space</title>
		<link>http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/</link>
		<comments>http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/#comments</comments>
		<pubDate>Fri, 24 Feb 2006 00:41:26 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[ruby]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/</guid>
		<description><![CDATA[I mentioned before that I&#8217;ve been busy this last week, one of the things I&#8217;ve been working on in my own time is a web spider (written in Ruby) that can trawl both HTML pages and RSS feeds. I won&#8217;t say much about what I&#8217;m using it for, other than to say I&#8217;m testing some [...]]]></description>
			<content:encoded><![CDATA[<p>I mentioned before that I&#8217;ve been busy this last week, one of the things I&#8217;ve been working on in my own time is a web spider (written in Ruby) that can trawl both HTML pages and RSS feeds. I won&#8217;t say much about what I&#8217;m using it for, other than to say I&#8217;m testing <a title="Whose authority?" href="http://warrenseen.com/blog/2006/02/15/whose-authority/">some</a> <a target="_blank" title=" Regional Inbound Link Authority Up the Duff" href="http://benbarren.blogspot.com/2006/02/regional-inbound-link-authority-up.html">ideas</a> out right now <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>Anyway, I&#8217;m almost at a point where I&#8217;m happy to share this code (probably under GPL) as it&#8217;s not exactly rocket science (and I&#8217;ve only invested a week or so of evenings into it), but it has a couple of neat tricks that made it a good exercise in Ruby. A short laundry list of features:</p>
<p><span id="more-24"></span></p>
<ul>
<li>Multi-process and multi-threaded</li>
<li>Can parse RSS, Atom and HTML</li>
<li>Capable of distributing workload across multiple machines, via a queue based scheduler (using DRb)</li>
<li>Respects robots.txt and is generally well behaved. Can set process and thread limits per machine or across the pool</li>
<li>disk-based storage of spidered pages (for now)</li>
<li>Asynchronous analyser(s) can process pages independent of the spiders&#8217; execution</li>
</ul>
<p>I won&#8217;t be releasing the analysers I&#8217;m using right now (gotta keep a bit of mystery) but I will release a generic analyser that you will be able to extend (ie subclass) to implement your own analysis, classification, etc.</p>
<p>Given the scope of the system (even though the code itself is simple), I&#8217;ll break it down according to function, and show how I&#8217;ve fitted it all together. I hope to have the first part up by Sunday evening.</p>
<p>edit:</p>
<p>Part 0 is now up, you can read it <a href="http://warrenseen.com/blog/2006/03/03/ruby-web-spider-part-0-concept/">here</a>.
</p>
<p><!--f21c02c6fe04950e86942429fa1400ab--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
