<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Warren Seen &#187; web2.0</title>
	<atom:link href="http://warrenseen.com/blog/category/web20/feed/" rel="self" type="application/rss+xml" />
	<link>http://warrenseen.com/blog</link>
	<description>freelance software developer</description>
	<lastBuildDate>Wed, 03 Jun 2009 23:54:34 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Bye bye hashjobs.com, it was fun.</title>
		<link>http://warrenseen.com/blog/2009/06/04/bye-bye-hashjobscom-it-was-fun/</link>
		<comments>http://warrenseen.com/blog/2009/06/04/bye-bye-hashjobscom-it-was-fun/#comments</comments>
		<pubDate>Wed, 03 Jun 2009 23:54:34 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/?p=84</guid>
		<description><![CDATA[It&#8217;s funny how things work out. When I made the initial version of hashjobs live back in January, I mentioned that I wasn&#8217;t really sure how or if I could make anything from it. As luck would have it, it turns out that I managed to catch the &#8220;post jobs on Twitter&#8221; trend in its [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s funny how things work out. When I made the initial version of <a href="http://hashjobs.com/">hashjobs</a> live back in January, <a href="http://warrenseen.com/blog/2009/01/22/just-launched-hashjobscom/">I mentioned</a> that I wasn&#8217;t really sure how or if I could make anything from it. As luck would have it, it turns out that I managed to catch the &#8220;post jobs on Twitter&#8221; trend in its infancy. </p>
<p>Almost immediately I started getting offers to buy the domain name. Each inquiry I got, I would knock back as I have always had plans as to what I wanted to work on for future versions of the site. Most of these seemed like tire-kickers who I never heard from again. </p>
<p>But there was one guy who kept coming back asking about the site. Early last month, he forwarded a very specific offer via a 3rd party to me that made me realise he was VERY serious about acquiring HashJobs. After a bit of back and forth negotiating, and consideration of what I could make from the site in the next 12-18 months if I had a serious go at it, I happily accepted his offer.</p>
<p>So I&#8217;m pleased to announce that as of this morning, the final installment of our transaction has cleared, and Jason Davis of <a href="http://recruitingblogs.com/">recruitingblogs.com</a> is the new owner of HashJobs. I&#8217;ve had a sneak peek at Jason&#8217;s plans for the site and I think he is going to do it far more justice than I could have, given his position in the online recruiting community. I wish him all the best.</p>
<p>(As a side note, I am amazed how quickly incoming wire transfers clear, compared to transfers between local banks!)</p>
<p>As for my plans, well at least I&#8217;ve filled a GFC-induced hole in my previous 6 months cashflow. People who I&#8217;ve told about this always ask me, &#8220;Can yo do the same again in a different field?&#8221; The answer is, &#8220;Probably not, it was more luck than good management.&#8221; </p>
<p>I think I&#8217;ll just start another side project (I have a few ideas) and see where the wind takes me&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2009/06/04/bye-bye-hashjobscom-it-was-fun/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Morfik @ San Francisco Web 2.0 Expo</title>
		<link>http://warrenseen.com/blog/2008/04/24/morfik-san-francisco-web-20-expo/</link>
		<comments>http://warrenseen.com/blog/2008/04/24/morfik-san-francisco-web-20-expo/#comments</comments>
		<pubDate>Wed, 23 Apr 2008 23:49:54 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[web]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2008/04/24/morfik-san-francisco-web-20-expo/</guid>
		<description><![CDATA[I mentioned Morfik a while back, the little Tassie company that was taking on Google with a number of patents.
Today they were featured on Qik, interviewed by Scoble himself. Check it out.

]]></description>
			<content:encoded><![CDATA[<p>I mentioned <a target="_blank" title="Morfik Home" href="http://www.morfik.com">Morfik</a> a <a href="http://warrenseen.com/blog/2007/04/03/local-company-taking-on-google-over-gwt/">while back</a>, the little Tassie company that was taking on Google with a number of patents.</p>
<p>Today they were featured on <a href="http://qik.com">Qik</a>, interviewed by <a href="http://www.fastcompany.com/scoble">Scoble</a> himself. Check it out.</p>
<p><object width="320" height="280"><param name="movie" value="http://qik.com/player.swf?streamname=fb2eacd6aa8544cfb01835a6eb7c6b76&#038;vid=63143&#038;playback=false&#038;polling=false&#038;user=scobleizer&#038;userlock=true&#038;islive=&#038;username=anonymous" ></param><param name="wmode" value="transparent" ></param><param name="allowScriptAccess" value="always" ><embed src="http://qik.com/player.swf?streamname=fb2eacd6aa8544cfb01835a6eb7c6b76&#038;vid=63143&#038;playback=false&#038;polling=false&#038;user=scobleizer&#038;userlock=true&#038;islive=&#038;username=anonymous" type="application/x-shockwave-flash" wmode="transparent" width="320" height="280" allowScriptAccess="always"></embed></object></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2008/04/24/morfik-san-francisco-web-20-expo/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ruby web spider Part 1: The scheduler</title>
		<link>http://warrenseen.com/blog/2006/03/08/ruby-web-spider-part-1-the-scheduler/</link>
		<comments>http://warrenseen.com/blog/2006/03/08/ruby-web-spider-part-1-the-scheduler/#comments</comments>
		<pubDate>Tue, 07 Mar 2006 13:05:13 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/03/08/ruby-web-spider-part-1-the-scheduler/</guid>
		<description><![CDATA[This is the second part of a series of posts covering the development of my web spider in Ruby. You can read about the initial idea here, and the architecture in Part 0: Concept.
You may also recognise some of the code in Scheduler#run from a short post I made to check that the syntax highlighting [...]]]></description>
			<content:encoded><![CDATA[<p>This is the second part of a series of posts covering the development of my web spider in Ruby. You can read about the initial idea <a href="http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/">here</a>, and the architecture in <a href="http://warrenseen.com/blog/2006/03/03/ruby-web-spider-part-0-concept/">Part 0: Concept</a>.</p>
<p>You may also recognise some of the code in Scheduler#run from a <a href="http://warrenseen.com/blog/2006/02/28/testing/">short post</a> I made to check that the syntax highlighting was working <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>First I want to recap the goal of the scheduler before getting into the code itself. Simply put, the scheduler exists to mangage the list of URIs (web pages, RSS feeds) that need to be spidered, and to manage the spiders themselves. In particular, we want to be able to limit the number of spiders working at any one time, out of politeness if nothing else.</p>
<p>I&#8217;m not going to make this a tutorial in Ruby syntax by explaining things line by line, if you haven&#8217;t used Ruby before and find something you don&#8217;t understand, the PragProg book, <a href="http://www.rubycentral.com/book/index.html">Programming in Ruby</a> is the place to go look.</p>
<p>So let&#8217;s take a peek at some code!</p>
<p><span id="more-27"></span></p>
<h3>Class declaration</h3>
<div class="synthi_code" style="display:none;" id ="plain_synthi_4c8ae16479697">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('styled_synthi_4c8ae16479697').style.display='block';document.getElementById('plain_synthi_4c8ae16479697').style.display='none';return false">Show Styled Code</a>]:</span></div>
<pre style="width:100%;overflow:auto;">
module RWS
  class Scheduler < RWS::Service
    include Singleton

    def initialize
      super
      @spider_queue = Queue.new
      @spider_threads = ThreadGroup.new
    end
</pre>
</div>
<div class="synthi_code" style="display:block;" id ="styled_synthi_4c8ae16479697">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('plain_synthi_4c8ae16479697').style.display='block';document.getElementById('styled_synthi_4c8ae16479697').style.display='none';return false">Show Plain Code</a>]:</span></div>
<div class="ruby" style="font-family: monospace;">
<ol>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#9966CC; font-weight:bold;">module</span> RWS</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; <span style="color:#9966CC; font-weight:bold;">class</span> Scheduler &lt; RWS::Service</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">include</span> Singleton</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> initialize</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">super</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @spider_queue = Queue.<span style="color:#9900CC;">new</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @spider_threads = ThreadGroup.<span style="color:#9900CC;">new</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span> </div>
</li>
</ol>
</div>
</div>
<p>Here we can see the basic structure of the RWS::Scheduler, which inherits from RWS::Service (we'll get back to that in a moment.) The key things to note here are the <code>spider_queue</code>, which is the basis for our URI queue, and <code>spider_threads</code>, a thread group which will contain separate threads for each spider instance executing.</p>
<h3>Queue management</h3>
<div class="synthi_code" style="display:none;" id ="plain_synthi_4c8ae1647bdce">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('styled_synthi_4c8ae1647bdce').style.display='block';document.getElementById('plain_synthi_4c8ae1647bdce').style.display='none';return false">Show Styled Code</a>]:</span></div>
<pre style="width:100%;overflow:auto;">
    # schedule a URI for spidering
    def add(uri)
      begin
        @spider_queue << URI.parse(uri).normalize
        @logger.info(&#034;Scheduled: #{uri}&#034;)
      rescue
        @logger.info(&#034;Invailid:  #{uri}&#034;)
      end
    end

    # get a URI to spider
    def get_uri
      @spider_queue.pop
    end
</pre>
</div>
<div class="synthi_code" style="display:block;" id ="styled_synthi_4c8ae1647bdce">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('plain_synthi_4c8ae1647bdce').style.display='block';document.getElementById('styled_synthi_4c8ae1647bdce').style.display='none';return false">Show Plain Code</a>]:</span></div>
<div class="ruby" style="font-family: monospace;">
<ol>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#008000; font-style:italic;"># schedule a URI for spidering</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> add<span style="color:#006600; font-weight:bold;">&#40;</span>uri<span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">begin</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; @spider_queue &lt;&lt; URI.<span style="color:#9900CC;">parse</span><span style="color:#006600; font-weight:bold;">&#40;</span>uri<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">normalize</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; @logger.<span style="color:#9900CC;">info</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Scheduled: #{uri}&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">rescue</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; @logger.<span style="color:#9900CC;">info</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Invailid:&nbsp; #{uri}&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#008000; font-style:italic;"># get a URI to spider</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> get_uri</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @spider_queue.<span style="color:#9900CC;">pop</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span> </div>
</li>
</ol>
</div>
</div>
<p>Here we see two extremely simple (for now) methods to work with the URI queue. Because Ruby's <code>Queue</code> class is inherently thread safe (it's designed specifically to allow synchronisation between threads), we can safely operate without the need for explicit synchronisation. This is not a big deal at the moment, but will become important as we build out the spidering infrastructure.</p>
<h3>Service control</h3>
<div class="synthi_code" style="display:none;" id ="plain_synthi_4c8ae1647e4b8">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('styled_synthi_4c8ae1647e4b8').style.display='block';document.getElementById('plain_synthi_4c8ae1647e4b8').style.display='none';return false">Show Styled Code</a>]:</span></div>
<pre style="width:100%;overflow:auto;">
    def run
      super {
        if @spider_queue.empty?
          sleep(@@settings[&#034;timeout&#034;])
        else
          # if we haven't reached the concurrency limit
          # schedule a spider
          if @spider_threads.list.length < @@settings[&#034;thread_limit&#034;]
            @spider_threads.add Thread.new { Spider.new.process(get_uri) }
          end
        end
      }
    end

    def shutdown
      super {
        @spider_threads.list.each { |spider_thread| spider_thread.join }
      }
    end
  end
end
</pre>
</div>
<div class="synthi_code" style="display:block;" id ="styled_synthi_4c8ae1647e4b8">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('plain_synthi_4c8ae1647e4b8').style.display='block';document.getElementById('styled_synthi_4c8ae1647e4b8').style.display='none';return false">Show Plain Code</a>]:</span></div>
<div class="ruby" style="font-family: monospace;">
<ol>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#9966CC; font-weight:bold;">def</span> run</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">super</span> <span style="color:#006600; font-weight:bold;">&#123;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">if</span> @spider_queue.<span style="color:#9900CC;">empty</span>?</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#CC0066; font-weight:bold;">sleep</span><span style="color:#006600; font-weight:bold;">&#40;</span>@@settings<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;timeout&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">else</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#008000; font-style:italic;"># if we haven't reached the concurrency limit</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#008000; font-style:italic;"># schedule a spider</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">if</span> @spider_threads.<span style="color:#9900CC;">list</span>.<span style="color:#9900CC;">length</span> &lt; @@settings<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;thread_limit&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; @spider_threads.<span style="color:#9900CC;">add</span> Thread.<span style="color:#9900CC;">new</span> <span style="color:#006600; font-weight:bold;">&#123;</span> Spider.<span style="color:#9900CC;">new</span>.<span style="color:#9900CC;">process</span><span style="color:#006600; font-weight:bold;">&#40;</span>get_uri<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#125;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">&#125;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> shutdown</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">super</span> <span style="color:#006600; font-weight:bold;">&#123;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; @spider_threads.<span style="color:#9900CC;">list</span>.<span style="color:#9900CC;">each</span> <span style="color:#006600; font-weight:bold;">&#123;</span> |spider_thread| spider_thread.<span style="color:#9900CC;">join</span> <span style="color:#006600; font-weight:bold;">&#125;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#006600; font-weight:bold;">&#125;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#9966CC; font-weight:bold;">end</span> </div>
</li>
</ol>
</div>
</div>
<p>This is probably the most interesting code we'll see today. <code>Scheduler#run</code> and <code>Scheduler#shutdown</code> both override base class methods of <code>RWS::Service</code> only to call the base class method with a code block (the bit between the {}'s)! </p>
<p>Whilst it seems slightly counter-intuitive, this allows the subclass to provide its own "work block", without having to deal with the additional overhead of service management. In the case of <code>Scheduler#run</code>, the code block is executed repeatedly, until the service receives a signal to stop. <code>#run</code> will sleep when the URI queue is exhausted, otherwise it will attempt to create a new work thread for a spider, which retrieves a URI to request via <code>#get_uri</code>, unless the concurrent threads limit has been reached.</p>
<p>The block passed into the <code>#shutdown</code> method in contrast is executed only once. This calls join on each thread, effectively blocking until each thread in turn finishes, thus initiating a graceful termination.</p>
<h3>The service superclass</h3>
<p>Here's a look at the base class beneath the scheduler. As you can see it provides simplistic state management through the <code>#run</code>, <code>#stop</code> and <code>#shutdown</code> methods. I think it's sufficiently straightforward to not need any further explanation. <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<div class="synthi_code" style="display:none;" id ="plain_synthi_4c8ae16480beb">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('styled_synthi_4c8ae16480beb').style.display='block';document.getElementById('plain_synthi_4c8ae16480beb').style.display='none';return false">Show Styled Code</a>]:</span></div>
<pre style="width:100%;overflow:auto;">
  class Service < RWS::Base
    def initialize
      super
      @state = :Stop
    end

    def run
      @state = :Running
      while @state == :Running
        yield if block_given?
      end
    end

    def stop
      if @state == :Running
        @state = :Stop
      end
    end

    def shutdown
      stop
      @state = :Shutdown
      @logger.info(&#034;#{self} Shutting down...&#034;)
      yield if block_given?
    end
  end
</pre>
</div>
<div class="synthi_code" style="display:block;" id ="styled_synthi_4c8ae16480beb">
<div class="synthi_header" style="font-weight:bold;"> Ruby <span  class="synthi_button"style="font-weight:lighter;font-size:smaller;">[<a href="#" onClick="javascript:document.getElementById('plain_synthi_4c8ae16480beb').style.display='block';document.getElementById('styled_synthi_4c8ae16480beb').style.display='none';return false">Show Plain Code</a>]:</span></div>
<div class="ruby" style="font-family: monospace;">
<ol>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#9966CC; font-weight:bold;">class</span> Service &lt; RWS::Base</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> initialize</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">super</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @state = :Stop</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> run</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @state = :Running</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">while</span> @state == :Running</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">yield</span> <span style="color:#9966CC; font-weight:bold;">if</span> block_given?</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> stop</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">if</span> @state == :Running</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; @state = :Stop</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">def</span> shutdown</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; stop</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @state = :Shutdown</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; @logger.<span style="color:#9900CC;">info</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;#{self} Shutting down...&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">yield</span> <span style="color:#9966CC; font-weight:bold;">if</span> block_given?</div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#9966CC; font-weight:bold;">end</span></div>
</li>
<li style="font-weight: bold;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; <span style="color:#9966CC; font-weight:bold;">end</span> </div>
</li>
</ol>
</div>
</div>
<h3>Issues so far:</h3>
<p>The best thing I can say about the current code is that it works with no unintended side effects. <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  There are a number of areas which I need to go back and assess, after all this is supposed to be a learning experience.</p>
<ol>
<li>Too much object creation going on. For each URI, a new <code>Spider</code> instance and a new <code>Thread</code> instance are created. As we will see later, the Spider class is relatively stateless, it should be simple to re-use.</li>
<li>Likewise, thread pooling would be a better approach than the current thread group to which new threads are constantly being added.</li>
<li><code>Queue#pop</code> can be used in a blocking fashion, it will block until there is something in the queue to process. This may be better than the fixed timeout approach currently in use.</li>
<li>Use of blocks instead of callbacks - maybe I just got a bit too excited with the ability to use a block for the <code>Service#run</code> method.</li>
<li>I'm not particularly fond of using a class variable <code>@@settings</code>, so I'm looking to extract this and inject a settings instance into the scheduler. This is somewhat difficult given that I am using the <code>Singleton</code> mixin so cannot simply pass settings into a constructor.</li>
<li>I'm still tainted by Java... I find myself doing things in an overly convoluted Java-esque fashion. This habit is hard to break, but I will <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </li>
</ol>
<p>So there you have it, the first proper installment of code for the web spider. I hope to shortly begin uploading actual .rb files containing what I've covered here, but a couple of housekeeping jobs still to be done on that front. I'll add links here when the code is up.</p>
<p><!--f13d79fe9b2263fe4d225931e955fdd2--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/03/08/ruby-web-spider-part-1-the-scheduler/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Ruby web spider Part 0: concept</title>
		<link>http://warrenseen.com/blog/2006/03/03/ruby-web-spider-part-0-concept/</link>
		<comments>http://warrenseen.com/blog/2006/03/03/ruby-web-spider-part-0-concept/#comments</comments>
		<pubDate>Fri, 03 Mar 2006 04:49:53 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/03/03/ruby-web-spider-part-0-concept/</guid>
		<description><![CDATA[(I should probably mention that I have never written a spider or worked on a search engine before, so this is a learning process&#8230; I don&#8217;t pretend to be an expert on this &#8211; I picked this partly because it is far enough from my &#8220;day&#8221; job that I&#8217;m not going to inadvertently end up [...]]]></description>
			<content:encoded><![CDATA[<p>(I should probably mention that I have never written a spider or worked on a search engine before, so this is a learning process&#8230; I don&#8217;t pretend to be an expert on this &#8211; I picked this partly because it is far enough from my &#8220;day&#8221; job that I&#8217;m not going to inadvertently end up in a conflict of interest. The closest I&#8217;ve come in the past was working on a natural language interface to search engine queries, way back in 2001 while I was in my final year at UTas.)</p>
<p>So how did I start?</p>
<p><span id="more-26"></span><br />
As I mentioned in my <a href="http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/">first post on this topic</a>, the web spider I&#8217;ve been developing is simply a framework to test out some ideas of mine, as well as being a fun little ruby project to fill what little spare time I have. </p>
<p>The ideas I&#8217;m tossing around at the moment require a broad collection of documents to analyse, so it seemed pretty clear that the way to go was to narrow the initial application to blog posts, and that I would therefore need a spider that could discover not only page links, but also RSS feeds.</p>
<p>My initial reflex was to sketch out the few basic components I thought I would need and allow a rough architecture to evolve from there.<br />
<img src="http://static.flickr.com/41/107051912_94bed6920b.jpg" alt="Ruby web spider: concept sketch" style="border: 1px solid black;"/><br />
The initial concept called for four components: a scheduler, a spider, a harvester, and an analyser. This is a cleaned up version of my initial sketch &#8211; the first one had a lot of crossing out, so much so that even I was getting confused about what my initial &#8220;design&#8221; was.</p>
<p>I&#8217;ll spare you my handwriting, I&#8217;ve copied the annotations numbered 1-4, they are:</p>
<ol>
<li>The scheduler provides a public interface to trigger the seeding of the URI queue, and control the spiders. Additional URIs can be submitted through the scheduler and added to the queue. </li>
<li>The spiders retrieve a URI for processing, and pass the retrieved data to the harvester. Additional URIs detected by the spider are inserted back into the URI queue.</li>
<li>The harvester collects the data retrieved by the spider and inserts this data into a &#8220;page&#8221; cache, an opaque storage service for HTML or RSS data. Currently this is filesystem based, however there is no reason that harvester subclasses could not store this data in a DB, virtual FS or other appropriate datastore (that&#8217;s why it&#8217;s called &#8220;opaque&#8221; <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> ). This also triggers inesrtion into the analysis queue a record of retrieved data &#8211; along with a reference to the file in the page cache, the record contains additional metadata such as URI, referring URI, time of retrieval, etc.</li>
<li>An analyser retrieves data to process, removing the record from the analysis queue and using this to retrieve the file from the page cache. And then some magic we&#8217;re not talking about happens here <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' />  </li>
</ol>
<p>As you can see, there is no real mention of where each of these components will run, or indeed exactly how they will be implemented. As this series continues, we&#8217;ll build up to those details&#8230;</p>
<p>Well, that&#8217;s it for this exciting installment, comments, criticism and offers of high paying contracts gladly accepted below <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>Next: <a href="http://warrenseen.com/blog/2006/03/08/ruby-web-spider-part-1-the-scheduler/">Part 1: the scheduler</a>
</p>
<p><!--56306cbea7726d8ec332355bc85deae8--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/03/03/ruby-web-spider-part-0-concept/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Ruby web spider &#8211; watch this space</title>
		<link>http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/</link>
		<comments>http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/#comments</comments>
		<pubDate>Fri, 24 Feb 2006 00:41:26 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[ruby]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/</guid>
		<description><![CDATA[I mentioned before that I&#8217;ve been busy this last week, one of the things I&#8217;ve been working on in my own time is a web spider (written in Ruby) that can trawl both HTML pages and RSS feeds. I won&#8217;t say much about what I&#8217;m using it for, other than to say I&#8217;m testing some [...]]]></description>
			<content:encoded><![CDATA[<p>I mentioned before that I&#8217;ve been busy this last week, one of the things I&#8217;ve been working on in my own time is a web spider (written in Ruby) that can trawl both HTML pages and RSS feeds. I won&#8217;t say much about what I&#8217;m using it for, other than to say I&#8217;m testing <a title="Whose authority?" href="http://warrenseen.com/blog/2006/02/15/whose-authority/">some</a> <a target="_blank" title=" Regional Inbound Link Authority Up the Duff" href="http://benbarren.blogspot.com/2006/02/regional-inbound-link-authority-up.html">ideas</a> out right now <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>Anyway, I&#8217;m almost at a point where I&#8217;m happy to share this code (probably under GPL) as it&#8217;s not exactly rocket science (and I&#8217;ve only invested a week or so of evenings into it), but it has a couple of neat tricks that made it a good exercise in Ruby. A short laundry list of features:</p>
<p><span id="more-24"></span></p>
<ul>
<li>Multi-process and multi-threaded</li>
<li>Can parse RSS, Atom and HTML</li>
<li>Capable of distributing workload across multiple machines, via a queue based scheduler (using DRb)</li>
<li>Respects robots.txt and is generally well behaved. Can set process and thread limits per machine or across the pool</li>
<li>disk-based storage of spidered pages (for now)</li>
<li>Asynchronous analyser(s) can process pages independent of the spiders&#8217; execution</li>
</ul>
<p>I won&#8217;t be releasing the analysers I&#8217;m using right now (gotta keep a bit of mystery) but I will release a generic analyser that you will be able to extend (ie subclass) to implement your own analysis, classification, etc.</p>
<p>Given the scope of the system (even though the code itself is simple), I&#8217;ll break it down according to function, and show how I&#8217;ve fitted it all together. I hope to have the first part up by Sunday evening.</p>
<p>edit:</p>
<p>Part 0 is now up, you can read it <a href="http://warrenseen.com/blog/2006/03/03/ruby-web-spider-part-0-concept/">here</a>.
</p>
<p><!--f21c02c6fe04950e86942429fa1400ab--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/02/24/ruby-web-spider-watch-this-space/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>On Blogcode and missing the mark</title>
		<link>http://warrenseen.com/blog/2006/02/23/on-blogcode-and-missing-the-mark/</link>
		<comments>http://warrenseen.com/blog/2006/02/23/on-blogcode-and-missing-the-mark/#comments</comments>
		<pubDate>Thu, 23 Feb 2006 12:43:17 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[blog]]></category>
		<category><![CDATA[blogcode]]></category>
		<category><![CDATA[technorati]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/02/23/on-blogcode-and-missing-the-mark/</guid>
		<description><![CDATA[Noticed Blogcode last week, signed up and had a play, but haven&#8217;t really collected my thoughts on this one. I&#8217;ve been flat out with &#8220;real&#8221; work, etc., so have neglected to write anything this last week.
Anyway, precis on BlogCode is fairly straight forward, drop the name and url (not the feed tho) of the blog [...]]]></description>
			<content:encoded><![CDATA[<p>Noticed <a title="blogcode.com" href="http://www.blogcode.com/">Blogcode</a> last week, signed up and had a play, but haven&#8217;t really collected my thoughts on this one. I&#8217;ve been flat out with &#8220;real&#8221; work, etc., so have neglected to write anything this last week.</p>
<p>Anyway, precis on BlogCode is fairly straight forward, drop the name and url (not the feed tho) of the blog you want to code into the UI, then score the blog on a range of sliding scales covering content, tone, etc.</p>
<p>Neat, but essentially useless in and of itself.</p>
<p><span id="more-22"></span>The trick however is that the sliding scale scores allow statistical analysis to determine the blogs that match according to the criteria rated. eg <a title="blogcode.com" href="http://www.blogcode.com/lcompare.php?r=860">for this blog</a>, a range of sites are returned with a 72-78% &#8220;match&#8221;.</p>
<p>The underlying idea is that over time, as more people code a blog, the ratings will become a more accurate and democratic opinion of the blog in question.</p>
<p>I noticed that they do collect the country that the blogger appears to be from, however it doesn&#8217;t seem that you can use this to narrow searches, etc.</p>
<p>The problem I can see with this system however is this: there are a LOT of categories to rank, and although it doesn&#8217;t take long with sliding scales to do this, the novelty of assessing a blog over 15+ criteria is going to get old pretty fast.</p>
<p>Additionally, there&#8217;s no real incentive to come back to the site again and again. This feature would be more useful as an element of existing blog search (eg Technorati) or built into an aggregator (eg Rojo) as a value add. Imagine you add a blog to your feeds, and get suggestions for similar blogs, with sample content so you can decide whether you want to read them too. Not to mention being able to sort blogs according to their BlogCode &#8220;match&#8221; level.<br />
Alternatively, it could serve as the basis of a meme tracker in which you can influence the results you see based on blogs you&#8217;ve coded.</p>
<p>In short, an interesting statistical experiment, but the model (which has been carried over from storycode.com), doesn&#8217;t seem to be that tight a fit with the blogging state of the art. I mean, no RSS feeds, what were they thinking? <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':-P' class='wp-smiley' />
</p>
<p><!--5115bbbcbb05cd58c683d72e2a8c44ee-->
</p>
<p><!--dbd05ebf2b3b65d97ccd0bfa1485e816-->
</p>
<p><!--d02531ea0abf12dab00c175cbf494a1e-->
</p>
<p><!--dbd05ebf2b3b65d97ccd0bfa1485e816-->
</p>
<p><!--5115bbbcbb05cd58c683d72e2a8c44ee--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/02/23/on-blogcode-and-missing-the-mark/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Link love for authority-dissenters</title>
		<link>http://warrenseen.com/blog/2006/02/15/link-love-for-authority-dissenters/</link>
		<comments>http://warrenseen.com/blog/2006/02/15/link-love-for-authority-dissenters/#comments</comments>
		<pubDate>Tue, 14 Feb 2006 13:26:51 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[blog]]></category>
		<category><![CDATA[technorati]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/02/15/link-love-for-authority-dissenters/</guid>
		<description><![CDATA[Quick follow up to the previous post  &#8211; I wanted to get my thoughts out before I was influenced by anyone else&#8217;s thoughts  

Steve Rubel takes the classic high school debate approach of defining the word and building an argument from there. Conclusion &#8211; yes, it&#8217;s not authority, it&#8217;s popularity, people.
Data mining agrees [...]]]></description>
			<content:encoded><![CDATA[<p>Quick follow up to the previous post  &#8211; I wanted to get my thoughts out before I was influenced by anyone else&#8217;s thoughts <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<ul>
<li><a title="What is authority?" href="http://www.micropersuasion.com/2006/02/what_is_authori.html">Steve Rubel</a> takes the classic high school debate approach of defining the word and building an argument from there. Conclusion &#8211; yes, it&#8217;s not authority, it&#8217;s popularity, people.</li>
<li><a title="Technorati, Authority, and Getting Names Right" href="http://datamining.typepad.com/data_mining/2006/02/technorati_auth.html">Data mining</a> agrees with Steve, <em>&#8220;name things for what they are, not for what they are used for&#8221;. </em>That is quite obviously right out of Usability 101.</li>
<li>Jack Krupansky leaves an excellent comment on <a href="http://scobleizer.wordpress.com/2006/02/13/technorati-adds-authority-weighting/#comment-14405">Scobelizer</a>: &#8220;<em>to Technorati, â€œauthorityâ€ is simply popularity. That makes *no* sense.</em>&#8220;</li>
</ul>
<p>Consensus seems to be that tracking popularity but calling it &#8220;authority&#8221; muddies the waters&#8230; There is nothing wrong with the feature itself, it&#8217;s just the name that&#8217;s misleading.</p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/02/15/link-love-for-authority-dissenters/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Whose authority?</title>
		<link>http://warrenseen.com/blog/2006/02/15/whose-authority/</link>
		<comments>http://warrenseen.com/blog/2006/02/15/whose-authority/#comments</comments>
		<pubDate>Tue, 14 Feb 2006 13:12:15 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[blog]]></category>
		<category><![CDATA[technorati]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/02/15/whose-authority/</guid>
		<description><![CDATA[It&#8217;s been a busy few days for me, but I&#8217;m back in time to notice this:
Technorati has just added &#8216;authority&#8217; filtering to their search (see Scoble, TechCrunch, Dave Sifry, et cetera).
My issue with this is simple &#8211; the number of inbound links does NOT necessarily qualify the source as being an &#8220;authority&#8221; on anything. This [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been a busy few days for me, but I&#8217;m back in time to notice this:</p>
<p>Technorati has just added &#8216;authority&#8217; filtering to their search (see <a title="Scobelizer" href="http://scobleizer.wordpress.com/2006/02/13/technorati-adds-authority-weighting/">Scoble</a>, <a title="TechCrunch" href="http://www.techcrunch.com/2006/02/13/technorati-now-has-authority/">TechCrunch</a>, <a title="Dave Sifry" href="http://www.sifry.com/alerts/archives/000420.html">Dave Sifry</a>, et cetera).</p>
<p>My issue with this is simple &#8211; the number of inbound links does NOT necessarily qualify the source as being an &#8220;authority&#8221; on anything. This is nothing more than a popularity contest.</p>
<p>Ben Barren hits the nail on the head <a title="Bali 9 Death. Inbound Aussie/UK LinkLove Inequity" href="http://feeds.feedburner.com/blogspot/Fumd?m=3118">here</a>.</p>
<blockquote><p><em>So its very hard to determine relative popularity, on a regional basis</em></p>
</blockquote>
<p>Before pointing to this <a href="http://newyorkmetro.com/news/media/15967/index1.html">New York Metro</a> quote:</p>
<blockquote><p><em>In the blogosphere, the biggest audiencesâ€”and the advertising revenue they bringâ€”go to a small, elite few. Most bloggers toil in total obscurity.</em></p>
</blockquote>
<p>Popularity, that&#8217;s all it is. And that&#8217;s sad, because some of the people who deserve to be held as authorities in their field are lost amongst the noise, while the &#8220;blogosphere&#8221; (gack, I HATE that &#8220;word&#8221;) becomes more and more like a conversation between a panel of &#8220;A-list&#8221; bloggers with everyone else on the sideline. Then you&#8217;ve flipped from being &#8220;citizen&#8221; media to just media.</p>
<p>Disparate thoughts to follow&#8230;</p>
<p><span id="more-19"></span></p>
<ul>
<li>Blog search the way Technorati does it is fundamentally broken. Ranking based purely on &#8220;link love&#8221; is primitive at best. Calling it &#8220;Authority&#8221; filtering is inaccurate. That&#8217;s a big call to make, but I&#8217;m going to be dumb and stand by that one&#8230;</li>
<li>It&#8217;s broken because it causes new blogs to be put in a catch 22 &#8211; if you want to get noticed, you have to get links. If you want to get links you need to get noticed in the first place.</li>
<li>Even if you can get some traffic driven to your blog from eg Technorati, how do you make it sticky? The majority of my page views reported by awstats are in the 30 second region. The next highest is the 30-60 second. It tails off quickly after that.</li>
<li>Have you ever wondered about the great bloggers you haven&#8217;t read? How will you ever find them if no one links to them?</li>
<li>Don&#8217;t you wish you could find blogs similar to the ones you already read? Or ones that provide a counter-point? ie &#8220;Show more like this&#8230;&#8221;</li>
<li>I&#8217;m not talking about just keyword matching here, it&#8217;s a far more sophisticated problem than it appears. Memetrackers DON&#8217;T do this&#8230; bet you wish they could.</li>
<li>What if your aggregator sorted items from your feeds by relevance? Think a mutant cross between <a title="Reddit" href="http://reddit.com/">reddit</a> and <a title="amazon.com" href="http://amazon.com">amazon.com</a>&#8217;s &#8220;for you&#8221; for your feeds. Key word filtered &#8220;smart&#8221; folders are dumb by comparison.</li>
<li>What if technorati did the same? Why should I have to manually create my watch list, when technorati knows what I search for regularly? Hell, they know what I tag, that&#8217;s a BFC (big effing clue) right there&#8230;</li>
</ul>
<p>time for the (long) tail to wag the dog. <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' />
</p>
<p><!--c9a0aabed165a7cbdf3760929705547c--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/02/15/whose-authority/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Note to self: if you look before you post&#8230;</title>
		<link>http://warrenseen.com/blog/2006/02/09/note-to-self-if-you-look-before-you-post/</link>
		<comments>http://warrenseen.com/blog/2006/02/09/note-to-self-if-you-look-before-you-post/#comments</comments>
		<pubDate>Wed, 08 Feb 2006 15:31:08 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[demo2006]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/02/09/note-to-self-if-you-look-before-you-post/</guid>
		<description><![CDATA[you might actually find a blog for Michael @ Zingee, linked from TechCrunch.   I also should probably read smh.com.au in future, not just The Age site.
Seems they travelled a similar path to one of my previous employers, via the ANZATech conference. I&#8217;ll leave it as an exercise to the reader to figure out [...]]]></description>
			<content:encoded><![CDATA[<p>you might actually find a blog for <a target="_blank" href="http://zingee.blogs.com/michael/">Michael @ Zingee</a>, linked from TechCrunch. <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  I also should probably read <a title="Aussies debut on world stage" href="http://www.smh.com.au/news/breaking/aussies-debut-on-world-stage/2006/02/06/1139074134083.html">smh.com.au</a> in future, not just The Age site.</p>
<p>Seems they travelled a similar path to one of my previous employers, via the ANZATech conference. I&#8217;ll leave it as an exercise to the reader to figure out which company that was. Hope it works out better after Demo than it did for the guys from <a target="_blank" title="NetPriva Takes Reins from Foursticks " href="http://www.impress.com.au/2005/netpriva_2005.asp#netpriva">Foursticks</a> <img src='http://warrenseen.com/blog/wp-includes/images/smilies/icon_sad.gif' alt=':-(' class='wp-smiley' /><br />
Nice work guys, keen to see the video when it&#8217;s up on the Demo website.
</p>
<p><!--3e78ca2347abdaae6a3709f7bb84b00f-->
</p>
<p><!--d16129e0e8e72210166b2cbd28e218b9-->
</p>
<p><!--a08aa3f0a223cb1c635c8c8fd8fcbbe1--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/02/09/note-to-self-if-you-look-before-you-post/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Zingee: stealth mode?</title>
		<link>http://warrenseen.com/blog/2006/02/09/zingee-stealth-mode/</link>
		<comments>http://warrenseen.com/blog/2006/02/09/zingee-stealth-mode/#comments</comments>
		<pubDate>Wed, 08 Feb 2006 15:17:13 +0000</pubDate>
		<dc:creator>warren</dc:creator>
				<category><![CDATA[demo2006]]></category>
		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://warrenseen.com/blog/2006/02/09/zingee-stealth-mode/</guid>
		<description><![CDATA[Why have I not heard of Zingee before? Apparently, they&#8217;re Australian and doing the whole online &#8220;storage&#8221; thing, but with a nice twist in that they don&#8217;t actually store the files (?) I guess it&#8217;s more like mediated file transfer than true online storage? As eg Hamachi is as opposed to a true VPN.
Zingee &#8211; [...]]]></description>
			<content:encoded><![CDATA[<p>Why have I not heard of <a target="_blank" title="Zingee" href="http://www.zingee.com/index.aspx">Zingee</a> before? <a target="_blank" href="http://ipioneer.typepad.com/ipioneer/2006/02/demo_day_1.html">Apparently, they&#8217;re Australian</a> and doing the whole online &#8220;storage&#8221; thing, but with a nice twist in that they don&#8217;t actually store the files (?) I guess it&#8217;s more like mediated file transfer than true online storage? As eg <a target="_blank" title="hamachi.cc" href="http://www.hamachi.cc">Hamachi</a> is as opposed to a true VPN.</p>
<blockquote><p><em>Zingee &#8211; an Australian-based company, who claims they&#8217;ve traveled the furthest to get to Demo.Â  They allow you to share files from your computer, without any uploads.Â  A neat way to preserve your privacy!</em></p>
</blockquote>
<p>Not just me though, apparently Michael Arrington at TechCrunch <a target="_blank" title="TechCrunch" href="http://www.techcrunch.com/2006/02/07/a-taste-of-demo-2006/">hadn&#8217;t heard of them</a> last week.</p>
<blockquote><p><em>I finally got a look at newcomer storage service Zingee, which would have been included on my â€œ<a href="http://www.techcrunch.com/2006/01/31/the-online-storage-gang/">Online Storage Gang</a>â€ post if they had been around. </em></p>
</blockquote>
<p>Trying to find out a bit more, but all I can see is a couple of phone numbers on their site (only one in Sydney), and their <a target="_blank" title="demo.com" href="http://www.demo.com/demonstrators/demo2006/63051.html">profile</a> on the Demo site, which actually lists Singapore contact details.</p>
<p>This is not really surprising, Singapore is a better place to build to flip than Australia according to some accounting folks I&#8217;ve worked with.</p>
<p>In any case, this may be another local outfit to watch.
</p>
<p><!--6d2a69a8c59ed2eb167ab7a7867b180f--></p>
]]></content:encoded>
			<wfw:commentRss>http://warrenseen.com/blog/2006/02/09/zingee-stealth-mode/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
