<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Mike Perham</title>
	
	<link>http://www.mikeperham.com</link>
	<description>On Ruby, software and the Internet</description>
	<lastBuildDate>Thu, 08 Jul 2010 14:31:52 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/mikeperham" /><feedburner:info uri="mikeperham" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Onehood is Hiring!</title>
		<link>http://feedproxy.google.com/~r/mikeperham/~3/326ax3Q08HM/</link>
		<comments>http://www.mikeperham.com/2010/07/01/onehood-is-hiring/#comments</comments>
		<pubDate>Thu, 01 Jul 2010 21:30:44 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=577</guid>
		<description><![CDATA[
Move to San Francisco. ✔
Found Onehood, a shiny new startup. ✔
Hire awesome technical staff. nil, oops

Onehood is a stealthy but funded startup located in downtown San Francisco.  We are absolutely looking for great people in the Ruby, Javascript and UI design world.  We&#8217;ve got the core of a great team but need people [...]]]></description>
			<content:encoded><![CDATA[<ul>
<li>Move to San Francisco. ✔</li>
<li>Found <a href="http://www.onehood.com">Onehood</a>, a shiny new startup. ✔</li>
<li>Hire awesome technical staff. <code>nil</code>, oops</li>
</ul>
<p>Onehood is a stealthy but funded startup located in downtown San Francisco.  We are absolutely looking for great people in the Ruby, Javascript and UI design world.  We&#8217;ve got the core of a great team but need people to fill out that team.</p>
<p>You:</p>
<ul>
<li>should have experience with Ruby, Javascript and/or HTML/CSS.</li>
<li>communicate well, be located in the Bay Area and love working in a tight-knit team setting.</li>
</ul>
<p>Send an email to jobs@onehood.com with whatever represents you best: a resume, a link to your github account, a project portfolio, whatever works best for you and we&#8217;ll get back to you.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/07/01/onehood-is-hiring/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.mikeperham.com/2010/07/01/onehood-is-hiring/</feedburner:origLink></item>
		<item>
		<title>Detecting Duplicate Images with Phashion</title>
		<link>http://feedproxy.google.com/~r/mikeperham/~3/djo1uTDDxpc/</link>
		<comments>http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/#comments</comments>
		<pubDate>Sat, 22 May 2010 03:05:29 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Ruby]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=556</guid>
		<description><![CDATA[Recently I was given a ticket to implement a &#8220;near-duplicate&#8221; image detector.  Look at these three images:
The original image files have different bytesizes and different sizes but they show essentially the same thing.  This is what we call a &#8220;near-duplicate&#8221; and the problem was that when displaying an automatically generated image gallery for [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I was given a ticket to implement a &#8220;near-duplicate&#8221; image detector.  Look at these three images:<br />

<a href='http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/earns-apple/' title='Earns Apple'><img width="86" height="86" src="http://www.mikeperham.com/wp-content/uploads/2010/05/86x86-0a1e.jpeg" class="attachment-thumbnail" alt="" title="Earns Apple" /></a>
<a href='http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/86x86-83d6/' title='86x86-83d6'><img width="86" height="86" src="http://www.mikeperham.com/wp-content/uploads/2010/05/86x86-83d6.jpeg" class="attachment-thumbnail" alt="" title="86x86-83d6" /></a>
<a href='http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/86x86-a855/' title='86x86-a855'><img width="86" height="86" src="http://www.mikeperham.com/wp-content/uploads/2010/05/86x86-a855.jpeg" class="attachment-thumbnail" alt="" title="86x86-a855" /></a>
<br />
The original image files have different bytesizes and different sizes but they show essentially the same thing.  This is what we call a &#8220;near-duplicate&#8221; and the problem was that when displaying an automatically generated image gallery for a given subject, we were sometimes showing duplicate images due to slight differences in the images.</p>
<p>Obviously we can&#8217;t use something like an MD5 or SHA1 fingerprint &#8211; we have to create a fingerprint based on the content of the image, not the exact bytes.  This is what the <a href="http://phash.org">pHash library</a> does.  A &#8220;perceptual hash&#8221; is a 64-bit value based on the discrete cosine transform of the image&#8217;s frequency spectrum data.  Similar images will have hashes that are close in terms of <a href="http://en.wikipedia.org/wiki/Hamming_distance">Hamming distance</a>.  That is, a binary hash value of 1000 is closer to 0000 than 0011 because it only has one bit different whereas the latter value has two bits different. The duplicate threshold defines how many bits must be different between two hashes for the two associated images to be considered different images.  Our testing showed that 15 bits is a good value to start with, it detected all duplicates with a minimum of false positives.</p>
<p><a href="http://github.com/mperham/phashion">Phashion</a> is my new Ruby wrapper for the pHash library and wraps just enough of the pHash API to implement the described functionality.  Here&#8217;s the test in the test suite which verifies that Phashion considers the images to be duplicates:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">  <span style="color:#9966CC; font-weight:bold;">def</span> assert_duplicate<span style="color:#006600; font-weight:bold;">&#40;</span>a, b<span style="color:#006600; font-weight:bold;">&#41;</span>
    assert a.<span style="color:#9900CC;">duplicate</span>?<span style="color:#006600; font-weight:bold;">&#40;</span>b<span style="color:#006600; font-weight:bold;">&#41;</span>, <span style="color:#996600;">&quot;#{a.filename} not dupe of #{b.filename}&quot;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
  <span style="color:#9966CC; font-weight:bold;">def</span> test_duplicate_detection
    files = <span style="color:#006600; font-weight:bold;">%</span>w<span style="color:#006600; font-weight:bold;">&#40;</span>86x86<span style="color:#006600; font-weight:bold;">-</span>0a1e.<span style="color:#9900CC;">jpeg</span> 86x86<span style="color:#006600; font-weight:bold;">-</span>83d6.<span style="color:#9900CC;">jpeg</span> 86x86<span style="color:#006600; font-weight:bold;">-</span>a855.<span style="color:#9900CC;">jpeg</span><span style="color:#006600; font-weight:bold;">&#41;</span>
    images = files.<span style="color:#9900CC;">map</span> <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span> <span style="color:#6666ff; font-weight:bold;">Phashion::Image</span>.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;#{File.dirname(__FILE__) + '/../test/'}#{f}&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">&#125;</span>
    assert_duplicate images<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">&#93;</span>, images<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">1</span><span style="color:#006600; font-weight:bold;">&#93;</span>
    assert_duplicate images<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">1</span><span style="color:#006600; font-weight:bold;">&#93;</span>, images<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">2</span><span style="color:#006600; font-weight:bold;">&#93;</span>
    assert_duplicate images<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">&#93;</span>, images<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">2</span><span style="color:#006600; font-weight:bold;">&#93;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<p>pHash does have much more functionality, including video and audio support and persistent MVP tree support for similarity searches based on previously processed files, but I have not wrapped any of those APIs.  Try it out and let me know what you think!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		<feedburner:origLink>http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/</feedburner:origLink></item>
		<item>
		<title>Stream Processing and “Trending” Data</title>
		<link>http://feedproxy.google.com/~r/mikeperham/~3/EhtI9BIGlqY/</link>
		<comments>http://www.mikeperham.com/2010/05/05/stream-processing-and-trending-data/#comments</comments>
		<pubDate>Wed, 05 May 2010 19:01:35 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=553</guid>
		<description><![CDATA[The Britney Spears Problem is a fantastic article from American Scientist about real-time processing of streaming data to determine trends.  I love discovering clever new algorithms and the &#8220;majority algorithm&#8221; is simple, easy to implement but something you probably wouldn&#8217;t think up yourself if solving the same problem.  If you&#8217;ve ever wondered how [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.americanscientist.org/issues/id.3822,y.0,no.,content.true,page.2,css.print/issue.aspx">The Britney Spears Problem</a> is a fantastic article from American Scientist about real-time processing of streaming data to determine trends.  I love discovering clever new algorithms and the &#8220;majority algorithm&#8221; is simple, easy to implement but something you probably wouldn&#8217;t think up yourself if solving the same problem.  If you&#8217;ve ever wondered how Twitter&#8217;s trending feature is implemented, this is probably a good place to start.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/05/05/stream-processing-and-trending-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://www.mikeperham.com/2010/05/05/stream-processing-and-trending-data/</feedburner:origLink></item>
		<item>
		<title>bayes_motel – Bayesian classification for Ruby</title>
		<link>http://feedproxy.google.com/~r/mikeperham/~3/41nj3Ct_Wjc/</link>
		<comments>http://www.mikeperham.com/2010/04/28/bayes_motel-bayesian-classification-for-ruby/#comments</comments>
		<pubDate>Thu, 29 Apr 2010 01:20:17 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=540</guid>
		<description><![CDATA[Bayesian classification is an algorithm which allows us to categorize documents probabilistically.  I recently started playing with Twitter data and realized there was no Ruby gem which would allow me to build a spam detector for tweets.  The classifier gem just works on a set of text by figuring out which words appear [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Bayesian classification</a> is an algorithm which allows us to categorize documents probabilistically.  I recently started playing with Twitter data and realized there was no Ruby gem which would allow me to build a spam detector for tweets.  The <code>classifier</code> gem just works on a set of text by figuring out which words appear in a category but a tweet is much more complicated than that.  A tweet looks like this:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#006600; font-weight:bold;">&#123;</span>:text<span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;Firesale prices, too! RT @nirajc: Time to change your Facebook password. Hacker selling 1.5m accounts. http://bit.ly/dryY7&quot;</span>, 
<span style="color:#ff3333; font-weight:bold;">:truncated</span><span style="color:#006600; font-weight:bold;">=&gt;</span>false, <span style="color:#ff3333; font-weight:bold;">:created_at</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;Fri Apr 23 18:26:51 +0000 2010&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:coordinates</span><span style="color:#006600; font-weight:bold;">=&gt;</span>nil, <span style="color:#ff3333; font-weight:bold;">:geo</span><span style="color:#006600; font-weight:bold;">=&gt;</span>nil, <span style="color:#ff3333; font-weight:bold;">:favorited</span><span style="color:#006600; font-weight:bold;">=&gt;</span>false,
<span style="color:#ff3333; font-weight:bold;">:source</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;&lt;a href=<span style="color:#000099;">\&quot;</span>http://www.tweetdeck.com<span style="color:#000099;">\&quot;</span> rel=<span style="color:#000099;">\&quot;</span>nofollow<span style="color:#000099;">\&quot;</span>&gt;TweetDeck&lt;/a&gt;&quot;</span>,  <span style="color:#ff3333; font-weight:bold;">:place</span><span style="color:#006600; font-weight:bold;">=&gt;</span>nil, <span style="color:#ff3333; font-weight:bold;">:contributors</span><span style="color:#006600; font-weight:bold;">=&gt;</span>nil,
<span style="color:#ff3333; font-weight:bold;">:user</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#006600; font-weight:bold;">&#123;</span>:verified<span style="color:#006600; font-weight:bold;">=&gt;</span>false, <span style="color:#ff3333; font-weight:bold;">:profile_text_color</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;666666&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:friends_count</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#006666;">226</span>, <span style="color:#ff3333; font-weight:bold;">:created_at</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;Wed Oct 08 07:15:23 +0000 2008&quot;</span>,
<span style="color:#ff3333; font-weight:bold;">:profile_link_color</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;2FC2EF&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:favourites_count</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#006666;">12</span>, <span style="color:#ff3333; font-weight:bold;">:description</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;All the news that's fit to tweet (and most that isn't)&quot;</span>,
<span style="color:#ff3333; font-weight:bold;">:lang</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;en&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:profile_sidebar_fill_color</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;252429&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:location</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;Brooklyn, NY&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:following</span><span style="color:#006600; font-weight:bold;">=&gt;</span>nil, <span style="color:#ff3333; font-weight:bold;">:notifications</span><span style="color:#006600; font-weight:bold;">=&gt;</span>nil,
<span style="color:#ff3333; font-weight:bold;">:time_zone</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;Eastern Time (US &amp; Canada)&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:statuses_count</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#006666;">981</span>, <span style="color:#ff3333; font-weight:bold;">:profile_sidebar_border_color</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;181A1E&quot;</span>, 
<span style="color:#ff3333; font-weight:bold;">:profile_image_url</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;http://a1.twimg.com/profile_images/834612904/Photo_on_2010-04-16_at_00.38__3_normal.jpg&quot;</span>, 
<span style="color:#ff3333; font-weight:bold;">:profile_background_image_url</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;http://s.twimg.com/a/1271725794/images/themes/theme9/bg.gif&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:protected</span><span style="color:#006600; font-weight:bold;">=&gt;</span>false, 
<span style="color:#ff3333; font-weight:bold;">:contributors_enabled</span><span style="color:#006600; font-weight:bold;">=&gt;</span>false, <span style="color:#ff3333; font-weight:bold;">:url</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;http://www.aolnews.com&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:screen_name</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;carlfranzen&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:name</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;Carl Franzen&quot;</span>, 
<span style="color:#ff3333; font-weight:bold;">:profile_background_tile</span><span style="color:#006600; font-weight:bold;">=&gt;</span>false, <span style="color:#ff3333; font-weight:bold;">:profile_background_color</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#996600;">&quot;1A1B1F&quot;</span>, <span style="color:#ff3333; font-weight:bold;">:id</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#006666;">16645918</span>, <span style="color:#ff3333; font-weight:bold;">:geo_enabled</span><span style="color:#006600; font-weight:bold;">=&gt;</span>false, 
<span style="color:#ff3333; font-weight:bold;">:utc_offset</span><span style="color:#006600; font-weight:bold;">=&gt;-</span><span style="color:#006666;">18000</span>, <span style="color:#ff3333; font-weight:bold;">:followers_count</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#006666;">174</span><span style="color:#006600; font-weight:bold;">&#125;</span>, <span style="color:#ff3333; font-weight:bold;">:id</span><span style="color:#006600; font-weight:bold;">=&gt;</span><span style="color:#006666;">12717456105</span><span style="color:#006600; font-weight:bold;">&#125;</span></pre></div></div>

<p>As you can see, a tweet is just a hash of variables.  So which variables are a better indicator of spam?  I don&#8217;t know and chances are you don&#8217;t either.  But if we create a corpus of ham tweets and a corpus of spam tweets, we can train a Bayesian classifier with the two datasets and it will figure out which variable values are seen often in spam and which in ham.</p>
<p>Some variables don&#8217;t work, statistically speaking:</p>
<ul>
<li><strong>:id, :created_at</strong> &#8211; these variables are unique for each tweet which means they are useless for classification.  BayesMotel will trim any variable values that don&#8217;t appear in more than 3% of the corpus.</li>
<li><strong>:followers_count</strong> &#8211; this is probably a pretty good spam/ham indicator in general, but not as a simple number.  There are millions of possible values (@aplusk has 4.5 million followers) but we are only training on hundreds or thousands of tweets.  What would be better is the binary logarithm of the followers_count to create discrete buckets: 32-64 followers = 5, 1024-2048 = 10 and so on.  I&#8217;d bet any tweet with a value greater than 12 or so (i.e. 4096+ followers) is very likely to be ham.	</li>
</ul>
<p>There are additional things we could do to improve our spam detector:</p>
<ul>
<li>We aren&#8217;t deep inspecting the value of the tweet text.  It might be useful to have variables like &#8220;text_link_count&#8221; or &#8220;text_hashtag_count&#8221; to provide basic metrics for the tweet text content.</li>
<li>We aren&#8217;t performing any timeline checks or storing previous tweet state &#8211; spammers tend to tweet the same text over and over and their tweets all contain links.  This is beyond the scope of a generic Bayesian system.</li>
</ul>
<p>I wrote <a href="http://github.com/mperham/bayes_motel">bayes_motel</a> based on my research this last weekend.  Give it a try and send a pull request if you make changes you&#8217;d like to see.  The test suite gives more detail about the API and has a few thousand tweets to use as sample data.  Happy coding!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/04/28/bayes_motel-bayesian-classification-for-ruby/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://www.mikeperham.com/2010/04/28/bayes_motel-bayesian-classification-for-ruby/</feedburner:origLink></item>
		<item>
		<title>Risk and Startups</title>
		<link>http://feedproxy.google.com/~r/mikeperham/~3/8C_Vzn954zM/</link>
		<comments>http://www.mikeperham.com/2010/04/20/risk-and-startups/#comments</comments>
		<pubDate>Tue, 20 Apr 2010 15:25:04 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Personal]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=524</guid>
		<description><![CDATA[I&#8217;ve worked at 7-8 startups in the last 12 years, learning along the way that I love the freedom and flexibility that a small company affords.  You pay a good price for that freedom though in the form of risk: your job will be measured in terms of months and years, not decades.  [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve worked at 7-8 startups in the last 12 years, learning along the way that I love the freedom and flexibility that a small company affords.  You pay a good price for that freedom though in the form of risk: your job will be measured in terms of months and years, not decades.  My parents spent decades at their jobs working for large corporations; that kind of job security does not exist at a startup.</p>
<p><strong>An Analogy</strong></p>
<p>Risk is something that you either purposefully manage or you roll the dice with your life, sometimes literally.  I ride/race a motorcycle as my main hobby away from the computer.  Riding a moto is a risky activity and I do several things to manage that risk:</p>
<ul>
<li>Always wear a helmet, gloves and jacket</li>
<li>Ride a relatively low power bike</li>
<li>Taken every MSF training course available</li>
<li>Refuse to ride in groups</li>
</ul>
<p>Do these guarantee I won&#8217;t crash?  Certainly not but I hope they will lessen the odds and minimize any damage if I do.</p>
<p><strong>Managing Risks</strong></p>
<p>As engineers, what are the risks of working at a startup?  The main risk is the company failing and going bankrupt.  A second, related risk is being laid off.  In both cases, your job and paycheck are at risk.  How do we manage those risks?  I have three tactics to manage the risk of working at a startup.</p>
<p>1) Make it as easy as possible to find a job</p>
<p>You could make yourself essential to the operation of the company; that helps with layoffs but does not help with bankruptcy and has the drawback that you will start from square one at the next startup.  My strategy has been to make myself a valuable developer, independent of any one startup, by working on open source software and maintaining a high quality blog that evangelizes myself and my work.  This is a last resort strategy: if anything happens to make my job disappear, ideally I can interview and find another job within days.  This recently proved successful when I announced my upcoming move to San Francisco and had 20-30 inquiries over the next few days.</p>
<p>2) Exercise common sense and your math skills</p>
<p>Do you know your startup&#8217;s monthly burn rate, cash reserves and revenue?  I&#8217;d bet that the majority of people at startups do not.  Get those numbers and figure out how many months the company has before it has no money.  Just a few months left?  Would it be difficult to raise more money?  Are you part of a &#8220;layer of fat&#8221; that could be laid off to cut the burn rate?  Is revenue rising or dropping?  Are you getting more customers?  These are questions you should be asking yourself every month to evaluate the health of your startup.  At some point you will need to leave on your own terms, before you are forced out by bankruptcy or layoffs.  I left FiveRuns last year when these questions made bankruptcy look unavoidable.  Leaving on my own terms meant I could take a few weeks to interview around to find the right job.</p>
<p>3) Stick with Success</p>
<p>They say failure is the best way to learn but in my experience nothing breeds success more than previous success.  I try to stick with entrepreneurs that have past successes.  As developers, we want to work with smart developers, yes, but you also want to work with great business guys who have a network of contacts, know how to raise funding and can navigate the company to a successful exit.  I can interview a person to learn if they are a good developer but I can&#8217;t interview a CEO to learn if they are a good CEO.  I have only two metrics:</p>
<ul>
<li>do they have a reasonable business plan with a way to make money?</li>
<li>have they had previous startup successes?</li>
</ul>
<p>The &#8220;halo&#8221; effect is very real.  VCs are more willing to talk to someone who has previous success and knows the funding process.  People are more willing to work at a company run by someone with previous success.  Press is easier to get and customers are easier to talk to if they already know the company as the latest effort by a successful entrepreneur.</p>
<p>4) Educate yo&#8217;self (Extra bonus tip!)</p>
<p>You may know computer science but how much do you know about management or finance?  Read a management book.  I recommend anything by Peter Drucker &#8211; he literally invented the science of management and his writing really opened my eyes.  Read a book on business finance.  You&#8217;re not trying to become an expert in these fields but when you learn a little bit about the other major roles in a startup, you&#8217;ll be able to evaluate your startup&#8217;s current situation more accurately.</p>
<p>Even with all this, you will fail often.  I&#8217;ve been part of two moderately successful exits and several bankruptcies.  I&#8217;ve only been caught flat-footed once and tried to learn as much as I could from that experience.  No matter what happens the startup experience is rewarding but with a little foresight you can minimize the inevitable risk to yourself and your livelihood.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/04/20/risk-and-startups/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		<feedburner:origLink>http://www.mikeperham.com/2010/04/20/risk-and-startups/</feedburner:origLink></item>
		<item>
		<title>Phat News</title>
		<link>http://feedproxy.google.com/~r/mikeperham/~3/tML9XPFionI/</link>
		<comments>http://www.mikeperham.com/2010/04/06/phat-news/#comments</comments>
		<pubDate>Tue, 06 Apr 2010 14:47:03 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=521</guid>
		<description><![CDATA[Gregg and Nathaniel (both of whom are notorious Gowalla cheats, which I would never do, no sir) chat a bit about Phat in the latest episode of Ruby5.
The Changelog crew also gave their take on Phat in a recent posting.
I&#8217;ve spent 100s of hours working on the technology behind Phat over the last six months. [...]]]></description>
			<content:encoded><![CDATA[<p>Gregg and Nathaniel (both of whom are notorious Gowalla cheats, which I would never do, no sir) chat a bit about Phat in the <a href="http://ruby5.envylabs.com/episodes/67-episode-64-april-2-2010">latest episode of Ruby5</a>.</p>
<p>The Changelog crew also gave <a href="http://thechangelog.com/post/494315826/phat-scale-rails-with-single-thread-multiple-fiber-ruby">their take on Phat</a> in a recent posting.</p>
<p>I&#8217;ve spent 100s of hours working on the technology behind Phat over the last six months.  If you think it&#8217;s awesome, please consider <a href="http://workingwithrails.com/person/10797-mike-perham">recommending me on Working with Rails</a>.  I&#8217;m not asking for money, just an electronic thumbs up from my fellow Ruby community members.  Thanks!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/04/06/phat-news/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.mikeperham.com/2010/04/06/phat-news/</feedburner:origLink></item>
		<item>
		<title>Introducing Phat, an Asynchronous Rails app</title>
		<link>http://feedproxy.google.com/~r/mikeperham/~3/RI3Z-eX7xTg/</link>
		<comments>http://www.mikeperham.com/2010/04/03/introducing-phat-an-asynchronous-rails-app/#comments</comments>
		<pubDate>Sat, 03 Apr 2010 22:51:37 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Rails]]></category>
		<category><![CDATA[eventmachine]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=509</guid>
		<description><![CDATA[Phat is my new Rails 2.3.5 application which runs 100% asynchronous, supporting many concurrent requests in a single Ruby process.
This is a new breed of Rails application which uses a new mode of execution available in Ruby 1.9: single Thread, multiple Fiber.  Existing modes of execution suck:

Single thread harkens back to the days of [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://github.com/mperham/phat">Phat</a> is my new Rails 2.3.5 application which runs 100% asynchronous, supporting many concurrent requests in a single Ruby process.</p>
<p>This is a new breed of Rails application which uses a new mode of execution available in Ruby 1.9: single Thread, multiple Fiber.  Existing modes of execution suck:</p>
<ul>
<li>Single thread harkens back to the days of Rails 1.x, where you started N mongrels to handle up to N concurrent requests.</li>
<li>Multiple threads is better but still has fundamental issues in Ruby.  <a href="http://redmine.ruby-lang.org/issues/show/921">Autoloading is simply broken</a> and Ruby&#8217;s thread implementation does not scale at all due to the GIL.</li>
</ul>
<p>Here&#8217;s a sample action which uses memcached and the database.  There&#8217;s nothing odd here &#8211; it&#8217;s the same old Rails API and codebase we are used to as Ruby developers, it just executes differently under the covers.</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#9966CC; font-weight:bold;">class</span> HelloController <span style="color:#006600; font-weight:bold;">&lt;</span> ApplicationController
  <span style="color:#9966CC; font-weight:bold;">def</span> world
    site_ids = Rails.<span style="color:#9900CC;">cache</span>.<span style="color:#9900CC;">fetch</span> <span style="color:#996600;">'site_ids'</span>, <span style="color:#ff3333; font-weight:bold;">:expires_in</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> 1.<span style="color:#9900CC;">minute</span> <span style="color:#9966CC; font-weight:bold;">do</span>
      Site.<span style="color:#9900CC;">all</span>.<span style="color:#9900CC;">map</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&amp;</span>:id<span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#9966CC; font-weight:bold;">end</span>
    render <span style="color:#ff3333; font-weight:bold;">:text</span> <span style="color:#006600; font-weight:bold;">=&gt;</span> site_ids
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<p>How does it work?  If you want the nitty-gritty, <a href="/2010/01/27/scalable-ruby-processing-with-eventmachine/">watch my talk on EventMachine and Fibers</a>.  Everything that does network access ideally should be modified to be Fiber-aware.  I&#8217;ve updated many gems to be Fiber-aware: <a href="http://github.com/mperham/memcache-client">memcache-client</a>, <a href="http://github.com/mperham/em_postgresql">em_postgresql</a> (and activerecord), cassandra, bunny and rsolr to name a few.  You&#8217;ll also need to run thin as your app server, since all of this code assumes it is executing within EventMachine.</p>
<p>Additionally we need to ensure that each request runs in its own Fiber.  My new gem, <a href="http://github.com/mperham/rack-fiber_pool">rack-fiber_pool</a>, will do this for you, just add it as Rack middleware in <code>config/environment.rb</code>.  Here&#8217;s the basic configuration:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#008000; font-style:italic;"># Asynchronous DNS lookup</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'em-resolv-replace'</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'rack/fiber_pool'</span>
<span style="color:#008000; font-style:italic;"># Pull in the evented memcache-client.</span>
<span style="color:#008000; font-style:italic;"># You'll need to configure config.cache_store as normal.</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'memcache/event_machine'</span>
&nbsp;
<span style="color:#6666ff; font-weight:bold;">Rails::Initializer</span>.<span style="color:#9900CC;">run</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>config<span style="color:#006600; font-weight:bold;">|</span>
  config.<span style="color:#9900CC;">cache_store</span> = <span style="color:#ff3333; font-weight:bold;">:mem_cache_store</span>
  <span style="color:#008000; font-style:italic;"># Run each request in a Fiber</span>
  config.<span style="color:#9900CC;">middleware</span>.<span style="color:#9900CC;">use</span> <span style="color:#6666ff; font-weight:bold;">Rack::FiberPool</span>
  <span style="color:#008000; font-style:italic;"># Get rid of Rack::Lock so we don't kill our concurrency</span>
  config.<span style="color:#9900CC;">threadsafe</span>!
<span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<p>Additionally we need to <a href="http://github.com/mperham/phat/blob/master/config/database.yml">configure Postgresql</a> and <a href="http://github.com/mperham/phat/blob/master/config/initializers/disable_locking.rb">disable ActionController&#8217;s reloader mutex</a> as it really doesn&#8217;t like fibered execution.  This is ok because remember &#8211; there&#8217;s only a single thread executing in our process!</p>
<p>With that done, we can try some tests to see how we scale now.  EventMachine works best when you have significant network latency.  A simple test with database access over coffeeshop WiFi:</p>
<blockquote><p>
    Without EventMachine:<br />
    Requests per second:    4.39 [#/sec] (mean)</p>
<p>    With EventMachine:<br />
    Requests per second:    21.31 [#/sec] (mean)
</p></blockquote>
<p>That&#8217;s it!  There&#8217;s no magic here: you can make your Rails app a &#8220;phat&#8221; app by following the same guidelines above.  Fire up one thin instance per processor/core, put nginx in front of it and it should scale like crazy!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/04/03/introducing-phat-an-asynchronous-rails-app/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		<feedburner:origLink>http://www.mikeperham.com/2010/04/03/introducing-phat-an-asynchronous-rails-app/</feedburner:origLink></item>
		<item>
		<title>Using ActiveRecord with EventMachine</title>
		<link>http://feedproxy.google.com/~r/mikeperham/~3/N-EhAlimJ44/</link>
		<comments>http://www.mikeperham.com/2010/03/30/using-activerecord-with-eventmachine/#comments</comments>
		<pubDate>Tue, 30 Mar 2010 05:25:14 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Rails]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[eventmachine]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=494</guid>
		<description><![CDATA[Given all my work with Fibers and EventMachine over the last three months, it should come as no surprise that I&#8217;ve been working on infrastructure based on Fibers and EventMachine to get maximum scalability without the callback style of code which I dislike for many reasons.  Watch my talk on scaling with EventMachine if [...]]]></description>
			<content:encoded><![CDATA[<p>Given all my work with Fibers and EventMachine over the last three months, it should come as no surprise that I&#8217;ve been working on infrastructure based on Fibers and EventMachine to get maximum scalability without the callback style of code which I dislike for many reasons.  <a href="/2010/01/27/scalable-ruby-processing-with-eventmachine/">Watch my talk on scaling with EventMachine</a> if you need more background on the problem.</p>
<p>Now that I have RabbitMQ, Cassandra, Solr and the Amazon AWS services evented, the only holdup was ActiveRecord.  Some people may advocate using another ORM layer but when you have 2-3 other Rails apps, all sharing 100+ models, you can&#8217;t afford to maintain two separate ORM layers.  Plus, frankly I like the Rails stack: it works pretty well, is thoroughly documented and every Ruby developer is familiar with it.</p>
<p>So what do we need to do to get AR working event-style?  At a high level, there&#8217;s two things required:</p>
<ul>
<li>The database driver itself must be modified to send SQL asynchronously.  The postgresql driver, for instance, calls the <code>exec(sql)</code> method for all traffic to the database.  So we just need to provide an exec method which uses Fibers under the covers to work asynchronously.</li>
<li>AR&#8217;s connection pooling needs to be Fiber-safe.  Out of the box, it is Thread-safe.  Since we are using an execution model based on a single Thread with multiple Fibers, all the Fibers would try to use the same connection, with disastrous consequences.</li>
</ul>
<p>These are the things that em_postgresql does.</p>
<ul>
<li><a href="http://github.com/mperham/em_postgresql/blob/master/lib/postgres_connection.rb">postgres_connection</a> is a basic, EM-aware Postgres driver.  It provides the Fibered <code>exec()</code> method which makes the whole thing asynchronous.
<li><a href="http://github.com/mperham/em_postgresql/blob/master/lib/active_record/connection_adapters/em_postgresql_adapter.rb">em_postgresql_adapter.rb</a> wraps postgres_connection to make it a proper ActiveRecord driver.</li>
<li><a href="http://github.com/mperham/em_postgresql/blob/master/lib/active_record/patches.rb">patches.rb</a> overrides a bunch of AR&#8217;s internal connection pooling to make it Fiber-friendly.</li>
</ul>
<p>Unfortunately the latter makes one hack necessary &#8211; we have to have a list of current Fibers to release any lingering connections associated with those Fibers.  The Threaded version can use <code>Thread.list</code> but Ruby does not provide an equivalent method for Fibers.  Instead I require the application to register a FiberPool with AR to clear stale connections.</p>
<p>So what does it all mean?  Well, here&#8217;s <a href="http://github.com/mperham/em_postgresql/blob/master/examples/app.rb">a Sinatra application</a> that uses plain old ActiveRecord and <strong>is completely asynchronous</strong>!  Try <code>ab -n 100 -c 20 http://localhost:9292/test</code> to hit the app with 20 concurrent connections; it will process them all in parallel, without any painful threading issues (autoloading, misbehaving extensions, etc).  Awesome!</p>
<p>You should guess what&#8217;s next.  Coming soon: the whole Rails stack, running asynchronously&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/03/30/using-activerecord-with-eventmachine/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://www.mikeperham.com/2010/03/30/using-activerecord-with-eventmachine/</feedburner:origLink></item>
		<item>
		<title>Cassandra Internals – Tricks!</title>
		<link>http://feedproxy.google.com/~r/mikeperham/~3/cza_aB9qJ6Q/</link>
		<comments>http://www.mikeperham.com/2010/03/20/cassandra-internals-tricks/#comments</comments>
		<pubDate>Sat, 20 Mar 2010 16:59:16 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Software]]></category>
		<category><![CDATA[cassandra]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=478</guid>
		<description><![CDATA[In my previous posts, I covered how Cassandra reads and writes data.  In this post, I want to explain some of the trickery that Cassandra uses to provide a scalable distributed system.
Gossip
Cassandra is a cluster of individual nodes &#8211; there&#8217;s no &#8220;master&#8221; node or single point of failure &#8211; so each node must actively [...]]]></description>
			<content:encoded><![CDATA[<p>In my previous posts, I covered how Cassandra <a href="/2010/03/17/cassandra-internals-reading/">reads</a> and <a href="/2010/03/13/cassandra-internals-writing/">writes</a> data.  In this post, I want to explain some of the trickery that Cassandra uses to provide a scalable distributed system.</p>
<p><strong>Gossip</strong></p>
<p>Cassandra is a cluster of individual nodes &#8211; there&#8217;s no &#8220;master&#8221; node or single point of failure &#8211; so each node must actively verify the state of the other cluster members.  They do this with a mechanism known as <a href="http://wiki.apache.org/cassandra/ArchitectureGossip">gossip</a>.  Each node &#8216;gossips&#8217; to 1-3 other nodes every second about the state of each node in the cluster.  The gossip data is versioned so that any change for a node will quickly propagate throughout the entire cluster.  In this way, every node will know the current state of every other node: whether it is bootstrapping, running normally, etc. </p>
<p><strong>Hinted Handoff</strong></p>
<p>In <a href="/2010/03/13/cassandra-internals-writing/">writing</a>, I mentioned that Cassandra stores a copy of the data on N nodes.  The client can select a consistency level for a write based on the importance of the data &#8211; for example, ConsistencyLevel.QUORUM means that a majority of those N nodes must reply success for the write to be considered successful.</p>
<p>What happens if one of those nodes goes down?  How do those writes propagate to that node later?  Cassandra uses a technique known as <a href="http://wiki.apache.org/cassandra/HintedHandoff">hinted handoff</a>, where the data is written to anther random node X to be stored and replayed for node Y when it comes back online (remember that gossip will quickly tell X when Y comes online).  Hinted handoff ensures that node Y will quickly match the rest of the cluster.  Note that read repair would still eventually &#8220;fix&#8221; the old data if hinted handoff did not work for some reason but only once the client asked for that data.</p>
<p>Hinted writes are not readable (since node X is not officially one of the N copies) so they don&#8217;t count toward write consistency.  If Cassandra is configured for three copies and two of those nodes are down, it would be impossible to fulfill a ConsistencyLevel.QUORUM write.</p>
<p><strong>Anti-Entropy</strong></p>
<p>The final trick up Cassandra&#8217;s proverbial sleeve is <a href="http://wiki.apache.org/cassandra/ArchitectureAntiEntropy">anti-entropy</a>.  AE explicitly ensures that the nodes in the cluster agree on the current data.  If read repair or hinted handoff don&#8217;t work due to some set of circumstances, the AE service will ensure that nodes reach eventual consistency.  The AE service runs during &#8220;major compactions&#8221; (the equivalent of rebuilding a table in an RDBMS) so it is a relatively heavyweight process that runs infrequently.  AE uses a <a href="http://en.wikipedia.org/wiki/Hash_tree">Merkle Tree</a> to determine where within the tree of column family data the nodes disagree and then repairs each of those branches.</p>
<p>This is the last post in my series on Cassandra.  I hope you enjoyed them!  Please leave a comment if you have questions or if I&#8217;ve made an error above.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/03/20/cassandra-internals-tricks/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://www.mikeperham.com/2010/03/20/cassandra-internals-tricks/</feedburner:origLink></item>
		<item>
		<title>Ruby Open Files</title>
		<link>http://feedproxy.google.com/~r/mikeperham/~3/YnAi7InKKRU/</link>
		<comments>http://www.mikeperham.com/2010/03/19/ruby-open-files/#comments</comments>
		<pubDate>Fri, 19 Mar 2010 16:57:41 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=461</guid>
		<description><![CDATA[Get the number of open files for each of your Ruby processes:

sudo lsof &#124; grep ruby &#124; ruby -e 'h=Hash.new(0);$&#60;.each_line {&#124;line&#124; h[line.split[1]] += 1};p h'

Example output:

{"3268"=>808, "4513"=>399, "4795"=>237, "5067"=>178, "5083"=>16, "23751"=>108}

]]></description>
			<content:encoded><![CDATA[<p>Get the number of open files for each of your Ruby processes:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #c20cb9; font-weight: bold;">sudo</span> lsof <span style="color: #000000; font-weight: bold;">|</span> <span style="color: #c20cb9; font-weight: bold;">grep</span> ruby <span style="color: #000000; font-weight: bold;">|</span> ruby <span style="color: #660033;">-e</span> <span style="color: #ff0000;">'h=Hash.new(0);$&lt;.each_line {|line| h[line.split[1]] += 1};p h'</span></pre></div></div>

<p>Example output:<br />
<code><br />
{"3268"=>808, "4513"=>399, "4795"=>237, "5067"=>178, "5083"=>16, "23751"=>108}<br />
</code></p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/03/19/ruby-open-files/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.mikeperham.com/2010/03/19/ruby-open-files/</feedburner:origLink></item>
	</channel>
</rss><!-- Dynamic page generated in 0.564 seconds. --><!-- Cached page generated by WP-Super-Cache on 2010-07-24 22:27:45 -->
