<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Kitchen Soap</title>
	
	<link>http://www.kitchensoap.com</link>
	<description>Thoughts on capacity planning and web operations.</description>
	<lastBuildDate>Fri, 09 Oct 2009 04:01:24 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/KitchenSoap" type="application/rss+xml" /><feedburner:browserFriendly></feedburner:browserFriendly><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item>
		<title>When you deploy: your internal monologue</title>
		<link>http://www.kitchensoap.com/2009/10/07/when-you-deploy-your-internal-monologue/</link>
		<comments>http://www.kitchensoap.com/2009/10/07/when-you-deploy-your-internal-monologue/#comments</comments>
		<pubDate>Wed, 07 Oct 2009 22:22:33 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=318</guid>
		<description><![CDATA[The minimum cycle of questions you should be asking yourself. As brought up by @debuggist and @benjaminblack.

]]></description>
			<content:encoded><![CDATA[<p>The minimum cycle of questions you should be asking yourself. As brought up by <a href="http://twitter.com/debuggist" target="_blank">@debuggist</a> and <a href="http://twitter.com/benjaminblack" target="_blank">@benjaminblack</a>.</p>
<p><a href="http://www.kitchensoap.com/wp-content/uploads/2009/10/InternalMonologue.png"><img class="alignnone size-full wp-image-319" style="border: 1px solid black;" title="What you might want to ask yourself before you deploy changes to production?" src="http://www.kitchensoap.com/wp-content/uploads/2009/10/InternalMonologue.png" alt="What you might want to ask yourself before you deploy changes to production?" width="724" height="547" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/10/07/when-you-deploy-your-internal-monologue/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Meanwhile: More Meta-Metrics</title>
		<link>http://www.kitchensoap.com/2009/10/05/meanwhile-more-meta-metrics/</link>
		<comments>http://www.kitchensoap.com/2009/10/05/meanwhile-more-meta-metrics/#comments</comments>
		<pubDate>Mon, 05 Oct 2009 17:50:26 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Tools]]></category>
		<category><![CDATA[WebOps]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=292</guid>
		<description><![CDATA[Like all sane web organizations, we gather metrics about our infrastructure and applications. As many metrics as we can, as often as we can. These metrics, given the right context, helps us figure out all sorts of things about our application, infrastructure, processes, and business. Things such as&#8230;
What:
&#8230;did we do before (historical trending, etc)
&#8230;is going [...]]]></description>
			<content:encoded><![CDATA[<p>Like all sane web organizations, we gather metrics about our infrastructure and applications. As many metrics as we can, as often as we can. These metrics, given the right context, helps us figure out all sorts of things about our application, infrastructure, processes, and business. Things such as&#8230;</p>
<p>What:</p>
<p style="padding-left: 30px;">&#8230;did we do before (historical trending, etc)<br />
&#8230;is going on right now? (troubleshooting, health, etc.)<br />
&#8230;is coming down the road (capacity planning, new feature adoption, etc.)<br />
&#8230;can we do to make things better (business intelligence, user-behavior, etc.)</p>
<p>All of which, of course, should be considered mandatory in order to help your business increase its awesome. Yay metrics!</p>
<p>Some time ago, Matthias wrote great a <a title="Agile Web Operations" href="http://www.agileweboperations.com/visible-ops-continuous-improvement/" target="_blank">blog post</a> about some of the metrics that can reasonably profile the effectiveness of web operations, taken from the <a title="VisibleOps" href="http://www.itpi.org/home/visibleops.php" target="_blank">ITIL primer, VisibleOps</a>.</p>
<p>In my opinion, there&#8217;s nothing on that list of things that isn&#8217;t valuable, as long as the cost of gathering those metrics isn&#8217;t too behaviorally, technically, or organizationally expensive. The topics included in that list of metrics and the context they live in is fodder for many, many blog posts.</p>
<p>But in the category of historical trending, I&#8217;m more and more fascinated by gathering what I&#8217;ll call &#8220;meta-metrics&#8221;, which is data about how you respond to the changes your system is experiencing.</p>
<p>One of the best examples of this is gathering information about operational disruptions. Collecting information about how many times your on-call rotation was alerted/paged/woken-up, during what times, and for what service(s) can be enlightening to say the least.  We&#8217;ve been tracking the volume of alerts a lot closer recently, and even with the level of automation we&#8217;ve got at Flickr, it&#8217;s still something you have to keep on top of, especially if you&#8217;re always finding new things to measure and alert on.</p>
<p>Now ideally, you have an alerting system that only communicates conditions that need resolvable action by a human. Which means every alert is critically important, and you&#8217;re not ignoring or dismissing any pages for any reasons that sound like <em>&#8220;oh, that&#8217;s ok, that cluster always does that&#8230;it&#8217;ll clear up, I&#8217;ll just acknowledge the page so I can shut up nagios.&#8221;</em> In other words, our goal is to have a zero-noise alerting system. Which means that <em>all</em> alerts are actionable, not ignorable, and require a human to troubleshoot or fix. Over time, you push as much of this work as you can to the robots. In the meantime, save humans for the yet-to-be-automated work, or the stuff that isn&#8217;t easily captured by robots.</p>
<p>Why is this important to us? I may be stating the obvious, but it&#8217;s because interrupting humans with alerts that don&#8217;t require action has a mental and physical context switching cost (especially if the guy on-call was sleeping), and it increases the likelihood of missing a truly critical page in a slew of non-critical ones.</p>
<p>Of course in the reality of evolving and growing web applications, even if we could reach a 100% noise-free alerting system, it&#8217;s impossible to sustain for any extended period of time, because your application, usage, and failure modes are constantly changing. So in the meantime, knowing how your alerts affect the team is a worthwhile thing to do for us. In fact, I think it&#8217;s so important that it&#8217;s worth collecting and displaying next to the rest of your metrics, and exposing these metrics to the entire dev and ops groups.</p>
<p>Something like this: (made-up numbers)</p>
<div id="attachment_295" class="wp-caption alignnone" style="width: 300px">
	<a href="http://www.kitchensoap.com/wp-content/uploads/2009/10/Alerts-Mockup.png"><img class="size-medium wp-image-295" title="Tracking Critical Alerts" src="http://www.kitchensoap.com/wp-content/uploads/2009/10/Alerts-Mockup-300x206.png" alt="Tracking Critical Alerts " width="300" height="206" /></a>
	<p class="wp-caption-text">Tracking Critical Alerts </p>
</div>
<p>Gathering up info about these alerts should give us a better perspective on where we can improve. So, things like:</p>
<ul>
<li> How many critical alerts are sent on a daily/hourly/weekly basis?</li>
<li> What does a time histogram of the alerts look like? Do you get more or less alerts during nighttime or non-peak hours?</li>
<li>How much (if any) correlation is there between critical alerts and:</li>
</ul>
<blockquote style="padding-left: 30px;"><p>- code deploys?<br />
- software upgrades?<br />
- feature launches?<br />
- open API abuse?</p></blockquote>
<ul>
<li> What does a breakdown of the alerts look like, in terms of: host type, service type, and frequency of each in a given time period?</li>
</ul>
<p>and maybe the most important ones:</p>
<ul>
<li> How many of those alerts aren&#8217;t actually critical or demand human attention?</li>
<li> How many of them always self-recover?</li>
<li> How many (and which) don&#8217;t matter in their role context (like, a single node in a load-balanced cluster) and could be turned into an aggregate check?</li>
</ul>
<p>We&#8217;ve built our own stuff to track and analyze these things. My question to the community is: I&#8217;m not aware of any open-source tool that is dedicated to analyzing these metrics. Do they exist? Nagios obviously has host/hostgroup/cluster warning and critical histories, and those can be crunched to find critical alert statistics, but I&#8217;m not aware of any comprehensive crunching. Of course, until I find one, we&#8217;re just building our own.</p>
<p>Thoughts, lazyweb?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/10/05/meanwhile-more-meta-metrics/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>WebOps: Good prep for becoming a new parent?</title>
		<link>http://www.kitchensoap.com/2009/09/29/webops-good-prep-for-becoming-a-new-parent/</link>
		<comments>http://www.kitchensoap.com/2009/09/29/webops-good-prep-for-becoming-a-new-parent/#comments</comments>
		<pubDate>Wed, 30 Sep 2009 04:23:36 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=281</guid>
		<description><![CDATA[I think I&#8217;ve said before somewhere that working in the field of web operations prepared me somewhat for being a parent. I thought the other day that I should write down some of this reasoning, because it&#8217;s pretty often that I&#8217;m reminded of similarities:
High availability
Having redundant infrastructure is WebOps 101. For my kids&#8217; most prized [...]]]></description>
			<content:encoded><![CDATA[<p>I think I&#8217;ve said before somewhere that working in the field of web operations prepared me somewhat for being a parent. I thought the other day that I should write down some of this reasoning, because it&#8217;s pretty often that I&#8217;m reminded of similarities:</p>
<p><em><strong>High availability</strong></em></p>
<p>Having redundant infrastructure is WebOps 101. For my kids&#8217; most prized possessions, their sleeping  <a title="Dollies" href="http://www.flickr.com/photos/eekaroo/3361150569/" target="_blank">&#8216;loveys&#8217; </a>there is no reason to have a <a title="Single Point of Failure" href="http://en.wikipedia.org/wiki/Single_Point_of_Failure" target="_blank">SPOF</a>, under any circumstances. We have at least 4 backups for each on any trip that we go on, as well as a couple of trusted stuffed animals who might meet unfortunate fates.</p>
<p><em><strong>Capacity planning</strong></em></p>
<p>This applies to both disposable diapers (a.k.a.<em> consumable capacity</em>) and episodes of the few TV shows we allow them to watch, on the Tivo. My daughter, at 3 and a half, knows every detail from every of the 49 episodes of <a title="The Backyardigans" href="http://www.google.com/url?sa=t&amp;source=web&amp;ct=res&amp;cd=1&amp;url=http%3A%2F%2Fwww.nickjr.com%2Fshows%2Fbackyardigans%2Findex.jhtml&amp;ei=LNjCSoypKZOCsgOQqPTuAg&amp;usg=AFQjCNFUuMBPdoxeunE6pvhpJtEtG1WSSw&amp;sig2=x2bYLViPeXoS70pEK6shww" target="_blank">The Backyardigans.</a> Having some of them on ipods and iphones can make a 6 hour drive to L.A. feel like 4, not 12.</p>
<p><em><strong>Documentation</strong></em></p>
<p>Since I&#8217;m already used to writing down observations and techniques learned &#8216;in the field&#8217;, then I was totally prepared:</p>
<div class="wp-caption alignnone" style="width: 500px">
	<a href="http://www.flickr.com/photos/allspaw/2592579909/"><img title="Allspaw Baby Soothing Method, v1" src="http://farm4.static.flickr.com/3205/2592579909_a5d8b25bb9.jpg" alt="Allspaw Baby Soothing Method, v1" width="500" height="327" /></a>
	<p class="wp-caption-text">Allspaw Baby Soothing Method, v1</p>
</div>
<p>and in case I ever forgot what my most successful swaddling method was:</p>
<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="400" height="300" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="flashvars" value="intl_lang=en-us&amp;photo_secret=de8c6a5027&amp;photo_id=2554081561&amp;flickr_show_info_box=true" /><param name="bgcolor" value="#000000" /><param name="allowFullScreen" value="true" /><param name="src" value="http://www.flickr.com/apps/video/stewart.swf?v=71377" /><param name="allowfullscreen" value="true" /><embed type="application/x-shockwave-flash" width="400" height="300" src="http://www.flickr.com/apps/video/stewart.swf?v=71377" allowfullscreen="true" bgcolor="#000000" flashvars="intl_lang=en-us&amp;photo_secret=de8c6a5027&amp;photo_id=2554081561&amp;flickr_show_info_box=true"></embed></object><br />
<em><strong></strong></em></p>
<p><em><strong>Architecture and design</strong></em></p>
<p>It&#8217;s unfortunate that I was so sleep-deprived that I never got a photo of the RadioShack remote-control truck that I turned into a cam-driven <a title="Moses basket" href="http://www.flickr.com/photos/nathanleland/2596474846/" target="_blank">Moses basket</a> automatic rocker mechanism. But you <a href="http://boingboing.net/2009/08/26/scripting-a-pc-cd-tr.html">understand what I&#8217;m talking about</a>.</p>
<p>There is one other thing that I learned from working at Flickr which turned out to be useful new parent advice: expect the unexpected, and never rely on past behaviors as an indication of what can happen in the future. They&#8217;re kids, not applications. <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/09/29/webops-good-prep-for-becoming-a-new-parent/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Automated Control paper by the RAD Lab folks</title>
		<link>http://www.kitchensoap.com/2009/08/01/automated-control-paper-by-the-rad-lab-folks/</link>
		<comments>http://www.kitchensoap.com/2009/08/01/automated-control-paper-by-the-rad-lab-folks/#comments</comments>
		<pubDate>Sat, 01 Aug 2009 22:32:11 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=271</guid>
		<description><![CDATA[Wow, how did I miss this until now? In June, some smart people gathered in Barcelona for the First Workshop on Automated Control for Datacenters and Clouds (ACDC09) and jeez it looked like it was a good time, from a glance at the program.
One of the cooler papers is &#8220;Automatic exploration of datacenter performance regimes&#8221; in [...]]]></description>
			<content:encoded><![CDATA[<p>Wow, how did I miss this until now? In June, some smart people gathered in Barcelona for the <a href="http://www.cs.duke.edu/nicl/acdc09/" target="_blank">First Workshop on Automated Control for Datacenters and Clouds (ACDC09)</a> and jeez it looked like it was a good time, from a glance at the <a href="http://www.cs.duke.edu/nicl/acdc09/program.html" target="_blank">program</a>.</p>
<p>One of the cooler papers is <a href="http://portal.acm.org/citation.cfm?id=1555271.1555273" target="_blank">&#8220;Automatic exploration of datacenter performance regimes&#8221;</a> in which the smart folks over at the <a href="http://radlab.cs.berkeley.edu/" target="_blank">RAD Lab</a> at UCB tackle the idea of:</p>
<ol>
<li>Gathering up real usage metrics in production</li>
<li>Taking that data to feed a resource allocation (&#8221;auto-scaling&#8221;) controller</li>
</ol>
<p>The bits about coming up with an <em>exploration policy</em> is where the juicy stuff comes in, building in safety factors driven by external SLAs. You should read the whole thing to see how thoughtful their method was, which includes taking into account effects such as cold ramping, which you almost never see accounted for in simulated situations.  Rock on, RAD Lab: this is the stuff that brings the academia smarts to the real world. Kudos.</p>
<p><em>FYI: I&#8217;m not just saying the paper is cool because they cite my book as a resource in it. <img src='http://www.kitchensoap.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/08/01/automated-control-paper-by-the-rad-lab-folks/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Extreme Automated Infrastructure</title>
		<link>http://www.kitchensoap.com/2009/07/18/extreme-automated-infrastructure/</link>
		<comments>http://www.kitchensoap.com/2009/07/18/extreme-automated-infrastructure/#comments</comments>
		<pubDate>Sat, 18 Jul 2009 15:39:00 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=269</guid>
		<description><![CDATA[I&#8217;ve said it before that I&#8217;ve always been a huge fan of SystemImager, for super simple imaging. It has some shortcomings for config management, but those are solved with things like Chef or Puppet.
With all of the great things being talked about surrounding &#8216;Automated Infrastructure&#8217;, I&#8217;ll point to something insanely cool: 1,190 nodes installed from [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve said it before that I&#8217;ve always been a huge fan of SystemImager, for super simple imaging. It has some shortcomings for config management, but those are solved with things like <a href="http://wiki.opscode.com/display/chef/Home" target="_blank">Chef</a> or <a href="http://reductivelabs.com/products/puppet/" target="_blank">Puppet</a>.</p>
<p>With all of the great things being talked about surrounding &#8216;Automated Infrastructure&#8217;, I&#8217;ll point to something insanely cool: <a href="http://wiki.systemimager.org/index.php/BitTorrent" target="_blank">1,190 nodes installed from bare metal to all done in 15 minutes. </a></p>
<p>That&#8217;s One Thousand One Hundred and Ninety nodes. Completely installed in: Fifteen. Fucking. Minutes.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/07/18/extreme-automated-infrastructure/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>SLAs, clouds, and whatnot</title>
		<link>http://www.kitchensoap.com/2009/07/16/slas-clouds-and-whatnot/</link>
		<comments>http://www.kitchensoap.com/2009/07/16/slas-clouds-and-whatnot/#comments</comments>
		<pubDate>Thu, 16 Jul 2009 16:43:16 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=266</guid>
		<description><![CDATA[Excellent. Good work, Ben:
ah, the mighty service level agreement! the tooth and claw by which the wily customer brings the vendor to heel. get the SLA right and you, the customer, can sit back and relax, safe in the knowledge that should there be an outage, you are covered. your business is protected from harm [...]]]></description>
			<content:encoded><![CDATA[<p>Excellent. Good work, Ben:</p>
<blockquote><p>ah, the mighty service level agreement! the tooth and claw by which the wily customer brings the vendor to heel. get the SLA right and you, the customer, can sit back and relax, safe in the knowledge that should there be an outage, you are covered. your business is protected from harm by the warm, experienced embrace of a big, stable telco. pinch me, i must be dreaming.</p></blockquote>
<p>go read the whole <a href="http://blog.b3k.us/service_level_agreements.html" target="_blank">thing</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/07/16/slas-clouds-and-whatnot/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Uncaching bits in filesystem cache</title>
		<link>http://www.kitchensoap.com/2009/07/09/uncaching-bits-in-filesystem-cache/</link>
		<comments>http://www.kitchensoap.com/2009/07/09/uncaching-bits-in-filesystem-cache/#comments</comments>
		<pubDate>Thu, 09 Jul 2009 18:17:26 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Random]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=263</guid>
		<description><![CDATA[Domas makes something more useful than I bet most would think: http://mituzas.lt/2009/06/26/uncache/
]]></description>
			<content:encoded><![CDATA[<p>Domas makes something more useful than I bet most would think: <a href="http://mituzas.lt/2009/06/26/uncache/" target="_blank">http://mituzas.lt/2009/06/26/uncache/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/07/09/uncaching-bits-in-filesystem-cache/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Slides for Velocity Talk 2009</title>
		<link>http://www.kitchensoap.com/2009/06/23/slides-for-velocity-talk-2009/</link>
		<comments>http://www.kitchensoap.com/2009/06/23/slides-for-velocity-talk-2009/#comments</comments>
		<pubDate>Tue, 23 Jun 2009 23:39:53 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Culture]]></category>
		<category><![CDATA[Slides]]></category>
		<category><![CDATA[Talks]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[WebOps]]></category>
		<category><![CDATA[velocity conference]]></category>
		<category><![CDATA[Web Ops]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=257</guid>
		<description><![CDATA[UPDATE: blip.tv has the video of the talk as well, below. Jeez I have some major bed-head.
That was a blast! I had never done a &#8216;duet&#8217; talk before. Here are the slides:
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
&#8230;and the video of it is here:

]]></description>
			<content:encoded><![CDATA[<p>UPDATE: blip.tv has the video of the talk as well, below. Jeez I have some major bed-head.</p>
<p>That was a blast! I had never done a &#8216;duet&#8217; talk before. Here are the slides:</p>
<div id="__ss_1628368" style="width: 425px; text-align: left;"><a style="font:14px Helvetica,Arial,Sans-serif;display:block;margin:12px 0 3px 0;text-decoration:underline;" title="10+ Deploys Per Day: Dev and Ops Cooperation at Flickr" href="http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr?type=presentation">10+ Deploys Per Day: Dev and Ops Cooperation at Flickr</a><object style="margin:0px" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="425" height="355" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowScriptAccess" value="always" /><param name="src" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=allspawhammondvelocity2009-090623161942-phpapp01&amp;stripped_title=10-deploys-per-day-dev-and-ops-cooperation-at-flickr" /><param name="allowfullscreen" value="true" /><embed style="margin:0px" type="application/x-shockwave-flash" width="425" height="355" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=allspawhammondvelocity2009-090623161942-phpapp01&amp;stripped_title=10-deploys-per-day-dev-and-ops-cooperation-at-flickr" allowscriptaccess="always" allowfullscreen="true"></embed></object></div>
<div style="width: 425px; text-align: left;">&#8230;and the video of it is here:</div>
<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="640" height="390" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="src" value="http://blip.tv/play/AYGMoH+LqzQ" /><param name="allowfullscreen" value="true" /><embed type="application/x-shockwave-flash" width="640" height="390" src="http://blip.tv/play/AYGMoH+LqzQ" allowfullscreen="true"></embed></object></p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/06/23/slides-for-velocity-talk-2009/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Annoying To Me.</title>
		<link>http://www.kitchensoap.com/2009/05/22/annoying-to-me/</link>
		<comments>http://www.kitchensoap.com/2009/05/22/annoying-to-me/#comments</comments>
		<pubDate>Fri, 22 May 2009 19:41:38 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=247</guid>
		<description><![CDATA[I can&#8217;t tell you how ripped I get when people say things like this:
&#8220;cloud computing means getting rid of ops&#8221;
If by &#8220;ops&#8221; you mean &#8220;people in data centers racking servers, installing OSes, running cables, replacing broken hardware, etc.&#8221; then sure, cloud computing aims to relieve you of those burdens. If you really think &#8216;ops&#8217; is [...]]]></description>
			<content:encoded><![CDATA[<p>I can&#8217;t tell you how ripped I get when people say things like this:</p>
<blockquote><p>&#8220;cloud computing means getting rid of ops&#8221;</p></blockquote>
<p>If by &#8220;ops&#8221; you mean &#8220;people in data centers racking servers, installing OSes, running cables, replacing broken hardware, etc.&#8221; then sure, cloud computing aims to relieve you of those burdens. If you really think &#8216;ops&#8217; is just that, then you really should put down your Nick Carr book and pay attention to the real world for a change.</p>
<p>The reality is, if your ops team is spending a lot of time doing that, then you&#8217;re either:</p>
<ol>
<li>Too big to use someone *else&#8217;s* cloud, because you basically have your own (Yahoo, Amazon, Google, etc.)</li>
<li>Stuck in 1999.</li>
</ol>
<p>If you deal with any of these things:</p>
<ul>
<li>handling site issues/incidents</li>
<li>building and maintaining tools to monitor and gather systems and application-level metrics</li>
<li>program abilities to adapt infrastructure to changing system or application-level conditions (usage, failure, degradation, etc.)</li>
<li>implements, and maintains deployment systems (code, config management, etc.)</li>
<li>capacity planning (no, really)</li>
</ul>
<p>then you&#8217;re doing &#8220;ops&#8221;, by my definition. In some environments, these things are done by &#8220;developers&#8221;. But <em>my</em> definition says those devs are performing ops functions.</p>
<p>Cloud computing isn&#8217;t going to make &#8216;ops&#8217; go away, it&#8217;s relieving of ops (and dev) of a bunch of pain-in-the-ass things so they can focus on the real work needed. Namely: your application.</p>
<p>Last I checked, clouds don&#8217;t perform the tasks listed above, because those things (done right) are application-specific. And while cloud computing enables (in an excellent way) the efficient resource allocation (or de-allocation) for an application, it doesn&#8217;t get rid of the need to do the above things.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/05/22/annoying-to-me/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Context and Operational Metrics</title>
		<link>http://www.kitchensoap.com/2009/05/10/context-and-operational-metrics/</link>
		<comments>http://www.kitchensoap.com/2009/05/10/context-and-operational-metrics/#comments</comments>
		<pubDate>Mon, 11 May 2009 02:35:44 +0000</pubDate>
		<dc:creator>allspaw</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.kitchensoap.com/?p=195</guid>
		<description><![CDATA[I really don&#8217;t think it can be overestimated how important context can be when it comes to troubleshooting or evaluating the health of an infrastructure. When starting to troubleshoot a complex problem, web ops 101 &#8220;best practices&#8221; usually start with asking at least these questions:

When did this problem start?
What changes, if any, (software, hardware, usage, [...]]]></description>
			<content:encoded><![CDATA[<p>I really don&#8217;t think it can be overestimated how important context can be when it comes to troubleshooting or evaluating the health of an infrastructure. When starting to <a class="zem_slink" title="Troubleshooting" rel="wikipedia" href="http://en.wikipedia.org/wiki/Troubleshooting">troubleshoot</a> a complex problem, web ops 101 &#8220;<a class="zem_slink" title="Best practice" rel="wikipedia" href="http://en.wikipedia.org/wiki/Best_practice">best practices</a>&#8221; usually start with asking at least these questions:</p>
<ol>
<li>When did this problem <em>start</em>?</li>
<li>What changes, if any, (software, hardware, usage, environmental, etc.) were made just previous to the start of the problem?</li>
</ol>
<p>The context surrounding these problem events are pretty damn critical to figuring out what the hell is going on.<br />
Most monitoring systems are based around the idea that you want to know if a particular metric is above (or sometimes below) a certain threshold, and have &#8216;warning&#8217; or &#8216;critical&#8217; values that represent what is going bad or already bad. When these alarms go off, knowing how and when they got there is really important your troubleshooting approach. This context is paramount in figuring out where to spend your time and focus.</p>
<p>For example: an alarm goes off because a monitor has detected that some metric has reached a critical state. Something that goes critical instantly can be quite different than something that edged into critical after being in a warning state for some time.</p>
<p>Check it out:</p>
<div id="attachment_199" class="wp-caption alignleft" style="width: 300px">
	<a href="http://www.kitchensoap.com/wp-content/uploads/2009/05/context1-monitoring.png"><img class="size-medium wp-image-199" title="Context: Monitoring" src="http://www.kitchensoap.com/wp-content/uploads/2009/05/context1-monitoring-300x183.png" alt="Monitored metric passing thru warning and critical thresholds." width="300" height="183" /></a>
	<p class="wp-caption-text">Metric passing thru warning and critical thresholds.</p>
</div>
<div id="attachment_200" class="wp-caption alignright" style="width: 300px">
	<a href="http://www.kitchensoap.com/wp-content/uploads/2009/05/context2-monitoring.png"><img class="size-medium wp-image-200" title="Context: Monitoring" src="http://www.kitchensoap.com/wp-content/uploads/2009/05/context2-monitoring-300x183.png" alt="Almost instantaneous critical, no time spent in warning." width="300" height="183" /></a>
	<p class="wp-caption-text">Almost instantaneous critical, no time spent in warning.</p>
</div>
<p style="text-align: left;">For this discussion, the actual metric here isn&#8217;t that important. It could be CPU on a webserver, it could be latency on a cache hit or miss on memcached/squid/varnish/etc, or it could be network bandwidth on a rack switch.  The values you set for warning and critical are normally informed by how much tolerance the system can withstand being in warning mode, and given &#8216;normal&#8217; failure modes, and allow enough wall-clock time for recovery actions to take place before it reaches critical.</p>
<p>Most people would approach these two scenarios quite differently, because of the context that <em>time</em> lends to the issue.</p>
<p>In <a title="The Art of Capacity Planning: Scaling Web Resources" href="http://www.amazon.com/Art-Capacity-Planning-Scaling-Resources/dp/0596518579/" target="_blank">the book</a>, I give <a href="http://books.google.com/books?id=Yi3trjJ1JuQC&amp;pg=PA28&amp;dq=%22art+of+capacity+planning%22+%22using+metric+collection+to+identify+problems%22&amp;ei=ng4BSpriHJDmkASaxq3LBA&amp;client=firefox-a" target="_blank">an example</a> of how valuable this context is in troubleshooting interconnected systems. When metrics from different clusters or systems are laid right next to each other, significant changes in usage can be put into the right context. Cascading failures can be pretty hard to track down to begin with. Tracking them down without the big picture of the system is impossible. That graph you&#8217;re using for troubleshooting: is it showing you a <em>cause</em>, or <em>symptom</em>?</p>
<p>Because context is so important, I&#8217;m a huge fan of overlaying higher-level application statistics with lower-level systems ones. This guy has a great example of it over on the <a title="Web Ops Visualization Group Pool on Flickr.com" href="http://www.flickr.com/groups/webopsviz/" target="_blank">Web Ops Visualization group pool</a>:</p>
<p><a href="http://www.flickr.com/photos/34790652@N06/3408390203/in/pool-webopsviz"><img class="alignnone" title="System Efficiency" src="http://farm4.static.flickr.com/3019/3408390203_b86fa87ace.jpg" alt="" width="500" height="222" /></a></p>
<p>He&#8217;s not just measuring the webserver CPU, he&#8217;s also measuring the ratio of requests per second <strong>to</strong> total CPU. This is context that can be hugely valuable. If any of the underlying resources change (faster CPUs, more caching on the back-end, application optimizations, etc.) he&#8217;ll be able to tell quickly how much benefit he&#8217;ll gain (or lose) by tracking this bit.</p>
<p>At the Velocity Summit, <a title="Theo Schlossnagle" href="http://lethargy.org/~jesus/" target="_blank">Theo</a> mentioned that since <a title="OmniTI" href="http://omniti.com/" target="_blank">OmniTI</a> started throwing metrics for all their clients into <a title="Reconnoiter" href="https://labs.omniti.com/trac/reconnoiter" target="_blank">reconnoiter</a>, they almost always plot their business metrics on top of their system metrics, because why the hell not? Even if there&#8217;s no immediate correlation, it gives their system statistics the context needed for the bigger picture, which is:</p>
<blockquote><p>How is my infrastructure actually <em>enabling</em> my business?</p></blockquote>
<p>I&#8217;ll say that gathering metrics is pretty key to running a tight ship, but seeing them in context is invaluable.</p>
<div class="zemanta-pixie" style="margin-top: 10px; height: 15px;"><a class="zemanta-pixie-a" title="Reblog this post [with Zemanta]" href="http://reblog.zemanta.com/zemified/85f7a8e8-9a3e-4b46-ac8f-418b8b19491d/"><img class="zemanta-pixie-img" style="border: medium none; float: right;" src="http://img.zemanta.com/reblog_e.png?x-id=85f7a8e8-9a3e-4b46-ac8f-418b8b19491d" alt="Reblog this post [with Zemanta]" /></a><span class="zem-script more-related pretty-attribution"><script src="http://static.zemanta.com/readside/loader.js" type="text/javascript"></script></span></div>
]]></content:encoded>
			<wfw:commentRss>http://www.kitchensoap.com/2009/05/10/context-and-operational-metrics/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss><!-- Dynamic Page Served (once) in 0.391 seconds -->
