<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Dataspora Blog</title>
	
	<link>http://dataspora.com/blog</link>
	<description>Big Data, open source analytics, and data visualization</description>
	<pubDate>Tue, 25 Aug 2009 23:02:51 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
	<language>en</language>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/data-evolution" type="application/rss+xml" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item>
		<title>How XML Threatens Big Data</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/sgrANTdWFkU/</link>
		<comments>http://dataspora.com/blog/xml-and-big-data/#comments</comments>
		<pubDate>Sun, 23 Aug 2009 06:25:02 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<category><![CDATA[computing]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[bigdata]]></category>

		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=91</guid>
		<description><![CDATA[Confessions from a Massive, Nightmarish Data Project
Back in 2000, I went to France to build a genomics platform.  A biotech hired me to combine their in-house genome data with that of public repositories like Genbank.  The problem was the repositories, all with millions of records, each had their own format.  It sounded [...]]]></description>
			<content:encoded><![CDATA[<p><strong><a href="http://dataspora.com/blog/wp-content/uploads/2009/08/elephant.jpg"><img class="alignleft size-thumbnail wp-image-93" title="elephant" src="http://dataspora.com/blog/wp-content/uploads/2009/08/elephant-150x150.jpg" alt="Credit:  http://www.flickr.com/photos/digitalart/2101765353" width="150" height="150" /></a>Confessions from a Massive, Nightmarish Data Project</strong></p>
<p>Back in 2000, I went to France to build a genomics platform.  A biotech hired me to combine their in-house genome data with that of public repositories like Genbank.  The problem was the repositories, all with millions of records, each had their own format.  It sounded like a massive, nightmarish data interoperability project.  And an ideal fit for <a href="http://www.nytimes.com/2000/06/07/business/the-next-big-leap-it-s-called-xml.html"> a hot new technology </a>:  XML.</p>
<p>So I dove in, spending my days designing DTDs, writing parsers, tweaking tags (&#8221;taxon&#8221; or &#8220;species&#8221;?  attribute or element?).  At night I dreamt in ontologies.  <a href="http://labs.dataspora.com/pubseq/docs/overview/records2xml.gif">It was perfect.</a></p>
<p>Then reality struck.  The pipeline was slow:  Oracle loaded XML at a crawl.  And it was a memory hog, since XSLT required putting full document trees in RAM.</p>
<p>We had a deadline to meet (and, mon dieu, a 35 hour work-week).  So we changed course.  We hacked our Perl scripts to emit a flat tab-delimited format &#8212; &#8220;TabML&#8221; &#8212; which was bulk loaded into Oracle.  It wasn&#8217;t elegant, but it was fast and it worked.</p>
<p>Yet looking back, I realize that XML was the wrong format from the start.  And as I&#8217;ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including  initiatives like <a href="http://www.data.gov">Data.gov</a>.</p>
<p>In the next sections, I discuss how XML fails for Big Data because of its unnatural form, bulk, and complexity.  Finally, I generalize to three rules that advocate a more liberal approach to data.</p>
<p><span id="more-91"></span></p>
<h3>Three Reasons Why XML Fails for Big Data</h3>
<p><strong>I. XML Spawns Data Bureaucracy </strong></p>
<p>In its natural habitat, data lives in relational databases or as data structures in programs.  The common import and export formats of these environments do not resemble XML, so much effort is dedicated to making XML fit.  When more time is spent on inter-converting data &#8212; serializing, parsing,translating &#8212; than in using it, you&#8217;ve created a data bureaucracy.</p>
<p>Indeed, it was what Doug Crockford called <a href="link://http//www.json.org/fatfree.html">&#8220;impedance mismatch inefficiencies&#8221;</a> that sparked him to create JSON - standardizing Javascript&#8217;s object notation as a portable data container.</p>
<p><strong>II. Yes, Size Matters for Data</strong></p>
<p>Size matters for data in a way it does not for documents.  Documents are intended for human consumption and have human-sized upper bounds (a lifetime&#8217;s worth of reading fits on a thumb drive).  Data designed for machine consumption is bounded only by bandwidth and storage.</p>
<p>XML&#8217;s expansiveness &#8212; for even when compressed, the genie must be let out the bottle at some point &#8212; imposes memory, storage, and CPU costs.</p>
<p><strong>III. Complexity Carries a Cost</strong></p>
<p>I never fail to sigh when I open a data file and discover an army of tags, several ranks deep, surrounding the data I need.  XML&#8217;s complexity imposes costs without commensurate benefits, specifically:</p>
<ul>
<li>In-line, element-by-element tagging is redundant.  Far preferable is stating the data model separately, and using a lightweight delimiter (such as a comma or a tab).</li>
<li> Text tags are purported to be self-documenting, but textual meaning is a slippery thing: it&#8217;s rare that one can be sure of a tag&#8217;s data type without consulting its DTD (in a separate document).</li>
<li> End-tags support nested structures (such as an aside (within (an aside)).  But to facilitate data exchange, flattened out structures are preferable, and arbitrary levels of nesting are best using sparingly.</li>
</ul>
<p>XML&#8217;s complexity inflicts misery on both sides of the data divide: on the publishing side, developers struggle to comply with the latest edicts of a fussy standards group.  While data suitors labor to <a href="http://www.crummy.com/software/BeautifulSoup/">quickly unravel</a> that XML format into something they can use.</p>
<h3>Three Rules for XML Rebels</h3>
<p><strong>I.  Stop Inventing New Formats</strong> <a href="http://www.tbray.org/ongoing/When/200x/2006/01/08/No-New-XML-Languages">(as Tim Bray said in 2006)</a></p>
<p>Before you call for &#8220;an XML format for X&#8221;, let me tell you a story about LaTeX and MathML.  (And while these are document formats, there&#8217;s a lesson here for data).</p>
<p>The LaTeX typesetting system is the lingua franca for composing scientific documents.  As the one-million plus LaTeX-formatted articles on arXiv.org attest, it is spoken by scientists worldwide.</p>
<p>MathML, on the other hand, is a markup language for mathematics recommended by the W3C.  If you&#8217;re a scientist looking to use MathML, you have two choices: (i) find a program to convert LaTeX, which you already know, to MathML 3.0 or (ii) familiarize yourself with this <a href="http://www.w3.org/TR/2009/WD-MathML3-20090604/"> handy 354-page spec</a> and code it yourself.</p>
<p>Two years ago, Mike Adams thought of a third way: why not just let people use LaTeX directly in WordPress?  So he wrote a plug-in that did it.  <a href="http://en.blog.wordpress.com/2007/02/17/math-for-the-masses/">The applause was deafening</a>.</p>
<p>Spoken languages are strengthened by usage, not by imperial fiat, and data formats are no different.  Far better to evolve and adapt the standards we already have (as JSON and SQLite&#8217;s file format do), than to fabricate new ones from whole cloth.  <a href="http://blog.jonudell.net/2009/07/31/polymath-equals-user-innovatio/">As John Udell says</a>, &#8220;good-enough solutions [that are] here now, and familiar to people, often trump great solutions that aren’t here and wouldn’t be familiar if they were.&#8221;</p>
<p><strong>II.  Obey The Fifteen Minute Rule</strong></p>
<p><a href="http://www.ddj.com/184404686">Interviewed several years ago</a>, James Clark stated &#8220;If a technology is too complicated, no matter how wonderful it is and how easy it makes a user&#8217;s life, it won&#8217;t be adopted on a wide scale.&#8221;</p>
<p>Accordingly, if you absolutely must develop a new API, language, or format, it should satisfy a simple rule: a person of reasonable ability should be able to get from zero to &#8216;Hello World&#8217; in fifteen minutes.  (This does not preclude complex languages or formats, per se:  it does require that additional complexity not be sui generis, but built on some existing foundation, <a href="http://people.mandriva.com/~prigaux/language-study/diagram-light.png">for example.</a>) </p>
<p>Despite <a href="http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/"> a noble vision for the semantic web </a>, the barriers for adopting the W3C&#8217;s proposals for linked data are too high.  The beauty of original HTML standard was that it was dead simple.  The flaw of RDF is that it is too hard.</p>
<p><strong>III.  Embrace Lazy Data Modeling</strong></p>
<p>To keep data bureaucracy to a minimum, <a href="http://my.safaribooksonline.com/9780596801656/information_platforms_as_dataspaces">several Big Data thinkers </a> have advocated a more <a href="http://en.wiktionary.org/wiki/catholic">catholic</a> approach to data:  building data stores that accommodate <a href="http://infochimps.org/">a broad range of data types and formats</a>.</p>
<p>Lazy data modeling is similar to lazy evaluation.  The right schema for data depends on future use cases, in as-yet-undeveloped applications.  Instead of trying to guess the future, we can store the data &#8220;as-is&#8221; &#8212; and deal with its transformation when (and if) a necessary use case arises.  As <a href="http://www.eecs.berkeley.edu/~franklin/Papers/dataspaceSR.pdf">Michael Franklin and colleagues note</a>: &#8221;the most scarce resource available for semantic integration is human attention.&#8221;</p>
<p>This liberal view also reduces barriers for data sharing, barriers which threaten initiatives like <a href="http://www.data.gov/">Data.gov</a>.  The US Census Bureau shouldn&#8217;t expend resources to publish in XML if they have a good-enough format available right now.</p>
<p>For the data geeks in the trenches, who are building the next generation of data services, the laws of economics hold fast: there are unlimited opportunities in the face of one limited resource, time. (Which also explains why <a href="http://blog.i2pi.com/">data geeks </a> <a href="http://www.datawrangling.com/">seem to </a> <a href="http://twitter.com/dpatil">get </a> <a href="http://anyall.org/blog/">no sleep</a>).</p>
<p>XML&#8217;s unfulfilled promise for data testifies that formats can create friction.  The easier it is for data to be shared and consumed, the more quickly we&#8217;ll realize our visions for smarter businesses and <a href="http://www.readwriteweb.com/archives/how_tim_oreilly_aims_to_change_government.php">better governments.</a></p>
<p><strong>(25-Aug-2009 Update:  <a href="http://groups.google.com/group/sunlightlabs/browse_thread/thread/da9118b9fe566c">  Read a response from open gov advocates at Sunlight Labs</a>).</strong></p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/sgrANTdWFkU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/xml-and-big-data/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/xml-and-big-data/</feedburner:origLink></item>
		<item>
		<title>The Rise of the Data Web</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/trYsY0hnfNQ/</link>
		<comments>http://dataspora.com/blog/the-rise-of-the-data-web/#comments</comments>
		<pubDate>Fri, 21 Aug 2009 01:51:33 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<category><![CDATA[computing]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[data bigdata xml]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=86</guid>
		<description><![CDATA[The future of the web is data, not documents.  The web has evolved from Tim Berners-Lee&#8217;s original vision of &#8220;some big, virtual documentation system in the sky&#8221; into an vibrant ecosystem of data where documents &#8212; and human actors &#8212; will play an ever smaller role.
As others have noted, we&#8217;ve reached a tipping point [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/08/stream.jpg"><img class="alignleft size-medium wp-image-88" title="stream" src="http://dataspora.com/blog/wp-content/uploads/2009/08/stream-188x300.jpg" alt="" width="188" height="300" /></a>The future of the web is data, not documents.  The web has evolved from Tim Berners-Lee&#8217;s original vision of <a href="http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html">&#8220;some big, virtual documentation system in the sky&#8221;</a> into an vibrant ecosystem of data where documents &#8212; and human actors &#8212; will play an ever smaller role.</p>
<p><a href="http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel programming/">As others have noted</a>, we&#8217;ve reached a tipping point in history: more data is being manufactured by machines &#8212; servers, cell phones, GPS-enabled cars &#8212; than by people.  The early, document-centric web was populated by hand-coded hypertext files; today, a hand-coded web page is as rare as hand-woven clothing.</p>
<p>Through web frameworks, wikis, and blogs, we have industrialized the creation of hypertext.  Similarly, we&#8217;ve also industrialized the collection of data, and spliced out the human steps in many data flows, such that data entry clerks may soon be as rare as typesetters.</p>
<p>The web we experience will continue to be dominated by documents &#8212; e-mail, blogs, and news.  And while many sites are data-centric &#8212; Google maps, Weather.com, and Yahoo finance &#8212; it&#8217;s the web that we can&#8217;t see that surging with data.  It&#8217;s not about us, it&#8217;s about servers in the cloud mediating <a href="http://radar.oreilly.com/archives/2007/02/pipes-and-filte.html">entire pipelines of data</a>, only occasionally surfacing in a browser.</p>
<p>But the web&#8217;s data architecture is fractious and in flux: many competing standards exist for serializing, parsing, and describing data.  As we build out the data web, we ought to embrace standards that mirror data&#8217;s form in its natural habitats &#8212; as programmatic data structures, relational tables, or key-value pairs &#8212; while taking advantage of data&#8217;s stream-like nature.  Mark-up languages like HTML and XML are ideal for documents, but they are poor containers for data, especially Big Data.</p>
<p><span id="more-86"></span></p>
<p><strong>Sacred &#8220;Words &amp; Enthusiasm&#8221; vs Meaningless Utterances</strong></p>
<p>Documents and data are different.  The table below reflects my thin grasp of the fissure lines, as a step towards arguing why we ought to design around them.</span></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/08/documents_vs_data.png"><img class="alignnone size-full wp-image-90" title="documents_vs_data" src="http://dataspora.com/blog/wp-content/uploads/2009/08/documents_vs_data.png" alt="" width="499" height="356" /></a></p>
<p>Documents are made of <a href="http://www.ted.com/talks/view/id/161">&#8220;words and enthusiasm&#8221;</a>: sonnets, cake recipes, blog posts, Supreme Court rulings, and dictionary definitions.  Their core stuffing is text.  Their structure is unpredictable and irregular &#8212; even <a href="http://seanmcgrath.blogspot.com/2004_05_23_seanmcgrath_archive.html"> fractal</a>.</p>
<p>Data are not created but collected (<a href="http://www.archives.nd.edu/cgi-bin/lookit.pl?latin=datum">something given</a>, not something made): city temperatures, stock prices, web visitors, and home runs. They are observations in time and space, with periodic and predictable structure.  Data are reorderable and divisible: you can relay city temperatures in any order, but you can&#8217;t rearrange a Shakespearian sonnet without muddling its meaning.  Some documents are so meaningful as to be considered <a href="http://www.ietf.org/rfc/rfc1.txt">sacred</a>.</p>
<p>Data are, in this regard, meaningless on their own; they do not signify, they simply are.  These data are the <a href="http://plato.stanford.edu/entries/assertion/">utterances </a>of the <a href="http://boingboing.net/images/blobjects.htm">spimes </a> that surround us.</p>
<p><strong>Documents as Trees, Data as Streams</strong></p>
<p>The argument for shifting away from markup languages as data formats is not just practical, it&#8217;s philosophical: it&#8217;s about pivoting our conception away from the dominant metaphor of documents &#8212; trees &#8212; towards one far more suitable for data &#8212; streams.</p>
<p>Trees are rooted and finite: you can&#8217;t chop up a tree and easily put it back together again (while XML has made concessions to <a href="http://www.w3.org/TR/xml-fragment">document fragments</a>, it is not a natural fit).</p>
<p>Streams can be split, sampled, and filtered.  The divisibility of data streams lends itself to parallelism in a way that document trees do not.  The stream paradigm conceives of data as extending infinitely forward in time.  The Twitter data stream has no end: it ought have no end tag.</p>
<p>Conceiving of data as streams moves us out of the realm of static objects and into the <a href="http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-24.html#%_sec_3.5">realm of signal processing</a>.  This is the domain of the living: where the web is not an archive but an organism, <a href="http://radar.oreilly.com/2009/08/big-data-and-real-time-structured-data-analytics.html">reacting in real-time</a>.</p>
<p><strong>XML Considered Harmful for Data</strong></p>
<p>XML is a poor language for data because it solves the wrong problems &#8212; those of documents &#8212; while leaving many of data&#8217;s unique issues unaddressed.   But many promising alternatives exist &#8212; microformats like <a href="http://www.json.org/fatfree.html">JSON</a>, <a href="http://developers.facebook.com/thrift/thrift-20070401.pdf">Thrift</a>, and even <a href="http://www.sqlite.org/fileformat.html">SQLite&#8217;s file format</a> &#8211; as I will detail in a <a href="http://dataspora.com/blog/xml-and-big-data/">my next post.</a></p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/trYsY0hnfNQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/the-rise-of-the-data-web/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/the-rise-of-the-data-web/</feedburner:origLink></item>
		<item>
		<title>The Three Sexy Skills of Data Geeks</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/Wf9Z8ufjH2o/</link>
		<comments>http://dataspora.com/blog/sexy-data-geeks/#comments</comments>
		<pubDate>Wed, 27 May 2009 10:02:05 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=85</guid>
		<description><![CDATA[Hal Varian, Google&#8217;s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly:
&#8220;The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/marilyn_scatter.png"><img class="alignnone size-medium wp-image-84" title="marilyn_scatter" src="http://dataspora.com/blog/wp-content/uploads/2009/05/marilyn_scatter-300x300.png" alt="Marilyn Monroe Scatterplot Mashup" width="300" height="300" /></a>Hal Varian, Google&#8217;s Chief Economist, was interviewed a few months ago, and said the following in <a href="http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286">the McKinsey Quarterly</a>:<br />
<em>&#8220;The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.” </em></p>
<p>In prepping for tonite&#8217;s talk at the <a href="http://www.youtube.com/watch?v=hcl3qmawY_0">Google IO Ignite</a> event, this quote inspired me to muse about how sex appeal and statistics might go together:  so I chose to mash up a few scatter plots with Andy Warhol&#8217;s Marilyn Monroe.</p>
<p>Statisticians&#8217; sex appeal has little to do with their lascivious leanings (ahem, <a href="http://www.bedposted.com">BedPost</a>), and more with the scarcity of their skills.  I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas:  statistics, data munging, and data visualization.  (In parentheses next to each, I&#8217;ve put the salient character trait needed to acquire it).</p>
<p><strong>Skill #1: Statistics (Studying).</strong> Statistics is perhaps the most important skill and the hardest to learn. <span id="more-85"></span>It&#8217;s a deep and rigorous discipline, and one that is actively progressing (the widely used method of Least Angle Regression was only <a href="http://arxiv.org/abs/math/0406456">recently developed in 2004</a>).  I expect to be on its learning curve my entire life.  This being the case, people who possess a solid grasp of modern statistics are rare.   And yet problems that require its application continue to multiply.  The text that I was exposed to in graduate school and find to be an unparalleled survey is Hastie, Tibshirani, and Friedman&#8217;s <a href="http://www.amazon.com/Elements-Statistical-Learning-T-Hastie/dp/0387952845">Elements of Statistical Learning</a>.</p>
<p><strong>Skill #2: Data Munging (Suffering).</strong> The second critical skill mentioned above is  &#8220;data munging.&#8221;  Among data geek circles (you can find us with a <a href="http://search.twitter.com/search?q=%23rstats">Twitter search for #rstats</a>), this refers to the painful process of cleaning, parsing, and proofing one&#8217;s data before it&#8217;s suitable for analysis.  Real world data is messy.  At best it&#8217;s inconsistently delimited or packed into an unnecessarily complex XML schema.  At worst, it&#8217;s a series of scraped HTML pages or a thoroughly undocumented fixed-width format.</p>
<p>A good data munger excels at turning coffee into regular expressions and parsers, implemented in a high-level scripting language of choice (often Perl, Python, even Javascript).  This is problem solving with programming, and quite different from statistics.  An aspiration towards elegance &#8212; in the form of a perfect XSLT filter, for example &#8212; is rarely rewarded, and often punished.  A decade ago, I thought that the world&#8217;s data would soon be well-structured, and my talent for syntactical incantations of regular expressions would be a moot skill.   I was wrong.  (Perhaps there&#8217;s an analogy with the paper industry:  the growing volume of data means we&#8217;ll likely need more regular expressions before we need less).</p>
<p>Related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores, using a combination of SQL, scripting languages (especially Python and its SciPy and NumPy libraries), and even several oldie-but-goodie Unix utilities (cut, join).</p>
<p>And when data sets grow too large to manage on a single desktop, the samurai of data geeks are capable of parallelizing storage and computation with tools like <a href="http://databeta.wordpress.com/2009/05/14/bigdata-node-density/">96-nodes of Postgres</a>, <a href="http://cran.r-project.org/web/views/HighPerformanceComputing.html">snow and RMPI</a>, Hadoop and Mapreduce, and <a href="http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop">on Amazon EC2 to boot.</a></p>
<p><strong>Skill #3: Visualization (Storytelling).</strong> This third and last skill that Professor Varian refers to is the easiest to believe one has.  Most of us have had exposure to basic chart-making widgets of Excel (and to date myself, tools like Harvard Graphics).   But a little knowledge is a dangerous thing:  these software tools are often insufficient when faced with the visualization of large, multivariate data sets.</p>
<p>Here it&#8217;s worth making a distinction between two breeds of data visualizations, which differ in their audience and their goals.  The first are exploratory data visualizations (as named by John Tukey), intended to faciliate a data analyst&#8217;s understanding of the data.   These may consist of <a href="http://dsarkar.fhcrc.org/lattice/book/images/Figure_05_17_stdBW.png">scatter plot matrices</a> and histograms, where labels and colors are minimally set by default.   Their goal is to help develop a hypothesis about the data, and their audience typically numbers one or a small team.</p>
<p>A second kind of data visualization are those intended to communicate to a wider audience, whose goal is to visually advocate for a hypothesis.  While most data geeks are facile with exploratory graphics, the ability to create this second kind of visualization, these visual narratives, is again a separate skill &#8212; with separate tools.  (R is excellent for static visualizations, but cannot compete with the kinds of rich interactive visualizations that tools like <a href="http://processing.org/">Processing </a>and <a href="http://flare.prefuse.org/">Flare</a> make possible).  Luckily, successful collaboration often occurs <a href="http://blog.jonudell.net/2009/05/26/a-conversation-with-eric-rodenbeck-about-usefully-cool-design-and-engineering/">between data analysts and designers</a>, the <a href="http://flowingdata.com/2009/04/22/narrow-minded-data-visualization/">occasional fracas</a> notwithstanding.</p>
<p>The ability to visualize and communicate data is critical, because even with good data and rigorous statistical techniques, if the results of an analysis are poorly visualized, they will not convince:  whether it&#8217;s an academic discovery or a business proposal.</p>
<p><strong>Put All Three Skills Together:  Sexy. </strong>Thus with the Age of Data upon us, those who can model, munge, and visually communicate data &#8212; call us statisticians or data geeks &#8212; are a hot commodity.  I grew up before the age of geek chic, when the computer wizzes were social pariahs, and feature-length movies were dedicated to <a href="http://www.imdb.com/title/tt0088000/">nerds seeking revenge</a>.  But in the last decade, Steve Jobs became an icon, the Internet became cool, and an entire generation of tech kids grew up well adjusted.  They even built the social web to prove it.   I believe the same could happen to statistics and data geeks too.</p>
<p><a href="http://panelpicker.sxsw.com/ideas/view/4287"><br />
</a><strong> (Update Aug-2009:  If you liked this post, consider </strong><a href="http://panelpicker.sxsw.com/ideas/view/4287"><strong>voting for it at the 2010 SXSW Conference</strong></a><strong>).</strong><a href="http://panelpicker.sxsw.com/ideas/view/4287"><img src="http://sxsw.com/files/SXSWPanelPicker-sm.png" alt="Vote for my PanelPicker idea at SXSW" /></a></p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/Wf9Z8ufjH2o" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/sexy-data-geeks/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/sexy-data-geeks/</feedburner:origLink></item>
		<item>
		<title>Dataviz Salon SF #2:  Maps, Grammars, &amp; Models</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/YIddP1eXxpc/</link>
		<comments>http://dataspora.com/blog/dataviz-sf-salon-no/#comments</comments>
		<pubDate>Fri, 08 May 2009 10:11:35 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=75</guid>
		<description><![CDATA[A few nights ago the talented folks at Stamen Design hosted us at their studios for our second dataviz salon in San Francisco.  (Special thanks to Tom Carden and Michal Migurski for inviting us).  Four talks were given, which I&#8217;ll review in turn.

Stamen:  Reaching through Maps
Protovis: A Declarative, Open Source Graphical Toolkit
A Mathematician&#8217;s View:  A Visualization is a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/dataviz_salon_poster_5may20.png"></a><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/dataviz_salon_poster_5may20.png"><img class="alignleft size-thumbnail wp-image-76" title="dataviz_salon_poster_5may20" src="http://dataspora.com/blog/wp-content/uploads/2009/05/dataviz_salon_poster_5may20-150x150.png" alt="" width="150" height="150" /></a>A few nights ago the talented folks at <a href="http://www.stamen.com">Stamen Design</a> hosted us at their studios for our second dataviz salon in San Francisco.  (Special thanks to <a href="http://www.tom-carden.co.uk">Tom Carden</a> and <a href="http://mike.teczno.com/">Michal Migurski</a> for inviting us).  Four talks were given, which I&#8217;ll review in turn.</p>
<ul>
<li><a href="http://dataspora.com/blog/dataviz-sf-salon-no/#stamen">Stamen:  Reaching through Maps</a></li>
<li><a href="http://dataspora.com/blog/dataviz-sf-salon-no/#protovis">Protovis: A Declarative, Open Source Graphical Toolkit</a></li>
<li><a href="http://dataspora.com/blog/dataviz-sf-salon-no/#morton">A Mathematician&#8217;s View:  A Visualization is a Hypothesis</a></li>
<li><a href="http://dataspora.com/blog/dataviz-sf-salon-no/#uuorld">UUorld:  Multidimensional Extrusion Maps</a></li>
</ul>
<h3 id="stamen">Stamen:  Reaching through Maps</h3>
<p>Eric Rodenbeck (Stamen) started by highlighting several mapping visualizations that Stamen has been hacking on recently and in the past, including <a href="http://oakland.crimespotting.org/map/#types=Va,Na,DP,Al,Pr&amp;dtend=2009-05-05T23:34:55-07:00&amp;dtstart=2009-04-22T23:47:51-07:00&amp;lon=-122.270&amp;zoom=14&amp;lat=37.806"> </a><a href="http://www.cabspotting.org"> Cabspotting in San Francisco </a>, <a href="http://oakland.crimespotting.org/">Crimespotting in Oakland</a>, and  <a href="http://www.london2012.com/in-your-area/map/index.php"> Olympic Stadium spotting in London</a>.</p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/stamen_cabspotting.png"><img class="alignleft size-thumbnail wp-image-79" title="stamen_cabspotting" src="http://dataspora.com/blog/wp-content/uploads/2009/05/stamen_cabspotting-150x150.png" alt="" width="150" height="150" /></a>Eric showed how Stamen has attempted to move away from what <a href="http://mappinghacks.com/2006/04/07/web-map-api-roundup/">Schuyler Erle has dubbed &#8220;red dot fever&#8221;</a>, whereby the overlayed data can overwhelm our visual attention, and toward allowing various data layers to &#8220;reach through&#8221; the maps.</p>
<p>For example, the London Olympic maps provide a mixture of schematic, satellite, and webcam images.  These various drill-downs of detail are not all exposed, but rather collaged.  Even more interesting was a movable &#8216;lens&#8217; that, as it is moved over regions of a map, reveals another layer (reminiscent of a <a href="http://www.flickr.com/photos/cdevers/2896777351/"> polarized-light based mural</a> at Boston&#8217;s MoS).  In these ways, additional layers of data are only selectively brought into focus (echoing a design pattern in Japanese gardening, <a href="http://www.amazon.com/Visual-Spatial-Structure-Landscapes/dp/0262580942">mie gakure</a>, meaning &#8220;seen and unseen&#8221;).<br />
<span id="more-75"></span><br />
One practical gem that Mike Migurski shared regarding the Oakland Crimespotting site was, &#8220;the design of a comments section is a huge part of how its perceived and used.&#8221;  Nota bene, social web developers.</p>
<h3 id="protovis">Protovis: A Declarative, Open Source Graphical Toolkit</h3>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/burtin_yeast_mic.png"><img class="alignnone size-thumbnail wp-image-77" title="burtin_yeast_mic" src="http://dataspora.com/blog/wp-content/uploads/2009/05/burtin_yeast_mic-150x150.png" alt="" width="150" height="150" /></a>Mike Bostock (Stanford CS) introduced <a href="http://vis.stanford.edu/protovis/">Protovis</a>, an extensible visualization toolkit implemented using Javascript&#8217;s canvas element.  Protovis draws inspiration from Leland Wilkinson&#8217;s <a href="http://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448">Grammar of Graphics</a>, which argues for moving away from the prevailing method of building visualizations, where data are simply poured into one of several chart types &#8212; pie, stacked bar, or scatter.</p>
<p>Wilkinson argues that visualizations should not be cast from chart typologies, but rather composed of graphical primitives.  In Protovis, these primitives include dots, areas, lines, and labels (called &#8220;marks&#8221;).</p>
<p>Among Protovis&#8217;s strengths are:</p>
<dl>
<dt><strong> A More Declarative Syntax for Creating Graphics </strong></dt>
<dd> One disadvantage of directly using Javascript&#8217;s canvas is its   imperative style.  To draw a diagonal line, the code must manipulate   and move a pen using x,y coordinates.  With Protovis, however, the   code declares (roughly) &#8220;add a bar to this graph&#8221; (<a href="http://vis.stanford.edu/protovis/ex/weather.html">example</a>).  Thus Protovis   provides a grammar for statements about graphical marks, rather than   statements about graphical mechanics. </dd>
<dt><strong> Visible Open Source </strong></dt>
<dd> With Protovis, the source code is not just open and available, it&#8217;s   viewable from within the browser.  I have an admittedly personal bias for <a href="http://dataspora.com/blog/open-source-dataviz/">open   source data visualization</a>, but lowering the barriers to sharing source   code ultimately drive faster adoption and iteration of visualization   techniques. </dd>
</dl>
<p>Mike has used Protovis to recreate classic data visualizations by Will Burtin, Florence Nightingale, William Playfair, and others.  You can find these at the <a href="http://vis.stanford.edu/protovis">Protovis site</a> and in their <a href="http://vis.stanford.edu/protovis/protovis.pdf">InfoVis &#8216;09 paper</a>.</p>
<p>(For those interested in a Wilkinson-inspired approach for graphics in R, check out <a href="http://had.co.nz/ggplot2/">Hadley Wickham&#8217;s ggplot</a>).</p>
<h3 id="morton">A Mathematician&#8217;s View:  A Visualization is a Hypothesis</h3>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/dataspora_wordle.png"><img class="alignleft size-thumbnail wp-image-78" title="dataspora_wordle" src="http://dataspora.com/blog/wp-content/uploads/2009/05/dataspora_wordle-150x150.png" alt="" width="150" height="150" /></a>Jason Morton (Stanford Mathematics) made the argument that a data visualization is not merely a descriptive vessel, it is a predictive model.</p>
<p>A visualization is a model is because, especially with large data sets, not every dimension of every observation can be shown.  Quite simply, a (compressed) 100k data visualization cannot losslessly describe a (compressed) 10 Mb data set: information must be discarded. What remains is a <em>model</em> of the original data, albeit a visual model.</p>
<p>Moreover, a data visualization&#8217;s model is predictive: it presents a hypothesis about how observable data points were generated, and implies predictions about future, as-yet-unobserved data.</p>
<p>Seen from this perspective, Stamen&#8217;s Crimespotting maps are powerful precisely because they make compelling hypotheses about when and where crime occurs in Oakland.  Their London Olympic maps, which integrate time series photographs of the stadium site, take a position about the pace of construction and how it is impacting the landscape.</p>
<p><strong>&#8220;Form Ever Follows Function&#8221;</strong></p>
<p>And if the function of a data visualization is to make hypotheses, then its form should follow this function. The arbitrary use of color, position, shape, and ornament &#8212; only adds noise.</p>
<p>The ever popular <a href="http://www.wordle.net/"> Wordle </a> provides a visual model for word distribution in a text: more frequent words are larger.  However, a word&#8217;s color, position, and font are arbitrarily chosen - they carry no meaning, and model nothing. Indeed, the &#8220;randomize&#8221; button is an admission of as much (for it does not randomize size).</p>
<p>Adding arbitrary marks or dimensions to a visualization carries two related risks: first, it can obscure the true model that&#8217;s trying to be conveyed (what do same-colored have in common?); second, this added complexity, beyond polluting the information channel, has a cost: the visualization is larger.  <a href="http://www.swivel.com/graphs/image/28893777/default/600/337/5/absolute/HorizontalBarGraph/ASC/all+time/daily/ignore?s=1241769339">Bar graphs with iPhone ads</a> in the background cannot be succinctly rendered.</p>
<p>The parallels to the modernist movement in architecture are obvious. Adolf Loos wrote in 1908 that &#8220;the evolution of culture marches with the elimination of ornament from useful objects.&#8221;  The American modernist Louis Sullivan proclaimed that &#8220;form ever follows function.&#8221;</p>
<p>But the truth is that stripping visualizations down to their bare models can be counterproductive.  Call it noise or ornamentation, but even visual marks that do not advance a hypothesis can act to support it,  by guiding the eye, providing context, or otherwise speeding the absorption of a pattern by the human brain.  At the very least, this functionalist perspective can help data visualizers use ornamentation intentionally, not inadvertently.</p>
<h3 id="uuorld">UUorld:  Multidimensional Extrusion Maps</h3>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/uuorld_stlouis.png"><img class="alignleft size-thumbnail wp-image-80" title="uuorld_stlouis" src="http://dataspora.com/blog/wp-content/uploads/2009/05/uuorld_stlouis-150x150.png" alt="" width="150" height="150" /></a>Zach Wilson (UUorld) showcased his <a href="http://www.uuorld.com">company&#8217;s</a> software that simplifies creating and exploring extrusion maps.  Among the several interesting applications of his software, Zach showed off a temporal visualization <a href="http://vimeo.com/4480815"> of the spread of swine flu in the United States</a> over the past several weeks.</p>
<p>In response to the critique that layering data dimensions on two-dimensional maps could be done more effectively by use other indicators such as color &#8212; instead of the simulation of a third dimension of height &#8212; Zach indicated that research has shown that physical dimensions (or their simulation) possess greater visual saliency to the human eye.</p>
<p>Zach also mentioned UUorld&#8217;s <a href="http://www.uuorld.com/portal">data portal</a> which contains thousands of downloadable statistics from a variety of public sources; some of which have been used to generate UUorld visualizations.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/YIddP1eXxpc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/dataviz-sf-salon-no/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/dataviz-sf-salon-no/</feedburner:origLink></item>
		<item>
		<title>Color:  The Cinderella of dataviz</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/EIEl54Ti7Bg/</link>
		<comments>http://dataspora.com/blog/how-to-color-multivariate-data/#comments</comments>
		<pubDate>Sat, 14 Mar 2009 00:14:42 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[R]]></category>

		<category><![CDATA[analytics]]></category>

		<category><![CDATA[color theory]]></category>

		<category><![CDATA[computing]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[dataviz]]></category>

		<category><![CDATA[sabermetrics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=58</guid>
		<description><![CDATA[&#8220;Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.&#8221;  &#8212; Envisioning Information, Edward Tufte, Graphics Press, 1990   
Color is one of the most abused and neglected tools in data visualization.  It is abused when we make poor color choices; it is neglected when we rely on poor software [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>&#8220;Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.&#8221;  &#8212; <em>Envisioning Information</em>, Edward Tufte, Graphics Press, 1990   </p></blockquote>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor2d_4001.png"><img class="alignnone size-full wp-image-73" title="stripcolor2d_4001" src="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor2d_4001.png" alt="multivariate color strip plot " width="400" height="185" /></a>Color is one of the most abused and neglected tools in data visualization.  It is abused when we make poor color choices; it is neglected when we rely on poor software defaults.  Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.</p>
<p>Most of us think twice before walking outside in fluorescent red underoos.  If only we were as cautious in choosing colors for infographics.  The difference is that few of us design our own clothes.  But until good palettes (like <a href="http://www.colorbrewer.org">ColorBrewer</a>) are commonplace, to get colors that fit our purposes, we must be our own tailors.</p>
<p>While obsessing about how to implement color on the <a href="http://labs.dataspora.com/gameday">Dataspora Labs&#8217; PitchFX viewer</a> I began with a basic motivating question:<span id="more-58"></span></p>
<h3>Why use color in data graphics?</h3>
<p>If our data are simple, a single color is sufficient, even preferable.  For example, below is a scatter plot of 287 pitches thrown by the major league pitcher Oscar Villarreal in 2008.  With just two dimensions of data to describe &#8212; the x and y location in the strike zone &#8212; black and white is sufficient.  In fact, this scatter plot is a perfectly lossless representation of the data set (assuming no data points perfectly overlap).</p>
<p><strong>Fig 1. Location of Pitches </strong><strong>(Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/bwxy_250.png"><img class="alignnone size-full wp-image-59" title="bwxy_250" src="http://dataspora.com/blog/wp-content/uploads/2009/03/bwxy_250.png" alt="Simple black and white scatter plot" width="250" height="250" /></a></p>
<p>But what if we&#8217;d like to know more: for instance, what kinds of pitches (curveballs, fastballs) landed where?  Or their speed?  Visualizations live in two dimensions, but the world they describe is rarely so confined.</p>
<p><strong>The defining challenge of data visualization is projecting high dimensional data onto a low dimensional canvas.</strong> (As a rule, one should never do the reverse: visualize more dimensions than what already exist in the data).</p>
<p>Getting back to our pitching example, if we want to layer another dimension of data &#8212; pitch type &#8212; into our plot, we have several methods at our disposal:</p>
<ol>
<li><strong>plotting symbols </strong> - vary the glyphs that we use (circles, triangles, etc.),</li>
<li><strong>small multiples</strong> - vary extra dimensions in space, creating a series of smaller plots</li>
<li><strong>color</strong> - we can color our data, encoding extra dimensions inside a color space</li>
</ol>
<p>Which techniques you employ depend on the nature of the data and the media of your canvas.  I will describe all three by way of example.</p>
<h3>Multivariate Method I:  Vary Your Plotting Symbols</h3>
<p><strong>Fig 2. Location and Pitch Type (Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/glyphs_300.png"><img class="alignnone size-full wp-image-60" title="glyphs_300" src="http://dataspora.com/blog/wp-content/uploads/2009/03/glyphs_300.png" alt="Scatterplot with varied plotting symbols." width="300" height="300" /></a></p>
<p>In this plot, I&#8217;ve layered the categorical dimension of pitch type into our plot by using four different plotting symbols.</p>
<p>I consider this visualization an abject failure.  In fact, the prize for my most despised graphs in graduate school goes to <a href="http://www.rbej.com/content/figures/1477-7827-4-23-10-l.jpg"> bacterial growth curves rendered this way </a>.  The reason these graphs make our heads hurt is because (i) distinguishing glyphs demands extra attention (versus what academics call &#8216;<a href="http://www.csc.ncsu.edu/faculty/healey/PP/index.html">pre-attentively processed</a>&#8216; cues like color), (ii) even after we visually decode the symbols, we have yet another step: mapping symbols to their semantic categories.  (Admittedly this can be improved with <a href="http://eagereyes.org/VisCrit/ChernoffFaces.html">Chernoff faces</a> or other iconic symbols, where the categorical mapping is self-evident).</p>
<h3>Multivariate Method II:  Small Multiples on a Canvas</h3>
<p>Folding additional dimensions into a partitioned canvas has a distinguished pedigree in information graphics.  It has been employed everywhere from <a href="http://hsci.ou.edu/images/jpg-100dpi-5in/17thCentury/Galileo/1613/Galileo-1613-Pt3-27.jpg"> Galileo sunspot illustrations </a> to William Cleveland&#8217;s trellis plots.  And as Scott Mccloud&#8217;s unexpected <a href="http://www.amazon.com/Understanding-Comics-Invisible-Scott-Mccloud/dp/006097625X"> tour de force on comics </a> makes clear, panels of pictures possess a narrative power that a single, undivided canvas lacks.</p>
<p>In this plot below, the four types of pitches that Oscar throws are splintered horizontally.   By reducing our plot sizes, we&#8217;ve given up some resolution in positional information. But in return, patterns that were invisible in our first plot, and obscured in our second (by varied symbols) are now made clear (Oscar throws his fastballs low, but his sliders high).</p>
<p><strong>Fig 3:  Location and Pitch Type (Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/strip_4002.png"><img class="alignnone size-full wp-image-70" title="strip_4002" src="http://dataspora.com/blog/wp-content/uploads/2009/03/strip_4002.png" alt="black and white strip plot" width="400" height="185" /></a></p>
<p>Multiplying plots in space works especially well on printed media, which can hold more than ten times as many dots per square inch as a screen.  Both columns and rows can be used to lattice over additional dimensions, the result being a <a href="http://dsarkar.fhcrc.org/lattice/book/images/Figure_06_07_stdBW.png"> matrix of scatter plots </a> (in R, see the &#8216;<a href="http://finzi.psych.upenn.edu/R/library/lattice/html/splom.html">splom</a>&#8216; function).</p>
<h3>Multivariate Method III: Color Your Data</h3>
<p><strong>So why bother with color?</strong></p>
<p>First, as compared to most print media, computer displays have fewer units of space, but a broader color gamut.  So color is a compensatory strength.</p>
<p>For multi-dimensional data, color can convey additional dimensions inside a unit of space &#8212; and can do so instantly.  Color differences can be detected within 200 ms, before you&#8217;re even conscious of paying attention (the &#8216;pre-attentive&#8217; concept I mentioned earlier).</p>
<p>But the most important reason to use color in multivariate graphics is that<strong> color is itself multidimensional</strong>.  Our perceptual color space &#8212; <a href="http://en.wikipedia.org/wiki/Opponent_process"> however </a><a href="http://en.wikipedia.org/wiki/RGB_color_model"> you </a><a href="http://en.wikipedia.org/wiki/HSL_and_HSV"> slice </a><a href="http://en.wikipedia.org/wiki/Lab_color_space"> it </a> &#8212; is three-dimensioned.</p>
<p>In the example below, I&#8217;ve used color as a means of encoding a fourth dimension of our pitching data: the speed of pitches thrown. The palette I&#8217;ve chosen is a divergent palette that moves along one dimension (think of it as the &#8216;redness-blueness&#8217; dimension) in the <a href="http://en.wikipedia.org/wiki/CIELUV_color_space">CIELUV</a> color space, while maintaining a constant level of luminosity.</p>
<p><strong>Fig 4. Location, Pitch Type, and Velocity (Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/keycolor1d_3001.png"><img class="alignnone size-full wp-image-69" title="keycolor1d_3001" src="http://dataspora.com/blog/wp-content/uploads/2009/03/keycolor1d_3001.png" alt="isoluminant, diverging color ramp" width="300" height="150" /></a></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor1d_400.png"> </a></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor1d_4002.png"><img class="alignnone size-full wp-image-71" title="stripcolor1d_4002" src="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor1d_4002.png" alt="color strip plot" width="396" height="187" /></a></p>
<p>Holding luminosity constant is important, because luminosity (similar to brightness) determines a color&#8217;s visual impact. Bright colors pop, and dark colors recede.  A color ramp that varies luminosity along with hue will highlight data points as an artifact of color choice.</p>
<p>I chose only seven gradations of color, so I&#8217;m downsampling (in a lossy way) our speed data - but further segmentation of our color ramp is not likely to be perceptible.</p>
<p>I&#8217;ve also chosen to use filled circles as my plotting symbol, as opposed to the open circles in all my previous plots.  This is done to improve the perception of each pitch&#8217;s speed via its color: small patches of color are less perceptible.  But a consequence of this choice &#8212; compounded by our choice to work with a series of smaller plots &#8212; is that more points overlap.  We&#8217;ve further degraded some of our positional information.  However, in our last step, we attempt to recover some of this.</p>
<p>Now I&#8217;ve finally brought color to bear on this visualization, but I&#8217;ve only encoded a single dimension &#8212; speed.  Which leads to another question:</p>
<h3>If color is three-dimensional, can I encode three dimensions with it?</h3>
<p>In theory, yes.  <a href="http://dataspora.com/blog/wp-content/uploads/2009/03/ware_infoviz_p142.jpg">Colin Ware researched this exact question</a>.  In practice, it&#8217;s difficult.  It turns out that asking observers to assess the amount of &#8216;redness&#8217;, &#8216;blueness&#8217;, and &#8216;greenness&#8217; of points is possible, but not intuitive (I suspect it&#8217;s somewhat like parsing symbols).</p>
<p>Another complicating factor is that a nontrivial fraction of the population has some form of color blindness.  This effectively reduces their color perception to two dimensions.</p>
<p>And finally, the truth is that our sensation of color is not equal along all dimensions; it&#8217;s thought the closely related &#8216;red&#8217; and &#8216;green&#8217; receptors emerged via duplication of the single long wavelength receptor (useful for detecting ripe from unripe fruits, according to one just-so story).</p>
<p>Because the high level of dichromacy in the population, and because of the challenge of encoding three dimensions in color, I  feel color is best used to encode no more than two dimensions of data.</p>
<p>So, for my last example of our pitching plot data, I will introduce luminosity as a means of encoding the local density of points (using a kernel density estimator).  This allows us to recover some of the data lost by increasing the sizes of our plotting symbols.</p>
<p><strong>Fig 5. Location, Pitch Type, Velocity, and Density (Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/keycolor2d_3001.png"><img class="alignnone size-full wp-image-72" title="keycolor2d_3001" src="http://dataspora.com/blog/wp-content/uploads/2009/03/keycolor2d_3001.png" alt="two-dimensional color palette" width="291" height="278" /></a></p>
<p><span style="text-decoration: underline;"><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor2d_4001.png"><img class="alignnone size-full wp-image-73" title="stripcolor2d_4001" src="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor2d_4001.png" alt="multivariate color strip plot " width="400" height="185" /></a><br />
</span></p>
<p>Here we have effectively employed a two-dimensional color palette, with blueness-redness varying along one axis for speed, and luminosity varying in the other to denote local density.</p>
<p>One final point about using luminosity.  Observing colors in a data visualization involves overloading, in the programming sense.  We rely on cognitive functions that were developed for one purpose (perceiving lions) and use them for another (perceiving lines).</p>
<p>Since we can overload color any way we want, whenever possible,  we should choose mappings that are natural.  Mapping pitch density to luminosity feels right because the darker shadows in our pitch plots imply depth.  Likewise, when sampling from the color space, we might as well choose colors found in nature.  These are the palettes our eyes were gazing at for the millions of years before #FF0000 showed up.</p>
<p>Color, used thoughtfully and responsibly, can be an incredibly valuable tool in visualizing high dimensional data.</p>
<h3>FutureMan Asks:  What about Animation?</h3>
<p>This discussion has focused on using static graphics in general, and color in particular, as a means of visualizing multivariate data.  I&#8217;ve purposely neglected one very powerful tool:  motion. The ability to animate graphics multiplies by several orders of magnitude the amount of information that can be packed into a visualization.   But packing  information into a time-varying data structure has to be done by someone (you or me) and from my view, this remains a significant challenge.  Canonical forms of animated visualizations (equivalent to the histograms, box plots, and scatterplots of the static world) are still a ways off, but frameworks like <a href="http://dataspora.com/blog/wp-admin/http:/processing.org">Processing</a> and <a href="http://prefuse.org/">Prefuse</a> are a promising start towards their development.</p>
<h3><a href="http://en.wikipedia.org/wiki/Lab_color_space"> </a>Methods</h3>
<p>The final product of these five-dimensional pitch plots &#8212; for all available data for the 2008 season &#8212; can be explored via the <a href="http://labs.dataspora.com/gameday">PitchFX</a> Django-driven web tool at Dataspora labs.</p>
<p>All of the visualizations here were developed using R and the Lattice graphics package.  (Of note, Hadley Wickham is developing <a href="http://had.co.nz/ggplot2/">ggplot2</a>, a bold re-write of the R graphics system based on a grammar of graphics).</p>
<h3>References for Further Reading</h3>
<ul>
<li>Ross Ihaka - <a href="http://www.stat.auckland.ac.nz/~ihaka/120/lectures.html">Lectures on Information Visualization</a>, Lectures 12-14</li>
</ul>
<ul>
<li>Colin Ware - <a href="http://www.amazon.com/Information-Visualization-Second-Interactive-Technologies/dp/1558608192"> Information Visualization</a>, Ch. 4</li>
</ul>
<ul>
<li>Edward Tufte,<a href="http://www.amazon.com/Envisioning-Information-Edward-R-Tufte/dp/0961392118"> Envisioning Information</a>, Ch. 4.</li>
</ul>
<ul>
<li> Deepayan Sarkar - <a href="http://lmdvr.r-forge.r-project.org">Lattice: Multivariate Data Visualization with R</a> (web site with code)</li>
</ul>
<ul>
<li>Maureen Stone - <a href="http://www.stonesc.com/">StoneSoup Consulting </a> (color consultant to Tableau Software)</li>
</ul>
<ul>
<li> Stephen Few,<a href="http://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167"> Information Dashboard Design</a>, Ch. 4</li>
</ul>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/EIEl54Ti7Bg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/how-to-color-multivariate-data/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/how-to-color-multivariate-data/</feedburner:origLink></item>
		<item>
		<title>People who love scatter plots &amp; connecting dots</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/_ps1Q8A3iHQ/</link>
		<comments>http://dataspora.com/blog/dataviz-sf/#comments</comments>
		<pubDate>Fri, 20 Feb 2009 06:02:34 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[R]]></category>

		<category><![CDATA[analytics]]></category>

		<category><![CDATA[dataviz]]></category>

		<category><![CDATA[sabermetrics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=54</guid>
		<description><![CDATA[
We hosted the first Dataviz Salon SF on Tuesday night, with lightning talks by boredom cop  Shane Booth, dataviz wiz  Lee Byron , computational journalist Brad Stenger, data wrangler  Pete Skomoroch , and any/all data enthusiast  Brendan O&#8217;Connor .
I was going to blog all about it &#8212; but Tom Carden of [...]]]></description>
			<content:encoded><![CDATA[<p><img title="dataviz-sf" src="http://dataspora.com/blog/wp-content/uploads/2009/02/dataviz_salon_poster_smal.jpg" alt="" /><br />
We hosted the first Dataviz Salon SF on Tuesday night, with lightning talks by boredom cop <a href="http://criminalizeboring.tumblr.com/"> Shane Booth</a>, dataviz wiz <a href="http://www.leebyron.com"> Lee Byron </a>, computational journalist <a href="http://nbagraphs.tumblr.com">Brad Stenger</a>, data wrangler <a href="http://www.datawrangling.com"> Pete Skomoroch </a>, and any/all data enthusiast <a href="http://www.anyall.org/blog"> Brendan O&#8217;Connor </a>.</p>
<p>I was going to blog all about it &#8212; but <a href="http://www.tom-carden.co.uk/2009/02/18/dataviz-salon-sf-1/">Tom Carden of Stamen Design already has a great write-up</a>.</p>
<blockquote><p>&#8230; Dataspora invited a few people to a Dataviz Salon yesterday evening. Mike and I went along and huddled in a brick-built basement in SoMa to listen to <a href="http://www.tom-carden.co.uk/2009/02/18/dataviz-salon-sf-1/">the following</a>:</p></blockquote>
<p>.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/_ps1Q8A3iHQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/dataviz-sf/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/dataviz-sf/</feedburner:origLink></item>
		<item>
		<title>How Google and Facebook are using R</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/GeD2DzlYIYs/</link>
		<comments>http://dataspora.com/blog/predictive-analytics-using-r/#comments</comments>
		<pubDate>Fri, 20 Feb 2009 03:11:03 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[R]]></category>

		<category><![CDATA[analytics]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[prediction]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=49</guid>
		<description><![CDATA[
(March 26th Update:  Video now available)   Last night, I moderated our Bay Area R Users Group kick-off event with a panel discussion entitled &#8220;The R and Science of Predictive Analytics&#8221;, co-located with the  Predictive Analytics World  conference here in SF.
The panel comprised of four recognized R users from industry:

 Bo [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/02/decision-tree.png"><img class="alignleft size-thumbnail wp-image-53" title="decision-tree" src="http://dataspora.com/blog/wp-content/uploads/2009/02/decision-tree-150x150.png" alt="" width="150" height="150" /></a><br />
<strong><a href="http://www.lecturemaker.com/2009/02/r-kickoff-video/">(March 26th Update:  Video now available)</a></strong>  <br /> Last night, I moderated our <a href="http://www.meetup.com/R-Users">Bay Area R Users Group</a> kick-off event with a panel discussion entitled &#8220;The R and Science of Predictive Analytics&#8221;, co-located with the <a href="http://www.predictiveanalyticsworld.com"> Predictive Analytics World </a> conference here in SF.</p>
<p>The panel comprised of four recognized R users from industry:</p>
<ul>
<li> Bo Cowgill, Google</li>
<li> Itamar Rosenn, Facebook</li>
<li> David Smith, Revolution Computing</li>
<li> Jim Porzak, The Generations Network (and Co-Chair of our R Users Group)</li>
</ul>
<p>The panelists were asked to explain how they use R for predictive analytics within their firms, its strengths and weaknesses as a tool, and provide a case study.  What follows is my summary with comments.</p>
<p><span id="more-49"></span></p>
<p><em> Panel Introduction </em></p>
<p>I began by describing R as a programming language with strengths in three areas: (i) data manipulation, (ii) statistics, and (iii) data visualization.</p>
<p>What sets it apart from other data analysis tools?  It was developed by statisticians, it&#8217;s free software, and it is extensible via user-developed packages &#8212; there are nearly 2000 of them as of today at the <a href="http://cran.r-project.org"> Comprehensive R Archive Network </a> or CRAN.</p>
<p>Many of these packages can be used for predictive analytics.  Jim highlighted Max Kuhn&#8217;s <a href="http://caret.r-forge.r-project.org"> caret package </a>, which provides a wrapper for accessing dozens of classification and regression models, from neural networks to naive Bayes.</p>
<p><em> Bo Cowgill, Google </em></p>
<p>R is the most popular statistical package at Google, according to Bo Cowgill, and indeed Google is a donor to the R Foundation.  He remarked that &#8220;The best thing about R is that it was developed by statisticians.  The worst thing about R is that&#8230; it was developed by statisticians.&#8221;  Nonetheless, he&#8217;s optimistic to see that as the R developer community has expanded, R&#8217;s documentation has improved, and its performance has gained.</p>
<p>One theme that Bo first brought up, but which was echoed by others, was that while Google uses R for data exploration and model prototyping, it is not typically used in production: in Bo&#8217;s group, R is typically run in a desktop environment.</p>
<p>The typical workflow that Bo thus described for using R was: (i) pulling data with some external tool, (ii) loading it into R, (iii) performing analysis and modeling within R, (iv) implementing a resulting model in Python or C++ for a production environment.</p>
<p><em> Itamar Rosenn, Facebook </em></p>
<p>Itamar conveyed how Facebook&#8217;s Data Team used R in 2007 to answer two questions about new users: (i) which data points predict whether a user will stay? and  (ii) if they stay, which data points predict how active they&#8217;ll be after three months?</p>
<p>For the first question, Itamar&#8217;s team used recursive partitioning (via the <a href="http://cran.r-project.org/web/packages/rpart">rpart</a> package) to infer that just two data points are significantly predictive of whether a user remains on Facebook:  (i) having more than one session as a new user, and (ii) entering basic profile information.</p>
<p>For the second question, they fit the data to a logistic model using a least angle regression approach (via the <a href="http://cran.r-project.org/web/packages/lars"> lars </a> package), and found that activity at three months was predicted by variables related to three classes of behavior: (i)  how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed &#8220;receptiveness&#8221; &#8212; related to how forthcoming a user was on the site.</p>
<p><em> David Smith, Revolution Computing </em></p>
<p>David&#8217;s firm, Revolution Computing, not only uses R, but R is their core business.  David said that &#8220;we are to R what Red Hat is to Linux&#8221;.  His firm addresses some of the pain points of using R, such as (i) supporting older versions of the software and (ii)  providing parallel computing in R through their ParallelR suite.</p>
<p>David showcased how one of their life sciences clients used R to classify genomic data through use of the <a href="http://cran.r-project.org/web/packages/randomForest"> randomForest </a> package, and how the analysis of classification trees could be easily parallelized using their &#8216;foreach&#8217; package.</p>
<p>He also mentioned that several firms they have worked with do use R in production environments, whereby a particular script is exposed on a server, and a client calls it with some data to return a result (several ways exist to set up R in a client-server manner, such as <a href="http://cran.r-project.org/web/packages/Rserve"> RServe </a>, <a href="http://biostat.mc.vanderbilt.edu/rapache/"> rapache </a>, and <a href="http://biocep-distrib.r-forge.r-project.org/"> Biocep</a>).</p>
<p>David evangelizes and educates about R at the <a href="http://blog.revolution-computing.com"> Revolutions blog </a>.</p>
<p><em> Jim Porzak, The Generations Network </em></p>
<p>Jim (also co-chairs the R Users Group), gave a brief overview of his <a href="http://www.predictiveanalyticsworld.com/agenda.php#sun"> PAW talk </a> on using R for marketing analytics.  In particular, Jim has used the <a href="http://cran.r-project.org/web/packages/flexclust"> flexclust </a> package to cluster customer survey data for Sun Microsystems, and apply the resulting profiles to identify high-value sales leads.</p>
<p>During the Q &amp; A session, the panelists were asked several questions.</p>
<p><em><strong>How do you work around R&#8217;s memory limitations?</strong> (R workspaces are stored in RAM, and thus their size is limited)</em></p>
<p>Three responses were given (including one from the audience):</p>
<p>(i) use R&#8217;s database connectivity (e.g. <a href="http://cran.r-project.org/web/packages/RMySQL">RMySQL</a>), and pull in only slices of your data, (ii) downsample your data (do you really a billion data points to test your model?), or (iii) run your scripts on a RAM-obsessed colleague&#8217;s machine  or fire up an <a href="http://blog.infochimps.org/2009/02/06/start-hacking-machetec2-released/"> virtual server on Amazon&#8217;s compute cloud </a> &#8212; for up to 15 Gigs.</p>
<p><em><strong>What&#8217;s the general ramp-up process for groups wanting to use R?</strong></em></p>
<p>Itamar and Bo both indicated that within their groups, almost everyone arrived having learned R in their university studies.  Jim Porzak led an R tutorial within his last firm using an internal slide deck.</p>
<p><em><strong>How easy is it for developers who are not statisticians to learn R?</strong></em></p>
<p>The consensus seemed to be that R is a difficult language to achieve competency in, vis-a-vis Python, Perl, or other high-level scripting languages.   Jim emphasized, however, that he is a not a statistician - nor were any of our panelists.  (As a non-statistician R user myself, I will say this &#8212; a consequence of learning R is an improved grasp of statistics.  Knowing statistics is a necessary pre-requisite for understanding R&#8217;s features, from its data types to its modeling syntax).</p>
<p><em><strong>How well does R interface with other tools and languages?</strong></em></p>
<p>There are several packages on CRAN for importing and exporting data to and from Matlab (<a href="http://cran.r-project.org/web/packages/R.matlab/"> RMatlab</a>), Splus, SAS, Excel and other tools.  In addition, there are interfaces for running R within Python (<a href="http://rpy.sourceforge.net/"> RPy </a>) and Java ( <a href="http://www.rforge.net/rJava/"> RJava </a>).</p>
<p>The panelists mentioned that they typically run R within a GUIs, either <a href="http://en.wikipedia.org/wiki/R_Commander"> RCommander </a> or <a href="http://rattle.togaware.com"> Rattle </a>.  (Aside: I run R exclusively in emacs using <a href="http://ess.r-project.org/"> ESS </a> &#8212; incidentally, one of its authors was panelist David Smith).</p>
<p><a href="http://www.lecturemaker.com/2009/02/r-kickoff-video/">A video of the event is now available</a> courtesy of <a href="http://www.lecturemaker.com"> Ron Fredericks</a> and LectureMaker.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/GeD2DzlYIYs" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/predictive-analytics-using-r/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/predictive-analytics-using-r/</feedburner:origLink></item>
		<item>
		<title>Is Big Data at a tipping point?</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/AEiXJbTfbU8/</link>
		<comments>http://dataspora.com/blog/tipping-points-and-big-data/#comments</comments>
		<pubDate>Fri, 09 Jan 2009 07:01:03 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[bigdata]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=46</guid>
		<description><![CDATA[
(5/18/09 update - included an overdue reference to linked data!) 
Stuart Kauffman, in one of his books about complexity, discusses tipping points in networks &#8212; what he calls a phase transitions &#8212; by way of buttons. Suppose you’re sitting on a floor strewn with 400 buttons, and you begin tying them together with pieces of string [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal"><a href="http://dataspora.com/blog/wp-content/uploads/2009/01/buttons_sketch.png"><img class="alignleft size-medium wp-image-45" style="float: left;" title="buttons_sketch" src="http://dataspora.com/blog/wp-content/uploads/2009/01/buttons_sketch.png" alt="" width="250" height="166" /></a></p>
<p class="MsoNormal"><em><span style="color: #808080;">(5/18/09 update - included an overdue reference to linked data!) </span></em></p>
<p class="MsoNormal"><em><span style="color: #808080;"><span style="color: #000000; font-style: normal;">Stuart Kauffman, in <a href="http://books.google.com/books?id=FxvENHL0qzYC">one of his books about complexity</a>, discusses tipping points in networks &#8212; what he calls a phase transitions &#8212; by way of buttons. Suppose you’re sitting on a floor strewn with 400 buttons, and you begin tying them together with pieces of string at random. At first, you have just pairs of buttons.   Then, you have clusters of threes, which in turn get tied into ever larger clumps. The question is: How long until picking any button off the floor pulls them all off together, in one connected mass?</span></span></em></p>
<p class="MsoNormal">It turns out that this supercluster of buttons doesn’t build gradually as we tie more threads, it emerges suddenly.  This rapid phase transition, from relatively unconnected to mostly connected, occurs right around where we have about half as many threads as buttons (see figure).  This is the tipping point of the system:  where a few threads make a big difference.</p>
<p class="MsoNormal"><a href="http://dataspora.com/blog/wp-content/uploads/2009/01/phase_transition_kauffman.png"><img class="alignleft alignnone size-full wp-image-44" style="float: left;" title="phase_transition_kauffman" src="http://dataspora.com/blog/wp-content/uploads/2009/01/phase_transition_kauffman.png" alt="" width="300" height="174" /></a>A similar phase transition has already occurred with regards to data inside business ecosystems. For the past several decades, an increasing number of business processes– from sales, customer service, shipping - have come online, along with the data they throw off.  As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center.  And every action &#8212; sales lead, mouse click, and shipping update  &#8212; is stored.  The result:  organizations are overwhelmed by what feels like a tsunami of data.</p>
<p class="MsoNormal">The same trend is occurring in the larger universe of data that these organizations inhabit.  <a href="http://www.nature.com/nature/journal/v455/n7209/full/455001a.html">Big Data</a> unleashed by the <a href="http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming/">“Industrial Revolution of Data”</a>, whether from public agencies, non-profit institutes, or forward-thinking private firms.</p>
<p class="MsoNormal">At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It&#8217;s frozen because format and meta-data standards make it hard to flow from one place to another:  comparing the SEC&#8217;s financial data with that of Europe&#8217;s requires common formats and labels (ahem, <a href="http://blogmaverick.com/2008/12/16/the-sec-madoff-and-xbrl/">XBRL</a>) that don&#8217;t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling <a href="http://content.nejm.org/cgi/content/full/359/20/2105">studies with huge cohorts</a>).</p>
<p class="MsoNormal">Yet there&#8217;s a slow thaw underway as evidenced by a number of initiatives:  <a href="http://theinfo.org">Aaron Swartz’s theinfo.org</a>, <a href="http://infochimps.org">Flip Kromer’s infochimps</a>, <a href="http://bulk.resource.org">Carl Malamud’s bulk.resource.org</a>, the <a href="http://www.linkedata.org">Tim-Berners-Lee-inspired LinkedData.org</a>, as well as <a href="http://www.numbrary.com">Numbrary</a>, <a href="http://www.swivel.com">Swivel</a>, <a href="http://www.freebase.com">Freebase</a>, and Amazon’s <a href="http://aws.amazon.com/publicdatasets/">public data sets</a>.  These are all ambitious projects, but the challenge of weaving these data sets together is still greater.</p>
<p class="MsoNormal">How far are we from the tipping point of Big Data? When will the world’s icebergs of data melt into one sea? More importantly, when it happens, will we be ready to do something useful with it all?</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/AEiXJbTfbU8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/tipping-points-and-big-data/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/tipping-points-and-big-data/</feedburner:origLink></item>
		<item>
		<title>What can Darwin’s finches tell us about the downturn?</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/2CPdDM4cp0g/</link>
		<comments>http://dataspora.com/blog/darwin-and-the-business-cycle/#comments</comments>
		<pubDate>Fri, 21 Nov 2008 02:26:01 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[analytics]]></category>

		<category><![CDATA[evolution]]></category>

		<category><![CDATA[natural selection]]></category>

		<category><![CDATA[productivity paradox]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=39</guid>
		<description><![CDATA[Newspaper articles paint the markets in metaphors like “difficult climate” and “harsh landscape” –but these clichéd phrases have a kernel of truth.   Thinking about markets as natural environments reveals that selective forces are at work.  But it also predicts when they work.  In the natural world, as the story of Darwin&#8217;s finches tells us, selection [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal"><a href="http://dataspora.com/blog/wp-content/uploads/2008/11/darwin.jpg"><img class="alignleft size-medium wp-image-40" style="float: left;" title="Charles Darwin" src="http://dataspora.com/blog/wp-content/uploads/2008/11/darwin-227x300.jpg" alt="" width="140" height="186" /></a>Newspaper articles paint the markets in metaphors like “difficult climate” and “harsh landscape” –but these clichéd phrases have a kernel of truth.   Thinking about markets as natural environments reveals that selective forces are at work.  But it also predicts <em>when </em>they work.  In the natural world, as the story of Darwin&#8217;s finches tells us, selection acts in times of crisis:  drought, famine, and disease.  For our markets, that time is now.</p>
<p class="MsoNormal">(Aside:  I confess that relating the economic crisis to Darwin is a symptom of an academic bad habit:  namely, mapping every phenomenon onto the intellectual giant of one&#8217;s field.  Somewhere there is a psychologist blogging about Freud and the economy).</p>
<p><span id="more-39"></span></p>
<p class="MsoNormal">
<p class="MsoNormal">When does natural selection act?   This question motivates two modern naturalists, Peter and Rosemary Grant, who studied Darwin&#8217;s finches over several decades on the Galapagos islands, and whose work is chronicled in Jonathan Weiner’s <a href="http://www.amazon.com/Beak-Finch-Story-Evolution-Time/dp/067973337X">The Beak of the Finch</a>.</p>
<p class="MsoNormal">During the wet seasons, it was hard to see how a finch&#8217;s beak made any difference to its fitness. <span> </span></p>
<p class="MsoNormal">“[F]inches with long thin beaks and short fat parrot-like beaks [were] all hopping on the same lava, eating identical bird food… All those beaks were cracking the same birdseed.”<span> </span>[p.52]</p>
<p class="MsoNormal">A long line of ornithologists had concluded that the beak of the finch was unimportant.</p>
<p class="MsoNormal">But despite this, Peter and Rosemary Grant kept returning to the islands, and kept measuring beaks.<span> </span>In 1977, the rainy season brought no rain.<span> </span>Weiner describes what the naturalists’ witnessed:</p>
<blockquote>
<p class="MsoNormal">“They found fewer than two hundred finches alive on the island.<span> </span>Just one finch in seven had made it through the drought… The average beak before the drought was 10.68 millimeters long and 9.42 deep.<span> </span>The average beak of the <em>fortis </em>that survived was 11.07 millimeters long and 9.96 deep&#8230; The birds were not simply magnified by the drought:<span> </span>they were reformed and revised.<span> </span>They were changed by their dead.<span> </span>Their beaks were carved by their losses.”<span> </span>[p.78]</p>
</blockquote>
<p class="MsoNormal">The drought was the crucible that shaped the species.<span> </span>And it wasn’t simply size, but dimension (longer and deeper beaks, versus wider) that separated the survivors from the dead.</p>
<p class="MsoNormal">In the same way, the benefits of new technologies are often masked during good times.<span> </span>Firms with both new and old technologies remain solidly profitable, happily hopping along.<span> </span>Like ornithologists watching finches in the wet season, some analysts have questioned whether technological innovation even matters.<span> </span>Robert Solow summed up this paradox by quipping “You can see the computer age everywhere but in the productivity statistics.&#8221;</p>
<p class="MsoNormal">But when hard times hit, innovators survive.<span> </span>More importantly, they flourish when the business cycle swings up again.<span> </span><a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=290325">Work by Erik Brynjolfsson</a> and others has shown strong positive evidence for technology’s impact on productivity, most markedly over five-to-seven year periods – the resonant frequency of the business cycle.<span> But l</span>ike Darwin’s finches, the survivors are not just those who have more technology investments, but those who get the dimensions right.</p>
<p class="MsoNormal">Downturns are not only good for innovation, they are necessary.  While innovation may occur in times of plenty, crises allow the right innovations (hybrid cars) to outcompete the wrong ones (SUVs).  This assumes that crises are allowed to run their course (the case against bailouts), but that there are at least some survivors (the case for them).</p>
<p class="MsoNormal">As a data guy, I&#8217;m cautiously optimistic that firms who have invested in analytics, who have quietly innovated in understanding their business data, will emerge as winners on the other side of this downturn.  As a <a href="http://en.wikipedia.org/wiki/Friedrich_Nietzsche">contemporary of Darwin&#8217;s</a> said, &#8220;That which does not kill us makes us stronger.&#8221;</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/2CPdDM4cp0g" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/darwin-and-the-business-cycle/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/darwin-and-the-business-cycle/</feedburner:origLink></item>
		<item>
		<title>What I’ll be presenting at O’Reilly Money Tech 2009</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/xswZMh8eoBQ/</link>
		<comments>http://dataspora.com/blog/what-ill-be-presenting-at-oreilly-money-tech-2009/#comments</comments>
		<pubDate>Tue, 21 Oct 2008 10:56:37 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[R]]></category>

		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[analytics]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[bigdata]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=37</guid>
		<description><![CDATA[
(April 2009 Update:  Unfortunately, The Money Tech Conference was indefinitely postponed, but fortunately I will be presenting a version of this talk in July at OSCON 2009).
I’ve been invited to speak at O’Reilly’s Money Tech conference this coming February 4-6th in New York City and thought I’d share the abstract for my talk here.  I’ll likely [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/blog/wp-content/uploads/2008/10/oreilly.gif"><img class="alignnone size-medium wp-image-38" title="oreilly" src="http://dataspora.com/blog/wp-content/uploads/2008/10/oreilly-300x60.gif" alt="" width="300" height="60" /></a></p>
<p>(<strong>April 2009 Update:</strong>  Unfortunately, The Money Tech Conference was indefinitely postponed, but fortunately I will be presenting a <a href="http://en.oreilly.com/oscon2009/public/schedule/speaker/33953">version of this talk in July at OSCON 2009).</a></p>
<p>I’ve been invited to speak at <a onclick="javascript:pageTracker._trackPageview('/outbound/article/en.oreilly.com');" href="http://en.oreilly.com/money2009">O’Reilly’s Money Tech</a> conference this coming February 4-6th in New York City and thought I’d share the abstract for my talk here.  I’ll likely be in New York for several days, if you’d like to get together to chat about data drop me a line!</p>
<p>My talk is entitled &#8220;Open Source Analytics: Visualization and Predictive Modeling of Big Data with the R Programming Language&#8221;<br />
<span id="more-37"></span></p>
<p><strong>ABSTRACT</strong></p>
<p>Just as the explosion of online data catalyzed the development of<br />
storage technologies such as <span class="nfakPe">Hadoop</span>, new challenges in data analytics<br />
&#8211; turning terabytes into actionable insights &#8212; demand new tools.  R,<br />
an open-source language for statistical computing and graphics, is an<br />
extensible, embeddable, and industry-strength solution for analytics.<br />
In this session, I showcase R&#8217;s power by building predictive models<br />
for Brazilian soybean harvests and baseball slugger salaries.</p>
<p><strong>DESCRIPTION</strong></p>
<p>The economics of data aggregation and analysis are being disrupted by<br />
falling costs for storage and CPU power, the continuing shift of<br />
business processes online, and the deluge of data that is being<br />
generated as a consequence.</p>
<p>Satellite images, SEC filings, supply chain data (RFID data streams),<br />
online prices, and newsgroup content represent just a few of the data<br />
sources that hold potential for predictive modeling of markets.</p>
<p>Much of this data does not fit within existing paradigms for business<br />
analysis: either its size overwhelms traditional desktop tools such as<br />
Excel, or else its unique dimensions (such as geocodes) prevent its<br />
being pipelined into more powerful, but narrowly designed, analysis<br />
tools.  Finally, closed-source tools cannot keep pace with the leading<br />
edge of innovation in statistical and machine-learning algorithms.</p>
<p>Enter the open source programming language R.  R has been dubbed the<br />
lingua franca for statistical computing and graphical analysis, with a<br />
pedigree tracing back several decades at Bell Labs.  Though its<br />
million-plus users are concentrated within academia, R is gaining<br />
currency within several high-profile quantitative analysis groups,<br />
including Google&#8217;s Customer Insights team and Barclays Global<br />
Investors.  In addition, R&#8217;s extensibility via user-contributed<br />
packages has spawned an active developer community.</p>
<p>In this session, I will focus on applying R&#8217;s powerful visualization<br />
tools to guide the construction of predictive models, using the kind<br />
of large, multidimensional data sets that increasingly confront<br />
quantitative analysts.  Along the way, I will highlight R&#8217;s packages<br />
for inferential statistics, its compact modeling syntax, and its ease<br />
of connectivity with persistent data stores.</p>
<p>The two specific examples I will discuss are:</p>
<p>- an analysis of NASA&#8217;s Landsat imagery of Brazil&#8217;s center-west<br />
agricultural regions to detect correlates for soybean harvest yields,<br />
and a derived predictor of the Brazilian soybean market based in part<br />
on these correlates.</p>
<p>- a validation of Bill James&#8217; sabermetrics approach to batting<br />
performance using 30 years of Major League Baseball statistics, and a<br />
derived predictor for batters&#8217; salaries.</p>
<p>For all of its strengths, R has an admittedly steep learning curve.<br />
While source code for the examples will be provided online, this talk<br />
will emphasize techniques and working examples over technical details.<br />
The goal of this session is to give quantitative analysts the courage<br />
to invest in learning the R language, by showcasing R&#8217;s power,<br />
highlighting its features, and providing examples of its use for<br />
innovative applications.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/xswZMh8eoBQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/what-ill-be-presenting-at-oreilly-money-tech-2009/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/what-ill-be-presenting-at-oreilly-money-tech-2009/</feedburner:origLink></item>
	</channel>
</rss>
