<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom">
 
 <title>Marginally Interesting by Mikio L. Braun</title>
 
 <link href="http://blog.mikiobraun.de/" />
 <updated>2012-01-24T10:46:33+01:00</updated>
 <id>http://blog.mikiobraun.de/</id>
 <author>
   <name>Mikio L. Braun</name>
   <uri>http://mikiobraun.de/</uri>
   <email>mikiobraun@gmail.com</email>
 </author>
 
 
   <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/MarginallyInteresting" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="marginallyinteresting" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry>
   <title type="html">What it means to do a Ph.D. - psychologically</title>
   <link href="http://blog.mikiobraun.de/2012/01/what-it-means-to-do-a-phd.html" />
   <updated>2012-01-24T10:23:00+01:00</updated>
   <published>2012-01-24T10:23:00+01:00</published>
   <author>
     <name>Mikio L. Braun</name>
     <uri>http://mikiobraun.de</uri>
     <email>mikiobraun@gmail.com</email>
   </author>
   <id>http://blog.mikiobraun.de/2012/01/what-it-means-to-do-a-phd</id>
   <content type="html">&lt;p&gt;&lt;p&gt;Most people who decide to do a Ph.D. are well aware that it will mean a lot of work. You have to learn a lot of new stuff, possibly also outside of the topics you have studied so far. Taking machine learning as an example, you probably need to learn much more math than you&amp;#8217;ve already been exposed to, including a mix of linear algebra, optimization theory, probability theory, statistics, and so on. But you also need to learn something about the area where you apply your methods, for example, bioinformatics, linguistics, and so on.&lt;/p&gt;

&lt;p&gt;But at the same time, doing a Ph.D. also poses some psychological challenges and from my experience I can say that many students are quite surprised by the level of problems they face. In contrast to a Bachelor or a Master, which requires you to learn some topic and be able to apply what you&amp;#8217;ve learned to new similar problems, doing a Ph.D. means doing something which hasn&amp;#8217;t been done before. You need to solve a problem which hasn&amp;#8217;t been solved before.&lt;/p&gt;

&lt;p&gt;Now this may sound not that surprising because that&amp;#8217;s what research is all about: exploring questions, solving problems, advancing the state-of-the-art. But you only realize what this really means when you&amp;#8217;re one or two years into your graduate studies, you have learned quite a lot and come to understand the nature of the problem, and you realize that you have no idea how to solve the problem.&lt;/p&gt;

&lt;p&gt;There is of course a lot you can do to hedge the risk of failing. For example, you can start with simpler subproblems and work yourself up towards the full problem. You can work on a number of smaller problems such that you build up a collection of work done. But at some point you will invariable find yourself in a situation when you have to admit that you really cannot know whether you&amp;#8217;ll be able to solve the problem, or whether any of your usual strategies will help.&lt;/p&gt;

&lt;p&gt;And this doesn&amp;#8217;t even include the social aspects of doing a Ph.D., of getting published, getting cited, building up some form of reputation in the community.&lt;/p&gt;

&lt;p&gt;I found myself in exactly this situation towards the end of my studies. I had to switch topics inbetween because the original idea didn&amp;#8217;t quite turn out as expected. I wrote my thesis about convergence of eigenvalues and eigenvectors of the kernel matrix. But till the very end, a central proof was missing. I had run extensive numerical simulations so I was quite sure about what I wanted to prove, but only in the very end I managed to put the proof together. So here I was, with a few month left before my position ended, trying to solve that problem every day but not knowing whether I would be able to do that in the end or not. To illustrate my state of mind, when I moved to a different town, I couldn&amp;#8217;t rent the truck of the size I had reserved but only one which was about a meter shorter. All my friends told me &amp;#8220;Mikio, forget it, we&amp;#8217;ll never get all your stuff in there&amp;#8221;, but I was just like &amp;#8220;ah, impossible, well, yes&amp;#8230; .&amp;#8221; In the end, everything except for one cupboard went in which was ok, and showed that we both had been wrong.&lt;/p&gt;

&lt;p&gt;Actually, I have come to believe that this experience is part of what it means to do a Ph.D.. Eventually, you will succeed in one way or another, and you will have learned a very valuable lesson. You will see how the problem slowly sinks into your mind until your understanding of the problem will lead you to a solution, or uncover that it is not possible, but you will also have understood why.&lt;/p&gt;

&lt;p&gt;In the end, doing a Ph.D. is exactly about this: Learning to do what no one has done before and be confident even when there is only a limited amount of time and you have no idea whether you will be able to solve the problem. And that is an important part of what science is about.&lt;/p&gt;&lt;/p&gt;
   &lt;p&gt;&lt;a href="http://blog.mikiobraun.de/2012/01/what-it-means-to-do-a-phd.html"&gt;Click here for the full article&lt;/a&gt;&lt;img src="http://feeds.feedburner.com/~r/MarginallyInteresting/~4/m308jom4HiI" height="1" width="1"/&gt;</content>
 </entry>
 

 
   <entry>
   <title type="html">Fast Cross Validation</title>
   <link href="http://blog.mikiobraun.de/2011/12/fast-cross-validation.html" />
   <updated>2011-12-20T14:58:00+01:00</updated>
   <published>2011-12-20T14:58:00+01:00</published>
   <author>
     <name>Mikio L. Braun</name>
     <uri>http://mikiobraun.de</uri>
     <email>mikiobraun@gmail.com</email>
   </author>
   <id>http://blog.mikiobraun.de/2011/12/fast-cross-validation</id>
   <content type="html">&lt;p&gt;&lt;a href='http://www.scribd.com/doc/76134034/Fast-Cross-Validation-Via-Sequential-Analysis-Talk' title='View Fast Cross Validation Via Sequential Analysis - Talk on Scribd' style='margin: 12px auto 6px auto; font-family: Helvetica,Arial,Sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 14px; line-height: normal; font-size-adjust: none; font-stretch: normal; -x-system-font: none; display: block; text-decoration: underline;'&gt;
   Fast Cross Validation Via Sequential Analysis - Talk
&lt;/a&gt;&lt;object name='doc_33688' data='http://d1.scribdassets.com/ScribdViewer.swf' id='doc_33688' type='application/x-shockwave-flash' height='600' width='100%' style='outline:none;'&gt;            
  &lt;param name='movie' value='http://d1.scribdassets.com/ScribdViewer.swf' /&gt;
  &lt;param name='wmode' value='opaque' /&gt;
  &lt;param name='bgcolor' value='#ffffff' /&gt;
  &lt;param name='allowFullScreen' value='true' /&gt;
  &lt;param name='allowScriptAccess' value='always' /&gt;
  &lt;param name='FlashVars' value='document_id=76134034&amp;access_key=key-26djy06t8tk841ycqlg4&amp;page=1&amp;viewMode=slideshow' /&gt;
  &lt;embed name='doc_33688' src='http://d1.scribdassets.com/ScribdViewer.swf?document_id=76134034&amp;access_key=key-26djy06t8tk841ycqlg4&amp;page=1&amp;viewMode=slideshow' allowfullscreen='true' id='doc_33688' type='application/x-shockwave-flash' allowscriptaccess='always' wmode='opaque' height='600' width='100%' bgcolor='#ffffff' /&gt;
&lt;/object&gt;
&lt;p&gt;These are the slides to our talk (joint work with Tammo Krüger and Danny Panknin) at the &lt;a href='http://biglearn.org/'&gt;BigLearning&lt;/a&gt; workshop at NIPS 2011. You can also have a look at the &lt;a href='http://www.scribd.com/doc/76134303'&gt;paper&lt;/a&gt; and the &lt;a href='http://www.scribd.com/doc/76134483'&gt;appendix&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So what is it about? In a nutshell, we try to speed up cross-validation by starting with subsamples of the data and identifying quickly parameter configurations which are clearly suboptimal. Learning on subsets is of course much faster so ideally you&amp;#8217;ll save a lot of time because you will only have a handfull of parameter candidates left on the full data set.&lt;/p&gt;

&lt;p&gt;The method is based on the &lt;a href='http://en.wikipedia.org/wiki/Sequential_analysis'&gt;sequential analysis framework&lt;/a&gt; which deals with the problem of statistical hypothesis testing when the sample size isn&amp;#8217;t fixed.&lt;/p&gt;

&lt;p&gt;The main problem one faces is that the performance of parameter configurations changes significantly as the sample size increases. For a fixed parameter configuration (say a kernel width and a regularization parameter for an SVM), it is clear that the error converges, and usually becomes smaller as the number of samples increases. However, if one compares two configurations, one can often observe that one configuration is better for small sample sizes, while the other becomes better later on. This phenomenon is linked to the complexity of the model associated with a parameter choice. General speaking, more complex models require more data to fit correctly and will overfit on too few data points.&lt;/p&gt;

&lt;p&gt;Our method accounts for this effect by adjusting the statistical tests to maximize the number of failures before a configuration is removed from the set of active configurations. Nevertheless, fast cross-validation is faster by a factor of 50-100 on our benchmark data sets.&lt;/p&gt;&lt;/p&gt;
   &lt;p&gt;&lt;a href="http://blog.mikiobraun.de/2011/12/fast-cross-validation.html"&gt;Click here for the full article&lt;/a&gt;&lt;img src="http://feeds.feedburner.com/~r/MarginallyInteresting/~4/-0HzbkZzuuI" height="1" width="1"/&gt;</content>
 </entry>
 

 
   <entry>
   <title type="html">Scala discussion heating up?</title>
   <link href="http://blog.mikiobraun.de/2011/11/scala-discussion-heating-up.html" />
   <updated>2011-11-30T11:18:00+01:00</updated>
   <published>2011-11-30T11:18:00+01:00</published>
   <author>
     <name>Mikio L. Braun</name>
     <uri>http://mikiobraun.de</uri>
     <email>mikiobraun@gmail.com</email>
   </author>
   <id>http://blog.mikiobraun.de/2011/11/scala-discussion-heating-up</id>
   <content type="html">&lt;p&gt;&lt;p&gt;Apparently, the discussions about &amp;#8220;Scala being too complex&amp;#8221; are heating up, mostly due to a &lt;a href='http://codahale.com/downloads/email-to-donald.txt'&gt;leaked email&lt;/a&gt; from one of &lt;a href='http://twitter.com/coda'&gt;Yammer&amp;#8217;s programmers&lt;/a&gt; to the Scala people where he discusses some of his experiences he&amp;#8217;s had with using Scala in a production environment, and the other being a &lt;a href='http://news.ycombinator.com/item?id=3293109'&gt;post on HN&lt;/a&gt; comparing Scala to Perl in the sense that both languages have too much flexibility in solving a specific task leading to a mix of different programming paradigms and styles which will make you code harder to read and maintain.&lt;/p&gt;

&lt;p&gt;Now we&amp;#8217;ve been using Scala as our main programming language for the last two and half years for &lt;a href='http://twimpact.com'&gt;TWIMPACT&lt;/a&gt;, so I know what people are talking about. And the truth is, it is all true, sadly. On the one hand, Scala is a pretty awesome programming language which is very nicely designed. I&amp;#8217;ve said this before, but normally you will eventually come across some feature of a programming language which is not designed well and you have to code your way around it, but I&amp;#8217;ve yet to come about something like it in Scala.&lt;/p&gt;

&lt;p&gt;On the other hand, it is also true that some of the libraries are not as fast as they should be. Although I like the idea of immutable collections a lot, every time I need performance, I&amp;#8217;d rather put in a Java collection. Also, it&amp;#8217;s true that the collection library is pretty complex. It all kind of makes sense to get a clean design of the classes, but it&amp;#8217;s pretty complicated with all those classes like Seq, SeqLike, Traversable, TraversableOnce, etc. However, you&amp;#8217;ll probably only need to know all the details if you want to write your own collections which integrate seamlessly with the existing collection classes.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s also true that upgrading to a new version is hard. For some reason, many libraries seem to be quite deeply interlocked with the Scala version. While our own code never had to be changed if Scala went to a new version, this wasn&amp;#8217;t true for most libraries, unfortunately, meaning that you have to wait till all the libraries have been upgraded to the new version before you can do the update yourself. And frankly, I don&amp;#8217;t see why this is necessary.&lt;/p&gt;

&lt;p&gt;We&amp;#8217;ve never bothered with sbt, but directly went for maven due to it&amp;#8217;s better integration in most IDE&amp;#8217;s. We&amp;#8217;re using IntelliJ IDEA whose Scala plugin has come a long way and gives pretty good support. There is also a lot to be improved in the basic tools like the compiler or the shell in terms of startup time. Scala seems to preload several megabytes of jar files on startup, probably in an attempt at optimization, but in the end, it only means that starting Scala takes anywhere between 5 - 10 seconds which is really a lot if you&amp;#8217;re working on the shell (and every other language starts up almost immediately) The guys behind JRuby have invested a lot of time to cut down on the startup time, and that was time well spent.&lt;/p&gt;

&lt;p&gt;People are also often attacking Scala for it&amp;#8217;s complexity. While it&amp;#8217;s certainly true that it&amp;#8217;s easier to hire some Java expert than someone who knows Scala, IMHO Scala is a big improvement in many ways over Java, which feels overly verbose once you&amp;#8217;ve learned Scala. As with every language, there are more basic concepts and more advanced concepts and usually, you don&amp;#8217;t have to master them all from the start. Also, people often argue as if the complexity about learning a programming language is all in the programming language, but you also have to consider the standard libraries and tools. For example, while the Java programming language is relatively simple in terms of concepts, the standard tools and frameworks are pretty intimidating to learn (all that XML, Maven, Spring, etc.)&lt;/p&gt;

&lt;p&gt;Then people are also complaining about the community, which is supposedly not helpful enough, or too fragmented, or only consists of crazy people who are just thinking about how to implement everything in terms of category theory. I don&amp;#8217;t think that is true. Scala is still young, and the community can still grow. We&amp;#8217;ve uncovered a number of bugs (mostly &lt;a href='http://twitter.com/thinkberg'&gt;Leo&lt;/a&gt; who has a knack for finding bugs in libraries) and people were mostly as responsive as you&amp;#8217;d expect them to be. One of the strengths of Scala is also that it is quite painless to reuse existing Java projects (as any other programming language for the JVM). I never found it that repulsive as some seem to use a Java library from Scala. The integration is quite painless, and if you really have to, you can add a bit of syntactic sugar on your side for the stuff you need most.&lt;/p&gt;

&lt;p&gt;Finally, I really don&amp;#8217;t get the argument of people who are saying &amp;#8220;Scala is too complex, I switched to Python (or some other scripting language)&amp;#8221;. To me, these are completely different sets of programming languages. While it&amp;#8217;s true that there are some applications like writing medium sized web sites which you can nowadays do in either a scripting language or a compiled language, there are many applications where Python (or any other scripting language) just can&amp;#8217;t compete. In scripting languages, it&amp;#8217;s hard to add primitive data types which are really fast unless someone else already took care of implementing the most computing-intensive routines in C.&lt;/p&gt;

&lt;p&gt;So in summary, Scala is both awesome and awful, just like almost every piece of sufficiently advanced technology. You can work with Scala, and it&amp;#8217;s a lot of fun, or you can reject it for a number of reasons, just acknowledge the complexity and don&amp;#8217;t give in to hypes and marketing.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Updates:&lt;/em&gt; &lt;a href='http://codahale.com/the-rest-of-the-story/'&gt;Coda Hale&amp;#8217;s comment on the leaked email&lt;/a&gt;, &lt;a href='http://eng.yammer.com/blog/2011/11/30/scala-at-yammer.html'&gt;Yammer&amp;#8217;s official statement on Scala&lt;/a&gt;, &lt;a href='http://blog.typesafe.com/getting-down-to-work'&gt;Typesafe&amp;#8217;s post on their committment to the industry&lt;/a&gt;.&lt;/p&gt;&lt;/p&gt;
   &lt;p&gt;&lt;a href="http://blog.mikiobraun.de/2011/11/scala-discussion-heating-up.html"&gt;Click here for the full article&lt;/a&gt;&lt;img src="http://feeds.feedburner.com/~r/MarginallyInteresting/~4/ECJOe1LZhZ8" height="1" width="1"/&gt;</content>
 </entry>
 

 
   <entry>
   <title type="html">Analyzing Social Media Data</title>
   <link href="http://blog.mikiobraun.de/2011/11/analyzing-social-media-data.html" />
   <updated>2011-11-01T22:20:00+01:00</updated>
   <published>2011-11-01T22:20:00+01:00</published>
   <author>
     <name>Mikio L. Braun</name>
     <uri>http://mikiobraun.de</uri>
     <email>mikiobraun@gmail.com</email>
   </author>
   <id>http://blog.mikiobraun.de/2011/11/analyzing-social-media-data</id>
   <content type="html">&lt;p&gt;&lt;p&gt;Analyzing social media has become quite popular. People have been &lt;a href='http://arxiv.org/abs/1003.5699'&gt;predicting box office openings based on Twitter chatter&lt;/a&gt;, studied &lt;a href='http://cs.stanford.edu/people/jure/pubs/lim-icdm10.pdf'&gt;information diffusion patterns&lt;/a&gt;, &lt;a href='http://research.yahoo.com/pub/3386'&gt;information flows between classes of users&lt;/a&gt;, &lt;a href='http://www.iq.harvard.edu/blog/netgov/2011/08/tweetquake.html'&gt;how real-world events like earthquakes are reflected in Twitter&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is all pretty exciting and interesting, but there are also a few things where there is still room for improvement.&lt;/p&gt;

&lt;p&gt;There is very little stuff on real-time analysis. Many papers boast with the hundreds of millions of tweets (and the access to Twitter&amp;#8217;s firehose necessary to get that amount of data) which have formed the basis for the paper. However, many papers later introduce some more or less arbitrary ways of truncating the data, for example by taking a number of &amp;#8220;most active users&amp;#8221;. This is both true for &lt;a href='http://cs.stanford.edu/people/jure/pubs/lim-icdm10.pdf'&gt;Jure Leskovec&amp;#8217;s paper&lt;/a&gt; as well as the &lt;a href='http://research.yahoo.com/pub/3386'&gt;Yahoo research&amp;#8217;s paper&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;However, I think that getting to real-time is extremely important, because you cannot just wait for days or longer to get your analysis. By that time, more data will have been streaming in, and when are you going to analyze that data?&lt;/p&gt;

&lt;p&gt;Another problem with many of the analyses is that they focus on the positive cases only. Meaning that they develop some method to detect bursts or trends and then use some famous real-world example (like Japan winning the women&amp;#8217;s soccer championship) to show that the method is triggered by the data. However, few publications go so far as to validate their method on negative examples as well, showing that the method not only detect trends well, but also does so robustly with few false positives.&lt;/p&gt;

&lt;p&gt;A classical example is the highly cited 2003 paper by Jon Kleinberg &amp;#8221;&lt;a href='http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.116.3462&amp;amp;rep=rep1&amp;amp;type=pdf'&gt;Bursty and Hierarchical Structure from Streams&lt;/a&gt;&amp;#8221; which explains how to detect areas of higher than usual activity, for example, from email streams. But then, the paper shows how the detected structure coincides with real deadlines for two examples without discussing negative examples in depth.&lt;/p&gt;

&lt;p&gt;Many methods also seem to believe that an analysis which is based on hundreds of millions of data points is automatically true in general. While this is certainly true for simple statistics which you can estimate well, there are other methods which can overfit. And for those, as many other disciplines like bioinformatics have had to learn the hard way, as you get more data, the probability that you find some evidence for your hypothesis increases drastically.&lt;/p&gt;

&lt;p&gt;To get reliable results, you need to follow the same rules as when validating the performance of a machine learning algorithm: Test on data which is disjoint from training data. If your method detects trends, check it on data which you believe has no structure. If you aggregate topics, check it on days when nothing special was happening. If you analyze the structure of the data, check on an independent sample (ideally from a period of time which is a bit removed from the original sample).&lt;/p&gt;

&lt;p&gt;That way you might have less data available, but your results will improve a lot in terms of reliability.&lt;/p&gt;&lt;/p&gt;
   &lt;p&gt;&lt;a href="http://blog.mikiobraun.de/2011/11/analyzing-social-media-data.html"&gt;Click here for the full article&lt;/a&gt;&lt;img src="http://feeds.feedburner.com/~r/MarginallyInteresting/~4/6wGhPbwxOnk" height="1" width="1"/&gt;</content>
 </entry>
 

 
   <entry>
   <title type="html">One does not simply scale into real-time</title>
   <link href="http://blog.mikiobraun.de/2011/10/one-does-not-simply-scale-into-realtime-processing.html" />
   <updated>2011-10-10T22:15:00+02:00</updated>
   <published>2011-10-10T22:15:00+02:00</published>
   <author>
     <name>Mikio L. Braun</name>
     <uri>http://mikiobraun.de</uri>
     <email>mikiobraun@gmail.com</email>
   </author>
   <id>http://blog.mikiobraun.de/2011/10/one-does-not-simply-scale-into-realtime-processing</id>
   <content type="html">&lt;p&gt;&lt;a href='http://qkme.me/355zge'&gt;&lt;img class='teaser-pic' src='/images/scale-into-mordor.jpg' /&gt;&lt;/a&gt;
&lt;p&gt;Real-time seems to be the next big thing in &lt;em&gt;big data&lt;/em&gt;. Map-Reduced has shown how to perform big analyses on huge data sets in parallel, and the next challenge seems to be to find a similar kind of approach to real-time.&lt;/p&gt;

&lt;p&gt;When you look around the web, there are two major approaches out there which try to building something which can scale to deal with Twitter-firehose-scale amounts of data. One is starting with a MapReduce framework like Hadoop and somehow finagle real-time or at least streaming capabilities on it. The other approach starts with some event-driven &amp;#8220;streaming&amp;#8221; computing architecture and makes it scale on cluster.&lt;/p&gt;

&lt;p&gt;These are interesting and very cool projects, however from our own experience with retweet analysis at &lt;a href='http://twimpact.com'&gt;TWIMPACT&lt;/a&gt;, I get the feeling that both approaches fall short of providing a definitive answer.&lt;/p&gt;

&lt;p&gt;In short: &lt;a href='http://www.youtube.com/watch?v=yrVGiuch1ag'&gt;One does not simply &lt;em&gt;scale&lt;/em&gt; into &lt;em&gt;real-time&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id='realtime_stream_analysis'&gt;Real-Time Stream Analysis&lt;/h2&gt;
&lt;div class='figure'&gt;
&lt;img src='/images/sd-real-time-stream-analysis.png' /&gt;
&lt;/div&gt;
&lt;p&gt;So what is real-time stream analysis? (Apart from the fact that I seem to be unable to decide whether to write it with a hyphen or as one word like &lt;a href='http://en.bab.la/dictionary/german-english/donaudampfschifffahrtskapitaensmuetze'&gt;Donaudampfschifffahrtskapitänsmütze&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Basically, the idea is that you have some sort of event stream like Twitter messages, or Apache log data, or URL expansion requests at bit.ly, which comes in at a high volume of several hundreds or even thousands of events per second. Let&amp;#8217;s just focus on Twitter stream analysis for now.&lt;/p&gt;

&lt;p&gt;Typical applications are to compute some statistics from the stream which summarizes it in a nice fashion. For example&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what is the most frequently retweeted tweet?&lt;/li&gt;

&lt;li&gt;what is the most frequently mentioned URL?&lt;/li&gt;

&lt;li&gt;what are the most influental users (in terms of mentions/retweets)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are pretty basic counting tasks. Very closely linked are questions like what the score of an arbitrary tweet or URL is, or what the first few hundred most influental users are, and so on. You can also compute more complex scores based on more of these numbers. For example, our own TWIMPACT score is based on all the counts of a user&amp;#8217;s retweets, and something like the Klout score is also based on a number of statistics (I&amp;#8217;m only assuming here, of course).&lt;/p&gt;

&lt;p&gt;Since the data arrives in real-time, you typically also want to get the results in real-time (instead of a daily report which is already hours old when you get it). Ideally, you get updated scores with each event, also I&amp;#8217;d say everything is ok if a query always takes less than a second.&lt;/p&gt;

&lt;p&gt;Finally, you&amp;#8217;d probably also want to be able to look at historical data to see how a user or retweet has performed in the past and to see whether its activity is going up or going down.&lt;/p&gt;

&lt;p&gt;Just to give you an idea of the amount of data involved: Each tweet corresponds to about 1k of data (including all metadata). If we assume that we have about 1000 tweets per second (actually it&amp;#8217;s probably more), then we get about 86.4 million tweets per day, or about 82.4GB of new data per day (about 30TB per year).&lt;/p&gt;

&lt;p&gt;Now let&amp;#8217;s discuss how you would approach this problem in a database centric fashion and using a stream processing framework.&lt;/p&gt;

&lt;h2 id='databases_approach'&gt;Databases Approach&lt;/h2&gt;
&lt;div class='figure'&gt;
&lt;img src='/images/sd-databases.png' /&gt;
&lt;/div&gt;
&lt;p&gt;With &amp;#8220;database approach&amp;#8221; I try to cover a whole range including traditional relational databases, NoSQL databases and MapReduce frameworks. The common denominator is that you basically pipe your data into the database and use the built-in queries of the database to compute your statistics. Depending on the type of NoSQL database you are using you&amp;#8217;ll probably have to do the analysis online to precompute all your statistics because you the database doesn&amp;#8217;t have sufficient query capabilities.&lt;/p&gt;

&lt;p&gt;As I see it, there are two main problems with this approach: First of all, the size of your database grows at a non-trivial rate. Unless you&amp;#8217;re Google and you&amp;#8217;ve planned for exponential growths of your data centers anyway, you will eventually run out of space and performance. Even if you assume that the reponse time will increase only logarithmically in your data size, your cluster will eventually become quite slow to deal with real-time data.&lt;/p&gt;

&lt;p&gt;This will directly affect the time it will take for your queries to complete. However, as queries also put quite some load on your disks, this problem will only get worse over time.&lt;/p&gt;

&lt;p&gt;At the same time, most of the data will likely be irrelevant for your current analysis. Tweets stop being retweeted, URLs fall out of fashion. In other word, there is a huge amount of historical baggage clogging up your servers. It is true that you still need this data in order to compute historical statistics, but note that the data doesn&amp;#8217;t change anymore, so that it would make much more sense to analyse the data once and only put the results in some read-only storage.&lt;/p&gt;

&lt;p&gt;You could of course try to keep the database size constant by periodically discarding old data. I think this is the right approach and I&amp;#8217;ll discuss it in more detail below. Note, however that many databases cannot really deal well with huge amount of deletions, as they need some form of more or less disruptive garbage collection (vacuum, compaction, etc.) to actually free up the space.&lt;/p&gt;

&lt;p&gt;Finally, one should also not forget that MapReduce doesn&amp;#8217;t work with all kinds of problems, but only with problems which are already inherently easy to parallelize. Actually, if you look into research in parallel algorithms, you will see that almost no problems scale linearly in the number of available processors (and you will also see that many of the efficient algorithms work with shared mutable state! Ugh!)&lt;/p&gt;

&lt;h2 id='stream_processing'&gt;Stream Processing&lt;/h2&gt;
&lt;div class='figure'&gt;
&lt;img src='/images/sd-stream-processing.png' /&gt;
&lt;/div&gt;
&lt;p&gt;So if storing all your data on slow disks to process it later is the wrong approach, you should probably try to process the data as it comes in through some form of pipeline. This is more or less the idea behind stream processing. Two example frameworks are &lt;a href='https://github.com/nathanmarz/storm'&gt;Storm&lt;/a&gt;, originally developed by BackType (recently acquired by Twitter), and &lt;a href='s4.io'&gt;S4&lt;/a&gt; developed by Yahoo.&lt;/p&gt;

&lt;p&gt;If you look closer, these are basically quite sophisticated frameworks to scale an actor based approach to concurrency. If you&amp;#8217;re not familiar with it, it&amp;#8217;s the idea to structure some computation in terms of independent small pieces of code which do not share state and communicate with one another through messages.&lt;/p&gt;

&lt;p&gt;Frameworks like the ones above let you define a computation in terms of a number of processing nodes which may also run on different servers, with the ability to add more parallel workers of a certain kind on the fly to scale up processing resources where necessary.&lt;/p&gt;

&lt;p&gt;This approach is essentially orthogonal to the database approach. In fact, stream processing frameworks usually don&amp;#8217;t even deal with persistence, they only focus on the computation. Therefore, there is also nothing specific to real-time stream processing in these frameworks. In essence, they deal with the question of how to split up a computation into small independent parts and how to scale such a network on a cluster.&lt;/p&gt;

&lt;p&gt;To me, the basic problem with this approach (and with actor based concurrency) is that it doesn&amp;#8217;t deal well with peak volumes which surpass the computation bandwith. In fact, it&amp;#8217;s even not so simple to say given such a network what the maxium throughput is. Conceptually, it is the throughput of the slowest component, but you also have to take the message routing topology into account.&lt;/p&gt;

&lt;p&gt;Now, once more messages need to be processed than possible, somewhere in the system messages queues are starting to fill up. This is a general problem with such systems, not specific to actor based concurrency. The book &lt;a href='http://pragprog.com/book/mnee/release-it'&gt;Release It!&lt;/a&gt; contains some very entertaining and also frightening real-world war stories of what can go wrong with systems where you plug together components with quite different capacities.&lt;/p&gt;

&lt;p&gt;Another problem is actor based concurrency tends to break a simple function which may fit on a single screen into dozens of classes, but that is another problem.&lt;/p&gt;

&lt;p&gt;In any case, the question is how you can guarantee that your system will run stably even for high peak volumes, apart from just adding more nodes? The easiest thing would be to randomly drop events (resulting in a more or less consistent subsample which at least has the same distribution as the original data stream), but is there something different you could do?&lt;/p&gt;

&lt;h2 id='stream_distillation'&gt;Stream Distillation&lt;/h2&gt;
&lt;div class='figure'&gt;
&lt;img src='/images/sd-stream-distiller.png' /&gt;
&lt;/div&gt;
&lt;p&gt;So to summarize: Putting all your data into a database is problematic because the data steadily grows and computing statistics based on the data is too slow. You also don&amp;#8217;t really need to keep all your data at hand to have an analysis of the current state of the stream.&lt;/p&gt;

&lt;p&gt;Stream processing, on the other hand, is a nice tool to scale your computations, but it doesn&amp;#8217;t deal well with peak volumes, and depending on how you persist your data, you run into the same scaling issues as the database centric approach.&lt;/p&gt;

&lt;p&gt;In other words, what we need is a way to separate the data we currently need for our analysis from the historic data and analyses, and we also need a way to limit the bandwidth of events while having as little distortion as possible on the statistics we&amp;#8217;re interested in.&lt;/p&gt;

&lt;p&gt;As it turns out, these kinds of problems are closely related to &amp;#8220;frequent itemset&amp;#8221; problems in data mining. The idea is to identify the top most frequent items in a data stream with limited resources, both in terms of memory and computation. Such methods are discussed, for example, in Chapter 4 of Rajaraman and Ullman&amp;#8217;s upcoming book &lt;em&gt;&lt;a href='http://i.stanford.edu/~ullman/mmds.html'&gt;Mining of Massive Datasets&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Most of these methods work by keeping a fixed amount of counts and then replacing the least frequent ones when a new type of event occurs. They come with some sort of theoretical guarantee to identify the correct set of items, or at least some bound on the differences in the count. However, when analysing retweets you get quite good results because only a fraction of the retweets is retweeted more than once, so that most of the slots are occupied by retweets which do not reoccur and discarding them doesn&amp;#8217;t hurt.&lt;/p&gt;

&lt;h2 id='summary'&gt;Summary&lt;/h2&gt;

&lt;p&gt;As you might have expected, this is exactly the approach we took with TWIMPACT. All the demos you can see at our site are powered by our new trending backend (codename &amp;#8220;Trevor&amp;#8221;), which is basically an in-memory trend database built using ideas from stream mining, together with read-only snapshots on disk for later historical analyses.&lt;/p&gt;

&lt;p&gt;These ideas can of course also be combined with databases and stream processing, but already without a huge amount of parallelism, we&amp;#8217;re able to process a few thousand tweets per second.&lt;/p&gt;

&lt;p&gt;In summary: it&amp;#8217;s not enough to just scale your database needs and your computational algorithms, to make your analysis framework stable against peak volumes and sustainable in terms of data growth, you also need to separate your analysis database from the historic data and add some ideas from stream mining to extract a statistically approximate subsample with capped bandwidth from your data.&lt;/p&gt;&lt;/p&gt;
   &lt;p&gt;&lt;a href="http://blog.mikiobraun.de/2011/10/one-does-not-simply-scale-into-realtime-processing.html"&gt;Click here for the full article&lt;/a&gt;&lt;img src="http://feeds.feedburner.com/~r/MarginallyInteresting/~4/TbrrweBqxZ4" height="1" width="1"/&gt;</content>
 </entry>
 

 
   <entry>
   <title type="html">Peer Review and NoSQL</title>
   <link href="http://blog.mikiobraun.de/2011/09/on-peer-review.html" />
   <updated>2011-09-20T10:03:00+02:00</updated>
   <published>2011-09-20T10:03:00+02:00</published>
   <author>
     <name>Mikio L. Braun</name>
     <uri>http://mikiobraun.de</uri>
     <email>mikiobraun@gmail.com</email>
   </author>
   <id>http://blog.mikiobraun.de/2011/09/on-peer-review</id>
   <content type="html">&lt;p&gt;&lt;p&gt;&lt;em&gt;Disclaimer: This post definitely falls into the &lt;a href='http://en.wiktionary.org/wiki/TLDR'&gt;tl;dr&lt;/a&gt; categories of posts. I&amp;#8217;ve been collecting these ideas for quite some time now, and somehow this post got longer and longer. Anyway, it is a complex topic.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Ever since I started to work in an academic environment back in &amp;#8216;96 (working as a student in Rolf Eckmiller&amp;#8217;s &lt;a href='http://www.nero.uni-bonn.de/index_nero.html'&gt;neuroinformatics group&lt;/a&gt;), the peer review process has always been a big topic. There were always some people complaining, discussing possible ways to improve it, or dismissing the whole idea of peer review at all.&lt;/p&gt;

&lt;p&gt;The interesting thing is that very little has changed since then. If we look not only at peer review but the whole scientific publication landscape, you can see a few significant changes. For example, open access is much more real than it used to be, and there are more scientific journals which let authors keep their full copyright and the right to republish the papers on their webpages, etc.&lt;/p&gt;

&lt;p&gt;These changes are all important, but what I find curious is that the general peer review process hasn&amp;#8217;t changed at all. The process to get published or accepted at a conference is still the same: You submit your paper to some board where it gets handed to two or more reviewers whose identity is not revealed to you. Based on the verdict of the reviewers the action editor/program chair decides what to do with your paper.&lt;/p&gt;

&lt;p&gt;I won&amp;#8217;t repeat all the things which people perceive as being broken with this system. Let&amp;#8217;s just say that the process has a high error rate, both false positives and false negatives (a.k.a. the &amp;#8220;bad review problem&amp;#8221;), it can take very long for a paper to get published, and the workload on the reviewers is pretty high.&lt;/p&gt;

&lt;p&gt;Still, nothing has changed. Is this only because we sort of bring this problem upon ourselves (as opposed to the system being forced upon us by some external agency)? Or is there a deeper reason?&lt;/p&gt;

&lt;h2 id='a_closer_look_at_peer_review'&gt;A Closer Look at Peer Review&lt;/h2&gt;

&lt;p&gt;I think the main reason why peer review is so resilient is that it is already a quite elegant partial solution to a number of interwoven requirements. In other words, just like democracy, it is an imperfect system but the best we have discovered so far.&lt;/p&gt;

&lt;p&gt;In this whole scientific publication business, there are a number of stakeholders involved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Science&lt;/em&gt; as a whole wants to progress to solve the smaller and bigger mysteries of life, the universe and all the rest. Science needs the publication system to be efficient, fair, and open, such that information can be distributed quickly and without bias.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;&lt;em&gt;Researchers&lt;/em&gt; want to have fun researching, but also need to build up a reputation to keep doing so. For this, they need the publication system to build up a track record of their work.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;&lt;em&gt;Researchers&lt;/em&gt; also need to have access to the works of others, to know what has already been done, which problems have been solved, and so on. The publication system is basically like an enormous, ever expanding library of knowledge.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;&lt;em&gt;Universities and funding agencies&lt;/em&gt; need the publication system to asses the scientific output of researchers for hiring decisions and to explain to tax payers how and why the money has been spent. Peer review is a very handy way of assessing scientific output. You can use it to basically just say &amp;#8220;I don&amp;#8217;t know exactly what they have been doing (and it&amp;#8217;s probably not even practically relevant for another decade or so), but at least these other researchers said that it&amp;#8217;s good.&amp;#8221;&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;&lt;em&gt;Publishers&lt;/em&gt; mainly want/need to make money (and probably also have a name in the whole scientific endeavor).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What&amp;#8217;s important to understand, is however how peer review addresses all of these problems at least partially:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Exchange of information is more or less efficient, fair, and open. It could be more efficient, but the publication lag is still on the same order of magnitude as the actual work. It&amp;#8217;s not like science is already five years ahead of a huge backlog of publications (at least I hope that&amp;#8217;s not the case&amp;#8230;) It is fair, because a good reviewer is bound by a scientific code of ethic to be a fair and unbiased, and it is open because everyone can submit something (as opposed to a closed club where you first have to become a member to get a chance to publish.)&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;Researchers get an excellent standardized measure of scientific output. A published journal paper is something nobody can take away from you. Not only is a published paper an important step towards tenure, it is also something everyone agrees on. On the other hand, peer review gives you a level of filtering such that the amount of new results is just large enough to be manageable.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;Universities and funding agencies are also happy, because they have a solid, generally accepted measure of scientific productivity, which even laymen can understand.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;Publishers can get their share by building a strong brand, becoming a journal with a high impact (while having researchers doing the actual peer review work for free, but that is another problem).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In summary, peer review is an okay solution to a complex problem, and whatever solution you propose to replace it has to cover all of these aspects as well.&lt;/p&gt;

&lt;h2 id='you_cant_ignore_the_complexity_of_the_problem'&gt;You can&amp;#8217;t ignore the complexity of the problem&lt;/h2&gt;

&lt;p&gt;You&amp;#8217;re probably wonder when I&amp;#8217;ll come to the NoSQL bit of this post, but before we get there, let&amp;#8217;s briefly discuss how common alternatives fail because they do not address all of the above aspects.&lt;/p&gt;

&lt;p&gt;For example, a common approach is to say that we should replace this whole process with a social media site around publications. Let&amp;#8217;s just call it SciNet for now (and I know that already exists, but it&amp;#8217;s really hard to find something in that namespace which isn&amp;#8217;t already taken). &amp;#8220;Likes&amp;#8221; or &amp;#8220;Recommendations&amp;#8221; would work as filtering, connections between users give people structure to navigate, or to form &amp;#8220;Web of Trusts&amp;#8221;, and so on.&lt;/p&gt;

&lt;p&gt;This idea has some appeal, but it neglects the aspect of building a track record and giving an objective measure of scientific output, because you&amp;#8217;ll have a hard time explaining to your funding agencies that that non-peer reviewed paper of yours is a solid piece of work because it got 1.5M &amp;#8220;likes&amp;#8221; on SciNet. I&amp;#8217;m not saying that this probably cannot be solved, but you can&amp;#8217;t just copy existing concepts, and you would also need to invest quite an amount of lobbying to convince the universities which hire professors and the funding agencies which pay for your research to accept these measures.&lt;/p&gt;

&lt;p&gt;&lt;a href='https://freedom-to-tinker.com/blog/dwallach/rebooting-cs-publication-process'&gt;Other approaches&lt;/a&gt; focus mainly turn-around times and open access, proposing some central server which is a mixture of a preprint server and a perpetual archive. Such systems don&amp;#8217;t really address the filtering aspect, and also don&amp;#8217;t deal with the main problem of how to improve peer review.&lt;/p&gt;

&lt;h2 id='finally_getting_to_the_nosql_part'&gt;Finally getting to the NoSQL part&lt;/h2&gt;

&lt;p&gt;So from a higher-level, we have a situation which is pretty common in engineering: We have a well-tested and established piece of &amp;#8220;technology&amp;#8221; for a complex problem. It&amp;#8217;s been around for quite some time now, and it shows. Somehow, it hasn&amp;#8217;t kept up with the acceleration of communication which the Internet brought about. We&amp;#8217;ve seen how fast information can be exchanged, and we&amp;#8217;d like to have that kind of quality for our professional scientific exchange as well.&lt;/p&gt;

&lt;p&gt;Of course, there is still room for improvement. People could just work harder to write better reviews on time, action editors could press reviewers harder to give good reviews. Already communities have found ways around the long turn-around times by moving to conferences (like computer science) or preprint servers (like physics). Conferences are actually an interesting example, because they play a quite different role in computer science and mathematics. In CS, conferences have become as important as journals, which is problematic because the review process is quite different (as there is really no way for a revision). In mathematics, conferences are much more informal. Often you can apply with just an abstract. That way, conferences function mostly as a platform for exchange, and less as an outlet for publications.&lt;/p&gt;

&lt;p&gt;But in order to change the problem, you either have to find a solution which is uniformly better than the current system on all the aspects I&amp;#8217;ve talked about earlier, or you have to put an equal amount of work into marketing to convince people that some of the aspects are not important anymore.&lt;/p&gt;

&lt;p&gt;All of this reminds me of the NoSQL movement. Classical relational database systems were the standard till a few years ago. Like peer review they address a very sophisticated set of requirements, and have been around for quite some time. However, it also became more and more apparent that they aren&amp;#8217;t good for certain applications.&lt;/p&gt;

&lt;p&gt;The main contribution of the NoSQL movement was to understand that some of the requirements could be weakened because they really weren&amp;#8217;t that important for certain kinds of applications, and to see how that changed set of requirements could be used to produce systems which scale more easily.&lt;/p&gt;

&lt;p&gt;What does this mean for the scientific publication system? I think to find an alternative process, we need to be fully aware of all the requirements the current system addresses, but we also need to question these requirements and be ready to fight hard to make people do the same. Because otherwise we&amp;#8217;re stuck with finding a system which is better in all the aspects than the current system.&lt;/p&gt;

&lt;p&gt;Note that this approach is also different from just focusing on one aspect and ignoring the rest as some of the approaches I&amp;#8217;ve discussed above. It&amp;#8217;s really something different to say &amp;#8220;we&amp;#8217;ve considered these requirements, but we think they aren&amp;#8217;t important anymore&amp;#8221; than &amp;#8220;we just considered half of the problem for now.&amp;#8221;&lt;/p&gt;

&lt;h2 id='rethinking_why_we_have_peer_review'&gt;Rethinking why we have peer review&lt;/h2&gt;

&lt;p&gt;So the question is which parts of the problem can go? I think generally there is the tendency to believe that we don&amp;#8217;t really need the publishers anymore. The Internet has made it very easy to publish something even in a permanent fashion, and most of the actual work has already been done by us anyway.&lt;/p&gt;

&lt;p&gt;There is really no way around an efficient exchange of information and being able to find the information you look for. These are probably the core requirements.&lt;/p&gt;

&lt;p&gt;Track records and objective measures of scientific output are of course important, but I think we might be able to find something new here eventually (and the current system also doesn&amp;#8217;t really work well anyway. Daniel Lemire has a &lt;a href='http://lemire.me/blog/archives/2011/04/29/is-science-more-art-or-industry/'&gt;number&lt;/a&gt; of &lt;a href='http://lemire.me/blog/archives/2011/04/28/the-case-against-double-blind-peer-review/'&gt;posts&lt;/a&gt; how papers as units of scientific work don&amp;#8217;t make sense).&lt;/p&gt;

&lt;p&gt;I think peer review is still very valuable, but its role probably needs to change. If we find more effective ways of filtering and measuring the impact, we no longer need peer review to be the first threshold to publication, and we no longer suffer from its errors or long turn-around times.&lt;/p&gt;

&lt;h2 id='what_can_we_do_for_now'&gt;What can we do for now&lt;/h2&gt;

&lt;p&gt;So what can we do for now. Actually, I think you can do a lot. Don&amp;#8217;t forget that we&amp;#8217;re running this system ourselves. So whenever you are a reviewer, work hard to be an unbiased and fair reviewer. Never recommend to reject a paper just because you somehow missed the point and didn&amp;#8217;t like the overall approach. NEVER reject a paper simply because it hasn&amp;#8217;t compared itself against method X (there are thousands of methods out there), unless there is a very good reason to do so. NEVER reject a paper because you believe it is similar to method Y, unless you are very certain that they are very similar. In all the cases I got reviews like this, it never was true.&lt;/p&gt;

&lt;p&gt;If you are an action editor or area chair, don&amp;#8217;t accept bad reviews. If you organize a workshop, think about alternative ways to accept and review papers. Turn your blog into an informal journal, invite people to submit their work if they want to get the word out.&lt;/p&gt;

&lt;p&gt;If you are in a position to discuss with decision makers in funding agencies, talk to them about alternative ways to measure scientific output. If you are in a committee to hire new faculty members, don&amp;#8217;t just rely on impact factors to assess the scientific output of a member, but encourage the others to also look at the contributions of a candidate to the community besides peer reviewed publications.&lt;/p&gt;

&lt;p&gt;And if you want to develop something new, always be aware of the full complexity of the problem, and be ready to explain why you neglect some of its aspects.&lt;/p&gt;

&lt;p&gt;For further reading, Marcio von Muhlen has an interesting post called &lt;a href='http://marciovm.com/i-want-a-github-of-science/index.html'&gt;&amp;#8220;We Need a Github of Science&amp;#8221;&lt;/a&gt; which covers a lot of ground, and also tries to take into account the whole problem.&lt;/p&gt;

&lt;p&gt;A last piece of advice: First get tenure or some other kind of permanent position, then work on improving the system. Always remember that others are publishing papers in the old system while you fantasize about a better world.&lt;/p&gt;&lt;/p&gt;
   &lt;p&gt;&lt;a href="http://blog.mikiobraun.de/2011/09/on-peer-review.html"&gt;Click here for the full article&lt;/a&gt;&lt;img src="http://feeds.feedburner.com/~r/MarginallyInteresting/~4/Y2BRVoK-5p0" height="1" width="1"/&gt;</content>
 </entry>
 

 
   <entry>
   <title type="html">Short Review: Visualize This by Nathan Yau</title>
   <link href="http://blog.mikiobraun.de/2011/08/review-yau-visualize-this.html" />
   <updated>2011-08-26T12:20:00+02:00</updated>
   <published>2011-08-26T12:20:00+02:00</published>
   <author>
     <name>Mikio L. Braun</name>
     <uri>http://mikiobraun.de</uri>
     <email>mikiobraun@gmail.com</email>
   </author>
   <id>http://blog.mikiobraun.de/2011/08/review-yau-visualize-this</id>
   <content type="html">&lt;p&gt;&lt;img class='teaser-pic' src='/images/yau-visualize-this-teaser.png' /&gt;
&lt;p&gt;I can&amp;#8217;t really remember how I came across this book. I think it was recommended by Amazon. The price was ok (about 31€), at least for a Wiley book, so I just went ahead and bought it. You probably know Nathan Yau from his blog &lt;a href='http://flowingdata.com'&gt;FlowingData&lt;/a&gt; where he frequently posts visualizations and interesting infographics.&lt;/p&gt;

&lt;p&gt;Overall, the book is quite nice. It starts with some basic discussion about on visualizations in general, stressing the fact that visualizations are an excellent tool to tell the stories behind statistical data. It then goes through some tools, starting with Excel, and then covering tools like &lt;a href='http://r-project.org'&gt;R&lt;/a&gt;, as well as JavaScript plotting libraries like &lt;a href='http://mbostock.github.com/protovis/'&gt;protovis&lt;/a&gt; (now abandoned in favor of &lt;a href='http://mbostock.github.com/d3/'&gt;D3.js&lt;/a&gt;), and several other more specialized libraries, for example, for maps.&lt;/p&gt;

&lt;p&gt;The remainder of the book goes through different kinds of visualizations in detail, from timeseries data, scatter plots, maps, etc. Each of the chapters focuses on one tool and show how to get the final plot in great detail.&lt;/p&gt;

&lt;p&gt;What I found particularly interesting is that his workflow almost always includes refining the plot in Illustrator to make the graphic more appealing and to add explanations and further labels. This might be nice if you create static visualizations, but if you want to generate dynamic visualizations automatically from data, you&amp;#8217;ll have to keep tweaking the original plot until it looks nice enough.&lt;/p&gt;

&lt;p&gt;The book is probably too entry level if you already have some experience with data analysis or programming. It tries to require no prior knowledge in programming, although I wonder whether you can really learn how to use R from the examples alone. On the other hand, if you want to do some visualization with maps, for example, it&amp;#8217;s nice to have almost complete examples in there.&lt;/p&gt;

&lt;p&gt;I also particularly liked the introduction and the final chapter which give a lot of interesting insight on the business of creating visualizations.&lt;/p&gt;&lt;/p&gt;
   &lt;p&gt;&lt;a href="http://blog.mikiobraun.de/2011/08/review-yau-visualize-this.html"&gt;Click here for the full article&lt;/a&gt;&lt;img src="http://feeds.feedburner.com/~r/MarginallyInteresting/~4/5HYKl_vK378" height="1" width="1"/&gt;</content>
 </entry>
 

 
   <entry>
   <title type="html">Short Review of Edward R. Tufte's "The Visual Display of Quantitative Information"</title>
   <link href="http://blog.mikiobraun.de/2011/08/review-visual-display-quantitative-information.html" />
   <updated>2011-08-15T17:21:00+02:00</updated>
   <published>2011-08-15T17:21:00+02:00</published>
   <author>
     <name>Mikio L. Braun</name>
     <uri>http://mikiobraun.de</uri>
     <email>mikiobraun@gmail.com</email>
   </author>
   <id>http://blog.mikiobraun.de/2011/08/review-visual-display-quantitative-information</id>
   <content type="html">&lt;p&gt;&lt;a href='http://mikiobraun.tumblr.com/post/7681880118/another-book-the-visual-display-of-quantative'&gt;&lt;img class='teaser-pic' src='/images/tufte-teaser.jpg' /&gt;&lt;/a&gt;
&lt;p&gt;On the bottom line, I found &lt;a href='http://www.edwardtufte.com/tufte/books_vdqi'&gt;the book&lt;/a&gt; quite interesting to read, although you probably would have managed to fit the material into three to five blog posts (yes, that&amp;#8217;s how we measure document lengths today). The book spends about a third of the time reviewing the history of statistical plots. While it might be quite fascinating that &lt;a href='http://en.wikipedia.org/wiki/William_Playfair'&gt;William Playfair&lt;/a&gt; produced pretty modern looking plots already in the 18th century, I&amp;#8217;m not sure that this is the best way to approach such a field.&lt;/p&gt;

&lt;p&gt;Another third is spent on common mistakes and lies in statistics plots, including all kinds of exaggerations, visual noise (cross hatching, moire effects and friends), and downright stupid plots, for example &amp;#8220;executive summary&amp;#8221;-style plots consisting of only three bars (one of which is the sum of the other two).&lt;/p&gt;

&lt;p&gt;The strongest part of the book IMHO was the third part which develops a number of design principles. Tufte&amp;#8217;s main points are to use as much &amp;#8220;ink&amp;#8221; (read &amp;#8220;toner&amp;#8221; or &amp;#8220;pixels&amp;#8221;) as possible to show data, reducing gridlines and axes as much as possible. Tufte is also an advocate for data rich plots, arguing that our visual system is quite capable of dealing with high information densities.&lt;/p&gt;

&lt;p&gt;Like most machine learners, I&amp;#8217;ve done most of my plots with MATLAB and more recently &lt;a href='http://matplotlib.sourceforge.net/'&gt;matplotlib&lt;/a&gt;, and I&amp;#8217;m sort of used to the style their provide. Tufte&amp;#8217;s approach is somewhat different, and more clean, which is a nice change. JavaScript plotting libraries like &lt;a href='http://protovis.org'&gt;protovis&lt;/a&gt; or &lt;a href='http://mbostock.github.com/d3/'&gt;D3.js&lt;/a&gt; follow the aesthetics of Tufte more.&lt;/p&gt;

&lt;p&gt;What I particularily liked about his approach was the idea that visualizations can really help to understand data using our visual system. As he says, &amp;#8220;Above all, show data&amp;#8221;, meaning that you shouldn&amp;#8217;t hesitate to put as much data as possible before your eyes (within reason) so that you can really start exploring the structure in your data visually.&lt;/p&gt;&lt;/p&gt;
   &lt;p&gt;&lt;a href="http://blog.mikiobraun.de/2011/08/review-visual-display-quantitative-information.html"&gt;Click here for the full article&lt;/a&gt;&lt;img src="http://feeds.feedburner.com/~r/MarginallyInteresting/~4/uWMyzwzhrrI" height="1" width="1"/&gt;</content>
 </entry>
 

 
   <entry>
   <title type="html">Cross-post: Hey, Google+ is not world peace</title>
   <link href="http://blog.mikiobraun.de/2011/07/hey-google-plus-is-not-world-peace.html" />
   <updated>2011-07-19T10:11:00+02:00</updated>
   <published>2011-07-19T10:11:00+02:00</published>
   <author>
     <name>Mikio L. Braun</name>
     <uri>http://mikiobraun.de</uri>
     <email>mikiobraun@gmail.com</email>
   </author>
   <id>http://blog.mikiobraun.de/2011/07/hey-google-plus-is-not-world-peace</id>
   <content type="html">&lt;p&gt;&lt;p&gt;&lt;em&gt;Sorry for the lack of posts lately. Somehow I&amp;#8217;ve become a bit spread out too thin across &lt;a href='http://mikiobraun.tumblr.com'&gt;tumblr&lt;/a&gt;, the &lt;a href='http://twimpact.tumblr.com'&gt;TWIMPACT Dev Blog&lt;/a&gt;, and now of course &lt;a href='http://profiles.google.com/mikiobraun'&gt;Google+&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I originally set up this blog to be the ideal place for everything with jsMath to render some latex, markdown for the editing, and disqus for comments, but sometimes I find it easier to post something on tumblr. No idea whether this will eventually converge on some platform.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;In any case, I posted the following &lt;a href='http://twimpact.tumblr.com/post/7796139131/hey-google-is-not-world-peace'&gt;originally on the TWIMPACT Dev Blog&lt;/a&gt;:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Some people start behaving like Google+ is the second coming of Christ, a cure against cancer and world peace all rolled into one. &lt;a href='https://plus.google.com/113117251731252114390/posts'&gt;Mike Elgan&lt;/a&gt; has gone on a &lt;a href='http://www.computerworld.com/s/article/9218456/Elgan_What_I_lost_on_the_Google_Diet'&gt;&amp;#8220;Google+ diet&amp;#8221;&lt;/a&gt; and is redirecting all his communication (including email!) to Google+. Other bloggers (e.g. Kevin Rose) have shut down their blog completely, &lt;a href='http://kevinrose.com'&gt;redirecting their site&lt;/a&gt; to their Google+ profile. Others state that everything else &lt;a href='http://scobleizer.com/2011/07/17/google-has-made-twitter-boring-heres-what-twitter-should-do-about-that/'&gt;has started to become boring&lt;/a&gt; once your exposed to Google+.&lt;/p&gt;

&lt;p&gt;Now admittedly, Google+ is very nice and it&amp;#8217;s definitely a big step forward, but there is still a lot that needs to be done. I&amp;#8217;m not dismissing it in any way. It&amp;#8217;s has a lot of potential, but as every complex system out there, there are a lot of details which need further attention.&lt;/p&gt;

&lt;p&gt;So here is the list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;RSS feeds. A lot of people are still using RSS readers for blogs. Google+ doesn&amp;#8217;t have this feature yet. Switching your blog to Google+ currently means that all those people won&amp;#8217;t get your updates anymore. Doesn&amp;#8217;t sound like a nice move to me. The closest thing you can get right now is an (unofficial) hack at http://plusfeed.appspot.com/&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;Some form of bookmarking. Just as in Twitter, it is currently very hard to find interesting stuff in your stream again. +1 would be a nice way to bookmark posts, but currently they don&amp;#8217;t show up under the +1 tab.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;Private messages. Sometimes you want to have a small private exchange. You can always go back to email, but that would be quite disruptive. I know that you can have a privat conversation if you share a post to just one person, but that not very obvious. What you need is a button or a menu entry next to the person&amp;#8217;s icon or on their profile page.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;Real-time search. Now that has me really baffled. On all other Google products (email, calendar, docs, etc.), search is an integral part of the experience. In fact, putting full text search on Gmail was one of the game changers back when it came out. Still, no real-time search on Google+, neither for all public posts nor on your private streams. IMHO, this is probably the most important feature that is missing right now.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;Public API and third party integration. To cross post your stuff to Facebook and make everyone there crazy.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;/p&gt;
   &lt;p&gt;&lt;a href="http://blog.mikiobraun.de/2011/07/hey-google-plus-is-not-world-peace.html"&gt;Click here for the full article&lt;/a&gt;&lt;img src="http://feeds.feedburner.com/~r/MarginallyInteresting/~4/VHjk0kTjYmc" height="1" width="1"/&gt;</content>
 </entry>
 

 
   <entry>
   <title type="html">Resist Holy Tech Wars</title>
   <link href="http://blog.mikiobraun.de/2011/06/holy-tech-wars.html" />
   <updated>2011-06-26T20:30:00+02:00</updated>
   <published>2011-06-26T20:30:00+02:00</published>
   <author>
     <name>Mikio L. Braun</name>
     <uri>http://mikiobraun.de</uri>
     <email>mikiobraun@gmail.com</email>
   </author>
   <id>http://blog.mikiobraun.de/2011/06/holy-tech-wars</id>
   <content type="html">&lt;p&gt;&lt;p class='quote'&gt;
Only a Sith deals in absolutes.&lt;br /&gt;
Obi-Wan Kenobi, Star Wars Episode III: Revenge of the Sith
&lt;/p&gt;
&lt;p&gt;It might probably be due to my age (recently turned thirty-six), but I&amp;#8217;m recently observing a certain aversion in me against tech holy wars. It&amp;#8217;s something between &amp;#8221;C&amp;#8217;mon, you can&amp;#8217;t be serious&amp;#8221; and &amp;#8220;Not again&amp;#8221;. I know that they have a decades old tradition in computer science, but still, I find them somewhat irritating.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Discussion about programming languages are full of these. For example, you have the dynamic vs. static typing discussion, functional vs. object-oriented, what is the right abstraction for concurrency: locks, actors, STM, or something entirely different, immutable vs. mutable data structures, etc. Of course, most of this discussion is led in a reasonably rational tone, but every now and then, you meet people who categorically reject anything which isn&amp;#8217;t X.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;In database systems, there is the NoSQL vs. SQL databases, which break down to consistency vs. being eventually consistent, &amp;#8220;scaling up&amp;#8221; vs. &amp;#8220;scaling out&amp;#8221;, and so on.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;In Machine Learning we have the Frequentism vs. Bayesians divide. I actually often forget that people are taking this serious, but then again I end up with people rejecting ideas because they are from the other group.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that most of these questions are pretty big ones: Programming languages, databases, notions of probability and inference under uncertainty. There are little holy wars on smaller things.&lt;/p&gt;

&lt;p&gt;I had an interesting exchange on Twitter today with &lt;a href='http://twitter.com/DRMacIver'&gt;David MacIver&lt;/a&gt; on this, and we agreed that in many cases, it is clear that there are arguments in favor of each side depending on the context. Different programming languages are fit for different things, and there is no language or programming paradigm which is universally superior. At the same time, the costs for mastering both alternatives are often quite large. You tend to grow attached to the programming paradigm you do most of your work in, and also to the tools, editors, libraries, the community, etc., and might over time become relucant to switch.&lt;/p&gt;

&lt;p&gt;David added that in many cases, however, it&amp;#8217;s actually not that difficult to do both but people suffer from &lt;a href='http://en.wikipedia.org/wiki/Sunk_costs#Loss_aversion_and_the_sunk_cost_fallacy'&gt;sunk cost fallacies&lt;/a&gt;. What this means is that people are somewhat averse to writing off investments they have already made. Basically, if you spent a lot of time learning a certain technology, you would feel like you&amp;#8217;ve wasted all that time if you switched to a different technology.&lt;/p&gt;

&lt;p&gt;Of course, and this is the reason why it&amp;#8217;s called a fallacy, &lt;em&gt;this has usually nothing to do with which technology is actually better for the problem at hand&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;As I said, holy wars are often about complex, high-level stuff. And unfortunately, things are never that easy. It&amp;#8217;s usually quite complicated. Both sides have areas where they shine, and others where they fail. At the same time, problems usually also have rather complex requirements which seldom align with the available solutions.&lt;/p&gt;

&lt;p&gt;I think that for many, holy wars are also rooted in a wish to simplify life, to get easy answers. &amp;#8220;Which programming language should I learn?&amp;#8221; For which problem? Number crunching? Distributed programming? Building a web site? &amp;#8220;Which database technology is the best one?&amp;#8221; Which will scale with your demands for the next ten years? There are no simple answers. But that&amp;#8217;s life, basically.&lt;/p&gt;

&lt;p&gt;I think what irritates me most about this is when smart people, who otherwise seem to be able to take in a huge amount of detail, give in to holy wars. In particular when they&amp;#8217;re scientists, because our professionality demands that we&amp;#8217;re open to new things, and always suspicious of what we think we already know.&lt;/p&gt;

&lt;p&gt;To close this rant, and to counter the argument that I&amp;#8217;m basically having a holy war against holy wars, I admit that holy wars (at least in tech) have their merit. Often, it forces both sides to focus on what&amp;#8217;s special, and possibly also grow in the process. Given the huge number of technologies to learn today, it&amp;#8217;s also a good thing to just focus on one thing for some time, and nothing helps like believing that you&amp;#8217;ve found the greatest thing there is. At the end of the day, however, you have to put it all into perspective and admit that perfect solutions only exist in fairy tales.&lt;/p&gt;&lt;/p&gt;
   &lt;p&gt;&lt;a href="http://blog.mikiobraun.de/2011/06/holy-tech-wars.html"&gt;Click here for the full article&lt;/a&gt;&lt;img src="http://feeds.feedburner.com/~r/MarginallyInteresting/~4/kQKmayI9LCw" height="1" width="1"/&gt;</content>
 </entry>
 

 
 
</feed>

