 <?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Semantic Void</title>
	<atom:link href="http://semanticvoid.com/index.php/feed" rel="self" type="application/rss+xml" />
	<link>http://semanticvoid.com/blog</link>
	<description>Extracting the semantics from the void</description>
	<lastBuildDate>Wed, 07 Oct 2009 08:42:40 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Reading Less Is Reading More</title>
		<link>http://semanticvoid.com/blog/2009/10/07/reading-less-is-reading-more/</link>
		<comments>http://semanticvoid.com/blog/2009/10/07/reading-less-is-reading-more/#comments</comments>
		<pubDate>Wed, 07 Oct 2009 08:19:27 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Project]]></category>
		<category><![CDATA[dygest]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/?p=363</guid>
		<description><![CDATA[If information is what drives you to the internet, like me, you might be spending roughly 60-70% of your time online reading blogs, news and feeds (not to forget twitter). For me at least, reading online has superseded email (and updating social networks) as the most time consuming activity. And yet everyone is busy generating [...]]]></description>
			<content:encoded><![CDATA[<p>If information is what drives you to the internet, like me, you might be spending roughly 60-70% of your time online reading blogs, news and feeds (not to forget twitter). For me at least, reading online has superseded email (and updating social networks) as the most time consuming activity. And yet everyone is busy generating more content rather than finding a solution to consume all this information. We are trying to tackle this problem precisely with <a href="http://dyge.st">Dygest</a>. At its core <a href="http://dyge.st">Dygest</a> is a summarization engine that tries to sift through all the noise and present only the *real* content/news contained in any (news) article/text. Recently, we released an experimental version of a feed summarizer that uses the <a href="http://dyge.st">Dygest</a> engine to summarize blogposts/news for any RSS/ATOM feed. This summarized feed can be subscribed in any feed reader like Bloglines, Google Reader etc.</p>
<p><strong>NOTE</strong>: A feed that has not been encountered by our system ever before should be summarized in a couple of minutes.</p>
<p><center><img src="http://farm4.static.flickr.com/3423/3988948493_63da2cb1bd_o.png" alt="Feed Summarizer" /></center></p>
<p>On the whole with Dygest, reading blogs has now become much faster, much more concise and consuming information has become a great deal easier. Imagine the time saved reading the summarized version as compared to the original post (also you are not overwhelmed with useless information). See for yourself below:</p>
<p><center><img src="http://farm3.static.flickr.com/2594/3989711414_1f28fd59bd.jpg" alt="Original Post"/></p>
<p><strong>Original Post</strong></center></p>
<p>
<center><img src="http://farm3.static.flickr.com/2600/3988953559_d203feb1b6.jpg" alt="Summarized Post"/></p>
<p><strong>Summarized Post</strong></center></p>
<p>While you might have the urge to head over to Dygest and summarize your entire subscription list on Google Reader, I would recommend reading this post a bit further for some real cool stuff we have in store. If you must though &#8211; <a href="http://dyge.st">click here to Dygest</a>.</p>
<p><strong><br />
<h3>Summarizing Your Twitter Links</h3>
<p></strong></p>
<p><a href="http://readtwit.com">Readtwit</a> is a really cool service launched recently, which extracts links from your twitter feed and packages them in a clean RSS format. The awesome combination of Readtwit along with Dygest yields a summarized twitter feed delivered to your favorite feed reader.</p>
<p>Steps to get a summarized twitter feed:</p>
<p>(1) Sign into <a href="http://readtwit.com">Readtwit</a>.<br />
(2) Copy the link on the &#8216;Get me the feed&#8217; button:<br />
<center><img src="http://farm3.static.flickr.com/2454/3989734546_db979a08f5_m.jpg"/></center><br />
(3) Paste this link into the <a href="http://dyge.st">Dygest</a> interface and subscribe to the summarized feed returned in your favorite feed reader.<br />
<center><img src="http://farm3.static.flickr.com/2473/3988983827_57010939ff_o.png"/></center></p>
<p><strong><br />
<h3>More To Come</h3>
<p></strong></p>
<p>This is just an experimental release of <a href="http://dyge.st">Dygest</a> and so do send in your feedback on the summaries and help us improve. In the coming months we are working on improving the algorithms and churning out other great applications of <a href="http://dyge.st">Dygest</a> (there is something really cool in the works). So while we are busy teaching computers to read, <a href="http://dyge.st">Dygest</a> your feeds &#8211; because reading less is reading more.</p>
<p>Follow us on twitter &#8211; <a href="http://twitter.com/dygest">@dygest</a></p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2009/10/07/reading-less-is-reading-more/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Web Content Extraction Dataset</title>
		<link>http://semanticvoid.com/blog/2009/08/22/web-content-extraction-dataset/</link>
		<comments>http://semanticvoid.com/blog/2009/08/22/web-content-extraction-dataset/#comments</comments>
		<pubDate>Sun, 23 Aug 2009 05:42:10 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[content]]></category>
		<category><![CDATA[dataset]]></category>
		<category><![CDATA[extraction]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/?p=326</guid>
		<description><![CDATA[For a recent project, we (sudheer_624 and I) have had to deal with developing algorithms to extract the true content from any given web page. By true content I mean the text excluding the ads, navigational links/text, etc even excluding comments (if any). Thus, given a blog post we are interested in extracting just the [...]]]></description>
			<content:encoded><![CDATA[<p>For a recent project, we (<a href="http://twitter.com/sudheer_624">sudheer_624</a> and I) have had to deal with developing algorithms to extract the true content from any given web page. By true content I mean the text excluding the ads, navigational links/text, etc even excluding comments (if any). Thus, given a blog post we are interested in extracting just the content of the post and not the comments and other surrounding text. We did not come across any dataset for the given task that would let us evaluate our algorithms. We recently generated our own dataset for this purpose and would like to share it with anyone tackling a similar problem.</p>
<p>The dataset contains the html source and text content (true content) for around ~4000 webpages. One metric to measure your algorithm against this dataset could be the edit distance. If you do use this dataset, it would be great if you could share the results of your algorithms for benchmarks to compare against. I&#8217;ll be updating this post with the accuracy of our algorithm soon enough.</p>
<p><strong><a href="http://semanticvoid.com/data/content_extraction_dataset.tar.gz">Download the dataset here (gzipped)</a></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2009/08/22/web-content-extraction-dataset/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>`Fact`orize Your Search</title>
		<link>http://semanticvoid.com/blog/2009/08/14/factorize-your-search/</link>
		<comments>http://semanticvoid.com/blog/2009/08/14/factorize-your-search/#comments</comments>
		<pubDate>Fri, 14 Aug 2009 07:37:08 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Yahoo!]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/?p=308</guid>
		<description><![CDATA[Dygest and a hackday later, @sudheer_624 and I (@semanticvoid) are back with &#8216;dfacto&#8217;, codename for our latest search hack for Yahoo! Hackday Summer 2009.
I think that search is undergoing a paradigm shift &#8211; its no longer about who presents the best ten blue links but now more about presenting the answers upfront. Dfacto (pronounced as [...]]]></description>
			<content:encoded><![CDATA[<p><strong><a href="http://semanticvoid.com/blog/2009/03/19/dygest-your-search/">Dygest</a></strong> and a hackday later, <a href="http://twitter.com/sudheer_624">@sudheer_624</a> and I (<a href="http://twitter.com/semanticvoid">@semanticvoid</a>) are back with <strong>&#8216;dfacto&#8217;</strong>, codename for our latest search hack for Yahoo! Hackday Summer 2009.</p>
<p>I think that search is undergoing a paradigm shift &#8211; its no longer about who presents the best ten blue links but now more about presenting the answers upfront. <strong>Dfacto</strong> (pronounced as &#8216;<em>de facto</em>&#8216;, Latin for &#8216;<em>by [the] fact</em>&#8216;) is aimed at addressing this issue. A large percentage (nearly 68%) of queries are informational queries &#8211; one where the searcher knows what she&#8217;d like to do or find but does not know how this can be achieved. <strong>Dfacto</strong> is aimed primarily at addressing this class of queries by presenting a set of facts associated with the query/topic to the searcher. It uses natural language algorithms to get facts that are most &#8220;semantically&#8221; related to the query. In lay terms, it literally tries to understand your query and the results. I&#8217;ll save the algorithmic details for another post. The few examples below show how it works:</p>
<p><em>Disclaimer: This is a work in progress, so you might notice a few &#8216;facts&#8217; that are irrelevant to the query.</em></p>
<p>Lets say the searcher is (losing hair and) looking for causes of hair loss. Normally he/she would need to click through a bunch of links to get an overview on the causes. This hack on the other hand makes life a bit easier by presenting the causes upfront (click to enlarge):</p>
<p><center><a href="http://farm3.static.flickr.com/2525/3819295965_c7f9c3a651_o.png">click to enlarge<br /><img src="http://farm3.static.flickr.com/2525/3819295965_d8d3055f49.jpg" alt="'hair loss cause'" /></a><br /></center></p>
<p>Along with the facts, we also list the source from where it was extracted. Alternatively, the searcher can also select a bunch of facts he/she thinks are relevant and refine the search. This in turn would yield a new set of &#8216;web results&#8217; along with new refined and related &#8216;facts&#8217;.</p>
<p>Another example (one which I particularly like) is a query about &#8216;table manners&#8217;. This precisely lists a set of etiquette&#8217;s to follow at the table (click to enlarge).</p>
<p><center><a href="http://farm3.static.flickr.com/2587/3820121342_ac99f01072_o.png"> click to enlarge<br /> <img src="http://farm3.static.flickr.com/2587/3820121342_543ae9bb92.jpg" alt="'table manners'" /></a></center></p>
<p>Alternatively, <strong>Dfacto</strong> also serves well as a product research tool. A query for &#8216;iphone 3gs&#8217; yeilds (click to enlarge):</p>
<p><center><a href="http://farm3.static.flickr.com/2595/3820128618_cfbc2db7d6_o.png"> click to enlarge<br /> <img src="http://farm3.static.flickr.com/2595/3820128618_5fb29f2762.jpg" alt="'iphone 3gs'" /></a></center></p>
<p>On another note, if you have a date in the coming weeks you might be interested in reading the list below (:</p>
<p><center><a href="http://farm3.static.flickr.com/2669/3819328509_59c127b413_o.png"> click to enlarge<br /> <img src="http://farm3.static.flickr.com/2669/3819328509_ba08fe9e02.jpg" alt="'first date tips'" /></a></center></p>
<p>Happy hacking!</p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2009/08/14/factorize-your-search/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Dygest Your Search</title>
		<link>http://semanticvoid.com/blog/2009/03/19/dygest-your-search/</link>
		<comments>http://semanticvoid.com/blog/2009/03/19/dygest-your-search/#comments</comments>
		<pubDate>Fri, 20 Mar 2009 06:56:36 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Project]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[Yahoo!]]></category>
		<category><![CDATA[Add new tag]]></category>
		<category><![CDATA[summarization]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/?p=256</guid>
		<description><![CDATA[Update: This hack won the coveted &#8216;Search&#8217; category award.
For the last couple of days, I and @sudheer_624 have been busy working on this hack for a Yahoo! Hackday. Although still a prototype, the hack has turned out to be interesting so we thought of putting it out for others to play around with.
Dygest (pronounced as [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Update:</strong> This hack won the coveted &#8216;Search&#8217; category award.</p>
<p>For the last couple of days, I and <a href="http://twitter.com/sudheer_624">@sudheer_624</a> have been busy working on this hack for a Yahoo! Hackday. Although still a prototype, the hack has turned out to be interesting so we thought of putting it out for others to play around with.</p>
<p><strong>Dygest</strong> (pronounced as &#8216;digest&#8217; &#8211; thanks to <a href="http://twitter.com/bluesmoon">@bluesmoon</a>) is aimed at changing the conventional way of displaying search context via a snippet to a more informative, machine generated document summary. There two kinds of relevance for evaluating search results:</p>
<ul>
<li>Vertical relevance: determined by the ranking algorithms.</li>
<li>Horizontal relevance: the contextual information made available to the user about the result &#8211; Searchmonkey is a good initiative on this front.</li>
</ul>
<p>
The current way of displaying this context is via a snippet of text under every result. This snippet shows the neighborhood of the occurrence of the query terms. Usually this information is not rich enough for a searcher to make the right judgement about the result. This causes the searcher to switch back and forth between the documents and the search results if the the page is not relevant. This can be frustrating at times.</p>
<p>
<strong>Dygest</strong> aims to solve this by either replacing or enhancing the current search snippet with a summary of the result page. At its core lies a summarization engine which figures out what the *real* content of the page is (distinguishing it from the other junk like surrounding text, navigational text, comments etc) and then performs text summarization on this content. The summary of the page is then displayed to the user via the appropriate interface. How cool is that?</p>
<p>
The user no longer needs to click on irrelevant links. He/She can perceive the theme/important facts of the page from right within the results page. The other advantage of this is that it gives the user a good overview of the query topic &#8211; he no longer needs to spend time reading many long documents but rather read a few summaries from the top results to get a good overview of the subject. This is particularly well suited for mobile devices where its frustrating to switch back and forth between pages and the search results. This is also fit for news articles where we just need the important facts about the story. </p>
<p>
Well, here is an example to convince you. A search for &#8216;Carol Bartz&#8217; yields the following result which at the first glance is not at all informative.</p>
<p><center> <img alt="" border="2" src="http://farm4.static.flickr.com/3456/3369960208_48edc07644_o.png" title="search snippet for Carol Bartz" /> </center></p>
<p>
Enhancing the existing view with an abstract of the page helps gauge the content and theme of the document. This would now look like:</p>
<p><center> <img alt="" src="http://farm4.static.flickr.com/3637/3369975750_f0b313ae61_o.png" title="summarized view" /> </center></p>
<p><strong>Dygest</strong> outputs the following summaries for the query &#8216;<a href="http://datacracy.info/cgi-bin/dygest/search.py?q=iran+site%3Anews.yahoo.com">Iran</a>&#8216; restricted to Yahoo! News:</p>
<p><center><img alt="" src="http://farm4.static.flickr.com/3658/3370011200_a757dc42d8_o.png" title="Query for Iran" /></center></p>
<p>And following for &#8216;<a href="http://datacracy.info/cgi-bin/dygest/search.py?q=obama+stimulus+plan">Obama stimulus plan</a>&#8216;:</p>
<p><center><img alt="" src="http://farm4.static.flickr.com/3578/3370098322_1a73cd285b_o.png" title="obama stimulus plan"  /></center></p>
<p>Currently, <strong>Dygest</strong> has two interfaces &#8211; (1) a search interface powered by yahoo boss and (2) a searchmonkey plugin. Its just a prototype so be kind and don&#8217;t be too judgmental.</p>
<p>Start dygest<em>ing</em> <a href="http://datacracy.info/dygest/">here</a>.</p>
<p><center><br />
<script src="http://pipes.yahoo.com/js/imagebadge.js">{"pipe_id":"3hCWTB0Y3hG3E9xK6ycw5g","_btype":"image"}</script><br />
</center></p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2009/03/19/dygest-your-search/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Interfacing Hadoop With MySQL</title>
		<link>http://semanticvoid.com/blog/2009/03/05/interfacing-hadoop-with-mysql/</link>
		<comments>http://semanticvoid.com/blog/2009/03/05/interfacing-hadoop-with-mysql/#comments</comments>
		<pubDate>Fri, 06 Mar 2009 01:11:19 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/?p=154</guid>
		<description><![CDATA[When you have terabytes of time series data, deciding how you will process it becomes more important than the issue of storage. MySQL serves well for storing such data but the complexity arises when we have to perform complex calculations or data mining operations on this sequential data. The mapreduce framework is designed to handle [...]]]></description>
			<content:encoded><![CDATA[<p>When you have terabytes of time series data, deciding how you will process it becomes more important than the issue of storage. MySQL serves well for storing such data but the complexity arises when we have to perform complex calculations or data mining operations on this sequential data. The mapreduce framework is designed to handle this kind of data well, but getting MySQL to do mapreduce-like processing is not supported unless you have access to <a href="http://www.asterdata.com/blog/index.php/2009/02/10/advanced-sql-made-easy-introducing-npath/">nPath</a>. The other solution is to get this data into an existing mapreduce framework like <a href="http://hadoop.apache.org/">Hadoop</a>. In a recent hadoop release (0.19), mapreduce jobs have the ability to take the input from databases [<a href="http://issues.apache.org/jira/browse/HADOOP-2536">link</a>]. Recently, I tried interfacing hadoop with MySQL and although it was an easy task, I did not find much documentation on the topic. So in this post I intend to outline the way you can get Hadoop talking to MySQL.</p>
<p>Lets try to implement Hadoop&#8217;s &#8216;Hello World&#8217; (Word Count) example. The MySQL table is as follows:<br />
<code><br />
CREATE TABLE `wordcount` (<br />
  `word` varchar(255) DEFAULT NULL,<br />
  `count` int(11) DEFAULT NULL<br />
)<br />
</code></p>
<p>We need a class that implements the map and reduce tasks. Lets call this class WordCount. This class needs to extend <a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/conf/class-use/Configured.html">Configured</a> and implement the <a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/util/Tool.html">Tool</a> interface.</p>
<p><code>public class WordCount extends Configured implements Tool {</code></p>
<p>We need to implement the Tool interface to parse the generic options. This is needed as we will be passing the mysql-connector jar via command line argument (-libjar) to hadoop. This jar will become part of the custom configuration for the WordCount class. Thus, the WordCount class needs to be configurable as well. This is done by extending Configured.</p>
<p>Tuples/rows from the DB are converted to Java objects. Thus we need to define a class that would hold the tuples. All such classes need to implement the Writable and DBWritable interfaces. Typically every table that we want to read/write needs to be represented by a class implementing the above interfaces. We will be dealing with reading tables, hence only the read functions are overridden.</p>
<pre>
   static class WordRecord implements Writable, DBWritable {
        String word;
        int count;

        public void write(DataOutput arg0) throws IOException {
            throw new UnsupportedOperationException("Not supported yet.");
        }

       public void readFields(DataInput in) throws IOException {
            this.word = Text.readString(in);
            this.count = in.readInt();
        }

        public void write(PreparedStatement arg0) throws SQLException {
            throw new UnsupportedOperationException("Not supported yet.");
        }

        public void readFields(ResultSet rs) throws SQLException {
            this.word = rs.getString(1);
            this.count = rs.getInt(2);
        }
    }
</pre>
<p>Once we&#8217;re done with this, its time to define the map and reduce operations. In the map function, we are just outputting the tuple, with the word as the key and its count as the value.</p>
<pre>
    static class WordCountMapper extends MapReduceBase
            implements Mapper<longwritable , WordRecord, Text, LongWritable> {

        public void map(LongWritable key, WordRecord value,
                OutputCollector<text , LongWritable> output, Reporter reporter)
                throws IOException {

            output.collect(new Text(value.word), new LongWritable(value.count));
        }
    }
</text></longwritable></pre>
<p>In the reduce function, we are summing up the counts associated for a given word and outputting the [word , sum].</p>
<pre>
    static class WordCountReducer extends MapReduceBase
            implements Reducer<text , LongWritable, Text, LongWritable> {

        public void reduce(Text key, Iterator<longwritable> values,
                OutputCollector<text , LongWritable> output, Reporter reporter)
                throws IOException {

            long sum = 0L;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            output.collect(key, new LongWritable(sum));
        }
    }
</text></longwritable></text></pre>
<p>Since we implemented the Tool interface, we need the WordCount class to implement the <strong>run</strong> function. In this function, we will specify the job configuration, configure the mapper and reducer classes, configure the DB, set the input and output to the job etc.</p>
<pre>
public int run(String[] arg0) throws Exception {
        // the getConf method is implemented by Configured - this way we can pass the generic options to the job
        JobConf job = new JobConf(getConf(), WordCount.class);

        job.setJobName("word count job");

        // set the mapper and reducer classes
        job.setMapperClass(WordCountMapper.class);
        job.setCombinerClass(WordCountReducer.class);
        job.setReducerClass(WordCountReducer.class);

        // configure the DB - provide the driver class, provide the mysql host and db name in the url
        DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost/mysqlhadoop");

        // define the fields you want to access from the table
        String[] fields = {"word", "count"};

        // specify which class represents the tuple as well as the table to be accessed (in this case 'wordcount')
        // alternatively we can also specify a SQL query (which here is null)
        // we are sorting the results by the field 'word'
        DBInputFormat.setInput(job, WordRecord.class, "wordcount", null, "word", fields);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputKeyClass(LongWritable.class);

        // write the final results to a folder in HDFS
        // alternatively we can also write the output back to mysql using DBOutputFormat
        FileOutputFormat.setOutputPath(job, new Path("output_wordcount);

        JobClient.runJob(job);

        return 0;
    }
</pre>
<p>And now finally we can define the main function as</p>
<pre>
    public static void main(String args[]) throws Exception {
        int ret = ToolRunner.run(new WordCount(), args);
        System.exit(ret);
    }
</pre>
<p>To see how the above pieces fit together take a look at <a href="http://semanticvoid.com/code/WordCount.java">WordCount.java</a>.To run this job we need to provide the mysql-connector jar in the classpath. This can done by either placing this JAR in the $HADOOP_HOME/lib or by providing this JAR in the command line as follows:</p>
<p><code>$ ==>  hadoop jar wordcount.jar WordCount -libjars mysql-connector-java-5.1.7-bin.jar</code></p>
<p>Its that easy to get hadoop talking to MySQL and you are ready to do some heavy number crunching. <a href="http://semanticvoid.com/code/WordCount.java">Get hold of WordCount.java here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2009/03/05/interfacing-hadoop-with-mysql/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>How the &#8220;What&#8221; becomes the &#8220;How&#8221;</title>
		<link>http://semanticvoid.com/blog/2008/10/16/how-the-what-becomes-the-how/</link>
		<comments>http://semanticvoid.com/blog/2008/10/16/how-the-what-becomes-the-how/#comments</comments>
		<pubDate>Fri, 17 Oct 2008 03:45:51 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Artificial Intelligence]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2008/10/16/how-the-what-becomes-the-how/</guid>
		<description><![CDATA[It&#8217;s about time I shared one of the best articles (one of my favorites) I have come across yet &#8211; one that provides a unique perspective to AI and gets you a step closer to understanding Alan Turing. This is an article written by Edward A. Feigenbaum (co-recipient of the Turing Award &#8216;94 along with [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s about time I shared one of the best articles (one of my favorites) I have come across yet &#8211; one that provides a unique perspective to AI and gets you a step closer to understanding Alan Turing. This is an article written by Edward A. Feigenbaum (co-recipient of the Turing Award &#8216;94 along with Raj Reddy).</p>
<p>The *What* to *How* spectrum on page 100 is a must see.</p>
<p><object codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0" id="doc_263466573406366" name="doc_263466573406366" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" align="middle"	height="500" width="100%"><param name="movie"	value="http://documents.scribd.com/ScribdViewer.swf?document_id=6911818&#038;access_key=key-1mtts2o7xsqw77c81a1a&#038;page=&#038;version=1&#038;auto_size=true&#038;viewMode="></param><param name="quality" value="high"></param><param name="play" value="true"></param><param name="loop" value="true"></param><param name="scale" value="showall"></param><param name="wmode" value="opaque"></param><param name="devicefont" value="false"></param><param name="bgcolor" value="#ffffff"></param><param name="menu" value="true"></param><param name="allowFullScreen" value="true"></param><param name="allowScriptAccess" value="always"></param><param name="salign" value=""><embed src="http://documents.scribd.com/ScribdViewer.swf?document_id=6911818&#038;access_key=key-1mtts2o7xsqw77c81a1a&#038;page=&#038;version=1&#038;auto_size=true&#038;viewMode=" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" play="true" loop="true" scale="showall" wmode="opaque" devicefont="false" bgcolor="#ffffff" name="doc_263466573406366_object" menu="true" allowfullscreen="true" allowscriptaccess="always" salign="" type="application/x-shockwave-flash" align="middle"  height="500" width="100%"></embed></param></object>
<div style="font-size:10px;text-align:center;width:100%"><a href="http://www.scribd.com/doc/6911818/How-the-What-Becomes-the-How">How the &#8220;What&#8221; Becomes the &#8220;How&#8221;</a> &#8211; <a href="http://www.scribd.com/upload">Upload a Document to Scribd</a></div>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2008/10/16/how-the-what-becomes-the-how/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dangerous Knowledge</title>
		<link>http://semanticvoid.com/blog/2008/10/10/dangerous-knowledge/</link>
		<comments>http://semanticvoid.com/blog/2008/10/10/dangerous-knowledge/#comments</comments>
		<pubDate>Sat, 11 Oct 2008 06:53:35 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2008/10/10/dangerous-knowledge/</guid>
		<description><![CDATA[Stumbled upon this great documentary on four famous mathematicians &#8211; Georg Cantor, Ludwig Boltzmann, Kurt Gödel and Alan Turing. This documentary talks about their extraordinary intellectual powers and how it drove them insane, eventually leading them to committing suicide.

 

Fun fact: Turing was conceived in Chhatrapur, Orissa, India. His father, Julius Mathison Turing, was a [...]]]></description>
			<content:encoded><![CDATA[<p>Stumbled upon this great documentary on four famous mathematicians &#8211; Georg Cantor, Ludwig Boltzmann, Kurt Gödel and Alan Turing. This documentary talks about their extraordinary intellectual powers and how it drove them insane, eventually leading them to committing suicide.</p>
<p><center><br />
<embed id="VideoPlayback" src="http://video.google.com/googleplayer.swf?docid=-5122859998068380459&#038;hl=en&#038;fs=true" style="width:400px;height:326px" allowFullScreen="true" allowScriptAccess="always" type="application/x-shockwave-flash"> </embed><br />
</center></p>
<p>Fun fact: Turing was conceived in Chhatrapur, Orissa, India. His father, Julius Mathison Turing, was a member of the Indian Civil Service. (via <a href="http://en.wikipedia.org/wiki/Alan_Turing">Wikipedia</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2008/10/10/dangerous-knowledge/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Grammar Of Thought</title>
		<link>http://semanticvoid.com/blog/2008/09/03/the-grammar-of-thought/</link>
		<comments>http://semanticvoid.com/blog/2008/09/03/the-grammar-of-thought/#comments</comments>
		<pubDate>Wed, 03 Sep 2008 09:10:43 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Knowledge]]></category>
		<category><![CDATA[Natural Language Processing]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2008/09/03/the-grammar-of-thought/</guid>
		<description><![CDATA[ Update:  Found this interesting book related to this post &#8211; The Language Instinct [link]
I have just started to scratch the surface of Natural Language Processing for my next project (involving NLP and Twitter &#8211; details to follow) and I already have a dozen questions bothering me. I shall attempt to put forth a [...]]]></description>
			<content:encoded><![CDATA[<p><b> Update: </b> Found this interesting book related to this post &#8211; The Language Instinct [<a href="http://pinker.wjh.harvard.edu/books/tli/index.html">link</a>]</p>
<p>I have just started to scratch the surface of Natural Language Processing for my next project (involving NLP and Twitter &#8211; details to follow) and I already have a dozen questions bothering me. I shall attempt to put forth a few of the ideas and questions in this post. Lets talk briefly about the structure of language. Language has different levels of structure:</p>
<ol>
<li> dicourse &#8211; group of sentences</li>
<li> sentences</li>
<li> phrases</li>
<li> words</li>
<li> and so on&#8230;</li>
</ol>
<p>Between the &#8217;sentences&#8217; and &#8216;words&#8217; lies the syntactic structure of language. This syntactic structure is built using the <a href="http://en.wikipedia.org/wiki/Part-of-speech_tagging">parts of speech</a> of the words: nouns, verbs, etc. Words are grouped into phrases whose formation is governed by the grammar rules, for example:</p>
<p>Sentence -> &#8216;Noun Phrase&#8217; . &#8216;Verb Phrase&#8217;<br />
&#8216;Noun Phrase&#8217; -> Determiner . Adjective . Noun<br />
&#8216;Verb Phrase&#8217; -> Verb . &#8216;Noun Phrase&#8217;</p>
<p>A sentence is grammatically correct if it adheres to the grammar of the language (like described above). With just the above knowledge about language (something you might have learnt in the 5th grade) we can see that for a candidate sentence to make sense in some language, it has to be composed of meaningful components and these components have to be in some specific order for it to logically make sense.</p>
<p><b>Grammar of Thought</b></p>
<p>This has led me to ponder if an analogous grammar exists for &#8216;thought&#8217;. Our thoughts can also be broken down into meaningful components and the components here also have to follow some implicit ordering for the &#8216;thought&#8217; to make sense. If you think about the way you think, you will notice that as you run from one thought to another there is some logical connection between them just as between the sentences in a paragraph. If we could somehow get a formal representation of this grammar, wouldn&#8217;t it enable machines to think?</p>
<p><b>Language and Thought</b></p>
<p>There is enough literature out there which links the structure of language with the structure of thought. Benjamin Whorf states in his writings:</p>
<blockquote><p> the structure of a human being&#8217;s language influences the manner in which he understands reality and behaves with respect to it </p></blockquote>
<p>Thus, human cognition is based on the structure of language which in turn is the grammar defining the language. Hence a machine capable of generating sequence of grammatically correct sentences which also fit together logically (discourse), should have some ability of cognition. Even the Turing test uses natural language as a test for some level of cognition. Is this perspective of Natural Language Processing as a means of provisioning cognition to a machine, correct? Could this be another path for achieving artificial intelligence? I would love to get an answer to this from NLP experts out there.</p>
<p>Or is it just one of my other posts which don&#8217;t make sense because its 3am and I&#8217;m half asleep?</p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2008/09/03/the-grammar-of-thought/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>The Monkey Just Got Delicious &#8211; II</title>
		<link>http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious-ii/</link>
		<comments>http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious-ii/#comments</comments>
		<pubDate>Tue, 12 Aug 2008 06:11:35 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Yahoo!]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious-ii/</guid>
		<description><![CDATA[[UPDATE] Try the search monkey app here
This is a follow-up post of The Monkey Just Got Delicious &#8211; I. The app is not yet public for the reasons mentioned in part I. As I had mentioned, my goal was to generate a tag cloud for the search results. Well, search monkey does not allow you [...]]]></description>
			<content:encoded><![CDATA[<p><b>[UPDATE]</b> Try the search monkey app <a href="http://gallery.search.yahoo.com/application?smid=YLs.s">here</a></p>
<p>This is a follow-up post of <a href="http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious/">The Monkey Just Got Delicious &#8211; I</a>. The app is not yet public for the reasons mentioned in <a href="http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious/">part I</a>. As I had mentioned, my goal was to generate a tag cloud for the search results. Well, search monkey does not allow you to spit out arbitrary html, thus making it difficult to render a tag cloud. After much thought I settled for a color coded tag cloud (as in the screenshot below). You will notice the color of the tags gradually fading (darker shade means that the tag is more popular). </p>
<p>Got feedback, will listen.</p>
<p><center><br />
<table>
<tr>
<td><img src="http://farm4.static.flickr.com/3105/2756185874_3907a8df26_o.png" alt="New Deliciousify" /></td>
<td> <img src="http://farm4.static.flickr.com/3043/2756301394_021e904c67_o.png"/></td>
</tr>
</table>
<p></center></p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious-ii/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Monkey Just Got Delicious</title>
		<link>http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious/</link>
		<comments>http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious/#comments</comments>
		<pubDate>Mon, 11 Aug 2008 09:04:56 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Yahoo!]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious/</guid>
		<description><![CDATA[[UPDATE] Try the search monkey app here
[UPDATE] New tag cloud UI for deliciousify can be viewed here
[UPDATE] The search monkey app is currently disabled for public use as it was hitting the delicious rate limit. Hence it will remain as a prototype for now. BTW the delicious team is working on their own search monkey [...]]]></description>
			<content:encoded><![CDATA[<p><strong>[UPDATE]</strong> Try the search monkey app <a href="http://gallery.search.yahoo.com/application?smid=YLs.s">here</a></p>
<p><strong>[UPDATE]</strong> New tag cloud UI for deliciousify can be viewed <a href="http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious-ii/">here</a></p>
<p><strong>[UPDATE]</strong> The search monkey app is currently disabled for public use as it was hitting the delicious rate limit. Hence it will remain as a prototype for now. BTW the delicious team is working on their own search monkey app and I bet its going to be much cooler.</p>
<p><img src="http://developer.search.yahoo.com/images/searchmonkeyLogo147x150.gif" alt="" />I&#8217;m a big fan and an avid user of <a href="http://developer.search.yahoo.com">Yahoo! Search Monkey</a>. So this weekend I decided to write myself a search monkey application that I have always wished for. Well, we all will agree that nothing beats human created metadata and what better metadata about search results can there be than the vast and rich knowledge stored in bookmarking services. My search monkey application deals with enriching the organic search results from Yahoo! with metadata from del.icio.us.</p>
<p><center>[LINK DISABLED] <a href="">Try Deliciousify Search Monkey App here</a></center></p>
<p>Sometimes the search summary does not provide a useful insight into the contents of the search result (as seen below). The only way users ascertain relevance is by clicking on the result and figuring it out themselves. Wouldn&#8217;t it be better if the contents of the result could be summarized by just a few words &#8211; keywords that highlight broadly what the document talks about. Deliciousify (as seen below) aims to solve this problem by listing the top tags about a search result from del.icio.us, along with its popularity (number of people who have bookmarked it). In the future, I plan to display a tag cloud for the results. Give it a try and send any comments/feedback my way.</p>
<p><center>[LINK DISABLED] Make your search results more delicious &#8211; <a title="Add the deliciousify Enhanced Result to your Search preferences" href=""> click here </a></center></p>
<p><center><img src="http://farm4.static.flickr.com/3024/2753013032_2393cce7b0_o.png" alt="" /></center></p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>
