<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Matt_ptr *</title>
	
	<link>http://mattptr.net</link>
	<description>Programming and stuff -- incoherent and unfocused since 1997</description>
	<lastBuildDate>Tue, 07 Sep 2010 18:35:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/mattptr/vmiR" /><feedburner:info uri="mattptr/vmir" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:browserFriendly></feedburner:browserFriendly><item>
		<title>Haven’t given up</title>
		<link>http://mattptr.net/2010/09/07/havent-given-up/</link>
		<comments>http://mattptr.net/2010/09/07/havent-given-up/#comments</comments>
		<pubDate>Tue, 07 Sep 2010 18:35:37 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[real life]]></category>

		<guid isPermaLink="false">http://mattptr.net/?p=185</guid>
		<description><![CDATA[I haven&#8217;t given up on the projects that I&#8217;ve started recently. However, work has been pretty busy and looks to be getting busier. That means the jQuery Plugin Index, that I was in the midst of creating, has been stalled. I still *want* to do it. Very much so. However, I did notice that someone [...]]]></description>
			<content:encoded><![CDATA[<p>I haven&#8217;t given up on the projects that I&#8217;ve started recently. However, work has been pretty busy and looks to be getting busier. That means the jQuery Plugin Index, that I was in the midst of creating, has been stalled. I still *want* to do it. Very much so. However, I did notice that someone is working on <a href="http://pypi.appspot.com/">PyPi for Google App Engine</a>. If it gets fully implemented, I think it would be easy enough to fork that and adapt it.</p>
<p>In the meantime, <a href="http://mattptr.net/2010/07/28/building-python-extensions-in-a-modern-windows-environment/">my post</a> on building python extensions in Windows has gotten a lot of attention. I hope it helps people out, but believe it or not, I still have trouble building certain extensions. It&#8217;s especially painful if the extension depends on a library that doesn&#8217;t have native Windows support. But hopefully, this cuts down on the need for running and maintaining a VM just so you can code fun stuff with Python.</p>
]]></content:encoded>
			<wfw:commentRss>http://mattptr.net/2010/09/07/havent-given-up/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Stupid site hacked</title>
		<link>http://mattptr.net/2010/08/11/stupid-site-hacked/</link>
		<comments>http://mattptr.net/2010/08/11/stupid-site-hacked/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 19:02:03 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mattptr.net/?p=182</guid>
		<description><![CDATA[So it looks like some bot managed to guess my FTP password and installed a malware script on to my wordpress files. This in turn caused Google to report this site as distributing malware and block it (at least in Chrome). I&#8217;m 99% certain it was a bot, since I&#8217;ve seen the *exact* same hack [...]]]></description>
			<content:encoded><![CDATA[<p>So it looks like some bot managed to guess my FTP password and installed a malware script on to my wordpress files. This in turn caused Google to report this site as distributing malware and block it (at least in Chrome).</p>
<p>I&#8217;m 99% certain it was a bot, since I&#8217;ve seen the *exact* same hack done on a Drupal site that I did for work. </p>
<p>I&#8217;ve changed everything&#8230; hopefully this won&#8217;t happen again. </p>
<p>I think this is from submitting the Python Extensions post to reddit. It had a good deal of spam comments blocked.</p>
]]></content:encoded>
			<wfw:commentRss>http://mattptr.net/2010/08/11/stupid-site-hacked/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Building Python Extensions in a Modern Windows Environment</title>
		<link>http://mattptr.net/2010/07/28/building-python-extensions-in-a-modern-windows-environment/</link>
		<comments>http://mattptr.net/2010/07/28/building-python-extensions-in-a-modern-windows-environment/#comments</comments>
		<pubDate>Wed, 28 Jul 2010 14:37:10 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[64bit]]></category>
		<category><![CDATA[Python Extensions]]></category>

		<guid isPermaLink="false">http://mattptr.net/?p=180</guid>
		<description><![CDATA[A few days ago I decided to upgrade to Python 2.7. I&#8217;m running Windows 7 64-bit &#8212; pretty sweet as far as Windows goes. ;) So, I&#8217;m thinking to myself, &#8220;I&#8217;m running a 64-bit OS, why was I running a 32-bit Python?&#8221; While the core Python distribution is available in 64-bit, many many packages that [...]]]></description>
			<content:encoded><![CDATA[<p>A few days ago I decided to upgrade to Python 2.7. I&#8217;m running Windows 7 64-bit &#8212; pretty sweet as far as Windows goes. ;) So, I&#8217;m thinking to myself, &#8220;I&#8217;m running a 64-bit OS, why was I running a 32-bit Python?&#8221;</p>
<p>While the core Python distribution is available in 64-bit, many many packages that I depend on only supply precompiled binaries for the 32-bit Python distribution. Why? I have no idea. There are two things you can do.</p>
<ol>
<li>Use <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">this site</a>. There are a bunch of packages available with 64-bit in mind that aren&#8217;t available from the package&#8217;s maintainers. MySQL-Python, for instance.</li>
<li>Compile them yourself. The unofficial repository doesn&#8217;t have all packages on PyPI compiled for Windows. gevent is one I&#8217;ve come to depend on a lot, and it&#8217;s not available &#8212; so I had to find a way to build extensions myself. Here&#8217;s how&#8230;</li>
</ol>
<h3>Install Microsoft Visual C++ 2008</h3>
<p>Don&#8217;t bother with MinGW. Let me say it again &#8212; DO NOT USE MINGW FOR THIS! For one, the standard mingw distro is 32-bit. I found a gcc toolchain for 64-bit Windows, but I couldn&#8217;t get it to work. The Python import lib is made for Visual Studio. There are apparently ways to convert the file to something compatible, but I spent 4-5 hours trying to get this to work to absolutely no avail. Save yourself the trouble.</p>
<p>Additionally, you can&#8217;t use Visual C++ 2010. Python&#8217;s distutils lib is not set up to handle it. Visual C++ express works, as long as it&#8217;s 2008.</p>
<p>Note that if you have Visual Studio 2008 Professional, Team Studio or whatever, you should be able to stop here. The Express editions, however, don&#8217;t have the 64-bit environment, so we need to do more stuff.</p>
<h3>Install the Windows 7 Platform SDK</h3>
<p>Now just called the Windows SDK. You can get it <a href="http://www.microsoft.com/downloads/details.aspx?FamilyID=6b6c21d2-2006-4afa-9702-529fa782d63b&amp;displaylang=en">here</a>. It&#8217;s pretty large, so be prepared. Obviously, make sure you install the 64-bit environment.</p>
<h3>Trick distutils</h3>
<p>distutils looks for a file called vcvarsall.bat, runs it, and gets the include and lib directories that the batch file sets up. The batch file sets up the environment based on what platform you supply to it &#8212; in this case, amd64. Unfortunately, Visual C++ Express does not have the proper files for 64-bit compilation, but you can set it up pretty easily.</p>
<p>vcvarsall.bat should be in a directory like: C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC</p>
<p>You need to create:</p>
<ul>
<li>The directory &#8212; C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin\amd64\</li>
<li>The file &#8212; C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin\amd64\vcvarsamd64.bat</li>
</ul>
<p>The Windows SDK comes with a fully working 64-bit environment, so we just need to point vcvarsamd64.bat to the new SDK &#8212; which distutils doesn&#8217;t recognize.</p>
<p>So in vcvarsamd64.bat put:</p>
<pre>call "C:\Program Files\Microsoft SDKs\Windows\v7.1\Bin\SetEnv.cmd" /x64 /Release</pre>
<p>Assuming you let the Windows SDK install in the default location.</p>
<h3>Still not done</h3>
<p>We have to patch distutils now. Unfortunately, the new linker doesn&#8217;t generate .manifest files by default, but distutils tries to embed a manifest file in the dll (pyd) that it just built, and *will fail* if it is unable to do so.</p>
<p>To fix this, add the follow line to distutils\msvc9compiler.py after line 648:</p>
<pre>ld_args.append('/MANIFEST')</pre>
<h3>That&#8217;s it!</h3>
<p>You should now be able build your own extensions for 64-bit Python in Windows 7! You can have PyCrypto, gevent, ZODB, and so on.</p>
<h3>Side Note</h3>
<p>If you&#8217;re having trouble with pip or easy_install opening up a separate console window, it&#8217;s an easy fix. It&#8217;s not necessarily a problem, but it&#8217;s annoying &#8212; the console window disappears as soon as the operation is done, whether or not it fails or completes.</p>
<p>The issue is that setuptools is running a 32-bit application, and Windows 7 (smartly) runs 32-bit applications in a separate process.</p>
<p>The fix is to uninstall setuptools and pip, and reinstall setuptools from source. Do not use ez_setup.py. I don&#8217;t know if you need to be able to build extensions before you can build setuptools, but that&#8217;s what I did. After that, you can easy_install pip, and pip will now run in a 64-bit environment too. Yay!</p>
]]></content:encoded>
			<wfw:commentRss>http://mattptr.net/2010/07/28/building-python-extensions-in-a-modern-windows-environment/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Replacing the jQuery Plugin site</title>
		<link>http://mattptr.net/2010/07/09/replacing-the-jquery-plugin-site/</link>
		<comments>http://mattptr.net/2010/07/09/replacing-the-jquery-plugin-site/#comments</comments>
		<pubDate>Fri, 09 Jul 2010 17:26:11 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[jquery]]></category>
		<category><![CDATA[new project]]></category>
		<category><![CDATA[plugins]]></category>
		<category><![CDATA[redo the jquery plugin site]]></category>

		<guid isPermaLink="false">http://mattptr.net/?p=172</guid>
		<description><![CDATA[I&#8217;ve touched on this a while back, but I never followed through with it. Having some down time at work, I&#8217;ve decided to jump in. I want to replace http://plugins.jquery.com. There are numerous problems with it, and one of the reasons I got overwhelmed by this project originally is because I wanted to fix the [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve touched on this a while back, but I never followed through with it. Having some down time at work, I&#8217;ve decided to jump in. I want to replace <a href="http://plugins.jquery.com">http://plugins.jquery.com</a>.</p>
<p>There are numerous problems with it, and one of the reasons I got overwhelmed by this project originally is because I wanted to fix the site, rather than replace it. So now, I&#8217;ve wised up and decided to start from scratch without considering any aspect of how the site currently works.</p>
<p>Here are the problem areas, as I see them, in no particular order.</p>
<h3>Browsing through plugins is ridiculously terrible</h3>
<p>First, when you get to the page, you get a bunch of categories. Compare this with <a href="http://pypi.python.org">http://pypi.python.org</a>. PyPi gives a tabular listing of 40 recently updated packages. For the Latest Releases, PJC (plugins.jquery.com) gives a full body of content for each item and goes on for a zillion pages.</p>
<p>Second, from the start page on PJC (with the category listing), the &#8220;Browse by Name&#8221; tab doesn&#8217;t work. The &#8220;Browse by Date&#8221; tab does work, but what date? The date the plugin was created, or the date of the last release? It turns out this is the same as the &#8220;Latest Releases&#8221; page, just the tab navigation at the top doesn&#8217;t disappear. The &#8220;All Plugins&#8221; link on the is the same as the &#8220;Browse by Name&#8221; tab and also doesn&#8217;t work.</p>
<p>Lastly, browsing plugins in a category gives a different layout from browsing by date. Why? It&#8217;s the same information, just sorted differently and filtered.</p>
<h3>Searching is basically useless</h3>
<p>Do you know why I&#8217;m surprised that people have actually used my timer plugin? Because I can&#8217;t even find it myself. Searching for &#8220;timer&#8221; yields 10 pages of results, and includes plain pages and issue tracker items.</p>
<p>I understand the appeal of having the bug tracker and plugin page tied together, but it&#8217;s terrible. A plugin like mine is so small that it doesn&#8217;t need a bug tracker. Not to mention that use of a bug tracker is annoying without the use of source control. The plugin author should bear the responsibility of setting up bug tracking, source control, etc. There are plenty of free sites to do that.</p>
<p>The search is easily bombed by adding keywords and tags (which are not moderated). So when I search for timer, the sixth result I get is for <a href="http://plugins.jquery.com/project/dualSlider">dualSlider</a> &#8212; perfect for managing timeouts and intervals.</p>
<h3>The Rating System</h3>
<p>There&#8217;s no point to this. The &#8220;Top Rated&#8221; plugins all have 1-3 votes. Plugins with more votes should have more clout. But it doesn&#8217;t really matter anyway. It&#8217;s not a popularity contest.</p>
<p>This particular part of PJC will have no part whatsoever in my new project. If there will be any spotlighting of plugins, it will be done by moderators.</p>
<h3>Other Data Formats</h3>
<p>Right now, there are no RSS feeds for plugins at all. Each plugin should have its own release feed, as well as a feed for all latest releases.</p>
<p>Writing a plugin manager currently would involve screen scraping the existing plugin page to see if there have been any changes. Of course, you have to know the URL of the plugin because searching basically gets no where, and if by some chance you were able to search, you&#8217;d have to scrape the search page as well.</p>
<p>That&#8217;s why I want to have everything available as JSON. Plugin details, list of plugins by category, search results&#8230; the new site has to be highly query-able. PyPi uses XML-RPC to expose their API. JSONRPC might be an option for this, or XML-RPC, but I&#8217;ll cross that bridge when I come to it.</p>
<h3>Categories</h3>
<p>The Categories on PJC are terrible. Not in the way that they aren&#8217;t descriptive, but they just suck. They should be hierarchical. For example, &#8220;Widgets&#8221; and &#8220;Windows and Overlays&#8221; could fall under &#8220;User Interface.&#8221; Menus could as well.</p>
<p>I&#8217;m not sure how Navigation and Menus are different.</p>
<p>DOM should probably be a child of Utilities.</p>
<p>I don&#8217;t know what AJAX means for a category. If the plugin is an AJAX request helper, it should go under &#8220;Utilities&#8221; or &#8220;jQuery Extension.&#8221; If it&#8217;s something like an auto-complete widget, well it should go under Widgets.</p>
<p>The point is, that categories aren&#8217;t very helpful in there current state. I put my Timer plugin under jQuery Extensions, Javascript, and Utilities, leading me to believe that they could all be the same category. I don&#8217;t know why Javascript is a category actually, since jQuery encapsulates, rather than extends.</p>
<h3>The New Site</h3>
<p>I&#8217;ve already started. <a href="http://code.google.com/p/jqpi">http://code.google.com/p/jqpi</a> (the app page will be http://jquerypi.appspot.com)</p>
<p>Basically, I want to create PyPi for jQuery plugins. I figured using Google App Engine would be nice. Also, knowing my penchant for dragging out projects, I&#8217;m coding it for HTML5, since it will probably be widely supported by the time I&#8217;m finished.</p>
<p>There are a few things that I don&#8217;t know how to do with GAE though. Hierarchical categories, searching, optimization, JSONRPC or XML-RPC. I&#8217;ll figure it out eventually, but help is always appreciated. Create an issue, create a wiki page, send patches, join the project, anything. We shouldn&#8217;t have to suffer the damned plugins.jquery.com any more.</p>
]]></content:encoded>
			<wfw:commentRss>http://mattptr.net/2010/07/09/replacing-the-jquery-plugin-site/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Windows output redirection bug</title>
		<link>http://mattptr.net/2010/05/21/windows-output-redirection-bug/</link>
		<comments>http://mattptr.net/2010/05/21/windows-output-redirection-bug/#comments</comments>
		<pubDate>Fri, 21 May 2010 16:48:10 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://mattptr.net/?p=169</guid>
		<description><![CDATA[While I was working on the SiteCrawler script, I had a problem getting it to redirect output to a file. In fact it&#8217;s one of the reasons I put it on the back burner. I thought it was a Python (on Windows) problem, but it turns it&#8217;s just a Windows® Issue™: http://support.microsoft.com/kb/321788 Even though the article [...]]]></description>
			<content:encoded><![CDATA[<p>While I was working on the SiteCrawler script, I had a problem getting it to redirect output to a file. In fact it&#8217;s one of the reasons I put it on the back burner. I thought it was a Python (on Windows) problem, but it turns it&#8217;s just a Windows® Issue™:</p>
<p><a href="http://support.microsoft.com/kb/321788">http://support.microsoft.com/kb/321788</a></p>
<p>Even though the article is about WinXP and 2k, I thought I&#8217;d try the registry fix. Sure enough &#8212; it works! So I will probably start adding stuff to the site crawler again and may actually release it!</p>
]]></content:encoded>
			<wfw:commentRss>http://mattptr.net/2010/05/21/windows-output-redirection-bug/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What to do?</title>
		<link>http://mattptr.net/2010/05/19/what-to-do/</link>
		<comments>http://mattptr.net/2010/05/19/what-to-do/#comments</comments>
		<pubDate>Wed, 19 May 2010 19:53:57 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mattptr.net/?p=167</guid>
		<description><![CDATA[I&#8217;m a little up in arms about what to do with this site. I barely program any more in my free time. I want to do stuff, but I lack motivation. I have about 20 unfinished posts. By the time I get through, I don&#8217;t care to re-read or add the links I want to [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m a little up in arms about what to do with this site. I barely program any more in my free time. I want to do stuff, but I lack motivation. I have about 20 unfinished posts. By the time I get through, I don&#8217;t care to re-read or add the links I want to reference.</p>
<p>I&#8217;m also a little worried about WordPress. I don&#8217;t like it and I want to ditch it. However, I have a bunch of posts that get me some attention from Google. Is there a way to export my posts to static files?</p>
<p>On a more positive note, I got to use <a href="http://www.djangoproject.com">Django</a> for a project at work. It&#8217;s an excellent product. Anyone developing webapps should try it, at least once. I might want to convert my site to <a href="http://www.django-cms.org">Django CMS</a>&#8230; dunno yet.</p>
]]></content:encoded>
			<wfw:commentRss>http://mattptr.net/2010/05/19/what-to-do/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Crap that annoys me, part 20184: Javascript dates</title>
		<link>http://mattptr.net/2010/04/09/crap-that-annoys-me-part-20184-javascript-dates/</link>
		<comments>http://mattptr.net/2010/04/09/crap-that-annoys-me-part-20184-javascript-dates/#comments</comments>
		<pubDate>Fri, 09 Apr 2010 18:13:42 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[Crap that Annoys Me]]></category>

		<guid isPermaLink="false">http://mattptr.net/?p=158</guid>
		<description><![CDATA[I realize that a lot of my posts are rants, especially when I&#8217;m dealing with PHP, and I hate sounding like a jerk/whiner/maniac&#8230; But I just couldn&#8217;t help myself today. Today&#8217;s gripe is about Javascript and Dates. There is a convenient Date object that handles just about everything I, as a web programmer, need when [...]]]></description>
			<content:encoded><![CDATA[<p>I realize that a lot of my posts are rants, especially when I&#8217;m dealing with PHP, and I hate sounding like a jerk/whiner/maniac&#8230;</p>
<p>But I just couldn&#8217;t help myself today.</p>
<p>Today&#8217;s gripe is about Javascript and Dates. There is a convenient Date object that handles just about everything I, as a web programmer, need when handling Dates in Javascript. However, there&#8217;s one particular thing that I think is probably the dumbest thing I&#8217;ve ever seen.</p>
<pre>//So let's say we have to get some information about today's date.
//It's April 9th.
//I want the number of the month and the day in this format: m/d
var now = new Date();
document.write(now.getMonth() + "/" + now.getDate())</pre>
<pre>//What do I get?
//  3/9</pre>
<p>&#8230;</p>
<p>There are three things wrong.</p>
<ol>
<li>Getters and Setters &#8212; die, ok? Strings and Arrays have a length property, it&#8217;s not blah.getLength(). Date objects have no properties (that I&#8217;m aware of).</li>
<li>Keeping in mind that I must use a method begrudgingly, it should be getDay() instead of getDate(). To me, getDate() implies that you&#8217;re getting a full date string, or a Date Object, which you already have, which wouldn&#8217;t make sense. getDay() returns the day of the week, which should be getDayOfWeek().</li>
<li>You probably noticed that the value of getMonth() is wrong. Technically, it isn&#8217;t. getMonth() returns the <em>zero-based</em> month number, you know, just like how they count months in <strong>nowhere on Planet Earth! </strong>Who the hell thought up this? I know that I might want to use an array of month names because the Date Object doesn&#8217;t really provide a way to get that, but most of the time, no.</li>
</ol>
<ol style="padding-left: 60px;"></ol>
<p>Why then, aren&#8217;t the calendar days zero based? I don&#8217;t get it.</p>
<p>This could be inspired from POSIX C, in which the tm structure has many of the same details &#8212; although not as getters and setters &#8212; and specifies that months are, you guessed it, numbered 0 to 11.</p>
<p>It also gives room for up to 2 leap seconds in the structure, which is really useful. No really, if you&#8217;re using localtime_r() for high-precision timing, or polling every second of every day to see if there might be a leap second&#8230;I&#8217;ll punch you.</p>
]]></content:encoded>
			<wfw:commentRss>http://mattptr.net/2010/04/09/crap-that-annoys-me-part-20184-javascript-dates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Site Crawler Chronicles, Part 4: I might be dumb</title>
		<link>http://mattptr.net/2010/03/23/site-crawler-chronicles-part-4-i-might-be-dumb/</link>
		<comments>http://mattptr.net/2010/03/23/site-crawler-chronicles-part-4-i-might-be-dumb/#comments</comments>
		<pubDate>Tue, 23 Mar 2010 19:08:00 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mattptr.net/?p=154</guid>
		<description><![CDATA[Turns out urljoin() wasn&#8217;t behaving badly, I just supplied it a lousy URL. Turns out after running urlopen, the file-like object that is returned has two additional methods, one of them giving the true URL (i.e. after redirects). So far, that&#8217;s seems to have fixed the issue. Hopefully I&#8217;ll have a release soon, but I [...]]]></description>
			<content:encoded><![CDATA[<p>Turns out urljoin() wasn&#8217;t behaving badly, I just supplied it a lousy URL. Turns out after running urlopen, the file-like object that is returned has two additional methods, one of them giving the true URL (i.e. after redirects). So far, that&#8217;s seems to have fixed the issue.</p>
<p>Hopefully I&#8217;ll have a release soon, but I still gotta work out some bugs.</p>
]]></content:encoded>
			<wfw:commentRss>http://mattptr.net/2010/03/23/site-crawler-chronicles-part-4-i-might-be-dumb/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Site Crawler Chronicles – Part 3</title>
		<link>http://mattptr.net/2010/03/19/the-site-crawler-chronicles-part-3/</link>
		<comments>http://mattptr.net/2010/03/19/the-site-crawler-chronicles-part-3/#comments</comments>
		<pubDate>Fri, 19 Mar 2010 14:33:05 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Thoughts]]></category>

		<guid isPermaLink="false">http://mattptr.net/?p=150</guid>
		<description><![CDATA[I managed to find a solution for the problem I had yesterday, though I don&#8217;t particularly know if it&#8217;s ideal. Originally I had thought that I would need to store the entire hierarchy of the site in a tree like structure. I figured I could just store a list of the links on a page [...]]]></description>
			<content:encoded><![CDATA[<p>I managed to find a solution for the problem I had yesterday, though I don&#8217;t particularly know if it&#8217;s ideal.</p>
<p>Originally I had thought that I would need to store the entire hierarchy of the site in a tree like structure. I figured I could just store a list of the links on a page in a dict structure and then output all of the errors when the crawl was finished. I don&#8217;t know why I was hung up on the idea that errors had to be reported as they were come across.</p>
<p>I was worried that memory use would be a factor, but it seems to be ok.</p>
<p>But there&#8217;s another issue:</p>
<pre>    #taken from lxml.html.__init__
    def make_links_absolute(self, base, root):
        """This function exists because urljoin behaves obnoxiously.
        For example, if I'm on the page:
            http://www.example.com/some/directory/index.html, or just:

http://www.example.com/some/directory/

        And I join the relative URL: ../../abc.html
        I end up with: http://www.example.com/abc.html

        *But*
        If I'm on: http://www.example.com/some/directory  [no trailing slash]
        I end up with: http://www.example.com/../abc.html
        """</pre>
<p>My fix for it was stripping out one &#8220;../&#8221;. Yesterday I thought that it would be a good fix. Today, I can&#8217;t figure out why I thought it would fix all cases.</p>
]]></content:encoded>
			<wfw:commentRss>http://mattptr.net/2010/03/19/the-site-crawler-chronicles-part-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Site Crawler Chronicles</title>
		<link>http://mattptr.net/2010/03/18/the-site-crawler-chronicles/</link>
		<comments>http://mattptr.net/2010/03/18/the-site-crawler-chronicles/#comments</comments>
		<pubDate>Thu, 18 Mar 2010 18:04:32 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Thoughts]]></category>

		<guid isPermaLink="false">http://mattptr.net/?p=148</guid>
		<description><![CDATA[So I managed to stop v4 of my web crawler from opening up a billion connections in parallel. Turns out that gevent has a Pool object and that was exactly what I needed. Now my little script (137 lines, including a utility object and comments) will not be a sysadmin&#8217;s nightmare. However, I now have a [...]]]></description>
			<content:encoded><![CDATA[<p>So I managed to stop v4 of my web crawler from opening up a billion connections in parallel. Turns out that <a href="http://www.gevent.org">gevent</a> has a <a href="http://www.gevent.org/gevent.pool.html">Pool</a> object and that was exactly what I needed.</p>
<p>Now my little script (137 lines, including a utility object and comments) will not be a sysadmin&#8217;s nightmare.</p>
<p>However, I now have a new problem. I described how the older versions work in my <a href="http://mattptr.net/2010/03/17/new-old-ideas/">previous post</a>, but this version is quite a bit different. Instead of using a queue or stack data structure to figure out where to go next, this version has a greenlet scrape all links from a page, filters out stuff it&#8217;s already been to, then returns the rest. The main thread then accumulates the lists when all greenlets are finished. After the accumulation &#8212; and it&#8217;s ensured that there are no duplicate links &#8212; the main thread then spawns a greenlet <em>for each link</em> and the main thread waits until the greenlets finish again. When there are no links returned by the greenlets, the main thread is done, and the script terminates.</p>
<p>The problem is, if there&#8217;s a 404 or some kind of error retrieving the page, I have no way of knowing what page that link was found on.</p>
<p>The only solution that I see is using a custom data structure and hope that it doesn&#8217;t kill performance.</p>
]]></content:encoded>
			<wfw:commentRss>http://mattptr.net/2010/03/18/the-site-crawler-chronicles/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
