nirvdrum

Open Sourcing a Failed Startup

2014-11-20T00:00:00+00:00

Background

In late October, 2014 I announced that I would be shutting down Mogotest. After close to 5 years of operations it was clear I wouldn't be able to grow the business. I don't think it was due to lack of business opportunity, but due to some business decisions made early on that became very difficult to course correct. The exact line of reasoning that justified the shutdown is a topic for another day. The purpose of this post is to discuss what to do with the code after the fact.

Are you Going to Open Source It?

Rather predictably, one of the first things I was asked after I announced the shutdown was whether I would be open sourcing it. I was asked from current customers, by friends, by companies that were interested in the tech but never felt the need to support it by giving us business, by random people on Twitter, and so on. I had already gone through some of the thought process a priori, but I was in a different state of mind then. Getting the bombardment of questions after the announcment impacted me in ways I couldn't predict.

For some additional context, I contribute to a lot of open source projects. I don't have a "brand name" and I've never professionaly sold open source software or sold consulting services around it, but I've worked with a lot of projects. I use the Apache Software license version 2.0 for just about everything. And I guess I would consider myself more of a pragmatist than an ideologue when it comes to open source software.

With that said, my gut reaction was to not open source it. My analytical reaction was also not to open source it.

Why Not?

I'd just like to insert a standard disclaimer at this point that what follows is my own experience and my own potentially irrational thought process. If anything I say comes off as a generality, please note that my pomposity stops short of speaking for others.

First up is the emotional aspect. I had just made the extremely difficult decision to walk away from something I spent the past 5 years of my life dedicated to. During that time, I lost at least two full years of wages, pissed through my savings, and lost ~$40K USD in cash invested into the business. I battled with some form of founder depression. Stastically speaking, this was the most likely outcome, so I'm hardly looking for sympathy. But, having made that gut-wrenching decision to walk away from it all, the prospect of going back to it and investing a non-trivial amount of effort just to give it away is a really tough pill to swallow.

Also on the emotional aspect is just my own human pettiness. I've been asked to open source the codebase from people that evidently didn't think the software was good enough to be worth paying for as a service. I've been asked to open source the codebase by other companies in the space that didn't want to buy the rights when I was shopping the company around. So, while I really want to provide a soft landing for my customers, I really didn't want to just be giving everything away to those that just wanted to mooch.

Setting all that aside, open sourcing the codebase is not some trivial process. And I'm not talking wanting to clean up stuff I might be embarrassed by. Here's a non-comprehensive list of issues that need to be addressed:

The web site design was a theme bought on WrapBootstrap that I don't have the rights to sublicense.
The rich UI widgets come from the commercial version of ExtJS. That needs to be excised or the whole project needs to be GPLv3.
Sidekiq Pro needs to be removed.
Every JS lib and every image resource we used must have its license examined and potentially be replaced.
Any customer info that made its way into the code needs to be removed. As an example, we built up an extensive regression suite around customer data that can't be distributed. This whole process means auditing every file in the codebase.
Ensuring any API keys or passwords aren't floating around in the source or configuration files (obviously bad, but things happen).
Potentially unobscuring security holes while the service is still running.
Removing all the billing code.
Removing all the drip email campaign code.
Removing any other non-Web Consistency Testing parts from the code.

A lot of this is a liability. Going through it all is a ton of work. After all that, I open myself up to all sorts of scrutiny I don't really care for. Sometimes I swear in code. I hold a somewhat traditionalist view of English and prefer my plurality to match up, so I use gendered pronouns in my personal writings, which will have now just become public. I'm certain there is some colorful commentary about each of the browser vendors buried somewhere in the source. Without a doubt, something in this codebase will offend someone and my personal reputation is at risk when it simply wouldn't have been by keeping it private.

It's basically all the work required to clean up during an acquisition, but with the inverse financial outcome.

If I managed to clear that hurdle, the next problem is that I simply don't find there to be much value in open source code. Open source projects, yes. Open source code, rarely. I won't have either the time or the wherewithal to spend any additional effort on this project. If I make the code publicly available, people will have questions that I won't have time to answer. Consequently, I'm just going to constantly feel like an inadequate piece of garbage. On the other hand, if I manage to find time to engage, I don't have the energy to justify every design decision. Some things do just look silly, but they were the product of the constraints imposed at the time. Contextually, they were sound. In today's world … probably not so much. Fixing them would certainly be progress, but in my experience these sorts of things aren't approached tactfully and I'd rather not be called an idiot without having the resources to defend the context.

Second Thoughts

At the end of the day, I want Web Consistency Testing to evolve. If making Mogotest open source will help achieve that, I'm willing to overlook some of the other problems. I've already released the ancillary libraries as ASLv2, and I was going to release the main application under the Affero General Public License (AGPL). After spending 14 hours cleaning things up this past weekend, I'm still not 100% certain I'm not violating IP somewhere or leaking customer info and I've had to gut the product so thoroughly that it's virltually useless. Rewriting all the view code just isn't something I have the desire to do.

In conjunction with the decision to use the AGPL, I decided to try a crowd-sourced campaign to help with the open sourcing effort. Precisely zero of the companies that have been begging me to open-source the code have contributed in any capacity. The incredible amount spam I've received via comments on IndieGogo and Twitter have been equally disheartening.

Conclusion

I had my initial emotional reaction, I analyzed the hell out of it, I decided against my better judgment to try opening the code anyway, and I simply can't do it. I think the tools I have open-sourced will be beneficial to others and I've explained how things work fairly extensively in a talk I gave at Google's Test Automation Conference. A clean-room implementation shouldn't be too onerous, given I've solved a lot of the environmental problems you're apt to encounter. Unfortunately, this is where I have to get off the train.

How to Take Full Page or Full Canvas Screenshots in Windows

2010-03-25T00:00:00+00:00

Introduction

While developing MogoTest, a service for detecting Web browser rendering issues, I found it necessary to be able to capture the contents of the entire browser canvas rather than just the current viewport. A full page screenshot lets the user see how their content looks from a holistic perspective. Unfortunately, capturing all of the content clipped by the scrollable viewport is not a very straightforward process.

One approach to grabbing the full canvas is to take a screenshot and then scroll the viewport to display all the previously hidden sections, stitching all of the images together to form one composite that represents the full canvas contents. While this generally works, it fails if there are any fixed position elements on the page or if there is ECMAScript in place to modify the DOM on scroll events because in both cases the very act of scrolling modifies the canvas's contents. It would be better then to capture the entire canvas without scrolling.

The Problem with Making Windows Large Enough to Remove Scrollbars

Windows does not allow windows to be larger than the virtual screen resolution by default. The virtual screen resolution is defined as the vertical and horizontal span of all connected displays. If you have a single display, the current screen resolution will have a 1:1 match with the virtual screen resolution and if you have two displays side-by-side, the virtual screen resolution will be the height of the lowest resolution by the sum of the two horizontal screen resolutions.

Resizing the window to display the entire canvas contents, thus, requires some trickery in handling the virtual screen dimensions. As it turns out, every window is sent a WM_GETMINMAXINFO message just before the window's screen coordinates or size is changed. WM_GETMINMAXINFO passes as its lParam a MINMAXINFO value which contains the virtual screen dimensions as the the ptMaxTrackSize member. Modifying the lParam before the window receives the message would allow us to effectively change the virtual screen resolutions on a per-process basis.

If you have access to the source code of the application you'd like to capture the full canvas contents of, the solution is simple: just modify your message processing loop to handle WM_GETMINMAXINFO messages and modify the lParam as necessary. In the general case, however, you won't be able to modify the binary so you'll have to modify the executing process's address space to inject your own message handler.

While the general concept is rather straightforward, the implementation is fairly convoluted. At the core of it, the complexity is caused by the WM_GETMINMAXINFO message being sent rather than retrieved by the window.

Tricking Windows into Letting you Resize the Window Larger than the Screen

The Win32 API allows applications to monitor global event message traffic by setting up a hook procedure via the SetWindowsHookEx function. This is how utilities like Spy++ are able to tell what messages are being sent to a program. As of this writing, there are thirteen different types of hooks, each with its own context and execution point in the global chain.

If you're like me, you'd probably try to register a WH_GETMESSAGE hook with GetMsgProc. On the outset it seems logical enough: intercept the WM_GETMINMAXINFO message before GetMessage or PeekMessage is called and modify accordingly. This will not work, however, and you will waste a lot of time trying to make it work. The nuance here is that WM_GETMINMAXINFO is sent to the window -- the window does not poll for it -- and as such a WH_GETMESSAGE hook will never see the message.

The next seemingly logical hook type to try is WH_CALLWNDPROC, which you register with the CallWndProc function. This type of hook will indeed intercept the WM_MINMAXINFO message before the window will, but unfortunately, the hook procedure cannot modify the message. This is by design and Windows will enforce it; trying to get a reference to the lParam to modify the value in memory will not work.

And so it goes with all the hook types. Many look like they'll do what you need, but will fail in some way. It seems that modifying the WM_GETMINMAXINFO message from out of process is not possible. And largely that's true. However, we can get creative by supplanting the process's window procedure, which executes in process, by using SetWindowLongPtr from the WH_CALLWNDPROC hook. Example 1 shows what that interaction may look like.

// The WH_CALLWNDPROC hook procedure, executed out-of-process.
LRESULT WINAPI CallWndProc(int nCode, WPARAM wParam, LPARAM lParam)
{
    CWPSTRUCT* cwp = (CWPSTRUCT*) lParam;

    if (WM_GETMINMAXINFO == cwp->message)
    {
        // Inject our own message processor into the process so we
        // can modify the WM_GETMINMAXINFO message.  It is not possible
        // to modify the message from this hook, so the best we can do
        // is inject a function that can.
        LONG_PTR proc = SetWindowLongPtr(cwp->hwnd, GWL_WNDPROC, (LONG_PTR) MinMaxInfoHandler);
        
        // Store a handle to the original window procedure in the window's
        // property list so we can restore the procedure from the custom processor.
        SetProp(cwp->hwnd, L"__original_message_processor__", (HANDLE) proc);
    }

    return CallhkHookEx(hkHook, nCode, wParam, lParam);
}

// Install the hook procedure in the global hook chain.  DLL_PATH must be
// a fully qualified path to the DLL file containing the WH_CALLWNDPROC hook procedure.
HINSTANCE hinstDLL = LoadLibrary(DLL_PATH);
HOOKPROC hkprcSysMsg = (HOOKPROC)GetProcAddress(hinstDLL, "CallWndProc");
HHOOK hkHook = SetWindowsHookEx(WH_CALLWNDPROC, hkprcSysMsg, hinstDLL, 0);

Example 1: Registering the WH_CALLWNDPROC hook procedure.

SetWindowLongPtr is an amazing feature of Windows that lets you supply a new function pointer for a restricted set of functions in a Windows process. The new function can then call out to the original function through a handle to that function. One of the functions allowed to be replaced is the window procedure. By supplying our own we will be able to finally modify that WM_MINMAXINFO message. In Example 1 we showed how to call SetWindowLongPtr. Example 2 shows what the custom window procedure looks like:

// The custom window procedure, executed in-process, to manipulate the WM_MINMAXINFO message.
LRESULT CALLBACK MinMaxInfoHandler(HWND hwnd, UINT message, WPARAM wParam, LPARAM lParam)
{
    // Grab a reference to the original message processor.
    HANDLE originalMessageProc = GetProp(hwnd, L"__original_message_processor__");
    RemoveProp(hwnd, L"__original_message_processor__");
 
    // Uninstall this custom window procedure so the next message will use the original procedure.
    SetWindowLongPtr(hwnd, GWL_WNDPROC, (LONG_PTR) originalMessageProc);
 
    // Only handle the message we're interested in.
    if (WM_GETMINMAXINFO == message)
    {
        MINMAXINFO* minMaxInfo = (MINMAXINFO*) lParam;
 
        // ptMaxTrackSize corresponds to the screen's virtual dimensions.
        // MAX_WIDTH and MAX_HEIGHT should be the width and height of the window allowing the full
        // canvas to be displayed without scrollbars.  This is application dependent.
        minMaxInfo->ptMaxTrackSize.x = MAX_WIDTH;
        minMaxInfo->ptMaxTrackSize.y = MAX_HEIGHT;
 
        // We're not going to pass this message onto the original message processor, so we should
        // return 0, per the contract for the WM_GETMINMAXINFO message.
        return 0;
    }
 
    // All other messages should be handled by the original message processor.
    return CallWindowProc((WNDPROC) originalMessageProc, hwnd, message, wParam, lParam);
}

Example 2: Modifying the WM_GETMINMAXINFO message with a custom window procedure.

Note that we only handle the WM_GETMINMAXINFO message and delegate all others to the original window procedure. Additionally, we uninstall the custom procedure as soon as we've accomplished what we need to.

We modify the ptMaxTrackSize component of the MINMAXINFO struct, which is itself a POINT struct, having an x and a y component. These should be set large enough to handle the full canvas plus the window chrome that surrounds the main client area. Once this is done, you should be able to size the window large enough to obviate the need for scrollbars.

Fig. 1 shows how this all ties together between a theoretical screenshot.exe process taking full canvas screenshots in Internet Explorer.

Fig. 1: Comprehensive interaction diagram for taking full page screenshots.

Capturing the Canvas Contents

Now that your application can be sized large enough to capture the canvas contents, you must resize the window to that maximum size. This calculation is application dependent. For Internet Explorer much of the work is done with the IWebBrowser2 interface, for example.

One caveat is that if the window is already maximized, Windows will not send it a sizing message. My solution to this problem is to first check if the window is already maximized and if so note that fact, change the maximized state to restored, then resize the window to be large enough for the full canvas contents. Once done, I then re-maximize the window if it was previously maximized, effectively restoring the window to its original dimensions. It is a bit kludgy, but I haven't been able to come up with a better solution. I suspect there is a way by intercepting a different window message, but I couldn't figure out which one if it is in fact possible. This process can be seen in Example 3.

All that remains now is to capture the contents, unregister the WH_CALLWNDPROC hook, and resize the window to its original dimensions so the user doesn't have to deal with a massive window. Example 3 pulls all this code together.

// Check if the window is maximized.
BOOL isMaximized = IsZoomed(hwnd);
if (isMaximized)
{
    ShowWindow(hwnd, SW_SHOWNORMAL);
}
else
{
  // Store the window's original dimensions into some local variables.
}

// Set the window to its new dimensions.  There are a variety of ways to do this.

// Note that CImage is part of ATL.  If you want to use strict Win32 API for DIBs, you
// can do so; it's just much more complicated.
CImage image;
image.Create(imageWidth, imageHeight, 24);
CImageDC imageDC(image);

// Capture the contents of the client area to our image DC.
PrintWindow(hwnd, imageDC, PW_CLIENTONLY);

// Remove our `WH_CALLWNDPROC` hook procedure from the global hook chain.
// hkHook was the return value from the SetWindowsHookEx function call.
UnhookWindowsHookEx(hkHook);

// Restore the window to the original dimensions.
if (isMaximized)
{
    ShowWindow(hwnd, SW_MAXIMIZE);
}
else
{
    // Set the window to its original dimensions.
}

// Actually save the image file.
image.Save(CW2T(outputFile));

Example 3: Taking the full canvas screenshot.

Conclusion

Taking full page or full canvas screenshots in Windows can be tricky, but the method discussed in this article should be widely applicable. In my particular case I was enhancing the SnapsIE utility. My SnapsIE fork illustrates how I use all of these techniques. Note that SnapsIE is written as an ActiveX control for Internet Explorer, so the code is likely more complex than is warranted in many cases.

Acknowledgments

Haw-Bin Chai for SnapsIE, which served as a basis for much of the work I did.
Jim Evans for more IE screenshot work on selenium, which handled IE8 a bit more gracefully than SnapsIE did.
sunnyandy, who had the closest answer on how to take full screen screenshots that I was able to find.
Igor Tandetnik, who knows VC++ better than any human I'm aware of.
Jeff Rafter, who helped me debug all sorts of issues while I was developing the foundation for this article and then served as a peer reviewer of the content.
MogoTest for allowing me to spend all this time solving the problem.

On Amazon EC2 Spot Instances

2010-01-22T00:00:00+00:00

Introduction

A couple months ago Amazon announced support for EC2 spot instances. In a nutshell, a spot instance is an EC2 instance that you bid on and that Amazon creates and destroys based upon whatever spare capacity is available in a given EC2 availability zone (i.e., supply) and your maximum bidding price versus the current spot instance price (i.e., demand). A spot instance is less flexible than an on-demand or reserved instance is in terms of lifecycle, but could be significantly cheaper if your application can handle that volatility.

This post summarizes my experience with spot instances and how I make use of them.

Background

My latest project is a front-end web testing tool, running a variety of web browsers across both Linux and Windows. We make heavy use of EC2, which allows us to pay for servers as we use them. While EC2 drastically reduces the start-up costs because we don't need to bulk purchase equipment, it can still be costly. The rate as of this posting for a small Windows instance is $0.12 USD / hour. At approximately 720 hours in a month, that's roughly $86 USD / month. In order to process our work queue quickly we need to run a decent sized cluster.

Like any reasonable organization, we'd like to reduce cost without adversely impacting quality of service. Prior to spot instances there were several ways to reduce cost, but none were ideal:

The simplest, but most costly, is to purchase a reserved instance. With a reserved instance you pay an up-front fee but then pay reduced hourly rates as you run your instance. Over the long term there are significant savings, but you have to be able to afford the initial cost and Amazon only supports reserved instances for Linux.
Another cost-saving technique is to adjust your number of running instances based on load, so you don't pay for resources you aren't really using. This can be tricky to do correctly though and you could be caught with an anemic cluster if you have a large burst of traffic.
The hardest approach is to try to increase throughput on a given server. This could require significant man power to achieve and for some applications may not even be possible.

Spot instances change the problem domain by making the instance price variable without having to be burdened with the initial expense of a reserved instance. We've been able to get small Windows instances for as low as $0.05 / hour, which equates to a nearly 60% savings. Similar savings can be had for linux servers as well at all of the various EC2 sizes (e.g., we routinely pay less for a medium linux spot instance than for a small linux on-demand instance). Spot instance pricing can change at various times throughout the day, but the price is almost always below the current on-demand instance price. Theoretically it could go higher than the on-demand price, but it would be silly to do so because you could just get an on-demand instance then. With that savings, we can run more instances for each browser type on the same budget, increasing quality of service.

Of course, this is all predicated on the cluster being able to handle the dynamic addition and removal of instances. You will have to account for the case where a spot instance dies in the middle of processing a request and be able to recover from that. So, spot instances are not ideal for all applications. But, for a background worker system it can be a cost-effective way to work through your queue more quickly.

Implementation

We use rubber for our cluster configuration and app deployment. Rubber is a capistrano plugin that simplifies working with Amazon Web Services. Using role-based deployment, we can configure the packages and gems to be installed on an EC2 instance, attach an EBS volume if necessary, and backup files to S3 with succinct YAML configuration. As of the 1.2.0 version, rubber can now handle spot instance requests.

A sample configuration for a single host in rubber would look like the following rubber.yml extract.

# Sample spot instance request configuration in rubber.yml.
hosts:
  ie8:                         # The instance's hostname
    instance_roles: "vnc,rdp"  # Only expose VNC and RDP for this server
    cloud_providers:
      aws:
        image_id: ami-df20c3b6 # Standard 32-bit Windows 2003 Server image
        image_type: m1.small   # Create a small EC2 instance
        spot_price: "0.12"     # Max. spot price you are willing to pay
        spot_instance: true    # Default is false.
        spot_instance_request_timeout: 600 # Fall back to on-demand after 5 min.

While this example shows configuration for a single host, any option could also be applied globally for all nodes in your cluster and can be overridden on a host-by-host basis. So, you can vary the maximum price you're willing to pay for a server on a per-instance basis and you can have a combination of on-demand and spot instances in your cluster.

One thing to note is that rubber was originally designed for on-demand and reserved instances, which have synchronous creation characteristics. Spot instance requests, on the other hand, are satisfied asynchronously. Rubber's solution is to block after the spot instance request is made and to poll Amazon until the instance is created. Since waiting ad infinitum isn't ideal for everyone, rubber lets you set your own service level target through a request timeout value (spot_instance_request_timeout in the example). If the request fails to be fulfilled before that timeout is exceeded, the spot instance request will be canceled and rubber will fallback to creating an on-demand instance.

We use resque for our work queue. Resque does an excellent job of adapting to changes in the worker topology. So, adding new workers through spot instances and even removing instances cleanly is managed nicely for us. Additionally, resque was designed to handle job failures from the outset. While this won't help you if your job is shutdown midway-through a non-atomic operation, it does ease the task of job management -- you just have to make sure your jobs are resumable.

Conclusion

As would be expected, spot instance requests are easier to fulfill during non-peak hours. Likewise, the most expensive operating times are during peak hours. We've found that trying to create a spot instance during peak business hours may take a while to fulfill, whereas requests during non-peak hours are fulfilled quickly (oftentimes under 3 minutes). If you set your maximum price high enough, you shouldn't lose your instance after it's created either, unless Amazon needs to reclaim resources for on-demand customers. In practice we've run spot instances for weeks at a time. We've also had some die shortly after creation because we didn't set a high enough maximum price. You'll have to do some analysis to find out what's best for your application.

If you can be flexible with your EC2 availability zones, you'll see the best results. There are marginal bandwidth fees between availability zones in the same region, but in our case the savings from a spot instance trump the bandwidth charges. However, if you do large amounts of data transfer between instances, you should take that into consideration.

Overall, we've found spot instances to be a great way to grow our cluster with a fixed budget. We've had to architect our application to be resilient to nondeterministic node additions and removals, but that was a lot easier for us than trying to increase the work throughput on any single server.

My New Tech Tumblr

2010-01-17T00:00:00+00:00

Following in the footsteps of Mike Champion and several others from boston.rb, I've decided to set up a tech tumblr account. Basically, it's a place where I can bookmark and annotate links to content of high technical value without having to deal with the ephemeral nature of twitter or the time sink of writing a full weblog post.

Anyway, if you're interested in some of the same things I am, you may find some value in it:

Lessons Learned in Large Computations with Ruby

2009-09-17T00:00:00+00:00

Introduction

This is the follow-up post to my GitHub Contest Recap post that I promised. As mentioned, I submitted two entries for the GitHub contest, starting with Ruby and then rewriting in Java. This post summarizes why I ultimately dropped Ruby in favor of Java for this particular task. I apologize for its length, but it is divided into discrete sections and can be read in chunks without great loss of continuity.

Things that Worked Well with Ruby

Very quick and easy text processing

Overall, Ruby excelled at the tasks I knew it would excel at. Principally, I was able to write fairly clean code that produced results in short order. Its text processing capabilities are solid and were a big boon in processing the data files supplied as part of the contest. Likewise, generating a results file was a trivial matter.

Class re-opening

I generally shy away from monkey-patching unless absolutely necessary, but having the ability to do it easily is always nice. I found myself working around limitations in AI4R, a Ruby-based artificial intelligence library, and adding basic statistics functions to Array. Being able to call [x, y, z].mean or [t, u, v].sum was an extremely concise way to represent terms in some of my equations.

Things that Did Not Work Well with Ruby

I had thought that I understood Ruby fairly well, but this project taught me otherwise. Much like Java and the JVM, there are a lot of subtleties in Ruby I was either blissfully unaware of or haven't had to worry about in any great detail. From the seemingly inane, such as the three different methods of object equality, to the surprising, such as the lack of a Float::Infinity constant.

Marshalling circular relationships

One of my earliest setbacks was related to marshalling. Given that I was dealing with processed datasets and a large number of objects, I thought marshalling the data to and from disk would be an appropriate thing to do in order to save start-up time. I had written my code such that a Watcher class and a Repository class maintained bi-directional references to one another. By using a set for the associations, I avoided creating infinite loops when establishing the relationships. However, the marshalling library does not know how to deal with circular references. Thus, an attempt to marshal resulted in an infinite loop. This seems quite odd to me as the problem of persisting a graph of objects is not intractable and thus I consider the limitation to be a bug in the marshalling library.

To be fair, Ruby does provide means of manually controlling marshalling, but I did not pursue this path. Ultimately, I was using Ruby because it was supposed to make my development faster. Getting drawn into the nuances of marshalling was something I didn't have time for and had no interest in doing.

Creating large number of objects

An early design decision I made was to keep as much data in memory as possible. This was because I was planning on running a large number of computations and wanted data access to be as quick as possible; faulting in from disk or DB would have been too slow. A consequence of that decision was that in my very first pass at the program I had 750,000 objects in memory. I never grew significantly beyond that because I managed to make the garbage collector in MRI segfault several times. For about a week I tried to clean up my memory space: I dropped the bi-directional relationships between a watcher and a repository; I removed memoized calculations; and I did away with local copies of data at both the method and instance levels in favor of effectively global data. Essentially, I tossed away any legibility my code once had. In the end, however, the program was able to run without crashing the garbage collector, albeit several multitudes slower. This prompted me to attempt profiling the application.

Profiling with perftools.rb

My first attempt at profiling was with perftools.rb, which uses Google's profiling tools. I had read good things about it from people that I generally hold in high regard. Alas, I was unable to run it. On both my MacOS X 10.5.7 laptop using prefixed Gentoo's Ruby (1.8.7p174) and Ubuntu 9.04's Ruby (1.8.7p72) the profiler segfaulted almost immediately. I believe this had to do with the large object graph, but the error message wasn't terribly helpful and I had little interest in analyzing the coredump.

Profiling with ruby-prof

Having failed to accomplish anything with profile.rb, I fell back to the venerable ruby-prof. This profiler worked, but was extremely slow. After several hours of running, I finally relented and sent it a SIGTERM. I was pleasantly surprised to see that it maintained intermediary stats and output them upon application exit. Unfortunately, it hadn't moved beyond my initial data loading code, so the reported information was of limited value.

Memoizing at class method level

At one point my back-of-the-napkin calculation was that I was performing 13 MM point comparisons in my instance space, many of them duplicating internal computations. This was an easy target for optimization just through caching, so I decided to try out the memoize gem. The general lack of code examples for it leads me to believe that it is not in widespread use. All examples I found, including the one in "The Ruby Way," use the library outside of any class structure. After searching news groups for a while and studying the source, I managed to get memoization going for instance methods on an object. However, I was unable to figure out how to get the library to work on class methods without code modification; another irritating setback.

Memcache only storing 1MB files

Having had issues with the MRI garbage collector and hoping to save costly calculations between data runs, I turned to the quintessential memory caching tool, memcached. I was running version 1.4.0, which sports a new binary protocol for compressed communication. Although I was connecting to localhost, I wanted to reduce latency as much as possible. Unfortunately, the memcache-client library had not yet been updated to use the new protocol, forcing me to use the older, less efficient protocol.

As it turned out, fretting over the application protocol was not time well spent; unbeknownst to me, memcache has a 1 MB limit on the size of an object being stored and I was storing rather large hashes. Although the limit can be configured at compilation time, preliminary research indicated this was not a recommended practice. Even if it worked, the memcache-client lib has the 1 MB limit hardcoded to avoid network traffic for large data, so the gem would also have to be patched and recompiled.

Alternative Ruby Implementations

Throughout the course my Ruby implementation I used three of the major Ruby implementations. I began with MRI (1.8.7p174) and moved to YARV (1.9.1p129) and ended up with JRuby 1.3.1 (1.8.6p287 compatible). Most of the discussion up until now has focused on my use of MRI. The following sections describe my experiences with alternative Ruby implementations.

YARV

I was excited at the prospect of being able to use Ruby 1.9 for a project. My contest entry had a small number of dependencies to cope with so I thought my potentials for failure were limited. Generally, things seemed to work well with YARV. I definitely saw a speed improvement over MRI. However, the ruby debug gem has not been updated for 1.9. Not having a debugger is not acceptable for development, so I had to abandon this path.

JRuby

After having experienced issues with MRI & Ruby 1.9, JRuby ended up being my runtime of choice. It was remarkably faster than MRI and did not suffer from the garbage collection issues. Likewise, it had the full debugger support that Ruby 1.9 lacked. However, there were several issues I ran into with JRuby.

Breaking the debugger with optimizations turned on

In order to maximize execution speed, I used the --fast flag, which performs some bytecode optimizations especially suited for Fixnum operations at the cost of compatibility with Ruby's stacktrace format. Generally, this was acceptable because I could largely deduce what a problem was by unraveling the stacktrace by inspection. The problem is that the Ruby debugger apparently relies quite heavily on its stacktrace format. I was astonished by this finding, but I was unable to use the debugger until the --fast flag was removed, forcing me to have to bounce from one to the other.

Lack of profiling tools

Discovering the source of massive heap use was difficult due to the lack of profiling tools for JRuby. None of the standard Ruby ones seemed to work at all. I tried to use YourKit for Java, which technically worked, but the class names did not match the Ruby ones and I think I was profiling more JRuby proper rather than my application. Unable to make much use of the results, I abandoned the profiling effort and resorted to manual code analysis, something humans are notoriously bad at.

Debugger issues

I wrote most of my Ruby implementation using the RubyMine IDE. RubyMine has an integrated graphical debugger, which worked fairly well. I found that often, however, the debugger would detach itself from the process forcing me to start debugging from scratch. It appears that this may be a problem with the ruby-debug-ide gem. I noticed that the issue occurred more frequently with the 1.5 beta of RubyMine than it did with the stable 1.1.1 release, so I suspect it may have to do with RubyMine itself to some degree. I also found it happened much more frequently with JRuby than it did with MRI.

The Switch to Java

Unhappy with the progress I had made with my Ruby-based contest entry, I made the decision to port the whole project over to Java. It was something I was extremely remiss to do, but ultimately I wanted to see the performance impact would be.

Static type checking

Dealing with generics in Java is an abomination, but static type checking was of enormous use when porting the Ruby code over to Java. One of the biggest headaches I had in Ruby was related to my used of the property named "id." When I had a rich object model, it wasn't a problem, but when I had to decompose that model I ran into all sorts problems because every object in Ruby has a default "id" property. It became hard to keep track of what were domain model objects and what were simple Strings. When porting over to Java, the distinction was quite clear and the IDE let me know immediately if I was using the wrong type. Granted it was more verbose, but when running an application with long runtime characteristics, relying on runtime type checking can be a brutal cycle of run-wait-fix-repeat.

Fast execution

My initial data load went down from about 2.5 minutes in JRuby (the fastest Ruby implementation I tried) to about 8 seconds in Java. The gain was so great that I was convinced I had made a mistake somewhere. Fortunately, I had a test suite to help indicate otherwise.

My best explanation for the difference in execution times is that Java must handle object creation much faster than JRuby can, despite both of them running on the JVM. All this particular section of code did was iterate through the data files and create my watcher and repository representations. That is to say, at this stage no "real work" had yet been performed.

Handling large object graph easily

During my translation from Ruby to Java I migrated back to a rich object model. Whereas I had to break down true bi-directional associations into hash lookups just to make my Ruby implementation run, Java could handle my full object graph quite easily. As a result, the code was much more straightforward. Additionally, I was able to cache values in memory judiciously without eating up the entire heap or breaking the garbage collector. This helped in making the overall solution faster.

Built-in thread pooling with producer / consumer model

In my Ruby implementation I wrote my own poor man's thread pool, used in conjunction with JRuby's native thread pool. The former allowed me to do concurrent execution of point comparisons while the latter allowed me to reuse discarded native threads. While my pool did work, I didn't abstract it very well and had to drain it frequently lest I ran out of stack space. I'm still amazed I couldn't find anything in the Ruby standard library or even a widely used third party gem for thread pooling.

While constructing threads in Java is not as nice as in Ruby, Java 5 introduced a set of concurrency APIs that remove a lot of the pain with thread management. Creating a thread pool and using a producer / consumer model backed by the pool was trivial to do. It was a small thing, but it meant I didn't have to worry about managing the pool in my code and I had a high degree of confidence that the JDK's implementation was correct. The JVM's native threads scheduling on multiple cores and the easy-to-use thread pool helped increase the throughput of my contest entry.

Profiling tools

The JVM makes it very easy to profile a running Java application. I used YourKit for Java on my entry early on when it felt slow. I found a few problems areas and addressed them, allowing the program to run an order of magnitude faster. This whole process didn't take much more than a couple hours in contrast to the many hours I spent, with minimal results, trying to do the same thing by inspection with my Ruby implementation.

Debugger

Debugging applications in Java is quite simple. The tools for doing so have been around a long time and are generally quite solid. I used IntelliJ IDEA as my primary development tool for the Java-based solution and relied on its graphical debugger for solving various problems. Not having a debugger that disconnects from its process should be a given, but happened all too frequently with RubyMine & JRuby.

Conclusion

After three days of porting, my first run of the Java implementation yielded a 36% prediction accuracy and ran in less than 10 minutes. My best Ruby implementation, after weeks of development, only had a 31% prediction accuracy and took almost 7 hours to compute it. With a few more hours of work the Java implementation broke 40% accuracy and ran in less than 3 minutes. Unfortunately, at this point I became ill and by the time I recovered the contest had ended.

My takeaway from the project is that Ruby is a great language hampered by a terrible execution environment. When writing Rails apps, this usually isn't a problem. I'd even go so far as to say for most types of applications it shouldn't be a problem. For anything that's heavily CPU bound or dealing with large object graphs, however, I don't think Ruby is a suitable option. In these cases, Ruby may best serve as a rapid prototyping tool. By writing the code in Ruby first I had a clear approach to use when translating to Java.

This was the first Java project I had written start to finish in a while. Most of my Java work is on existing codebases. I was pleasantly surprised to see how few issues I had with the maven build system and it was nice to have a real profiler at my disposal; for much of my Ruby work I must rely on a hosted service for my profiling needs, which is less than ideal. The Java language wasn't even that bad to write in. I'm convinced that if there were easier syntax for collection access and proper language property support that writing in Java may even be a pleasant experience.

It seems Ruby and Java have become the two representatives for the dynamic versus static language debate. It would be nice if some of the vitriol could be removed from the discussion and the two languages evaluated on their merits for particular tasks. It should go without saying that every language has its place. This exercise has given me a clearer understanding of where Ruby should and should not be used. It also helped reassert to me what Java's strengths and weaknesses are.

GitHub Contest Recap

2009-09-03T00:00:00+00:00

Recently I decided to throw my hat into the GitHub contest ring, the goal of which was to predict repositories that a GitHub user should watch. It had been some years since I had really done anything intensive with artificial intelligence (AI) and I thought it would be fun. I was also attracted to the bounty offered: an aged bottle of bourbon and a lifetime large GitHub account. This post serves as my high-level recap of the contest.

While I had never worked on a recommendation system per se, I had done work with classifiers before. Looking at the problem, my gut reaction was that an instance-based learning algorithm, such as k-nearest neighbors, would be the most effective approach. I also anticipated the top three winning entries to end up somewhere in the 70 - 85% accuracy range. As it turns out, the best accuracy at the end of the contest was 56%. While it is possible that this prediction accuracy is the best that could be achieved with the data, my guess is the relatively short time allowed for the contest (roughly a month) made it difficult for a very strong entry in the contest. Additionally, the prize offered likely only attracted amateur entries. This is not to say the contestants were unskilled, just that I doubt the dedication shown for something such as the Netflix contest was exerted for the GitHub contest.

I decided to use the weighted k-nearest-neighbors algorithm as the basis for my submission. In order to measure progress and avoid overfitting results, I used a stratified n-fold cross validation evaluation strategy. The idea behind n-fold cross validation is to partition the dataset into proper subsets (the "fold") that each maintain the underlying distribution of the full dataset. This consisted of taking a stratified sample (watcher ID, repository ID) pairings, using the repository ID as the class value. Once the n folds are created, one is used as a test set while the other n - 1 are used for training; the process is repeated until every fold has the opportunity to be the test set (the cross validation). The average of the observed accuracies is used to mitigate the effect potential sampling bias.

My first approach was not terribly efficient, but I had taken the engineering approach of making it correct and then making it fast. I was quite surprised when my first pass wouldn't even run because it broke the MRI Ruby garbage collector. This forced me to reduce the search space earlier than I had intended. While I could have taken a stratified sample of the training data, I wanted to minimize data loss, so I opted to restructure the instance space into regions grouped by repository ancestry. This reduced the instance space down from about 113,000 points to roughly 77,000 using 10 folds. While an improvement, this reduced space still proved to be too large to perform full evaluations over doing pairwise comparisons (a O(n²) approach). Other pruning methods, such as removing regions with single repositories or single watchers, further reduced the search space, but at the cost of accuracy.

The next step, thus, was to devise a search strategy that attempted to find the regions that the test instance belonged to, while avoiding those that the instance did not. I tried several heuristics with varying degrees of success. Empirical evidence suggested that if a test instance either watched 25% or more of an owner's repositories or 25% of the test instance's repositories were owned by the same person, that the test instance would be likely to watch other repositories owned by that person. For my final submission the search heuristic I used considered all the regions that the test instance was already known to belong to and regions that contained repositories that were owned by any repository owner that the test instance was known to watch a substantial portion of.

Simply finding the correct regions wasn't sufficient, however. Once the regions were chosen there was still the matter of choosing the correct repositories from those regions. I found that using the most forked repository per region was generally the correct one. A minor accuracy boost was achieved by evaluating the parent of any repository that a test instance was known to be watching, working under the observation that watchers tended to watch parent repositories when watching one of that repository's children.

Evaluations were performed by calculating the Euclidian distance between two repositories. My first approaches were pure distances and yielded symmetric results. As the project evolved, however, I found myself augmenting the definition of distance to be a route taken between two repositories by a given training instance and even adopted some aspects of Hamming distance. This meant that for two repositories r₁ and r₂, given training instances t₁ and t₂, the distance between r₁ and r₂ could vary depending on characteristics of t₁ and t₂. While this may not be a strict Euclidian distance calculation, I likened it to modes of transportation. E.g., the distance between two city centers may be fixed for a straight line, but can vary considerably if one chooses to walk versus take a train.

The distance calculation weighted different attributes based upon observations of the shape of the training data. The attributes I ended up using were: parent-child relationships, general common ancestry, common owner, and common watchers. I had planned on using a genetic algorithm to optimize the weighting system, but ran out of time before the end of the contest.

I had two goals when starting this project: 1) win the contest; and 2) learn more about Ruby. Unfortunately, achieving the latter may have come at the expense of the former. In the end, I achieved slightly over 40% accuracy, which was a far cry from where I had expected to end up and wasn't enough to win the contest. My original choice of technology stack was not appropriate. In most of my previous AI projects I used Java with a smattering Python and to a lesser extent, ML and LISP. Having been using Ruby as my primary programming language for the past year and a half, I thought I would use it for my contest entry. I spent roughly three weeks trying to optimize a Ruby solution that could only achieve 31% accuracy after a seven hour run on a quad core machine with 4 GB RAM. During the last week of the contest, I rewrote the project in Java and achieved 40% accuracy with a 10 minute long run on a dual core MacBook Pro. I plan on elaborating on that a bit in a future post.

[Edit: (Sept. 17, 2009) I've now published the post comparing my Ruby and Java entries]

The code for my entry can be found on GitHub.

Nesting alias_method_chain Calls

2009-05-15T00:00:00+00:00

Introduction

Rails provides a nifty utility in ActiveSupport called alias_method_chain. For those not familiar with it, it simplifies the task of "replacing" an already defined method with an augmented one. The new method is aliased to the name of the original method and the original method is aliased to some other name in order that it may still be referenced.

More succinctly, the following call:

alias_method_chain :number_printer, :filter

is effectively the same as:

alias_method :number_printer_without_filter, :number_printer
alias_method :number_printer, :number_printer_with_filter

Fig. 1 illustrates how the alias_method_chain call changes references to the method definitions like so:

Fig. 1: Results of alias_method_chain call.

Now the original method defined as :number_printer is referenced as :number_printer_without_filter. :number_printer now points to the method definition for :number_printer_with_filter, which can be referenced as either :number_printer or :number_printer_with_filter.

This implies that prior to the execution of the alias_method_chain call, you must define both methods :number_printer and :number_printer_with_filter.

Motivating Example

The ninja-decorators project relies heavily on alias_method_chain and its usage will be used as the example throughout the remainder of the article. ninja-decorators gives you before_filter, after_filter, and around_filter functionality outside of Rails controllers. With these methods you can handle cross-cutting concerns in a class located elsewhere in your Rails app or without having to use Rails at all. Using the standard examples of security and logging as cross-cutting concerns, we have something like the following:

around_filter :secure_around, [:number_printer]
around_filter :log_around, [:number_printer]

Here, we want :log_around to decorate :number_printer with :secure_around applied. Internally, around_filter delegates to alias_method_chain to handle method decoration.

Problem

The problem with the implementation of alias_method_chain is one of definition order with regards to its two internal alias_method calls. If the new head of the chain is an enhancement of an existing method in the chain, there likely exists a coupling between the two. Since the alias_method_chain call is effectively atomic, however, this complicates how the two methods reference each other. Fig. 2 shows the intermittent states between each of the two alias_method calls made internally by alias_method_chain.

Figure 2: Detailed breakdown of alias_method_chain mechanics.

:number_printer_without_filter will not exist until after the alias_method_chain call is complete. :number_printer_with_filter must exist before the alias_method_chain call can begin, otherwise the second alias_method call made internally will fail. As a consequence, :number_printer_with_filter must call :number_printer_without_filter dynamically. As long as you only use one level of alias_method_chain calls, this isn't a problem. With multiple levels of chaining, however, dynamic calls like this fall apart.

To make the discussion a little more concrete, we'll use the following example adapted from the ninja-decorators project. It is a bit contrived, but should serve well enough as a basis for discussion.

require 'activesupport'
  
class NumberFun
  def self.around_filter(around_method, method_names)
    method_names.each do |meth|
      define_method("#{meth}_with_around_filter") do |*args|        
        send(around_method, *args) do |*ar_args|
          send("#{meth}_without_around_filter", *ar_args)
        end
      end

      alias_method_chain meth, :around_filter
    end
  end

  def increment_filter(num)
    yield(num + 1)
  end

  def number_printer(num)
    puts num
  end

  def square_printer(num)
    puts num * num
  end

  around_filter :increment_filter, [:number_printer, :square_printer]
end

:number_printer and :square_printer are two simple methods. They take a number in and print out its value or its square, respectively. :increment_filter is a simple "around filter"; it augments a method by incrementing the input argument by 1 before executing the original method. Running both methods will produce the following in IRB:

>> NumberFun.new.number_printer 3
4
=> nil
>> NumberFun.new.square_printer 5
36
=> nil

around_filter is where all the hard work is being done and is where alias_method_chain is employed. It takes as its arguments a filter method name and a list of method names to decorate with that filter. For each method to decorate it defines the required "with" method for alias_method_chain. This newly defined method will call the filter method (:increment_filter in this case), which will in turn call the original, undecorated method (:number_printer_without_increment_filter or :square_printer_without_increment_filter) as a block. Once the "with" method is defined, alias_method_chain is called so that the original method name can be used to transparently call the newly decorated method.

While convoluted (don't worry, it's wrapped up a library), this approach will work dandily until you need to start decorating a method more than once. For the sake of the example, we'll pretend that we actually want to increment each input argument as two. In reality, well'd likely want to apply a completely different filter. The outcome is precisely the same, but to keep things simple, we'll just apply the :increment_filter twice:

around_filter :increment_filter, [:number_printer, :square_printer]
around_filter :increment_filter, [:number_printer, :square_printer]

Running this through IRB again, we'd likely expect to see :number_printer print out 2 + num for argument num. Instead, the session looks like this:

>> NumberFun.new.number_printer 3
SystemStackError: stack level too deep
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:6:in `send'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:6:in `number_printer_with_around_filter'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:7:in `number_printer_without_around_filter'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:7:in `send'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:7:in `number_printer_with_around_filter'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:16:in `increment_filter'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:6:in `send'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:6:in `number_printer_with_around_filter'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:7:in `number_printer_without_around_filter'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:7:in `send'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:7:in `number_printer_with_around_filter'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:16:in `increment_filter'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:6:in `send'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:6:in `number_printer_with_around_filter'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:7:in `number_printer_without_around_filter'
    from /Users/nirvdrum/dev/workspaces/alias_method_chain/blah.rb:7:in `send'
... 7586 levels...

That's the polite way of telling you that you have infinite recursion, not the expected value of 5. The issue is that each around_filter call defines a new "with" method that calls the "without" method dynamically. Calling a method by name with send, however, only calls the one at the current lexical scope. Meanwhile, each call to alias_method_chain changes the alias target of the "without" method. As such, we have the execution flow illustrated in Fig. 3 rather than the expected one in Fig. 4.

Figure 3: Actual behavior when chaining alias_method_chain calls.

Figure 4: Expected behavior when chaining alias_method_chain calls.

The key thing to note about the code behavior observed in Fig. 3 is that the second alias_method_chain call will alias :number_printer to :number_printer_without_filter, just like the first call will. However, after the first alias_method_chain call is made, :number_printer is aliased to the definition :number_printer_with_filter (as seen in Fig. 2). Calling :number_printer at this point will call :number_printer_with_filter because of the decoration and, subsequently, :number_printer_with_filter will call :number_printer_without_filter, the latter of which is now pointing at the definition of :number_printer_with_filter as well. That's a lot of words to say that we end up in a situation where :number_printer_with_filter calls itself inadvertently and there's no base case to break out.

There is no* clean way around this with alias_method_chain. It's a classic chicken-and-egg situation. The best that can be done is for around_filter to maintain a stack of UnboundMethod objects in a class instance variable. While doable, this resource management is error-prone and would have to be replicated by any method affected by the problem. Effectively, it is the same process the VM would normally perform in managing stack frames at each recursion level, so it's best to let the VM do it.

The problem can be averted with a minor change in alias_method_chain. The idea is to yield to a block between the two alias_method calls, allowing the proper formation of closures to the "without" method. Unfortunately, this is not backwards-compatible with Rails because alias_method_chain yields to a block elsewhere for reasons that are not quite clear to me (I believe it's to handle method names with punctuation).

A simplified definition is thus:

def alias_method_chain(target, feature)
  with_method = "#{target}_with_#{feature}"
  without_method = "#{target}_without_#{feature}"

  alias_method without_method, target
  yield if block_given?
  alias_method target, with_method
end

The block passed to alias_method_chain can then take care of the creation of the "with" method, which will have access to the "without" method at the current level. Breaking away from around_filter, we can more easily see how nested alias_method_chain calls work with the new definition:

class MoreNumberFun
  # Build up a proc that will construct the filtered method
  # Execution of the proc is delayed until we encounter the alias_method_chain call.
  filtered_method_builder = Proc.new do

    # Get a reference to the unfiltered method or, more accurately, the original method with
    # all previous filters already applied.  This new filtered method builds up on the filters
    # already applied.
    unfiltered_method = instance_method :number_printer_without_filter

    # Define the newly filtered method.
    define_method("number_printer_with_filter") do |*args|
      unfiltered_method.bind(self).call(args.first + 1)
    end
  end

  def number_printer(num)
    puts num
  end

  alias_method_chain :number_printer, :filter, &filtered_method_builder
  alias_method_chain :number_printer, :filter, &filtered_method_builder
end

In this admittedly convoluted example, the block passed to the alias_method_chain calls is built up as a proc first. This allows us to make the same alias_method_chain calls without needing to duplicate code. The proc gets a reference to :number_printer_without_filter and calls it within the newly defined :number_printer_with_filter, which for simplicity in the example, provides the same behavior that :increment_filter previously did. This forms a closure and lets each level of "with" and "without" methods to pair up, subsequently avoiding the infinite recursion problem when using just send alone.

Running in IRB now, we get the expected behavior of print out of 2 + num for argument num, rather than the stack overflow exception we previously experienced:

>> MoreNumberFun.new.number_printer 3
5
=> nil

Conclusion

I began writing this post just to document the changes necessary to alias_method_chain in order to make ninja-decorators work. If this work could make its way back into Rails core, great. Otherwise, it serves as a decent rationale document. If you've run into similar issues yourself, you should now know why and how to work around them. One issue not addressed here is reordering the chain or removing links from the chain. Since each link has a tight coupling at the time of definition, altering the chain via anything other than an append/prepend may be confusing.

* I suspect someone much smarter than me knows a way. After a couple days on the issue, I couldn't come up with anything.

Composite Index Problem with PostgreSQL and Rails 2.x

2009-04-26T00:00:00+00:00

Introduction

Recently I ran into an issue with using composite indices in PostgreSQL and Rails 2.3.2. I only managed to catch the problem by using the shoulda should_have_index macro. This macro asserts that an index appears on a list of columns. Since it is a list, the order of the columns is in fact significant.

Problem

The problem is that given a table with the following definition:

  create_table "video_games", :force => true do |t|
    t.string   "asin"
    t.integer  "user_id", :null => false
  end

and the following migration:

  add_index :video_games, [:user_id, :asin], :unique => true

the schema dumper for the ActiveRecord PostgreSQL adapter may actually produce the following in schema.rb:

  add_index "video_games", ["asin", "user_id"],
    :name => "index_video_games_on_user_id_and_asin",
    :unique => true

The distinction here is subtle, but important. In the migration, I declared the index should be on the tuple (user_id, asin) and the schema dumper in turn generated code that would add a tuple on (asin, user_id).

The issue was with the way that the adapter was fetching the index data. It issued a query against PostgreSQL's maintenance tables to reconstruct the index pseudo-DDL statement. The query used in Rails 2.3.2 is:

  SELECT distinct i.relname, d.indisunique, a.attname
     FROM pg_class t, pg_class i, pg_index d, pg_attribute a
   WHERE i.relkind = 'i'
     AND d.indexrelid = i.oid
     AND d.indisprimary = 'f'
     AND t.oid = d.indrelid
     AND t.relname = '#{table_name}'
     AND i.relnamespace IN (SELECT oid FROM pg_namespace WHERE nspname IN (#{schemas}) )
     AND a.attrelid = t.oid
     AND ( d.indkey[0]=a.attnum OR d.indkey[1]=a.attnum
        OR d.indkey[2]=a.attnum OR d.indkey[3]=a.attnum
        OR d.indkey[4]=a.attnum OR d.indkey[5]=a.attnum
        OR d.indkey[6]=a.attnum OR d.indkey[7]=a.attnum
        OR d.indkey[8]=a.attnum OR d.indkey[9]=a.attnum )
  ORDER BY i.relname

There's a lot going on there that may be hard to follow. The query returns the index name (i.relname), a Boolean indicating whether or not the index is unique (d.indisunique), and a member column of the index (a.attname). For composite indices, there are multiple rows, one for each member column.

The important thing to note is that d.indkey is a PostgreSQL array type (int2vector) that contains a list of column positions for member columns of the index. As can be seen by the query, there is no explicit ordering of the a.attname, so PostgreSQL is free to return the rows in any order it wishes. In PostgreSQL 8.3, this ordering appears to be attribute's positional index, in ascending order. Please not that I have not consulted the PostgreSQL source to verify this. Suffice it to say, the returned ordering should not be relied upon and is not guaranteed to match the order in d.indkey. The problem is that the schema dumper did in fact rely on this order.

As an aside, there is another problem with this query. It will only index 10 elements of the d.indkey array, leading to a ceiling of 10 columns per index. This is a Rails-imposed limit. As of at least PostgreSQL 7.4, that limit is 32 by default and can be configured higher at compile-time.

Resolution

Both issues were fixed as of April 21, 2009 with the closing of Rails ticket #2515, nearly 3.5 years after the problem was first introduced on September 23, 2005. Interestingly, the problem was reported by three different parties in April 2009. Between the time I came across it and then eventually came up with a fix and filed a ticket, someone else reported the issue and fixed it. So, that's how I ended up with this analysis of a problem that in the end I didn't have to solve.

Interestingly, the issue shows up with rake db:test:load but not rake db:test:clone_structure because the former uses the ActiveRecord PostgreSQL adapter's implementation of schema dumping and loading, whereas the latter uses the pg_dump tool to create a DDL file. rake db:test:prepare does a clone_structure followed by a load, which yields a test database that does not match the correct one used in development.