I was overall very happy with these bulbs: decent Android and iOS apps and, compared to fancier solutions (e.g., Philips Hue or Belkin WeMo), they do not require any proprietary base stations, and you can’t beat the price!  However, switching off the lights before falling asleep involved hunting for the phone, opening the app, and waiting for it to scan the network; not an ideal user experience.  I was actually missing our old X10 alarm clock controller (remember those?), so I decided to make one from scratch, because… why not?
Although the X10 Powerhouse controller’s faux-wood styling and 7-segment LED had a certain… charm, I decided to go more modern and use a touchscreen.  I also designed a 3D printed enclosure with simple geometric shapes and used it as a further excuse to play with 3D print finishing techniques.  Here is the final result:
And here it is in action:
If this seems interesting, read on for details. Â The source code for everything is available on GitHub. Edit:Â You can also check the Hackaday.io project page for occasional updates.
Component selection. There are several boards with the ESP8266, most of them using the ESP-12 module. I decided to go with the SparkFun Thing (which directly incorporates the ESP chip), as it also includes a LiPo charge controller.  Perhaps overkill for battery backup, but nice to have.  If you do use the charge controller, then the price is very reasonable (e.g., an Adafruit ESP breakout and Micro-LiPo combo will cost about the same–although flash is 4x larger and the ESP-12 module is FCC-approved). Also, it’s a very nice board for experimentation and it’s become my go-to ESP board: nice header layout, and the easiest to program (tip: instead of fiddling with the DTR jumper on the Thing, just cut your DTR wire and insert a pin header pair: once esptool
starts uploading, just pull the pin and.. done!).
For the display, modules with a parallel interface were out of the question, since the ESP does not have enough pins. After some googling, I found Digole’s IPS touchscreen, which incorporates a PIC MCU and can connect over UART, I2C, or SPI (selectable via a solder jumper). There are several users that really like Digole’s display modules and, particularly their older models, seem quite popular. The display itself is very nice.  However, touchscreen support appears to be relatively recent and isn’t that great (more later).  It is also a bit on the expensive side, the firmware is not upgradeable (so you’re basically stuck with whatever version your module comes loaded with — I got one with an older version that has some bugs with 18-bit color support), and manufacturing quality could have been a bit better (mine had poor reflow).  Still, for prototype experimentation, this isn’t a bad module and the company is generally helpful to customer inquiries.
I also picked up a DS3231 RTC module off of Amazon, but I ended up not using it; periodically synchronizing with an NTP server is more than good enough.
Total cost. The first version of this device comes to about $45 including everything: SparkFun Thing ($15), touchscreen (highest cost at $21.50), and 500mAh LiPo cell ($8.50 off eBay). However, in retrospect, it could be done for much less: about $13 total (!) if you skip the LiPo (and charge controller), using a $5-6 ESP module instead, and also get a much cheaper ILI9341 touchscreen module (not IPS, but just $7 off eBay; I have one on the way from China). This does not include plastic filament (maybe a dollar?), paint (assuming you have these already, doesn’t use much), and labor.
3D-printed enclosure. I mocked a couple of profiles in 2D CAD to see what I like, and then did the actual design in OpenSCAD, which is my go-to CAD tool, because… code! (Who has time for point-and-click? :)  It’s a fairly standard affair, with simple geometric shapes, designed in multiple pieces for printing.
The picture above shows the parts in their printing orientation. The standoffs are conical to eliminate the need for supports (alternatively, I could have printed them as separate parts, but getting them inserted is too fiddly, especially in tight spaces like this). The cylindrical sections (middle right) are support ribs which I ABS-glued to the main enclosure’s vertices. They serve two purposes. First, to hold the endcaps in place (the ribs are slightly shorter than the enclosure). Second, to provide some extra support (after gluing, rib layers are perpendicular to enclosure layers). Printing them separately may be unnecessary overkill (there is also a version of the enclosure and ribs in one piece, which requires some extra support to print, but not much), but gluing them is easy enough so… why not. The little clip (top right) is for holding the LiPo cell in place, and is also glued inside the main enclosure.  It’s printed separately to eliminate the need for support (and, at the prototyping stage, also make it a little easier to try different battery sizes, without having to re-print the whole thing or do separate test-prints).
One part of the design that does require a lot of support is the opening for the display.  I entertained the idea of printing the main enclosure in three vertical sections (and gluing them together), but eventually decided against it. Printing that successfully took a bit of trial and error.  I use both Cura and Slic3r.  For most parts I used Slic3r (mainly because it produces smoother outer perimeters, and also integrates better with Octoprint).  However, for the life of me, I haven’t managed to print supports that break off easily.  Even with the new pillar mode, most parts are fine, except one part somewhere that’s just too close to the print to separate!  Cura, on the other hand, always does an excellent job with supports.
Finally, when designing cases like this, one of the (many!) things I like about open-source hardware is that I can download the PCB layout and get precise component positions and dimensions; no calipers and almost-there test prints! Â Sometimes it’s the little things…
You will note, however, that there are holes to insert hex nuts, which are visible from the outside and would have been rather ugly (you don’t see exposed fasteners in “real” products). Â Which brings me to the next trick.
Friction welding: dos and don’ts. I first heard about friction welding through Make magazine’s excellent article on 3D print post-processing and since then it has become one of my favorite techniques. I’ve seen a few tutorials on YouTube about friction welding plastics using a rotary tool.  However, at least the ones I found, seem to be from people who recently learned the technique themselves and are excited to share it.  I wish Make: had placed more emphasis on this (it would have saved me several failed attempts), but do not skip step 2c (pre-heating the surfaces); this is crucial! And, no matter what you do, do not immediately press the spinning, cold filament onto the cold pieces (as some of these tutorials appear to suggest), since you’ll most likely gouge them.  An alternative I’ve found to pre-heating with a heat gun is to use friction itself to do the preheating. Initially, just barely touch the spinning filament onto the plastic surfaces (not on the metal).  Without applying any pressure, wait until you see a tiny bit of plastic start flowing.  Only then gradually increase the pressure and start moving the filament, to keep a consistent flow.  Also, if blobs form on the tip of the filament, it’s best to stop and lightly spin it against some sandpaper to clean them off.
Embedding nuts with friction welding. Using friction welding to embed nuts is a trick I came upon by accident.  When I was building my Kossel-based printer, I overtighened the screws holding the rods to the effector, stripping the cutouts for the nylocs.  I was too lazy/impatient to print another effector, so I just quickly filled the gaps using friction welding.  I’m still using that effector, which has held the nuts very nicely for over a year (and after having taken the effector apart several times, to tweak various things).
I now regularly use this technique to also hold magnets and, generally, anything inserted that needs to stay put.  Superglue is the easiest, but it develops stress cracks and invariably fails over time (and, if you’re thinking threadlocker, don’t: it will craze the plastic, especially ABS).  Next easiest is using a soldering iron to press the nuts/items into the plastic, which is I use very often (I regularly design all my holes undersized and do this anyway).  However: (i) you can’t use it on non-metal items; (ii) you can’t use it on magnets (the necessary heat will demagnetize them); (iii) if you don’t have a steady hand, you may loosen the hole enough to cause the part to fall out eventually, even if it seems fine at first.  Friction welding takes a bit more time, but it’s the best solution I’ve found so far and it’s also very easy after just a little bit of  practice.  I haven’t tried threaded inserts yet. The McMaster-Carr “heat-set inserts for plastics” (it appears their site does not support direct linking!?) that Werner Berry uses look really nice and I’ve been itching to try them, but that’s another piece of hardware I need to keep around.
Another nice thing is that you can use this trick to embed blind nuts that are not visible from the outside. Â This is a trick that is rather obvious (once you’ve done all the above :). Â First, insert the nuts (I used the soldering iron) and make sure the surface is flat (lightly file, if necessary):
Make sure that the fastener axis is oriented properly (if not, adjust). Â Then, fully thread the holes from the outside. Â Use a proper tap (not a screw) to cut threads, especially for finer pitches. Â Do not skip this step (more later).
Finally, apply molten plastic, starting from the outside (i.e., touching the perimeter of the hole, plastic-to-plastic) and working your way towards the center. Â Once you’re done (if you do it right, you should end up with a very clean-looking plug, without any gouges or streaks), lightly file to make the surface flat.
Done! Now the nuts are not visible from the outside, and you have a very clean finish. Â Additional advantages of this approach: you do not need easy access to the hole from the fastener side (in this enclosure it would have been very difficult to insert the nuts and/or tap the holes from the inside), and you can use a regular taper or plug tap (rather than a bottom tap).
3D-print finishing. Although I often like the surface finish of 3D printed layers, in this case I wanted a smoother, more “product-like” finish.  Some time ago I bought some XTC-3D and this was a good opportunity to play with it a little more.  Overall, XTC works very well; especially on organic/curved shapes, you’re pretty much done after applying. Do follow the instructions about applying very thin coats (it will even out, even if it does not look like it at first).  However, in this case (no pun intended) there were two issues. First, I used an older printer (a Solidoodle 2; my Kossel is not yet set up for long ABS prints) which has significant banding. XTC is good, but it’s not magic; I did some initial sanding (and cleaning with denatured alcohol) before applying the XTC resin.  Second, on large flat surfaces, you will get some minor unevenness and some tiny bubbles here and there. Light sanding (with a sanding block!) will address most of these issues, but in some places you may need to use a little filler.  One-part Bondo spot putty is sufficient for this.  Apply it generously, and after it is dry, sand most of it off (it sands very easily).  Do wait for it to set, though.  Especially on thick coats, the manufacturer’s recommended set time (25 minutes) may not be sufficient; rule of thumb is to wait until it turns light pink everywhere and then wait some more.
All things considered, XTC-3D works great (unfortunately, I forgot to take pictures after applying just the XTC-3D). It definitely beats sanding (substantially reducing it), as well as two-part body fillers (which I haven’t used with prints, but I’ve used in another project a long time ago).  And for smaller surfaces or organic shapes, you’re pretty much done after applying.
Spray painting. I’m very new at this; I did it once in the past (again for this project) and, surprisingly, it had gone very smoothly.  I still don’t know why (maybe too much false confidence?), but it’s always the second time that gets you burned, isn’t it?  To cut the long story short, I learned about the difference between lacquers and enamels (simplifying, the first just dry by evaporation, the second cure by reacting with air), got distracted by paint chemistry (if you’re curious look, e.g., here or here, and if you’re really curious try this), and found the following paint compatibility chart, which is worth it’s weight in gold:
Furthermore, in the past I had used Krylon, which is not available at big box stores (we have one two blocks from home) so I decided to try Rustoleum instead.  Although people are often happier with Rustoleum (and, these days, they’re also cheaper), for the life of me I couldn’t get an even spray with their nozzles. Maybe they work well on large items like chairs and tables, or maybe it’s my (lack of) technique, but on this small enclosure I couldn’t get even coverage, and always got spots with too much paint (not enough to cause drips, but enough to affect the surface finish). More importantly, Rustoleum takes forever to dry and, if you’re doing your spraying in all sorts of weird places with temporary setups (we live in an apartment), that’s an issue.
So, I wiped it all off (tip perhaps worth sharing: I found that, at least if the paint hasn’t completely cured, white spirit works well and it doesn’t attack the plastic at all), went to an auto parts store, and got some Krylon.  I think their newer non-rotating nozzles spray a bit more like a firehose (just have to live with overspray), but other than that, the second attempt went pretty well.
I chose a satin finish both because I like it, and also because it’s a bit more forgiving with improper spraying distance (you can err on keeping the nozzle too far from the surface, and it won’t have an ill effect, within reason). Skipping the intermediate steps (nothing to be proud of :), here is the end result — not bad for a rookie:
Putting it together. The last bits were easy: soldering headers on the Thing (whatever fits in the enclosure, some straight and some raised right-angle pins) and on the display module.  Also, the right-angle JST header soldered onto the Thing wouldn’t work in this enclosure (the LiPo wire collides with the endcap), so I desoldered it and replaced it with a vertical JST header.  Finally, I had to solder a wire to the reset pads on the Digole module (the reset signal is not broken out, but it’s accessible through an unpopulated reset pushbutton).
After fiddling with the screws (long nose pliers and balldrive Allen keys FTW!) and wires, the mechanical assembly was done — whewww!
Epic fail(s). So far I’ve omitted an epic fail from the story.  The enclosure shown above is actually the second attempt.  The first one ended up in disaster, all within a couple of hours.  The first attempt was printed in PLA.  First fail and lesson: PLA really does melt under the sun, and it takes less than you’d think.  I sprayed the endcaps first and temporarily set them down on a cardboard on top of a metal outdoor table under the sun.  In the few minutes it took me to spray the first coat on the main enclosure, the encaps had seriously warped!  You can see this in the picture on the left (and that is after I spent half an hour re-shaping them with a temperature-controlled hot air gun at low heat!).  The second fail was even worse: when I did the first attempt, I did not have an M2 tap, so I decided to use M2 screws (and oversize tap diameter).  Unfortunately, this does not cut the threads properly, and the screws still meet substantial resistance.  Since the nuts will never be perfectly aligned, when inserting the screws from the inside what happened was what you see on the left photo. Doh!
So, definitely use a tap to properly cut threads (or, make the holes really oversize, and make sure you clean any molten plastic if you use a soldering iron). Furthermore, measuring your fastener lengths twice and hand-tightening them is not a bad idea either.
You may also notice that the finish here is a little glossier; that’s what happens when you over-apply paint and/or spray from too close.
Finally, to top it all off, I hadn’t realized that the standoffs for the RTC module where on the wrong side (double-doh!) and when test-fitting it also turned out that the wires I had crimped were a couple of cm too short. Oh well, it had been a while since I had an epic fail like this! :)
Protocol sniffing. On to the software part.  First thing was to reverse-engineer the WiFi bulbs’ protocol.  It appears that, although there are several variants of the hardware that look identical, not all of them run the same protocol (e.g., see links in sniffing notes on GitHub).  I’m not even sure all are made my the same OEM (FWIW, MAC vendor lookup on my bulbs says Hi-flying Electronics). Of course, none of these protocols are published, but all of them are very similar and quite simple.  In my case, since I’m running OpenWRT on our router, I just installed ngrep
and sniffed the iOS app’s traffic. Â I’m pretty sure it’s possible to sniff traffic even if you don’t have access to the router (but I didn’t have to find out). Â Edit: Root access on the router makes sniffing much easier (otherwise you’ll probably need a sniffer on your tablet/phone).
For on and off commands, I can just copy them verbatim. Â For commands to set color, the structure is easy to figure out. Â First is an opcode byte, followed by RGBW values (the bulbs have both RGB as well as warm-white LEDs, and it seems you can turn on either one or the other), a constant(?), and a checksum byte. Nothing too fancy.
The iOS app uses UDP broadcast for bulb discovery (that protocol is also easy to figure out). Â This step does take some time (and was one of the annoyances with the user experience, since this information is not cached by the iOS app). Â However, after that, all communication happens over TCP. Â To keep things simple, I decided to skip the device discovery step (at least for now), and just assign fixed hostnames/IPs to the bulbs.
Firmware. The firmware is fairly standard stuff. It’s written using the ESP port of Arduino (many thanks, @igrr et al!), and it currently occupies about 70% of the Thing’s flash.
First, the display driver and UI code. Touch handling uses a combination of interrupts to detect the first touch, and then polling and debouncing to detect finger down/move/up events, and update the UI accordingly (this is probably the most complex bit here, and it’s actually pretty simple). Â While at it, I did a gratuitous rewrite of the Digole library, inspired by a very cool hack I had seen. Â Then NTP client, WiFi bulb client, and a webserver for configuring the device over a web browser. Â Settings are stored in “EEPROM” (which, on the ESP, is just a sector of the flash memory). Â The web UI is pretty simple for now:
Arduino on ESP has a great set of libraries for networking stuff, which makes all this quite easy! I’m using basic Bootstrap and Knockout.js to make it look a little pretty. I decided to write a proper HTML5 frontend and a simple REST API (using the excellent ArduinoJson library). However, upon first boot, the device has no Internet access.  If it fails to connect to WiFi, the firmware will switch the device to AP mode, so initial configuration can be done over WiFi.  Since the flash chip is not large enough to store Bootstrap and Knockout locally, there is a separate, minimal UI (not shown) that uses regular HTML forms (no AJAX) and just allows setting the SSID and password.
One problem (that I eventually worked around, rather than solved) was getting the Digole module to talk back to the ESP.  I2C was a fail (and it wasted me a couple of days; still not sure if the problem is on the display’s end or with the Arduino’s clock stretching implementation, or something else) and SPI I didn’t really try. I finally got UART to work (except that you can’t turn off the ESP’s 78Kbaud boot messages, hence the need for accessing the reset signal on the Digole).  The downside is that now reflashing the firmware is a PITA (I have to fiddle inside the case, to disconnect the display and connect my FTDI), but that happens relatively infrequently (the display stuff is mostly done, and the network stuff I test on a spare Thing first).
Conclusion. After all this, I think the result is not bad for a completely home-made device. Could I have gotten a used Chumby for a comparable price (they go for about $60 used), or just used an old/cheap Android tablet (and perhaps just 3D print a stand)?  Aside from the Chumby service’s ups and downs… sure, but where’s the fun in that? :)  Also, there is no way to reduce the cost of these alternatives down to $13.
What’s next? Well, you may have noticed there is a zipcode setting. That’s for weather information (planning to use OpenWeatherMaps, which returns reasonably-sized responses — parsing anything more than 1KB, maybe 2KB, is probably a bit iffy).  Also, a web UI to control the lamps would be nice (the REST API endpoints are there, just need to get around to writing and refactoring the HTML bits).  Maybe adapt the whole thing to a cheaper display module (as discussed in the beginning; I’ve already started a port of the ucglib library to the ESP, but need an actual device to finish it). Finally, one could perhaps re-write it in Lua (NodeMCU?) with support for pluggable modules (a-la true Chumby). That probably won’t be me, though; by that time, I’m pretty sure a new hack will have “distracted” me. :)
]]>Before I continue, let me say that, yes, I know Matlab has cell arrays and even objects, but still… you wouldn’t really use Matlab for e.g., text processing or web scraping. Yes, I know Matlab has distributed computing toolboxes, but I’m only considering main memory here; these days 256GB RAM is not hard to come by and that’s good enough for 99% of (non-production) data exploration tasks. Finally, yes, I know you can interface Java to Matlab, but that’s still two languages and two codebases.
Storing matrix data in Matlab is easy.  The .MAT format works great, it is pretty efficient, and can be used with almost any language (including Python).  At the other extreme, arbitrary objects can be stored in Python as pickles (the de-facto Python standard?), however (i) they are notoriously inefficient (even with cPickle), and (ii) they are not portable.  I could perhaps live with (ii), but (i) is a problem.  At some point, I tried out SqlAlchemy (on top of sqlite) which is quite feature-rich, but also quite inefficient, since it does a lot of things I don’t need. I had expected to pay a performance penalty, but hadn’t realized how large until measuring it.  So, I decided to do some quick-n-dirty measurements of various options.
The goal was to compare Python overheads (due to the interpreted nature of Python, the GIL, etc etc), not raw I/O performance. Â Furthermore, I’m looking for a simple data storage solution, not for a distributed, fault-tolerant, scalable solution (so DHTs, column stores, etc like memcached, Riak, Redis, HBase, Cassandra, Impala, MongoDB, Neo4j, etc etc etc, are out). Â Also, I’m looking for something that’s as “Pythonic” as possible and with reasonably mature options (so I left things like LevelDB and Tokyo Cabinet out). Â And, in any case, this is not meant to be an exhaustive list (or a comprehensive benchmark, for that matter); I had to stop somewhere.
In the end, I ended up comparing the following storage options:
Furthermore, I also wanted to get an idea of how easily Python code can be optimized.  In the past, I’d hand-coded C extensions when really necessary, I had played a little bith with cython, and I had heard of PyPy (but never tried it).  So, while at it, I also considered the following Python implementations and tools:
The dataset used was very simple, consisting of five columns/fields of random floating point numbers (so the data are, hopefully, incompressible), with sizes of up to 500M records. Â The dataset size is quite modest, but should be sufficient for the goals stated above (comparing relative Python overheads, not actual disk I/O performance). File sizes (relative to sqlite, again) are shown below. Â For the record, the ‘raw’ data (500,000 rec x 5 floats/rec x 8 bytes/float) would have stood at 0.74, same as pytables which has zero overhead (well, 64KB to be exact); sqlite has a 36% overhead. Â ZODB size includes the index, but that’s just 2.7% of the total (caveat: although records were only added, never deleted, I’m not familiar with ZODB and didn’t check if I should still have done any manual garbage collection).
Runs were performed on a machine with an ext4-formatted Samsung EVO850 1TB SSD, Ubuntu 14.04LTS and, FWIW, a Core i7-480K at 3.7GHz. RAM was 64GB and, therefore, the buffercache was more than large enough to fit all dataset sizes.  One run was used to warm up the cache, and results shown are from a second run.  Note that, particularly in this setting (i.e., reading from memory), many (most?) of the libraries appear to be CPU-bound (due to serialization/deserialization and object construction overheads), not I/O-bound.  I cautiously say “appear to be” since this statement is based on eyeballing “top” output, rather than any serious profiling.
For full disclosure, here’s a dump of the source files and timing data, provided as-is (so, caveat: far from release quality, not intended for reuse, really messy, undocumented, etc etc—some bits need to be run manually through an iPython prompt and/or commented-in/commented-out, don’t ask me which, I don’t remember :). If, however, anyone finds anything stupid there, please do let me know.
First a sanity check, wall clock time vs. dataset size is perfectly linear, as expected:
The next plot shows average wall-clock time (over all dataset sizes) for both cpython and pypy, normalized against that of raw sqlite with cpython:
As usual, I procrastinated several weeks before posting any of this. In the meantime, I added a second EVO850 and migrated from ext4 to btrfs with RAID-0 for data and LZO compression.  Out of curiosity I reran the code.  While at it, I added ZODB to the mix. Here are the results (cpython only, normalized against sqlite on btrfs):
Pytables is, oddly, faster! Â For completeness, here are the speedups observed with striping across two disks, vs a single disk.
Remember that these are (or should be) hot buffercache measurements, so disk I/O bandwidth should not matter, only memory bandwidth.  Not quite sure what is going on here; I don’t believe PyTables uses multiple threads in its C code (and, even if it did, why would the number of threads depend on the… RAID level??).  Maybe some profiling is in order (and, if you have any ideas, please let me know).
Comparing Python implementations. Woah, look at PyPy go!  When it works, it really works.  See SqlAlchemy go from being 2.5x slower when using the low-level APIs or 25x slower with all the heavyweight ORM machinery, to almost directly competitive with raw sqlite or 6x slower (a 4x speedup), respectively.  Similarly, manual object re-construction on top of raw sqlite now has negligible overhead.  However, most libraries unfortunately do not (yet) run on PyPy.  More importantly, the frameworks I need for data analysis also do not support PyPy (I’m aware there is a special version of NumPy, but matplotlib, SciPy, etc are still far from being compatible).  Also, I’m not quite sure why pickles were noticeably slower with PyPy.
Comparing data formats. Sqlite is overall the most competitive option.  This is good; you can never really go wrong with a relational data format, so it should serve as a general basis. PyTables is also impressively fast (it’s pretty much the only option that beats raw sqlite, for this simple table of all-floats).  Finally, I was somewhat surprised that NumPy’s CSV I/O is that slow (yes, it has to deal with all the funky quotation, escapes, and formatting variations, and CSV text is not exactly a high-performance format, but still…).
For the time being, I’ll probably stick with sqlite, but get rid of the SqlAlchemy ORM bits that I’ve been using (or, perhaps, keep them for small datasets). The nice thing is that I can keep my data files and perhaps look for a better abstraction layer than DB-API, but the sqlite “core” itself appears reasonably efficient. Eventually, however, I’d like to have something like the relationship feature of the ORM (but without all the other heavyweight machinery for sessions, syncing, etc), so I can easily persist graph data, with arbitrary node and edge attributes (FWIW, I currently use NetworkX once the data is loaded; I know it’s slow, but it’s very Pythonic and convenient, and I rarely resort to iGraph or other libraries, at least so far—but that’s another story).
]]>TL;DR: I went from the PCB on the left, to the device on the right, without ever leaving home. Design files are available here (caveat: I’m not an EE, but I sometimes play one on the web! :).
In addition to the plastic enclosure (designed and 3D printed at home), I also added a boost converter and a LiPo charge controller, so that the device can run off a LiPo battery and can be recharged via a standard micro-USB port.  These days a computer, the right tools, a fair amount of googling, and some common sense go a long way. Much of this is possible by standing on the shoulders of open source, both software (e.g., OpenSCAD and Slic3r) and hardware (e.g., Adafruit’s designs).  Also, CAD and common data formats make it easy to manufacture components, from circuits, to enclosures, to mechanical assemblies (example of this in another post), with just a few mouse clicks (e.g., with a 3D printer or through online services like OSHPark).  Just, say, five years ago, very little of this would be as easy as it is today.  Even Jonathan Jaglom, son of Stratasys’s chairman and CEO of Makerbot, seems to recognize this (via Hackaday), although he doesn’t actually say the “o” (for opensource) word.
Measuring things out. Instead of getting off-the-shelf breakout boards and jamming them in a large enclosure, I decided to streamline everything onto a single PCB, which would fit the overall round shape of the W-Ear. First, I needed precise dimensions of the W-Ear PCB.  Some information (microphone and mounting hole locations) is available on the W-Ear website, but I also needed the board outline and component locations to make the add-on LiPo PCB fit as tightly as possible.  Therefore, I scanned the W-Ear PCB on a flatbed scanner, and traced the outlines using Inkscape (an opensource vector drawing application).  After marking the locations of taller components (capacitors, transistor, and LM386 IC), I also drew the add-on board outline, saved it as DXF, and imported it into Eagle.
Designing the voltage regulator and charge controller PCB. Working with the W-Ear PCB imposed some constraints that are somewhat artificial, the most important of which is that supply voltage needs to be 9V.  The LM386-4 has a minimum supply voltage of 5V, and I also wasn’t sure if the rest of the microphone array circuit would work properly with anything different.  A single-cell LiPo supplies about 3.7V, so a voltage converter was necessary.  I decided to go with the MIC2288, and basically copied the datasheet example circuit (including component placement guidelines, as much as possible).
Next, I needed a charge controller for the LiPo battery.  Adafruit has several, and I chose one of their older designs, based on the MCP73833 IC.  Since this is open source hardware, I could download the schematic, tweak it for my needs (e.g., remove a few headers I didn’t need, change some resistors and thermistors, and switch to an MSOP package so it’s easier to hand-solder), and then lay out my custom PCB.  Isn’t that nice?  In the meantime, I had chosen a couple of LiPo cells off of EBay, and had them shipped from China.  Finally, I laid out the PCB, using the traced board outline and leaving empty space for the LiPo.
In the meantime, I also soldered the W-Ear board and printed my charge controller PCB on paper and cut it out, to make sure that the outline was correct and that it would fit snugly around the various components.  After tweaking the outline’s cutouts by a few fractions of a mm here and there, I shipped the design files off to OSHPark, to have a set of three prototype boards made.  Here are the bare boards (including the add-on fix; see below):
Designing the enclosure. While I was waiting for those to arrive (it takes about 10-14 days), I started designing the 3D printed enclosure, using the actual W-Ear PCB and the paper mockup of my PCB.  I made the enclosure’s cAD design parametric (e.g., total height, slack around the board, position and size of microphone, LED, and socket cutouts, etc), so I could easily tweak it.  A couple of test prints later, I was almost done.  The enclosure measures 79mm in diameter (basically constrained by the diameter of the W-Ear PCB), and 19.5mm thick, which is significantly thinner than would have been possible with the originally supplied 9V battery. I was actually surprised to realize that the total height is constrained by the electrolytic caps, not by my extra PCB + LiPo “sandwich”!  Much better than I had expected.
One thing that bothered me was the huge volume knob that shipped with the W-Ear kit, so I quickly designed and 3D printed a smaller, nicer-looking one. Finally, somewhere at this point, I placed an order for all the necessary SMD components from DigiKey (these arrive quickly, in just a couple of days).
If you haven’t worked with 3D printing before, it can be like magic at first, but for me it’s now almost routine. Â Although there are a number of details in designing a CAD model like this, I’m glossing over them. Here is a render of the final CAD model for all enclosure pieces:
PCB mounting standoffs are part of the enclosure, and the tabs on the back cover (tapered, to make them less likely to break) are meant to hold the LiPo cell in place.
Assembly and initial testing. When everything arrived in the mail, I was ready to put together and test. Â I assembled the charge controller board using hot air reflow soldering. Â If you’re interested, there are several example videos on YouTube; here is one by Dave Jones, demonstrating on much smaller and trickier (QFN instead of MSOP) components than I used. Â Everything fit together almost perfectly (except for the battery’s JST connector that protruded by about 0.5mm, which was easy enough to trim). Â The “measure twice (or thrice, or more), cut once” mantra paid off, as usual.
Working around ripple issues. The circuit worked correctly the first time, much to my surprise (can you tell I have no EE training, or anything beyond high-school physics when in comes to circuits — e.g., see the redundant caps… :).  Except for one thing, which I had feared: there was too much ripple on the switching regulator’s output, and the W-Ear requires a very clean power supply.  After some googling, it seems I had two options: (i) design an appropriate output filter, or (ii) add a linear LDO regulator after the switching regulator. I decided against the first option, for two reasons.  First, it would probably take too much time (days?) and trial-and-error to get a clue on filter design.  Second, I wasn’t entirely sure that, even after all that, I’d get a passive low-pass filter with components small enough to fit in the enclosure.  Therefore, I searched DigiKey for an appropriate LDO, and came up with the ADP7102, which has a very high power-supply rejection ratio (PSRR; a term I hadn’t even heard of before :) and could probably serve as a kind of active filter in this case, I guess.  It ain’t cheap, but that wasn’t a concern, since this is a one-off circuit, mainly for fun.
Getting the PCB done from scratch would cost quite a bit, so I decided to make a tiny add-on board (basically, a breakout for the ADP7102, plus the datasheet-recommended input and output caps), which could be soldered onto the main board with a pin header.  So, instead of paying $29 for another batch of the entire board, I paid only $1.5.  SMD components made the add-on small enough to stay below the top of the LiPo battery.  I designed this tiny board quickly and shipped the files off to OSHPark, once again.  When the boards came back, I assembled them (hot air reflow again), changed the feedback resistor on the switching regulator to increase its output voltage by about 0.2-0.3V (to compensate for the dropout), and put everything together. And it actually worked!  No more hiss and distortion.  Here’s what the final assembly looks like:
Almost everything you see in the picture (except the green printed circuit) was designed and manufactured “at home” (or at least without leaving it)! Â Yay for opensource and CAD.
There is one more shortcoming in the design: the switching+linear regulator portion is always enabled, and quiescent current is enough to kill battery within a few days, even if the W-Ear is switched off. Â However, I didn’t want (or, rather, I was afraid?) to touch the W-Ear circuit in any way (e.g., tapping into it’s volume potentiometer’s on-off switch). Â I can live with this anyway.
The finishing touch was a piece of paracord (cut to length, inserted into the enclosure’s holes for it, then knotted and slightly melted with a lighter to make it stay put), so the finished device could be worn around the neck. Â Mission accomplished!
Conclusion. This side-project was completed over time during the summer of 2014. If I had to guess how long it would have taken if I’d worked exclusively on this, I’d say less than a week (excluding the time waiting for PCBs, but including time spent googling, learning, and collecting all necessary information). Is this a finished product, or even production-ready?  No, but it’s a pretty darn convincing prototype (and would have been even more so if I hadn’t been too lazy to apply a coat of XTC-3D and spraypaint; one of these days :). More so if you consider that it was done in a short period of time, by someone who has no formal training in design or EE, largely by re-using opensource designs on the web, and relying on freely available tools!  And all of this without ever leaving home, and without any major investments in equipment!  Not bad.
]]>The overview of SVMs was centered around the observations that the decision function is, eventually, a weighted additive superposition (linear combination) of evaluations of “things that behave like projections in a higher-dimensional space via a non-linear mapping” (kernel functions) over the support vectors (a subset of the training samples, chosen based on the idea of “fat margins”).
Most of the explanations and pictures were based on linear functions, but I wanted to give an idea of what these kernels look like, how their “superposition” looks like, and how kernel parameters vary the picture (and may relate to overfitting).  For that I chose radial basis functions. I found myself doing a lot of handwaving in the process, until I realized that I could whip up an animation.  Following that class, I had 1.5 hours during another midterm, so I did that (Python with Matplotlib animations, FTW!!).  The result follows.
Here is how the decision boundary changes as the bandwidth becomes narrower:
For large radii, there are a fewer support vectors and kernel evaluations cover a large swath of the space. Â As the radii shrink, all points become support vectors, and the SVM essentially devolves into a “table model” (i.e., the “model” is the data, and only the data, with no generalization ability whatsoever).
This decision boundary is the zero-crossing of the decision function, which can also be fully visualized in this case. Â One way to understand this is that the non-linear feature mapping “deforms” the 2D-plane into a more complex surface (where, however, we can still talk about “projections”, in a way), in such a way that I can still use a plane (z=0) to separate the two classes. Â Here is how that surface changes, again as the bandwidth becomes narrower:
Finally, in order to justify that, for this dataset, a really large radius is the appropriate choice, I ran the same experiments with multiple random subsets of the training data and showed that, for large radii, the decision boundaries are almost the same across all subsets, but for smaller radii, they start to diverge significantly.
Here is the source code I used (warning: this is raw and uncut, no cleanup for release!).  One of these days (or, at least, by next year’s class) I’ll get around to making multiple concurrent, alpha-blended animations for different subsets of the training set, to illustrate the last point better (I used static snapshots instead) and also give a nice visual illustration of model testing and ideas behind cross-validation; of course, feel free to play with the code. ;)
]]>Despite hearing about 3D printing daily, very few studies have looked at the digital content of physical things, and the processes that generate it. I collected data some time ago, and started off with this visualization, which I wrote about before. A further initial analysis of the data has some interesting stories to tell.
Exponential growth rates. The total number of things over time (blue) exhibits an exponential growing, with a compound doubling time of 6.1 months. Furthermore, if we consider only remixes (green), then the growth rate far outpaces the overall rate, with a compound doubling time of 4.6 months. Consequently, the relative ratio of remixes is also growing at an exponential pace (red) and, although obviously this cannot continue forever, there is little evidence that the growth rate of remixing is abating (in fact, after the introduction of the Thingiverse Customizer, which is excluded from this plot, the rate has picked up even further).
Popularity: views vs. likes vs. makes.  The following table summarizes the results of least-squares regression on measures of user actions, showing the top-3 best predictive features (\(p < 0.01\), ranked by \(t\)-test scores) with 95% confidence intervals of the corresponding regression coefficients, as well as the bottom-2 worst features.
Variable | Best predictors | Worst predictors |
---|---|---|
\(\mathit{\#Views}\) | \(\mathit{\#Likes}\!: 43.1\text{–}44.6, \mathit{\#DLs}\!: 0.35\text{–}0.38, \mathit{\#Views}’\!: 0.28\text{–}0.31\) | \(\mathit{\#Make}’\, (p=0.48), \mathit{\#Remix}’\, (p=0.06)\) |
\(\mathit{\#DLs}\) | \(\mathit{\#Likes}\!: 43.1\text{–}44.6, \mathit{\#DLs}\!: 0.35\text{–}0.38, \mathit{\#Views}’\!: 0.28\text{–}0.31\) | \(\mathit{\#Remix}\, (p=0.66), \mathit{\#Remix}’\, (p=0.51)\) |
\(\mathit{\#Likes}\) | \(\mathit{\#Views}\!: 0.006, \mathit{\#Make}\!: 2.72\text{–}2.83, \mathit{\#Likes}’\!: 0.42\text{–}0.46\) | \(\mathit{\#Remix}’\, (p=0.59), Â \mathit{\#DLs}’\, (p=0.27)\) |
\(\mathit{\#Makes}\) | \(\mathit{\#Likes}\!: 0.074\text{–}0.077, \mathit{\bf\#Files}\!: -0.13\text{–}0.11, \mathit{\#Makes}’\!: 0.28\text{–}0.33\) | \(\mathit{\#Remix}’\, (p=0.99), Â \mathit{\#DLs}’\, (p=0.51)\) |
\(\mathit{\#Remix}\) | \(\mathit{\#Views}\!: 0.0003, \mathit{\bf\#Remix}’\!: 0.18\text{–}0.27, \mathit{\bf\#Sources}\!: 0.19\text{–}0.39\) | \(\mathit{\bf\#Make}’\, (p=0.71), Â \mathit{\#DLs}\, (p=0.66)\) |
The relative incidence of user actions depends on the relative effort required to take those actions. Therefore, we observe that roughly (order of magnitude) 100 views “contribute” one like in our linear models, and roughly 10 likes “contribute” a make. The first is not particularly surprising. However, the fact that only 10\(\times\) likes contribute a make seems to suggest that users are actively seeking things, and have the means and motivation to actually print things that they have liked.
Another intuitive, in retrospect, observation is that the number of files has a negative effect on makes. This provides evidence for the hypothesis that simpler things (consisting of fewer parts) are more likely to be made.
Sublinearities and power-laws. The first figure below shows the number of likes vs. makes, and the second figure shows views vs. likes (both smoothed using exponential-size buckets).  The emerging relationships are that \(\mathit{\#Likes} \propto \mathit{\#Makes}^{0.70}\) and \(\mathit{\#Views} \propto \mathit{\#Likes}^{0.85}\).  Similar relationships have been observed in other domains.  However, if we look at remixes vs. makes, no such pattern emerges, which brings us to a last point.
Popular vs. Generative.  Perhaps the most surprising observation is that typical measures of general popularity have little relation to whether a thing is remixed or not: (i) makes are, in fact, the worst predictor of number of remixes (table and last figure above); and (ii) in fact, the number of remixes is a bad predictor of almost everything, except of other remixes (table above). This suggests that aspects of a design that make it broadly appealing are distinct from aspects that make it inspiring and, furthermore, agrees with the author’s personal experience that following remix links is more useful when looking for ideas, than when looking for utilitarian or fun things to print.
What next?  As a “bonus”, here is a visualization of the evolution of the largest connected component of the remix graph (with Customizer outputs excluded).  The last frame is essentially the same data as in our interactive visualization.  This video was hacked together using Matplotlib’s basic animation facilities and layed out using a simple breadth-first traversal of the graph.  Not as pretty as it can be, but it still shows an interesting picture.
]]>
Perhaps the NVR industry is ripe for “disruption”, but I wasn’t willing to wait. Last time I did that (for car stereos) was almost three years ago… and I’m still waiting.  Luckily, an NVR is a much simpler build than a custom car stereo (this was enough for me, thank you :).  There are several low-cost hardware options and ZoneMinder is a great open-source surveillance system that was originally built to scratch an itch (the original author’s power tools were stolen from his garage, and he couldn’t find any reasonably-priced commercial surveillance solutions he liked).  Here is what I got after about a day:
In addition to some familiarity with installing Linux, a 3D printer, and my case design from Thingiverse, you’ll also need:
Total cost comes to $120 if you have some spare parts around, or about $140 if you get everything and add shipping too. That’s about half the price of just software licenses for a NAS box, and an order of magnitude cheaper than NVR boxes in the market. Â Plus, there’s CPU cycles to spare, for more cameras, and it leaves the ReadyNAS Atom CPU free to handle its main tasks (file and media serving).
Although I love the Raspberry Pi and already have a couple for various tasks, I went with the Cubieboard since it has a much more powerful CPU (AllWinner A20 dual-core ARM) and built-in SATA, for just $20 more. Â Adding a powered USB hub and SATA-to-USB adaptor to a RasPi would probably have cost more (plus require funky wiring solutions); the Cubieboard was mostly plug-and-play.
The A20 can handle all four cameras in “modect” mode (motion detection triggered recording) at 1fps with one alarm zone per camera, without problems. Â The load average can be high (between 0.5 and 1.5) probably due to the continuous I/O, but actual utilization per core seems to peak around 20-25% and is typically in the single digits. Â Not bad at all for a low-power (10W max) single-board computer!
There are several Linux distributions for the Cubieboard (including Android) and the documentation is a bit messy, so I installed Linux a few times before I settled with Cubian (basically Debian Wheezy, in the spirit of Raspbian) which is great.  It can be installed on either an SD card or built-in NAND flash (I went with the former).  There are already DEBs for ZoneMinder, so this is a fairly standard Linux install.  The only additional steps were moving data directories for ZoneMinder and MySQL, as well as temporary files and logs (to minimize flash wear), over to the hard drive; see brief instructions on Thingiverse.
If you have a 3D printer and some basic Linux skills, perhaps this might save you a few hundred to a few thousand dollars. YMMV with other video formats (e.g., H.264 HD cameras). Let me know if it works for you.
]]>If you haven’t heard of it before, 3D printing refers to a family of manufacturing methods, originally developed for rapid prototyping, the first of which appeared almost three decades ago. Much like mainframe computers in the 1960s, professional 3D printers cost up to hundred thousands of dollars. Starting with the RepRap project a few years ago, home 3D printers are now becoming available, in the few hundred to a couple of thousand dollar price range.  For now, these are targeted mostly to tinkerers, much closer to an Altair or, at best, an Apple II, than a MacBook. Despite the hype that currently surrounds 3D printing, empowering average users to turn bits into atoms (and vice versa) will likely have profound effects, similar to those witnessed when content (music, news, books, etc) went digital, as Chris Anderson eloquently argues with his usual, captivating dramatic flair. Personally, I’m similarly excited about this as I was about “big data” (for lack of a better term) around 2006 and mobile around 2008, so I’ll take this as a good sign. :)
One of the key challenges, however, is finding things to print!  This is crucial for 3D printing to really take off. Learning CAD software and successfully designing 3D objects takes substantial, time, effort, and skill. Affordable 3D scanners (like the ones from Matterform, CADscan, and Makerbot) are beginning to appear. However, the most common way to find things is via online sharing of designs. Thingiverse is the most popular online community for “thing” sharing. Thingiverse items are freely available (usually under Creative Commons licenses), but there is also commercial potential: companies like Shapeways offer both manufacturing (using industrial 3D printers and manual post-processing) and marketing services for “thing” designs.
I’ve become a huge fan of Thingiverse.  You can check out my own user profile to find things that I’ve designed myself, or things that I’ve virtually “collected” because I thought they were really cool or useful (or both). Thingiverse is run by MakerBot, which manufactures and sells 3D printers, and needs to help people find things to print. It’s a social networking site centered around “thing” designs. Consequently, the main entities are people (users) and things, and links/relationships revolve around people creating things, people liking things, people downloading and making things, people virtually collecting things, and so on. Other than people-thing relationships, links can also represent people following other people (a-la Twitter or Facebook), and things remixing other things (more on this soon). Each thing also has a number of associated files (polygon meshes for 3D printing, vector paths for lasercutting, original CAD files—anything that’s needed to make the thing).
The data is quite rich and interesting. I chose to start with the remix relationships. When a user uploads a new design, the can optionally enter one or more things that their design “remixes”. In a sense, a remix is similar to a citation, and it conflates a few, related meanings. It can indicate an original source of inspiration; e.g., I see a design for 3D printable chainmail and decide that I could use a similar link shape and pattern to make a chain link bracelet.  I could design the bracelent from scratch, using just the chainmail idea, or perhaps I could download the original chainmail CAD files (if their creator made them available) and re-use part of the code/design.  A remix could also indicate partial relatedness: I download and make a 3D printer (yes, it’s possible, if you have the time—or, in this case, you can buy it instead) and decide to design a small improvement to a part.  Finally, a remix may indicate use of a component library (e.g., for embossed text, gears, electronic components, and much more).
Remix links can also be created automatically by apps. Like any good social networking platform, Thingiverse also has an API for 3rd party apps. The most popular Thingiverse app is the Customizer: anyone who can write parametric CAD designs may upload them and allow other users to create custom instances of the general design by choosing specific parameter values (which can be dimensions, angles, text or photos to emboss, etc).  For example, the customizable iPhone case allows you to chose your iPhone model, the case thickness, and the geometric patterns on the back.  Another popular parametric design is the wall plate customizer, which allows you to choose the configuration of various cutouts (for power outlets, switches, Ethernet jacks, etc) and print a custom-fit wallplate. A parametric design is essentially a simple computer program that describes a solid shape (via constructive solid geometry and extrusion operators). The Customizer will execute this program and render the solid on a server, generating a new thing, which will automatically have a remix link to the original parametric design.
So let’s get back to the remix relationship.  While I was waiting for my 3D printer to arrive, I spent some time browsing Thingiverse.  I noticed that I was frequently following remix hyperlinks to find related things, but following a trail was getting tedious and I was losing track.  So, I decided to make something that gives a birds eye view of those relationships. What are people creating, and how are they reusing both ideas and their digital representations? Last week I hacked together a visualization (using the absolutely amazing D3 libraries) to begin answering this question. Here is the largest connected component of the remix graph, which consists of about 3,500 things (nodes). If you think about it, its pretty amazing: more than 5% of the things (or at least those in my data) are somehow related to one another.  It may not seem like much at first, but check out the variety of things and you’ll see some pretty unexpected relationships (direct or indirect).
Clicking on the hyperlink or the image above will take you to an interactive visualization (if you’re on an iPad, you may want to grab your laptop for this component; D3 is pretty darn fast, but 3,500 nodes on an iPad is pushing it a bit :).  You can click-drag (on a blank area) to pan, and turn your scroll wheel (or two-finger scroll on a touchpad, or pinch on an iPad) to zoom. Nodes in red are things that a site editor/curator chose to feature on Thingiverse. Each featured thing is prominently displayed on the site’s frontpage for about a day. Graph links (edges) are directed and represent remix relationships (from a source thing to a derived thing).  If you mouse over a node, you’ll see some basic information in a tooltip, and outgoing links (i.e., links to things inspired or derived from that node) will be highlighted in green, whereas incoming links will be highlighted in orange. You can open the corresponding Thingiverse page to check out a thing by clicking on a graph node.  Finally, on the right-hand panel you can tweak a few more visualization parameters, or choose another connected component of the remix graph.
Before moving on to other components, a few remarks on the graph above: Although cycles are conceivable (I see thing X and it inspires me to remix it into thing Y, then the creator of X sees the remix action in his newsfeed, checks out my thing, and incorporates some of my ideas back into X, adding a remix link annotation in the process), it seems that this is never the case: the remix graph (or at least the fraction in this visualization, which is substantial) is, in practice, a DAG (directed, acyclic graph). Next, many of the large star subgraphs are centered around customizable (parametric) things; for example the large star on the left is the iPhone case (noted above) and it’s variations. Most of the remixes are simple instances of the parametric design, but some sport more involved modifications (e.g., cases branded with company logos). However not all stars fall in this category. For example, the star graph with many red nodes near the bottom left is centered around a 3D scan of Stephen Colbert, made on the set of the show. This has inspired may remixes, into things like ColberT-Rex, or Cowbert. Most of these remixes have one parent node, but some combine more than one 3D model; for example a cross between Colbert and the Stanford bunny is the Colberabbit, and a cross between Colbert and an octopus is Colberthulu. The original Colbert scan and most of it’s remixes were featured on Thingiverse’s frontpage (apparently the site editors are huge Colbert fans?).
So, anyway, how about the other connected components? The distribution of component sizes follows a power law (again, click on the image for an interactive plot—singleton components are not included), no surprises here:
Components beyond the giant one are also interesting (as always, click on each image for the interactive visualization).  For example, the component on the left below consists of things inspired by a 3D-printable chainmail design, which also includes things like link bracelets, etc.  The component on the right contains various designs for… catapults!
Some components contain pretty useful stuff, such as the one with items for kid’s parties (e.g., coasters, cookie cutters) — on the left.  Since many people in the community are tinkerers, there are many 3D-printable parts for.. 3D-printers!  An example is the component on the right, which is centered around the design files for the original MakerBot Replicator, and around it are related items (like covers and other modifications).
Other components contain cool, geeky things, such as the small but well-featured component on the left, with figures and items from the Star Wars universe (including Darth Vader, as well as Yoda, remixed into a “gangsta” and other things). Â Finally, not all components consist of 3D-printable things. Â The component on the right has designs for lasercutting plywood so it can be folded, which was remixed into book covers, Kindle covers, and other things:
All this is just a fraction of what’s out there. Thingiverse is also growing at an amazing pace: around March when I collected some of this data there were about 60,000 things and now there are over 100,000 things (the latter number is based simply on what appears to be linearly assigned thing IDs). Â That’s roughly a doubling in four months; the exponential trend is going on strong! Â This is quite impressive given the small (but fast-growing) size of the home 3D printer market.
Visualizing just the remix aspect of the Thingiverse is a start. For example, another thing I found myself doing when browsing Thingiverse is following indirect same-collection links (rather than direct remix links) to find related items. Once I get over gaping at the graph and all the stuff on Thingiverse (some of which I’ve printed on my Solidoodle), there are a few things to try in terms of data/graph properties as well as in terms of improving the visualization as a tool to browse the Thingiverse and discover interesting and useful things more effectively. If anyone is actually reading this :) and there is something you’d like to see, please chip in a comment.
Postscript: My favorite cluster among those I spotted in the visualization is probably the one related to Colbert (see above), with the Colberabbit (“a godless killing machine”) a particular favorite. Â I’ll be printing one of those soon. :)
]]>
If it’s technically possible to infer my identity (given a long enough period of observation, and enough resources and time to piece the various, possibly inaccurate, pieces of information together), someone (with enough patience and resources) will likely do it. Therefore, as the amount of data about me tends to infinity (which, on the Internet, it probably does), the fraction that I have to hide in order to maintain my privacy tends to one: you have long-term privacy only if you never reveal anything.  There are various ways of not revealing anything.  One is to simply not do it.  Another might be to keep it to yourself and never put it in any digital media.  Yet another might be encrypting the information.
However, not revealing anything isn’t really a solution (if a tree falls in the forest and nobody hears it… the tree has privacy, I guess).  There is an alternative, of course: precise access control. Your privacy can be safeguarded by a centralized, trusted gatekeeper that controls all access to data. This leads to something of a paradox: guaranteeing privacy (access control) implies zero privacy from the trusted gatekeeper: they (have to) know and control everything.  Many people are still confused about this. For example, a form of this dichotomy can be seen in peoples’ reactions towards Facebook: on one hand, people complain about giving Facebook complete control and ownership of their data, but they also complain when Facebook essentially gives up that control by making something “public” in one way or another. [Note: there is the valid issue of Facebook changing its promises here, but that’s not my point—people post certain information on Facebook and not on, say, Twitter or the “open web” precisely because they believe that Facebook guarantees them access control which, by the way, is a very tall order, leading to confusion on all sides, as I hope to convince you.]
Although I learned not to worry about what can be inferred about me, I am perhaps somewhat worried about knowing who is accessing my data (and making inferences), and how they are using it. Particularly if this is done by parties that have far more resources and determination than myself.  However, who uses my information and how is also another piece of information (data) itself.  Although everything is information, there seems to be an asymmetry: when my information is revealed and used, it may be called “intelligence”, but when the information that it was used is revealed, it may be called “whistleblowing” or even “treason“.  This asymmetry does not seem to have any technical grounding—one might make valid arguments on political, legal, moral, etc grounds, but not on technical grounds. Seen in this context, Zuckerberg’s calls for “more transparency” make perfect sense—he’s calling for less asymmetry.
More generally, privacy does not really seem to be a technical problem, much like DRM isn’t really a technical problem.  That privacy can be guaranteed by technical means seems to be a delusion and, perhaps, a dangerous one, because it gives a false sense of security. Privacy is, for the most part, a social, political and legal problem about how data can be used (any and all data!) and by whom. The apparent technical infeasibility of privacy had led me to believe that people will, eventually, get over the idea. After all, privacy is a 200-300 year old concept (at least in the western world; interestingly, Greek did not have a corresponding word until very recently). I may have missed something obvious, however: if privacy is attainable via a centralized, trusted gatekeeper, then perhaps privacy is the “killer app” for centralization and “walled gardens”. “I want full control over your data” is tougher to sell than “I want to protect your privacy”. Which is why Eric Schmidt’s recent backpedaling is somewhat worrying, even if the goal is noble (and there currently isn’t any evidence to believe otherwise).
I don’t think there are any (technical) solutions to privacy.  Also, enforcing transparency is perhaps almost as hard as enforcing privacy, although I have slightly more hope for the former—but that’s a separate discussion.  Privacy is cat-and-mouse game, much like “piracy” and DRM. However, our expectations should be tempered by the reality of near-zero-cost transmission, collection, and storage of “inifinitely” growing amounts of information, and we should perhaps re-examine existing notions of privacy under this light. I find that many non-technical people are still surprised when I explain the simple example in the opening paragraph, even though they consider it obvious in retrospect.
Personally, I find it safer to just assume that I have no privacy. Saves me the aggravation.
]]>At least in data mining, “fully automatic” is an often unquestioned holy grail.  There are certainly several valid reasons for this, such as if you’re trying to scan huge collections of books such as this, or index images from your daily life like this.  In this case, you use all the available processing power to make as few errors as possible (i.e., maximize accuracy).
However, if the user is sitting right in front of your program, watching your algorithms and their output, things are a little different. No matter how smart your algorithm is, some errors will occur. This tends to annoy users. In that sense, actively involved users are a liability. However, they can also be an asset: since they’re sitting there anyway, waiting for results, you may as well get them really involved. If you have cheap but intelligent labor ready and willing, use it! The results will be better or, at the very least, no worse. Â Also, users tend to remember the failures. So, even if end results were similar on average, allowing users to correct failures as early as possible will make them happier.
Instead of making algorithms as smart as possible, the goal now is to make them as fast as possible, so that they produce near-realtime results that don’t have to be perfect; they just shouldn’t be total garbage. When I started playing with the idea for WordSnap, I was thinking how to make the algorithms as smart as possible.  However, for the reasons above, I soon changed tactics.
The rest of this post describes some of the successful design decisions but, Â more importantly, the failures in the balance between “automatic” and “realtime guidance”. The story begins with the following example image:
Incidentally, this image was the inspiration for WordSnap: I wanted to look up “inimical” but I was too lazy to type. Also, for the record, WordSnap uses camera preview frames, which are semi-planar YUV data at HVGA resolution (480×320). This image is a downsampled (512×384) full-resolution photograph taken with the G1 camera (2048×1536); most experiments here were performed before WordSnap existed in any usable form. Finally, I should point out that OCR isn’t really my area; what I describe below is based on common sense rather than knowledge of prior art, although just before writing this post I did try a quick review of the literature.
A basic operation for OCR is binarization: mapping grayscale intensities between 0 and 255 to just two values: black (0) and white (1).  Only then can we start talking about shapes (lines, words, characters, etc).  One of the most widely used binarization algorithms is Otsu’s method.  It picks a single, global threshold so that it maximizes the within-class (black/white) variance, or equivalently maximizes the across-class variance. This is very simple to implement, very fast and works well for flatbed scans, which have uniform illumination.
However, camera images are not uniformly illuminated. The example image may look fine to human eyes, but it turns out that even for this image no global threshold is suitable (click on image for animation showing various global thresholds):
If you looked at the animation carefully, you might have noticed that at some point, at least the word of interest (“inimical”) is correctly binarized in this picture. Â However, if the lighting gradient were steeper, this would not be possible. Incidentally, ZXing uses Otsu’s method for binarization, because of it is fast. So, if you wondered why barcode scanning sometimes fails, now you know.
So, a slightly smarter approach is needed: instead of using one global threshold, the threshold should be determined individually for each pixel (i,j). A natural threshold t(i,j) is the mean intensity μw(i,j) of pixels within a w×w neighborhood around pixel (i,j).  The key operation here is mean filtering: convolving the original image with a w×w matrix with constant entries 1/w2.
The problem is that, using pure Java running on Dalvik, mean filtering is prohibitively slow. Â First, Dalvik is fully interpreted (no JIT, yet). Firthermore, the fact that Java bytes are always signed doesn’t help: casting to int and masking off the 24 most significant bits almost doubles running time.
Method | Dalvik (msec) | JNI (msec) | Speedup | ||||
---|---|---|---|---|---|---|---|
Naïve | 109,882 | ± | 4,813 | 1,712 | ± | 261 | 64× |
Sliding | 2,435 | ± | 141 | 71 | ± | 19 | 34× |
JNI to the rescue. The table above shows speedups for two implementations. The naïve approach uses a triple nested loop and has complexity O(w2mn), where m and n is the image height and width, respectively (m = 348, n = 512 in this example). The 1-D equivalent would simply be:
for i = 0 to N-1: s = 0 for j = max(i-r,0) to min(i+r,N-1): s += a[j]
where w = 2r+1 is the window size. The second implementation updates the sums incrementally, based on the values of adjacent windows. The complexity now is just O(mn). An interesting aside is the relative performance of two implementations for sliding window sums (where w = 2r+1 is the window size). The first checks border conditions inside each iteration:
Initialize s = sum(a[0]..a[r]) for i = 1 to N-1: if i > r: s -= a[i-r-1] if i < N-r: s += a[i+r]
The second moves the border condition checks outside the loop which, if you think about it for a second, amounts to:
Initialize s = sum(a[0]..a[r]) for i = 1 to r: s += a[i+r] for i = r+1 to N-r-1: s -= a[i-r-1] s += a[i+r] for i = N-r to N-1: s -= a[i-r-1]
Among these two, the first one is faster, at least on a laptop running Sun’s JVM with JIT (I didn’t time Dalvik or JNI). I’m guessing that the second one messes loop unrolling, but I haven’t checked my guess.
It turns out that there is a very similar approach in the literature, called Sauvola’s method. Furthermore, there are efficient methods to compute it, using integral images. These are simply the 2-D generalization of partial sums. In 1-D, if partial sums are pre-computed, window sums can be estimated in O(1) time using the simple observation that sum(i…j) = sum(1..j) – sum(1..i-1).
Savuola’s method also computes local variance σw(i,j), and uses a relative threshold t(i,j) = μw(i,j)(1 + λσw(i,j)/127). WordSnap uses the global variance and an additive threshold t(i,j) = μw(i,j) + λσglobal, but after doing a contrast stretch of the original image (i.e., linearly mapping minimum intensity to 0 and maximum to 255). Doing floating point math or 64-bit integer arithmetic is much more expensive, hence the additive threshold. Furthermore, WordSnap does not use integral images because the same runtime can be achieved without the need to allocate a large buffer. Memory allocation on a mobile device is not cheap: the time needed to allocate a 480×320 buffer of 32-bit integers (about 600KB total) varies significantly depending on how much system memory is available, whether the garbage collector is triggered and so on, but on average it’s about half a second on the G1. Even though most buffers can be allocated once, startup time is important for this application: if it takes more than 2-3 seconds to start scanning, the user might as well have typed the result.
Anyway, here is the final result of locally adaptive thresholding:
Conclusion: In this case we needed the slightly smarter approach, so we invested the time to implement it efficiently. WordSnap currently uses a 21×21 neighborhood.  Altogether, binarization takes under 100ms.
Another problem is that the orientation of the text lines may not be aligned with image edges. Â This is called skew and makes recognition much harder.
Initially, I set out to find a way to correct for skew.  After a few searches on Google, I came across the Hough transform.  The idea is simple.  Sayyou want to detect a curve desribed by a set of parameters. E.g., for a line, those would be distance Ï from origin and slope θ. For each black pixel, find the parameter values for all possible curves to which this pixel may belong. For a line, that’s all angles θ from 0 to 180 degrees, and all distances Ï from 0 to sqrt(m2+n2).  Then, compute the density distribution of parameter tuples.  If a line (Ï0,θ0) is present in the image, then the parameter density distribution should have a local maximum at (Ï0,θ0).
If we apply this approach to our example image, the first maximum is detected at an angle of 20 degrees. Here is the image counter-rotated by that amount:
Success!  However, computing the Hough transform is too slow!  Typical implementations bucketize the parameter space. This would require a buffer of about 180×580 32-bit integers (for a 480×320 image), or about 410KB. In addition, it would require trigonometric operations or lookups to find the buckets for each pixel, not to mention counter-rotation. There are obvious optimizations one can try, such as computing histograms at multiple resolutions to progressively prune the parameter space.  Still, the cost implied by back-of-the envelope calculations put me off from even trying to implement this on the phone. Instead, why not just try to use the users:
Conclusion: Simple approach with help from user wins, and the computer doesn’t even have to do anything to solve the problem! Incidentally, the guideline width is determined by the size of typical newsprint text at the smallest distance that the G1’s camera can focus.
Next, we need to detect individual words.  The approach WordSnap uses is to dilate the binary image with a rectangular structuring element (in the following image, the size 7×7), and then expand a rectangle (shown in green) until it covers the connected component which, presumably, is one word.
However, the size of the structuring element should really depend on the inter-word spacing, which in turn depends on the typeface as well as the distance of the camera from the text.  For example, if we use a 5×5 element, we would get the following:
I briefly toyed with two ideas for font size detection. Â The first is to do a Fourier transform. Â Presumably the first spatial frequency mode would correspond to inter-word and/or inter-line spacing and the second mode to inter-character spacing. But that assumes we apply Fourier to a “large enough” portion of the image, and things start becoming complicated. Â Not to mention computationally expensive.
The second approach (which also appears to be the most common?) is to to hierarchical grouping. First expand rectangles to cover individual letters (or, sometimes, ligatures), then compute histogram of horizontal distances and re-group into word rectangles, and so on. Â This is also non-trivial.
Instead, WordSnap uses a fixed dilation radius. Â The implementation is optimized to allow near-realtime annotation of the detected word extent. Â This video should give you an idea:
Conclusion: Simple wins again, but this time we have to do something (and let the user help with the rest). But, instead of trying to be smart and find the best parameters given the camera position, we try to be fast: fix the parameters and let the user find the camera position that works given the parameters. WordSnap uses a 5×5 rectangular structuring element, although you can change that to 3×3 or 7×7 in the preferenfces screen. Altogether, word extent detection takes about 150-200ms, although it could be significantly optimized, if necessary, by using only JNI only, instead of a mix of pure Java and JNI calls.
I’m now looking into the possibility of moving OCR into the “live” loop: as you move the camera, the phone shows not only the word extent rectangle, but also the recognized word.  Perhaps as a hyperlink to Google, or along with Google Translate results.  Then I can justifiably use the buzzword of the day, “augmented reality”!  It looks that it might just be possible, but let me get back to you in a week or two.  :)
Postscript: Some of the papers referenced were pointed out to me by Hideaki Goto, who started and maintains WeOCR. Also, skew detection and correction experiments are based on this quick-n-dirty Python script (needs OpenCV and it ain’t pretty!). Update (9/2): Fixed really stupid mistake in parametrization of line.
]]>Overall, the Android APIs are quite impressive, even though some edges are still rough. Â It was reasonably easy to get up to speed, even though my prior experience on mobile application frameworks was zero. Â The toughest part was getting used to the heavily event-based programming style, as well as the idea that your code may be interrupted, killed and restarted at any time.
Activity lifecycle. Although Android supports multitasking and concurrency, on a mobile device with limited memory and no swap it’s likely that the O/S will have to kill some or all of your tasks to reclaim resources needed by higher-priority, user-visible processes (e.g., an incoming phone call). Â If you have non-persistent or external state, such as open database connections or separate threads that fetch data in the background, things may get a little tricky. Although Android has auxiliary features such as managed cursors and dialogs, you still need to know they exist and use them properly.
However, even things like screen orientation changes are handled by terminating and restarting any affected activities. At first, while spending a couple of hours to figure out why my app was crashing when I opened the keyboard, I bitched about this. Apparently, I wasn’t the only one who was confused. To my surprise, I found that many Android Market apps crash when the screen is rotated.  Some Market apps even come with grave-sounding warnings that, e.g., “the life counter [sic] resets on screen orientation change =/ Will fix for new version.” Luckily, I also found numerous good posts about orientation changes, such as this or this (the series by Mark Murphy are pretty good, by the way), as well as a post on the official blog.
In retrospect, handling orientation changes in this way is a good thing: it forces app developers to be prepared. After I fixed my code to handle orientation changes gracefully, I found that I was also ready to properly handle other sources of interruption: when an incoming call came as I was testing my app, everything worked out beautifully.
Now, whenever I download an app, I perform the following test: I flip the keyboard open when the app executes a background operation, even if I don’t need to type anything. Â If the app crashes or gets into an inconsistent state (something that happens surprisingly often), that’s a strong indication that the code is not very robust.
Event handling. For APIs that are so heavily event-based, one of my gripes was that some (but not all) event handlers are based on inheritance rather than delegation. These design choices are probably due to performance reasons that may be specific to Dalvik, the Android VM which is motivated partly for non-technical reasons.Â
However, inheritance sometimes complicates things. For example, Android supports managed cursors and dialogs via methods in the base Activity class. On more than one occasion I found that managed threads would also be nice.  Implementing this requires hooking into the activity lifecycle events (and has, on occasion, been over-engineered to death). Because there are several Activity subclasses (e.g., ListActivity, PreferenceActivity, etc), there is no simple way to extend them all. If lifecycle events were handled via delegates, it would be possible to implement a background UI thread manager as, say, an activity decorator that can be added to any activity instance. Â
The delegation-based event model was introduced in Java 1.1 precisely to address such shortcomings of the inheritance-based model. But, being pragmatic about performance on current mobile devices, I should probably not complain too much.  Still, some API design choices seem a bit arbitrary, perhaps even Microsoft-esque: why would performance be an issue with lifecycle events (which are presumably rare, but handlers use inheritance) but not with click events (which are presumably more frequent, but handlers use delegation)?
Data sync and caching. Another gripe was the lack of syncable content providers, something I’ve mentioned before. Also, content providers aren’t really appropriate for network-hosted data. The requirement that content providers use an integer primary key (row ID) is reasonable for local databases and simplifies the APIs, but requires some book-keeping when that’s not the “natural” primary key.
Ideally, I’d like to see some support for caching remote data on the SD card (which would require gracefully handling card removal, and transparently fetching data either from the cache or the network). Although the core APIs provide all that is necessary to implement this from scratch, it was getting too complicated for my simple “weekend hack” app, so I decided to drop it.
I hope that, in the near future, porting web apps to mobile devices will become easier with the support for offline applications and client-side storage in HTML5, as well the proposed geolocation APIs (all of which are already part of Google Gears). An application manifest might include “web activities”, translating intents into HTTP POST requests, while granting device access permissions to those activities (e.g., see promising hacks such as OilCan). Porting might then involve little more than writing a new stylesheet. Perhaps that’s where Palm is going with its WebOS which apparently supports both “native application” and “web application” models, but information is rather thin at the moment.
Epilogue. My first Android app was an interesting learning experience, not only from a technical standpoint (perhaps more on this in another post). I also found that Android is quite stable. I sometimes used my phone for live debugging, forcefully killing threads and processes through ADB.  Let me put it this way: if it wasn’t for the RC33 OTA update, my phone would now have an uptime of a few months. For a piece of software that barely existed a year ago, this is impressive.
There is plenty of documentation available, but at times it can take some searching to find the necessary information. Â However, since Android is open-source, it’s always possible to consult the source code itself (which is fairly well-written and documented).
Note:Â This post was mostly written sometime around February. Since then I had no time to try SDK v1.5, but I believe most points above are still relevant.
]]>First, in a networked environment, it is common standards, rather than a single, common software platform, which further enable information sharing. So, Google may be doing Android for precisely the opposite reason than I originally suggested: to avoid the emergence of a single, dominant, proprietary platform. Chrome may exist for a similar reason. After all, Android serves a purpose similar to a browser, but for mobile devices with various sensing modalities.
Finally, mobile is arguably an important area and Google probably wants to encourage diversity and experimentation which, as I wrote in a previous post, is a pre-requisite for innovation. This is in contrast to the established mentality summarized by the quote I previously mentioned, to “find an idea and ask yourself: is the potential market worth at least one billlion dollars? If not, then walk away.” In fairness, this approach is appropriate to preserve the status quo. (By the way, in the same public speech, the person who gave this advice also responded to a question about competition by saying with commendable directness that “Look: we’ll all be dead some day. But there’s a lot of money to be made until then.”) But for innovation of any kind, one should “ask ‘why not?’ instead of ‘why should we do it?'” as Jeff Bezos said, or “innovate toward the light, not against the darkness” as Ray Ozzie said.
]]>This is certainly true of the end-products of intellectual labor, such as the article you are reading. However, it is also true of more mundane things, such as checkbook register entries or credit card activity. Whenever you pay a bill or purchase an item, you implicitly “create” a piece of content: the associated entry in your statement. This has two immediately identifiable “creators”: the payer (you) and the payee. The same is true for, e.g., your email, your IM chats, your web searches, etc. Interesting tidbit: over 20% of search terms entered daily in Google are new, which would imply roughly 20 million new pieces of content per day, or over 7 billion (over twice the earth’s population) per year—all this from just one activity on one website.
When I spend a few weeks working on, say, a research paper, I have certain expectations and demands about my rights as a “creator.” However, I give almost no thought to my rights on the trail of droppings (digital or otherwise) that I “create” each day, by searching the web, filling up the gas tank, getting coffee, going through a toll booth, swiping my badge, and so on. However, with the increasing ease of data collection and distribution in digital form, we should re-think our attitudes towards “authorship”.
People call me “Spiros”, my identity documents list me as “Spyridon Papadimitriou” and on most online sites I’m registered as spapadim. However, sometimes I’m s_papadim or spiros_papadimitriou, and so on. Like most people, I lost track of all my accounts a time ago. Vice versa, I’m not the only “Spiros Papadimitriou” in the real world. For example, I occasionally get confused with my cousin, and receive comments about my interesting architectural designs! Nor am I the only spapadim on the net.
A framework and mechanisms that allow (but do not enforce) asserting and verifying which of those labels (i.e., names, userids, etc) refer to the same entity (i.e., me) is missing. However, this is a prerequisite: how can we talk about data ownership and tackle portability, transparency and accountability, if we have to jump through countless hoops just to prove identity?
Some people, especially in the US, may object or even outright panic at the thought of such a global identifier. In Greece, and in much of Europe, we’ve had national identity cards for decades. Which is fine, as long as you know they exist and what are permissible uses-in other words, as long as transparency is ensured. Furthermore, the illusion of privacy should not be confused with privacy itself—if in doubt, I suggest reading “Database Nation” (official site). Its examples are largely US-centric, but the lessons are not.
OpenID (despite some shortcomings) and OAuth are emerging as open standards for authentication and authorization. OpenID allows reuse of authentication credentials from one site on others: I can reuse, say, my Google username and password to log in to other sites (e.g., to leave a comment on this blog), without having to create yet another account from scratch. OAuth resembles Kerberos’s ticket granting service but for the web, permitting other web services to ask for access to a subset of personal information: I could allow Facebook to access only my Google addressbook and not, potentially, all of my data on any Google service. OpenID and OAuth can, at least in principle, work together.
Both high-profile individual developers and major companies are involved in these efforts. For example, Yahoo! already supports OpenID and plans to support OAuth as well, while Google supports OAuth directly and OpenID indirectly in various ways. Wide adoption of these standards would be a major step forwards for data portability and web interoperability. However, I suspect they fall slightly short of providing a truly permanent and global personal identity. What if, for any reason, my Yahoo! account disappears, either because I decided to shut it down or because Yahoo! went bust?
I was going to suggest a DNS-based solution and I was surprised when I found that the generic top-level domain .name has been instituted since 2001 to provide URIs for personal identities. You can register for a free three-month trial on FreeYourID (after that, it’s $11/year). What’s more, their service already provides OpenID authentication. In principle, this should allow easy switching of authentication and authorization service providers. Just as I can still keep the “label” for this site even if I move to a different web host, I can still keep my personal “label” no matter who I choose to manage my personal information. So, now my universal username is spiros.papadimitriou.name, any emails sent to spiros@papadimitriou.name will find their way to me, you can call me on Skype using spiros.papadimitriou.name/call, and so on.
With such a unique identity tied to authorization and authentication services, the Giant Global Graph and its materializations would be one step closer to becoming really useful. If I want to use my identity to log and controll access to my data, I should be able to prove my claims. Currently, FOAF and XFN allow assertion of relationshipt but provide no way to verify them.
The point of this mental exercise so far is the following: A unique identity that can be verifiably associated with each and every data item that I produce is a prerequisite for making data ownership claims. Subsequently, we need to ask what fundamental rights should be associated with data ownership. The first is the right to keep my information with me or, in other words, “data portability”. Just as I can freely move my money from one financial institution to another, I should be able to move any of my information from one data warehouse to another.
For example, consider my web search history. I don’t think I need to argue about the importance of historical information to improve search quality. If I decide for any reason to move to another search provider, I should be able to carry along all the information that’s directly associated with me. This should include my search keyword history, as well as any additional information I may have contributed.
The actual details, however, may not be that straightforward. Take, say, the third hit on a Google search. Who is the “creator”? Me by entering the search keywords, Google by producing the search results in response to those keywords, or the person who wrote the web page that contains them in the first place? Similarly, when I buy gas, who is the “creator” of the transaction entry: me, Mobil, or American Express?
Even though intuition can often be wrong, my intuitive response to the Google search example would be that both I and Google have an ownership claim on this particular search, which includes the query keywords as well as a ranking of URLs. On the other hand, the person who wrote the contents of, say, the third URL has ownership claims only on those, and not the search results. Furthermore, the thousands of people that provided feedback to Google’s ranking algorithms by clicking on this URL on similar searches have ownership claims on those searches, but not on mine.
Finally, those two ownership claims (on keywords and on rankings) should probably not be treated the same. If they were, then, say, MS Live could effectively copy Google by getting many users to move. It seems reasonable to have the right to move my search history, but not the actual search results. However, I can imagine that some form of ownership claim on the rankings may be useful for other personal rights.
This is a highly idealized example and I’m not sure what an appropriate litmus test for ownership is, but some form of legal consensus must be in place.
The second fundamental right is that I should know who is using my personal information and how. For example, if an insurance company accesses my credit history to give me a rate quote, I can find this out. It may not be a completely painless process but it is certainly possible today, with a regulatory framework that ensures this. Similar regulations should be instituted to cover any and all forms of access to personal information.
Data access should be fully transparent to all parties involved. If the an insurance company accesses my medical records, I should know this. If the government does a background check on me, I should know this too. Transparency is a prerequisite for accountability. Otherwise, individuals have very limited power to protect themselves from improper uses of their personal information.
Much of the privacy research in computer science seems to assume that we can keep the existing legal and regulatory frameworks intact. Computer scientists taking such a position is even sadder than lawyers doing so; we have no excuse of failing to understand the technical issues. We cannot and should not make this assumption. Technical solutions should be subsidiary to new regulations. But that doesn’t mean technologists cannot lead. We should work towards supporting full transparency (for both individuals, as well as governments and corporations) rather than opacity and I’m currently in favor of a “shoot first, ask questions later” approach (and help lawmakers figure out the answers). After all, if there is anything that the DRM wars have taught us, it’s that information really wants to be free. Why do we think it’s technically hard (to say the least) to prevent copying of music, movies and software but we still think it may be possible to prevent copying of personal information? As I pointed out in an older post, it’s usually the use and not the possession of information that’s the problem.
My point in this post is simple: we should not fight the wrong war. Instead, we need an easy way to make data ownership claims, and use this to enforce at least two fundamental rights: the ability to keep any personal data with us, and the ability to know who is using this data and how.
Postscript. This post was wallowing for a while as a draft (originally separated from this post, then forgotten). Since then, a recent MIT TR article discusses some aspects of data ownership. Even better, I have since found an excellent short piece in the same issue by Esther Dyson, with which I could not agree more.
Update. After posting this last night, I did some further Googling and found another piece by Esther Dyson in the Scientific American. If you’ve read through my ramblings so far, then I’d urge you to read her article; she’s a much better writer than me, and has apparently been thinking about these issues for almost a decade, way before many people even knew what the Internet is. I should probably follow her more closely myself, as I agree disturbingly often with what I’ve read from her so far.
]]>I recently upgraded to a T-Mobile G1 (aka. HTC Dream), running Android. The G1 is a very nice and functional device. It’s also compact and decent looking, but perhaps not quite a fashion statement: unlike the iPhone my girlfriend got last year, which was immediately recognizable and a stare magnet, I pretty much have to slap people on the face with the G1 to make them look at it. Also, battery life is acceptable, but just barely. But this post is not about the G1, it’s about Android, which is Google’s Linux-based, open-source mobile application platform.
I’ll start with some light comments, by one of the greatest entertainers out there today: Monkey Boy made fun of the iPhone in January, stating that “Apple is selling zero phones a year“. Now he’s making similar remarks about Android, summarized by his eloquent “blah dee blah dee blah” argument. Less than a year after that interview, the iPhone is ahead of Windows Mobile in worldwide market share of smartphone operating systems (7M versus 5.5M devices). Yep, this guy sure knows how entertain—even if he makes a fool of himself and Microsoft.
Furthermore, Monkey Boy said that “if I went to my shareholder meeting […] and said, hey, we’ve just launched a new product that has no revenue model! […] I’m not sure that my investors would take that very well. But that’s kind of what Google’s telling their investors about Android.”  Even if this were true, perhaps no revenue model is better than a simian model.
Anyway, someone from Microsoft should really know better—and quite likely he does, but can’t really say it out loud. There are some obvious parallels between Microsoft MS-DOS and Google Android:
An executive once said that money is really made by controlling the middleware platform. Lower levels of the stack face high competition and have low profit margins. Higher levels of the stack (except perhaps some key applications) are too special-purpose and more of a niche. The sweet-spot lies somewhere in the middle. This is where MS-DOS was and where Android wants to be.
Microsoft established itself by providing the platform for building applications on the “revolution” of its day, the personal computer. MS-DOS became the de-facto standard, much more open than anything else at that time. Subsequently, Microsoft took a cut of the profits out of each PC sold ever since. Taiwanese “PC-compatibles” helped fuel Microsoft’s (as well as Intel’s) growth. The rest is history.
In “cloud” computing, the ubiquitous, commodity infrastructure is the network. This enables access to applications and information from any networked device. Even though individual components matter, it is common standards, rather than a single, common software platform, which further enable information sharing. If you believe that the future will be the same as the past, i.e., selling shrink-wrapped applications and software licenses, then Android not only has no revenue model, but has no hope of ever coming up with one. Ballmer would be absolutely right. But if there is a shift towards network-hosted data and applications, money can be made whenever users access those. There are plenty of obvious examples which could be profitable: geographically targeted advertising, smart shopping broker/assistant (see below), mobile office and add-on services, online games (location based or not), and so on. It’s not clear whether Google plans to get directly involved in those (I would doubt it), or just stay mostly on the back end and provide an easy-to-use “cloud infrastructure” for application developers.
The services provided by network operators are becoming commodities. This is nothing new. A quote I liked is that “ISPs have nothing to offer other than price and speed“. I wouldn’t really include security in their offerings, as it is really an end-to-end service. As for devices, there is already evidence that commoditization similar to that of PC-compatibles may happen. Just one month after Android was open-sourced, Chinese manufacturers have started deploying it on smartphones. Even big manufacturers are quickly getting in the game; for example, Huawei recently announced an Android phone. Most cellphones are already manufactured in China anyway. The iPhone is assembled in Shenzhen, where Huawei’s headquarters are also located (coincidence?). The Chinese already have a decent track record when it comes to building hardware and it’s only a matter of time until they fully catch up.
So, it’s quite simple: Android wants to be for ubiquitous services as MS-DOS was for personal computers. But Microsoft in the 80s did not really start out by saying “our revenue model is this: we’ll build a huge user base at all costs, which will subsequently allow us to get $200 out of each and every PC sold”? Not really. Similarly, Google is not going to say that “we want to build a user base, so we can make a profit from all services hosted on the [our?] cloud and accessed via mobile devices [and set-top boxes, and cars, and…].” Such an announcement would be premature, and one of the surest ways to scare off your user base: unless Google first provides more evidence that it means no evil, the general public will tend to assume the worst.
The most interesting feature of Android is it’s component-based architecture, as pointed out by some of the more insightful blog posts. Components are like iGoogle gadgets, only Android calls them “activities.” Applications themselves are built using a very browser-like metaphor: a “task” (which is Android-speak for running applications) is simply a stack of activites, which users can navigate backwards and forwards. The platform already has a set of basic activities that handle, e.g., email URLs, map URLs, calendar URLs, Flickr URLs, Youtube URLs, photo capture, music files, and so on. Any application can seamlessly invoke any of these reusable activities, either directly or via a registry of capabilities (which, roughly speaking, are called “intents”). The correspondence between a task and an O/S process is not necessarily one-to-one. Processes are used behind the scenes, for security and resource isolation purposes. Activities invoked by the same task may or may not run in the same process.
In addition to activities and intents, Android also supports other types of components, such as “content providers” (to expose data sources, such as your calendar or todo list, via a common API), “services” (long-running background tasks, such as a music player, which can be controlled via remote calls) and “broadcast receivers” (handlers for external events, such as receiving an SMS).
I think that Google is really pushing Android because it needs a component-based platform, and not so much to avoid the occasional snafu. If embraced by developers, this is the major ace up Android’s sleeve. Furthermore, the open source codebase is the strongest indication (among several) that Google has no intention to tightly regulate application frameworks like Apple, or to leverage it’s position to attack the competition like Microsoft has done in the past. Google wants to give itself enough leverage to realize it’s cloud-based services vision. If others benefit too, so much the better—Google is still too young to be “evil“. After all, as Jeff Bezos said, “like our retail business, [there] is not going to be one winner. […] Important industries are rarely made by single companies.” I find the comparison to retail interesting. In fact, it is quite likely that many “cloud services” themselves will also become commodities.
I’d wager that really successful Android applications won’t be just applications, but components with content provided over the network. A shopping list app is nice. It was exciting in the PalmPilot era, a decade ago. But a shopping list component, accessible from both my laptop and my cellphone, able to automatically pull good deals from a shopping component, and allow a navigation component to alert me that the supermarket I’m about to drive by has items I need—well, that would be great! Android is built with that vision in mind, even though it’s not quite there yet.
It’s kind of disappointing, but not surprising, that many app developers do not yet think in terms of this component-based architecture. In fairness, there are already efforts, such as OpenIntents, to build collections of general-purpose intents. Furthermore, the sync APIs are not (yet) for the faint of heart. Even Google-provided services could perhaps be improved. For example, Google Maps does not synchronize stored locations with the web-based version. When I recently missed a highway exit on the way to work and needed to get directions, I had to pull over and re-type the full address. Neither does it expose those locations via a data provider. When I installed Locale, I had to manually re-enter most of “My Locations” from the web version of Google Maps. So, there are clearly some rough edges that I’m sure will be smoothed out. After all, there have been other rough edges, such as forgotten debugging hooks, something I find more amusing than alarming or embarrassing and certainly not the “Worst. Bug. Ever.”
Android has a lot of potential, but it still needs work and Google should move fast. The top two items on my wish list would be:
I suspect it might not be that hard to build a Google gadget container for Android. Google Gears is already there and some form of interaction with the local device via Javascript is already allowed. Many gadgets don’t need that much screen real estate anyway, so this may be an interesting hack to try out.
But not many people will buy an Android device for what it could do some day. Google has created a lot of positive buzz, backed by a few actual features. Now it needs some sexy devices and truly interesting apps, to really jumpstart the necessary network effect. Building the smart shopping list app should be as easy as building the dumb one. In the longer run, the set of devices on which Android is deployed should be expanded. Move beyond cell phones, to in-car computers, set-top boxes, and so on (Microsoft Windows does both cars and set-top boxes already, but with limited success so far)—in short, anything that can be used to access network-hosted data and applications.
]]>What about advice for CS teachers and professors?
That it’s time for us to start being more honest with ourselves about what our field is and how we should approach teaching it. Personally, I think that if we had named the field “Information Engineering” as opposed to “Computer Science,” we would have had a better culture for the discipline. For example, CS departments are notorious for not instilling concepts like testing and validation the way many other engineering disciplines do.
Is there anything you wish someone had told you before you began your own studies?
Just that being technically strong is only one aspect of an education.
[…]
Alice has proven phenomenally successful at teaching young women, in particular, to program. What else should we be doing to get more women engaged in computer science?
Well, it’s important to note that Alice works for both women and men. I think female-specific “approaches” can be dangerous for lots of reasons, but approaches like Alice, which focus on activities like storytelling, work across gender, age, and cultural background. It’s something very fundamental to want to tell stories. And Caitlin Kelleher’s dissertation did a fantastic job of showing just how powerful that approach is.
The interview was conducted a few weeks before his death. I’ll just say that, somehow, I suspect someone not in his position would never have said at least one of these things. It’s a sad thought, but Randy’s message is, as always, positive.
]]>“The combine harvester, […] is a machine that combines the tasks of harvesting, threshing and cleaning grain crops.” If you have acres upon acres of wheat and want to separate the grain from the chaff, a group of combines is what you really want. If you have a bonsai tree and want to trim it, a harvester may be less than ideal.
MapReduce is like a pack of harvesters, well-suited for weeding through a huge volumes of data, residing on a distributed storage system. However, a lot of machine learning work is more akin to trimming bonsai into elaborate patterns. Vice versa, it’s not uncommon to see trimmers used to harvest a wheat field. Well-established and respected researchers, as recently as this year write in their paper “Planetary Scale Views on a Large Instant-messaging Network“:
We gathered data for 30 days of June 2006. Each day yielded about 150 gigabytes of compressed text logs (4.5 terabytes in total). Copying the data to a dedicated eight-processor server with 32 gigabytes of memory took 12 hours. Our log-parsing system employed a pipeline of four threads that parse the data in parallel, collapse the session join/leave events into sets of conversations, and save the data in a compact compressed binary format. This process compressed the data down to 45 gigabytes per day. Processing the data took an additional 4 to 5 hours per day.
Doing the math, that’s five full days of processing to parse and compress the data on a beast of a machine. Even more surprisingly, I found this exact quote singled out among all the interesting results in the paper! Let me make clear that I’m not criticizing the study; in fact, both the dataset and the exploratory analysis are interesting in many ways. However, it is somewhat surprising that, at least among the research community, such a statement is still treated more like a badge of honor rather than an admission of masochism.
The authors should be applauded for their effort. Me, I’m an impatient sod. Wait one day for the results, I think I can do that. Two days, what the heck. But five? For an exploratory statistical analysis? I’d be long gone before that. And what if I found a serious bug half way down the road? That’s after more than two days of waiting, in case you weren’t counting. Or what if I decided I needed a minor modification to extract some other statistic? Wait another five days? Call me a Matlab-spoiled brat, but forget what I said just now about waiting one day. I changed my mind already. A few hours, tops. But we need a lot more studies like this. Consequently, we need the tools to facilitate them.
Hence my decision to frolic with Hadoop. This post focuses on exploratory data analysis tasks: the kind I usually do with Matlab or IPython/SciPy scripts, which involve many iterations of feature extraction, data summarization, model building and validation. This may be contrary to Hadoop’s design priorities: it is not intended for quick turnaround or interactive response times with modestly large datasets. However, it can still make life much easier.
First, we start with a very simple benchmark, which scans a 350GB text log. Each record is one line, consisting of a comma-separated list of key=value pairs. The job extracts the value for a specific key using a simple regular expression and computes the histogram of the corresponding values (i.e., how many times each distinct value appears in the log). The input consists of approximately 500M records and the chosen key is associated with about 130 distinct values.
The plot above shows aggregate throughput versus number of nodes. HDFS and MapReduce cluster sizes are always equal, with HDFS rebalanced before each run. The job uses a split size of 256MB (or four HDFS blocks) and one reducer. All machines have a total of four cores (most Xeon, a few AMD) and one local disk. Disks range from ridiculously slow laptop-type drives (the most common type), to ridiculously fast SAS drives. Hadoop 0.16.2 (yes, this post took a while to write) and Sun’s 1.6.0_04 JRE were used in all experiments.
For such an embarrassingly parallel task, scaleup is linear. No surprises here, but it’s worth pointing out some numbers. As you can see from the plot, extracting simple statistics from this 350GB dataset took less than ten minutes with 39 nodes, down from several hours on one node. Without knowing the details of how the data were processed, if we assume similar throughput, then processing time of the raw instant messaging log could be roughly reduced from five days to just a few hours. In fact, when parsing a document corpus (about 1TB of raw text) to extract a document-term graph, we witnessed similar scale-up, going down from well over a day on a beast of a machine, to a couple of hours on the Hadoop cluster.
Hadoop is also reasonably simple to program with. It’s main abstraction is natural, even if your familiarity with functional programming concepts is next to none. Furthermore, most distributed execution details are, by default, hidden: if the code runs correctly on your laptop (with a smaller dataset, of course), then it will run correctly on the cluster.
Linear scaleup is good, but how about absolute performance? I implemented the same simple benchmark in C++, using Boost for regex matching. For a rough measure of sustained sequential disk throughput, I simply cat the same large file to /dev/null.
I collected measurements from various machines I had access to: (i) a five year old Mini-ITX system I use with my television at home, running Linux FC8 for this experiment, (ii) a two year old desktop at work, again with FC8, (iii) my three year old Thinkpad running Windows XP and Cygwin, and (iv) a recent IBM blade running RHEL4.
The hand-coded version in C++ is about 40% faster on the older machines and 33% faster on the blade [Note: I’m missing the C++ times for my laptop and it’s drive crashed since then — I was too lazy to reload the data and rerun everything, so I simply extrapolated from single-thread Hadoop assuming a 40% improvement, which seems reasonable enough for these back-of-the-envelope calculations]. Not bad, considering that Hadoop is written in Java and also incurs additional overheads to process each file split separately.
Perhaps I’m flaunting my ignorance but, surprisingly, this workload was CPU-bound and not I/O-bound—with the exception of my laptop, which has a really crappy 2.5″ drive (and Windows XP). Scanning raw text logs is a rather representative workload for real-world data analysis (e.g., AWK was built at AT&T for this purpose).
The blade has a really fast SAS drive (suspiciously fast, except perhaps if it runs at 15K RPM) and the results are particularly instructive. The drive reaches 120MB/sec sustained read throughput. Stated differently, the 3GHz CPU can only dwell on each byte for 24 cycles on average, if it’s to keep up with the drive’s read rate. Even on the other machines, the break-even point is between 30-60 cycles [Note: The laptop drive seems to be an exception, but I wouldn’t be so sure that at least part of the inefficiency isn’t due to Cygwin].
On the other hand, the benchmark throughput translates into 150-500 cycles per byte, on average. If I get the chance, I’d like to instrument the code with PAPI, validate these numbers and perhaps obtain a breakdown (into average cycles for regex state machine transition per byte, average cycles for hash update per record, etc). I would never have thought the numbers to be so high and I still don’t quite believe it. In any case, if we believe these measurements, at least 4-6 cores are needed to handle the sequential read throughput from a single drive!
The common wisdom in algorithms and databases textbooks, as far as I remember, was that when disk I/O is involved, CPU cycles can be more or less treated as a commodity. Perhaps this is an overstatement, but I didn’t expect it to be so off the mark.
This also raises another interesting question, which was the original motivation for measuring on a broad set of machines: what would be the appropriate cost-performance balance between CPU and disk for a purpose-built machine? I thought one might be able to get away with a setup similar to active disks: a really cheap and power-efficient Mini-ITX board, attached to a couple of moderately priced drives. For example, see this configuration, which was once used in the WayBack machine (I just found out that the VIA-based models have apparently been withdrawn, but the pages are still there for now). This does not seem to be the case.
The blades may be ridiculously expensive, perhaps even a complete waste of money for a moderately tech-savvy person. However, you can’t just throw together any old motherboard and hard disk, and magically turn them into a “supercomputer.” This is common sense, but some of the hype might have you believe the opposite.
Once the original, raw data is processed, the representation of the features relevant to the analysis task typically occupies much less space. In this case, a bipartite graph extracted from the same 350GB text logs (the details don’t really matter for this discussion) takes up about 3GB, or two orders of magnitude less space.
The graph shows aggregate throughput for one iteration of an algorithm similar to k-means clustering. This is fundamentally very similar to computing a simple histogram. In both cases, the output size is very small compared to the input size: the histogram has size proportional to the number of distinct values, whereas the cluster centers occupy space proportional to k. Furthermore, both computations iterate over the entire dataset and perform a hash-based group-by aggregation. For k-means, each point is “hashed” based on its distance to the closest cluster center, and the aggregation involves a vector sum.
Nothing much to say here, except that the linear scaleup tapers off after about 10-15 nodes, essentially due to lack of data: the fixed per-split overheads start dominating the total processing time. Hadoop is not really built to process datasets of modest size, but fundamentally I see nothing to prevent MapReduce from doing so. More importantly, when the dataset becomes really huge, I would expect Hadoop to scale almost-linearly with more nodes.
Hadoop can clearly help pre-process the raw data quickly. Once the relevant features are extracted, they may occupy at least an order of magnitude less space. It may be possible to get away with single-node processing on the appropriate representation of the features, at least for exploratory tasks. Sometimes it may even be better to use a centralized approach.
My focus is on exploratory analysis of large datasets, which is a pre-requisite for the design of mining algorithms. Such tasks typically involve (i) raw data pre-processing and feature extraction stages, and (ii) model building and testing stages. Distributed data processing platforms and, in particular, Hadoop are well-suited for such tasks, especially the feature extraction stages. In fact, tools such as Sawzall (which is akin to AWK, but on top of Google’s MapReduce and protocol buffers), excel at the feature extraction and summarization stages.
The original, raw data may reside in a traditional database, but more often than not they don’t: packet traces, event logs, web crawls, email corpora, sales data, issue-tracking ticket logs, and so on. Hadoop is especially well-suited for “harvesting” those features out of the original data. In its present form, it can also help in model building stages, if the dataset is really large.
In addition to reducing processing time, Hadoop is also quite easy to use. My experience is that the programming effort compares very favorably to the usual approach of writing my own, quick Python scripts for data pre-processing. Furthermore, there are ongoing efforts for even further simplification (e.g., Cascading and Pig).
I was somewhat surprised with the CPU vs I/O trade-offs for what I would consider real-world data processing tasks. Perhaps also influenced by the original work on active disks (one of the inspirations for MapReduce), which suggested using the disk controller to process data. However, there is a cross-over point for the performance of active disks versus centralized processing; I was way off with my initial guess on how much CPU power it takes for a reasonably low cross-over point (which is workload-dependent, of course, and any results herein should be treated as indicative and not conclusive).
Footnote: For what it’s worth, I’ve put up some of the code (and hope to document it sometime). Also, thanks to Stavros Harizopoulos for pointing out the simple cycles-per-byte metric.
]]>However, the article offers no concrete examples at all, so I’ll venture a suggestion. In a growing open source ecosystem of scalable, fault-tolerant, distributed data processing and management components, MapReduce is emerging as a predominant elementary abstraction for distributed execution of a large class of data-intensive processing tasks. It has attracted a lot of attention, proving both a source for inspiration, as well as target of polemic by prominent database researchers.
In database terminology, MapReduce is an execution engine, largely unconcerned about data models and storage schemes. In the simplest case, data reside on a distributed file system (e.g., GFS, HDFS, or KFS) but nothing prevents pulling data from a large data store like BigTable (or HBase, or Hypertable), or any other storage engine, as long as it
Arguably, MapReduce is powerful both for the features it provides, as well as for the features it omits, in order to provide a clean and simple programming abstraction, which facilitates improved usability, efficiency and fault-tolerance.
Most of the fundamental ideas for distributed data processing are not new. For example, a researcher involved in some of the projects mentioned once said, with notable openness and directness, that “people think there is something new in all this; there isn’t, it’s all Gamma“—and he’s probably right. Reading the original Google papers, none make a claim to fundamental discoveries. Focusing on “academic novelty” (whatever that may mean) is irrelevant. Similarly, most of the other criticisms in the irresponsibly written and oft (mis)quoted blog post and its followup miss the point. The big thing about the technologies mentioned in this post is, in fact, their promise to materialize Margo Seltzer’s vision, on clusters of commodity hardware.
Michael Stonebraker and David DeWitt do have a valid point: we should not fixate on MapReduce; greater things are happening. So, if we are indeed witnessing the emergence of an open ecosystem for scalable, distributed data processing, what might be the other key components?
Data types: In database speak, these are known as “schemas.” Google’s protocol buffers the underlying API for data storage and exchange. This is also nothing radically new; in essence, it is a binary XML representation, somewhere between the simple XTalk protocol which underpins Vinci and the WBXML tokenized representation (both slightly predating protocol buffers and both now largely defunct). In fact, if I had to name a major weakness in the open source versions of Google’s infrastructure (Hadoop, HBase, etc), it would be the lack of such a common data representation format. Hadoop has Writable, but that is much too low-level (a data-agnostic, minimalistic abstraction for lightweight, mutable, serializable objects), leading to replication of effort in many projects that rely on Hadoop (such as Nutch, Pig, Cascading, and so on). Interestingly, the rcc record compiler component (which seems to have fallen in disuse) was once called Jute with possibly plans grander than what came to be. So, I was pleasantly surprised when Google decided to open-source protocol buffers a few days ago—although it may now turn out to be too little too late.
Data access: In the beginning there was BigTable, which has been recently followed by HBase and Hypertable. It started fairly simple, as a “is a sparse, distributed, persistent multidimensional sorted map” to quote the original paper. It is now part of the Google App Engine and even has support for general transactions. HBase, at least as of version 0.1 was relatively immature, but there is a flurry of development and we should expect good things pretty soon, given the Hadoop team’s excellent track record so far. While writing this post, I remembered an HBase wish list item which, although lower priority, I had found interesting: support for scripting languages, instead of HQL. Turns out this has already been done (JIRA entry and wiki entries). I am a fan of modern scripting languages and generally skeptical about new special-purpose languages (which is not to say that they don’t have their place).
Job and schema management: Pig, from the database community, is described as a parallel dataflow engine and employs yet another special-purpose language which tries to look a little like SQL (but it is no secret that it isn’t). Cascading has received no attention in the research community, but it merits a closer look. It is based on a “build system” metaphor, aiminig to be the equivalent of Make or Ant for distributed processing of huge datasets. Instead of introducing a new language, it provides a clean Java API and also integrates with scripting languages that support functional programming (at the moment, Groovy). As I have used neither Cascading nor Pig at the moment, I will reserve any further comparisons. It is worth noting that both projects build upon Hadoop core and do not integrate, at the moment, with other components, such as HBase. Finally, Sawzall deserves an honorable mention, but I won’t discuss it further as it is a closed technology.
Indexing: Beyond lookups based on row keys in BigTable, general support for indexing is a relatively open topic. I suspect that IR-style indices, such as Lucene, have much to offer (something that has not gone unnoticed)—more on this in another post.
A number of other projects are also worth keeping an eye on, such as CouchDB, Amazon’s S3, Facebook’s Hive, and JAQL (and I’m sure I’m missing many more). All of them are, of course, open source.
]]>