marcua's blog

Fun with Voter Data

Sun, 28 Feb 2016 11:57:47 -0500

Since elections are on everyone’s mind, I played around with some voter data. Your name/address/phone/party affiliation/participation is available in public voter data. I created an example of how it can be used in a Jupyter/IPython notebook.

This was my first use of Jupyter notebooks to write a data analysis piece. It was a lot of fun, and I hope all of you do it too!

Some fun findings: 1) two 115-year-old active voters, and 2) an example of how a campaign can create a call/mailing list of active voters to ask them for additional help.

Thanks to Meredith Blumenstock and Derek Willis for reading an early draft of this!

Crowdsourced Data Management: Industry and Academic Perspectives

Thu, 28 Jan 2016 14:41:02 -0500

A few years ago, Aditya and I were catching up at Voltage cafe in Kendall Square when he asked if I’d be interested in writing a book on crowd-powered data processing systems. At the time, he was a postdoc at MIT and I was in startupland at Locu, and in the time that he became a professor at UIUC and I co-founded a company, we went ahead and finished the book.

The book, which is freely available as a PDF, has two parts. In the first half, we review the state of academic research in crowdsourcing, with a special eye for data processing. The first half was a natural follow-on to our research in grad school. The second half of the book features summaries of 13 interviews with industry users of crowd work and 4 operators of crowdsourcing marketplaces. This half is filled with summary statistics and rich quotes from folks at companies like Google, Facebook, and Microsoft on how they manage large crowd workforces, what their use cases are, which aspects of the research literature they benefit from, and where they could use a little more help from researchers.

I really enjoyed two aspects of working on this book. First, it was wonderful to work with Aditya, who I never got to collaborate with in grad school. Second, the experience opened Aditya and me up to just how much you can learn from qualitative work like the interviews and surveys in the second part of the book. Both of us felt that this second lesson would have a lasting impact on how we approach learning a new topic, and how to keep industry and academia in sync on the most important problems in a field.

My only regret with the book as that, due to the formatting guidelines of our publisher, the Acknowledgements section is at the end of the book. One of my not-so-secret delights is reading the acknowledgements that people put in their Ph.D. theses, and I like it when they can be front and center. Nonetheless, it’s there, and I’m grateful!

To make the book more accessible, we’ll be putting together summaries of our favorite sections as blog posts. You can read the first one on Aditya’s blog.

Argonaut: Processing Complex Work with the Crowd

Tue, 01 Sep 2015 13:37:59 -0400

One of my favorite times of year at a company is when interns join for the summer. Internships are great avenues for those fun projects you’ve had in the back of your mind but haven’t had time to test out. Two summers ago at Locu, I had the great fortune to work with Daniel Haas, a grad student at Berkeley’s AMP Lab. His three months of work laid the foundation for a paper on a framework called Argonaut that he and Lydia Gu, my Locu/GoDaddy colleague who joined the project, are presenting today at VLDB.

While we primarily used Argonaut for structured data extraction, Argonaut’s concepts can be applied to other areas of complex crowd work. It’s amazing to think our learnings come from over half a million hours of worker contributions. Without them, none of these learnings would be possible.

As a sneak peak, I’ll highlight a few fun learnings we report on in more detail in the paper. I’ll also editorialize a bit on what I think the findings mean for crowd work, especially as people do more interesting and complex things with it. Here’s the scoop:

Complex work. Traditionally in crowd work, we’re told to design microtasks: simple yes/no or multiple choice questions that are well defined. You can imagine this is pretty dehumanizing and not inspiring to workers. With the Argonaut model, we send large, meaty tasks to workers. Tasks might take upward of an hour to complete, and are generally easier to design since there’s no microtask decomposition to think about. They are closer to what you’d imagine knowledge work being like: we trust humans to do what they’re good at on challenging tasks.

Review, don’t repeat. To avoid workers making mistakes in traditional microtask work, we send multiple workers each task, and use voting-based schemes (like majority vote or expectation maximization) to identify the correct answer. With Argonaut, we do something different: only one worker completes each complex task. Entry-level workers are sometimes reviewed by trusted ones, which allows us to catch mistakes, and also allows us to send tasks back to workers so that they can correct them and learn by example. In the paper we show that review works: a large majority of tasks that are reviewed end up of higher quality, and workers get to see how to improve their own work, unlike the opaque voting-based schemes of the microtask world.

Spotcheck with help from models. The technical heart of the Argonaut paper is a TaskGrader model. We built a regression on a few hundred features of each task, like the worker’s previous work history, the length of the task, the time of day in the worker’s timezone, etc. The regression predicted, based on these features, how much a review might change/improve a worker’s work. Given a fixed budget for review or a fixed number of reviewers, we can now identify which tasks the reviewers should look at for maximal task quality improvement. In the paper, we find that for a practical review budget, you can catch around 50% more errors with the same amount of review, just by pointing reviewers at the tasks that will benefit most from their attention.

Optimizing for longevity and upward mobility changes everything. One topic/hypothesis that doesn’t receive enough attention in the paper is that having long-term relationships with crowd workers changes everything. Half of the crowd workers contributed to our system for more than 2.5 years. The ones that performed the best ended up being promoted to reviewer status, and were selected to do more interesting work when it came up. This had pretty drastic effects on our worker and task models. Hidden in Figure 7 of the paper is a neat finding: almost the entirety of the predictive power of the TaskGrader comes from task-specific features. Worker-specific features on their own don’t appear to be too predictive: by the time you establish long-term relationships with workers, the discerning properties of a task’s quality are not the trusted people you’re working with, but the difficulty of the task they are completing. This is in stark contrast to traditional microtask crowd work, where the most celebrated work quality algorithms identify the trusted workers and weigh their responses more heavily.

While this paper brings us to the tip of the iceberg of complex work and hierarchical machine-mediated review, there are a ton of questions we have yet to answer. Most important to me are questions around just how complex of work we can do with these models. Can we support high-quality creative and analytical tasks beyond structured data extraction? How generalizable is the TaskGrader to other tasks? Finally, what does it mean for crowd work if longevity and upward mobility matter as much as they do in traditional employment scenarios?

A data differ to help journalists

Tue, 02 Jun 2015 22:14:47 -0400

I recently read an article that reminded me of a type of reporting I’ve seen a few times now. In this article, the reporters compare a medical expenses dataset from this year to the one from last year. They report how some aggregates (e.g., average price) grouped by various fields (e.g., treatment type) have changed over time.

It would be nice to have a utility that, given two datasets (e.g., two csv files) that are schema-aligned, returns a report of how they differ from one-another in various ways. The utility could take hints of interesting grouping or aggregate columns, or just randomly explore the pairwise combinations of (grouping, aggregate) and sort them by various measures like largest deviation from their own group/across groups.

There are a few challenges with the “just show me interesting combinations” version of this:

The approach suffers from multiple hypothesis testing and you’re likely to end up finding differences where they might not actually exist.
The system is going to present a bunch of different combinations to the user, resulting in overload. We’d have to think up some interface to present the various findings for them to be useful.

update with related work:

Manasi Vartak, Aditya Parameswaran and friends are working on SeeDB. SeeDB optimizes for findings that would visualize well, so its goal might be slightly different. It also has a notion of how a query (subset of the dataset) differs from the rest of the dataset, which we could use for comparing two schema-aligned datasets.
Michael Bernstein suggested a look at this paper, which says We found that long-term correlation data provided users with new insights about systematic wellness trends that they could not make using only the time series graphs provided by the sensor manufacturers.

The N=1 Guide to Startups after Grad School

Mon, 11 May 2015 22:40:17 -0400

About 2.5 years ago, I finished a Ph.D. with Sam Madden, David Karger, and Rob Miller, with my thesis at the intersection of Databases and Crowdsourcing. April marked my third year since joining Locu, which was acquired by GoDaddy about a year and a half ago. Having recently moved on from the company, I feel like this is a decent point for reflection.

Doing computer science research as a grad student can be a pretty amazing experience: you have several years to make cool things and scratch a bunch of mental itches. You get to think long and hard about problems, create new ones, and spend healthy amounts of time thinking about solutions.

Really early stage startups are also a place to make cool things, but tend to offer a frenetic whiplash-inducing experience as you try to find products that meaningfully improve the lives of some set of customers. Until you find something that fits, your job at a startup is to iterate quickly and favor fast failures over elegant contributions.

One commonality of grad school research and an early stage startup is the opportunity to work on and build cool and interesting things. Since they differ in their approach substantially, a natural question arises: Are startups a good place to go after you finish grad school?

Hopefully my experience arc of finishing a Ph.D., joining a startup, experiencing an acquisition, participating in an IPO, and deciding to move on can serve as some sort of lesson to others. It’s an N=1 guide and your experiences will be different, so caveat emptor and other warnings in dead languages. I’ve packaged the guide into an incomplete series of lessons targeted at grad students and other folks in the research world who might also be interested in exploring work at a startup.

Expect some funny reactions

I get the sense that by the time I left grad school, startups were a more common post-graduation choice than when I started school. My advisors, family, and friends were really supportive of my decision, but I sometimes got the sense that other people were surprised or taken aback by my decision. This was particularly poignant at conferences I attended a few months before and after defending my thesis. Although most folks took the news as they’d take any other exciting career choice, I also observed a decent amount of fidgeting and uncertainty in some people’s reactions.

It makes sense: the objectives you optimize for as a successful grad student are best suited for establishing a career in academia or industry research. As a result, a natural gut reaction to news of someone’s entering startups is that they are throwing away the work they did in grad school. In my research area, it’s more widely accepted that a natural conclusion of the systems one builds might be their commercialization, but this is less accepted in other research areas. At computer science departments that have had professors or alumni with successful startups, it’s easier to accept a post-grad school startup as a logical next step. What I’ve unscientifically heard from folks at departments with less examples of professors or alumni diving into startups is that there’s a stigma attached to it.

To be clear, entering a startup after grad school will likely reduce your chances of re-entering academia down the line, as you’re spending less time than your peers focusing on the sorts of things that academia values. This notion of narrowing down ones’ career choices will naturally result in discomfort or concern. Luckily, we’re increasingly seeing examples of people living at the intersection of both worlds, and I hope the stigma/false dichotomy fades with time.

Everyone’s story is different

As I neared the end of grad school, my advisors recommended that I wrap up my thesis and spend the following year conducting an academic job search. That sounded nice: I loved my time in academia, and still love teaching and mentoring people, so I planned on trying that out.

Then the opportunity at Locu opened up. Locu’s founders, who I met during our overlap in grad school, approached me to head up the company’s data and crowdsourcing efforts, which was broadly my area of research. In grad school, I did a lot of systems building around human processes, and I felt like I could do really interesting research in an environment where we could harness more than half a million hours of crowd worker contributions to clean and structure data.

Just like an applied particle physicist would probably benefit a lot from working at CERN, it makes sense that an applied social computing + data systems builder would benefit a lot from working at a place like Locu. If I were, for example, a theoretical computer scientist, that decision would be pretty different.

Managing a team that built systems in my research area made a lot of sense in my situation, but you should hardly listen to people who blindly peddle advice that startups are good for everyone coming out of grad school.

You can try before you buy

I finished up at MIT about 2.5 years ago, but started at Locu part time before that, about 3 years ago. The day after I submitted the last paper (chapter) of my thesis to a conference, I started spending around two days a week at Locu while compiling and editing the document that became my thesis.

This arrangement provided the benefit of trying out a startup before I bought into it fully. I got to learn two important things through it. Most importantly, I learned whether I worked well with the team, and whether the founders and I could work well together. I also got to see whether I enjoyed the type of work I’d be doing. As I describe later, things were initially too hectic to do research, so I had to take a bit of a leap of faith on the research opportunities that would pop up. Given the team’s early excitement at my doing other activities like blogging about what we did, I felt like they would be receptive to answering deeper questions as well.

The arrangement had the downside of being pretty stressful. Finishing up the thesis and putting together a defense was more time consuming than I planned. Startups are hectic places, and the combination of the two made for half a year of stressful times. I enjoyed both sets of activities, but my mind was never at rest, and I had a little scare when, after my defense, my thesis committee asked for a bit more work than I expected. I had to scramble to put in final edits while already working full time at Locu.

It worked out, and I’m grateful I got to try Locu out before I bought in completely. That said, there are likely saner ways to do this: consider interning at a startup for a summer rather than trying to multitask two complex sets of responsibilities.

You can do research at startups

Most people will advise you that it’s hard or impossible to do meaningful research at a startup. I agree that it’s hard, but it’s definitely not impossible. In fact, the kind of research I was able to do at Locu would have been challenging to work through in academia. Amazing crowdsourcing research comes out of academia all the time, but it’s rare that academics have established multi-year relationships with hundreds of paid crowd workers. This long-term, high-resource relationship had an impact on what I view as meaningful research in the area of practical systems to support crowd work.

That said, my job title and job descriptions at Locu or GoDaddy didn’t contain the word “Researcher.” My bosses were very supportive of introspection, though, and were happy if I spent some time on researchy questions. My day-to-day job, however, involved leading teams that made an impact on our customers and products, and when we were a young startup in particular, most of the sleep I lost was on things that weren’t research-related. For the first year and a half or so at Locu, it was hard to do what one might consider academic research.

As the company matured and our business processes became more sane, and in particular after we were acquired and were given the space to think more broadly, it became easier to do what looks more traditionally like research. We’ve got two full papers in submission on the work we’ve done, and it feels nice to be contributing back to the community I came from.

At peak research throughput, my weekly work schedule was still almost entirely not research-related. In general, I spent early mornings and weekends working on things like writing and making figures. In the weeks leading up to paper deadlines, I would spend about half of my working day implementing analyses. That said, as someone who wasn’t explicitly a researcher, I typically kept these activities outside of 9AM - 5 PM.

In short, if you’re building something interesting at a startup, it’s possible that with time you can ask and answer questions that are interesting to your research community. There are two hacks I’ve found for doing this effectively:

Just like academic advisors hire grad students to augment their research efforts, you can do the same! Hire a graduate student as a summer intern: they get access to some amazing datasets, and you benefit from having an immensely talented person to collaborate with on hard problems for a summer. I was so lucky to work with Daniel Haas from Berkeley when he joined us for a summer as an intern. One of those in-submission papers I mentioned is thanks to the work he did with us one summer on automatically identifying crowd worker output that could stand to be reviewed by more experienced workers. In addition to saving us money and improving work quality, it turned out to be one of my favorite papers to write!
Collaborate with folks that are still in academia. These are people who have the drive and external incentives to deliver meaningful research contributions, and at least in my field are quite interested in working with people embedded in industry. Aditya Parameswaran and I just submitted a first draft of a book on Crowdsourced Data Management to our editor, and I can’t imagine either of us would have been able to do it without the other. Plus, it’s a great excuse to hang out with some wonderfully cool folks like Aditya.

The community might be interested in your work even if you don’t publish

Even if you don’t publish papers on your work, the research community might still be quite interested in you and your research. Luckily, there are several ways to stay involved.

One way to stay involved is to participate in program committees, where you review paper submissions and discuss which papers should be presented at conferences. Greedily speaking, it’s a fun way to keep in touch with your research friends, but it’s also a nice way to stay in touch with what’s happening in the research world. Reviewing papers also serves as a mentorship opportunity: find some coworkers that are thinking about research/grad school and mentor them in how to review papers. They get to kick the tires on a new experience, and you get help reviewing your papers!

Another way to stay involved is to participate in industry panels or submit industry track papers to conferences. Most academic conferences want to learn what’s relevant to practitioners, and as someone that’s been in both worlds, you’re a perfect person to help bridge that gap! One slightly frustrating problem we’ve run into is that our papers have a harder time getting accepted to purely academic research tracks at conferences. But the reviewers always encourage us to submit to industry tracks, which seem a lot more welcoming to our line of work.

There’s also a whole world of non-academic conferences that I didn’t really participate in before graduating. You picked up all of those presentation skills in grad school, and these are nice opportunities to use them! O’Reilly runs a set of conferences like Strata. There are also a bunch of smaller but more intimate communities like Craft Conference, or !!con that might be nice venues for your work. Strata was fun, and while I haven’t been brave enough to submit proposals to the tighter-knit conferences, I hope to join those communities one day and you should too!

One word of warning: These conferences tend to be expensive to attend. By giving a talk, you generally don’t have to pay a registration fee, and the conferences will often fund travel/lodging, so make sure to ask!

A Ph.D. isn’t a perfect degree for startups

Your objectives and reward systems are different in academia and startups, and as a result, you have to hone a different set of skills.

In academia, you optimize for clean, elegant, novel solutions to broad problems that you evaluate in a deep way. In a Ph.D., you spend your time honing your question asking, answering, and presentation skills, and are critiqued on your clarity, generalizability, depth, and novelty. If you can present an interesting question, answer, and evaluation mostly through papers and slides, people are willing to look past code that’s poorly documented, untested, runs on only a single machine, and would likely need to be reimplemented if it was commercialized. (None of these comments are meant to come off as disparaging: researchers are simply optimizing for different objectives.)

In startups, you’re building an ecosystem around some core technological or societal insight. You’re not rewarded for novelty as much as you are for some combination of utility and revenue. Early on, this means rapidly iterating with customers and dropping solutions that don’t work for ones that do. People care less for why something works than for if it works. As you come upon a solution, you focus less on presenting it to your community and more on stabilizing it, automating it, scaling it, and reporting on it.

The human aspects are quite different as well. Whereas academia rewards the degree to which you establish your unique identity and build relationships across institutions, startups optimize for growing a product and a team. Both academia and startups could stand to improve how they think about team health, professional growth, and people’s sense of self-worth, but that’s for another day.

In academia, your initial time horizon for hitting something interesting is a bit longer than in startups, and as a result, you can afford to take the long road to your next experiment. If you discover a tangentially interesting thing along the way, the reward might be your next research project. In startup land, after an initial set of iterations and discovery, it’s your job to set up a process that will keep your company useful, growing, and relevant for years or decades, and only a small part of that involves coming up with something brand new or arguing for its novelty.

Several times at Locu, I found myself short circuiting a tangent that, toward the end of grad school I would have identified as the start of a new project. At the startup, we had to avoid these tangents because they would have significantly distracted from our core focus at the time. I didn’t always ignore the tangents though, and some resulted in internships that released open source or papers into the world.

If your academic interests and the focus of a startup align, that might help motivate you to deliver good things in startup land. But don’t confuse your interests for your skills. You’ll have to learn a new set of objectives and approaches if you transition from grad school to a startup.

A Ph.D. is a useful degree for startups

Perhaps if I had gotten six years of experience in industry rather than going to grad school, I’d be in the same position professionally that I am in now. For what it’s worth, I don’t think that I would be in the same position. At both Locu and GoDaddy, I used the skills I picked up in grad school to solve problems and collaborate with and mentor other people. I had to learn a ton beyond what I learned in grad school, but I’m grateful for the useful skills that the Ph.D. offered me.

To start, here’s a list of things you do in grad school, biased toward systems builders: find a good way to state problems, identify solutions to those problems, work on certain problems for months or years, architect and build systems, identify reasonable algorithms, measure things that are hard to measure, mentor some undergrads, collaborate with humans, and communicate clearly in written, spoken, and visual forms. Every single one of these skills is useful basically anywhere you will go, including startups.

At Locu, I mentored and managed some amazing fresh-out-of-undergrad computer scientists. It’s not an exaggeration to say that every single one of them was faster than me at solving well-stated problems. That said, they needed help thinking through problems and solutions, keeping the higher level objective in their heads, and getting comfortable with uncertainty. If there’s one thing grad school prepares you for well, it’s smacking your head and keyboard against an uncertain problem for several months on end until something meaningful falls out, all the while making sure you’re contributing to some cohesive story. A combination of teaching, mentoring, problem-solving, and presentation experiences I picked up in grad school helped me team up with these amazingly talented builders so that we could do something nice.

The other area grad school really helps with is in external communication. At Locu, our founders were certainly the most active communicators. After them, it seems like the more academic members of the team tended to put ourselves out there the most, both in terms of offering to make presentations and in finding paper- or blog post-writing opportunities. These are skills you get comfortable with in academic life, and translate well to startups and beyond.

Fin.

Don’t listen to any one person’s advice on the ideal life after grad school. Some of my friends from grad school love their lives as professors, and others quite enjoy their lives at larger corporations or industry labs. For me, startups provided a nice way to keep pursuing my grad school interests while working in a different context and at a different scale, and I’m grateful for the experience.

Many thanks to Aditya Parameswaran, Arvind Thiagarajan, Jean Yang, Lydia Gu, Meredith Blumenstock, Michele Catasta, Neha Narula, Nitesh Banta, and Philip Guo for reading a rough draft of this.

Reproducibility in the age of Mechanical Turk: We’re not there yet

Sun, 13 Apr 2014 13:05:25 -0400

There’s been increasing interest in the computer science research community in exploring the reproducibility of our research findings. One such project recently received quite a bit of attention for exploring the reproducibility of 613 papers in ACM conferences. The effort hit close to home: hundreds of authors were named and shamed, including those of us behind the VLDB 2012 paper Human-powered Sorts and Joins, because we did not provide instructions to reproduce the experiments in our paper. I’m grateful to Collberg et al. for their work, as it started quite a bit of discussion, and in our particular scenario, resulted in us posting the code and instructions for our VLDB 2012 and 2013 papers on github.

In cleaning up the code and writing up the instructions, I had some time to think through what reproducibility means for crowd computing. Can we, as Collberg et al. suggest, hold crowd research to the following standard:

Can a CS student build the software within 30 minutes...without bothering the authors?

My current thinking is a strong no: not only can crowd researchers not hold their work to this standard of reproducibility, but it would be irresponsible for our community to reach that goal. In fact, even if we opt for a different interpretation of reproducibility that requires an independent reconstruction of the research, making crowdsourcing research reproducible requires care.

Reproducibility is a laudable goal for all sciences. For computer science systems research, it makes a lot of sense. Systems builders are in the business of designing abstractions, automating processes, and proving properties of the systems they build. In general, these skills should lend themselves nicely to building standalone reproductions-in-a-box that make it easy to rerun the work of other researchers.

So why does throwing a crowd into the mix make reproducibility harder? It’s the humans. Crowd research draws on human-oriented social sciences like psychology and economics as much as it does on computer science, and as a result we have to draw on approaches and expectations that those communities set for themselves. The good thing is that in figuring out an appropriate standard for reproducibility, we can borrow lessons from these more established communities, so the solutions do not need to be novel.

How does crowd research challenge the laudable goal of reproducibility?

From here, I’ll spell out what makes crowd research reproducibility hard. It won’t be a complete list, and I won’t pose many solutions. As a community, I hope we can have a larger discussion around these points to define our own standards for reproducibility.

Humans don’t fit nicely into virtual machines. You pay crowd workers to do work for you when you want to add a human touch: there’s some creativity or decisionmaking that you couldn’t automate, but a human could do quite nicely on your behalf. Whereas you might package a system/experiment with a complex environment into a virtual machine for reproducibility’s sake, you can’t quite do the same with human creativity.
Cost. Crowdsourcing is not unique in costing money to reproduce. Some research requires specialized hardware that is notoriously costly to acquire, install, and administer. Even when the hardware isn’t proprietary, costs can be prohibitive: some labs have horror stories of researchers that accidentally left too many Amazon EC2 machines running for several days, incurring bills in the tens of thousands of dollars. Still, compared to responsibly spinning up a few tens of machines on EC2 for a few hours, crowdsourced workflows can bankrupt you faster.

Each of our VLDB papers cost around $1000 to run: each paper saw about 1000 Turker IDs complete around 65,000 HIT assignments at 1.5 cents apiece. This expense included errors we made along the way, but our errors were nowhere near what they could have been. For example, accidentally creating too many pairwise comparison tasks could have easily increased our costs by an order of magnitude in just a few hours. Reproducing crowdsourced systems research requires an upfront cost in the platform you’re using, but it also requires a nontrivial budget.

This expense is not insurmountable in the way that replicating the recruiting strategy of a psychology experiment or the wetlab setup of a biology experiment might be, but it’s should at least make you wary of getting up and running in a half hour. Whereas providing researchers with a single script to reproduce all of your experiments would be great for most systems, providing a single executable that spends a thousand dollars in a few hours might be irresponsible.
Investing in the IRB process. One of the warnings we put in our reproducibility instructions was that you’d be risking future government agency funding if you ran our experiments without seeking Institute Review Board approval to work with human subjects. Getting human subjects training and experiment approval takes on the order of a month for good reason: asking humans to sort some images seems harmless, but researchers have a history of poor judgement when it comes to running experiments on other people. Working with your IRB is a great idea, but it’s another cost of reproducibility.
Data sharing limitations. We could save researchers interested in reproducing and improving on our work a lot of money if we released our worker traces, allowing other scientists to inspect the responses workers gave us when we sent tasks their way. Other researchers could vet our analyses without having to incur the cost of crowdsourcing for themselves.

There are many benefits to such data sharing, but in releasing worker traces, you risk compromising worker anonymity. Turker IDs, while opaque, are not anonymous. As the AOL search log fiasco shows, even if we obscured identifiers further, it’s still possible to identify seemingly anonymous users from usage logs. IRBs are pretty serious about protecting personally identifiable information, and our IRB application does not cover sharing our data for these reasons. This limitation, like the others, is not unsolvable, but it will require the community to come together to figure out best practices for keeping worker identities safe.
Tiny details matter. Crowdsourced workflows have people at their core. Providing workers with slightly different instructions can result in drastically different results. When a worker is confused, they might reach out to you and ask for clarification. How do you control for variance in experimenter responses or worker confusion? What if, instead of requiring informed consent on only the first page a worker sees, as our IRB requested, your IRB asks you to display the agreement on every page? These little differences matter with humans in the loop. Separating the effects of these differences in experimental execution is important to understanding whether an experiment reproduced another lab’s results.
Crowds change over time. When we ran our experiments for our VLDB 2012 paper, we followed the reasonably rigorous CrowdDB protocol for vetting our results. We ran each experiment multiple times during the east coast business hours of different weekdays, trusting only experiments that we could reproduce ourselves. This process helped eliminate some irreproducible results. Several months later, Eugene and I re-ran all of our experiments before the paper’s camera ready deadline. No dice: some of the results had changed, and we had to remove some findings we were no longer confident in. As Mechanical Turk sees changing demographic patterns, you can expect your results to change as the underlying crowd does. These changes will compound the noise that you will already see across different workers. This is no excuse for avoiding reproducibility: every experimental field has to account for diverse sources of variance, but it makes me wary of the one-script-to-reproduce-them-all philosophy that you might expect of other areas of systems research.
Platforms change over time. Even after all of the work we put into documenting our experiments for future generations, they won’t run out of the box. Between the time that we ran our experiments and the time we released the reproducibility code, Amazon added an SSL requirement for servers hosting external HITs. This is a wonderful improvement as far as security goes, but underscores the fact that relying on an external marketplace for your experiments is one more factor to compound the traditional bit rot that software projects see.
Industry-specific challenges. Our VLDB 2012 and 2013 research was performed solely in academia. I’ve since moved to do crowdsourcing research and development in industry. This new environment poses new challenges to reproducibility. While most of the code powering machine learning and workflow design for crowd work in industry is proprietary, so is the crowd. For our work on the Locu team, we’ve got a few hundred workers that we’ve established long-term relationships with. We’ve had relationships with many crowd workers for over two years. Open sourcing the code behind our tools is one thing, but imagining other researchers bootstrapping the relationships and workflows we’ve developed for the purposes of reproducibility is near impossible. Still, I believe industry has a lot to contribute to discussions around crowd-powered systems: the mechanism design, incentives, models, and interfaces we develop are of value to the larger community. If industry is going to contribute to the discussion, we’ll have to work through some tradeoffs, including less-than-randomized evaluations, difficult-to-independently-reproduce conclusions, and as a result, more contributions to engineering than to science.

As crowd research matures, it will be important for us to ask what reproducibility means to our community. The answer will look pretty different from that of other areas of computer science. What are your thoughts?

Thank you to Peter Bailis and Michael Bernstein for providing feedback on drafts of this piece, and to my coauthors for helping get our work to a reproducible state.

Web Scraping Tools for Non-developers

Sun, 26 Jan 2014 17:41:11 -0500

I recently spoke with a resource-limited organization that is investigating government corruption and wants to access various public datasets to monitor politicians and law firms. They don’t have developers in-house, but feel pretty comfortable analyzing datasets in CSV form. While many public datasources are available in structured form, some sources are hidden in what us data folks call the deep web. Amazon is a nice example of a deep website, where you have to enter text into a search box, click on a few buttons to narrow down your results, and finally access relatively structured data (prices, model numbers, etc.) embedded in HTML. Amazon has a structured database of their products somewhere, but all you get to see is a bunch of webpages trapped behind some forms.

A developer usually isn’t hindered by the deep web. If we want the data on a webpage, we can automate form submissions and key presses, and we can parse some ugly HTML before emitting reasonably structured CSVs or JSON. But what can one accomplish without writing code?

This turns out to be a hard problem. Lots of companies have tried, to varying degrees of success, to build a programmer-free interface for structured web data extraction. I had the pleasure of working on one such project, called Needlebase at ITA before Google acquired it and closed things down. David Huynh, my wonderful colleague from grad school, prototyped a tool called Sifter that did most of what one would need, but like all good research from 2006, the lasting impact is his paper rather than his software artifact.

Below, I’ve compiled a list of some available tools. The list comes from memory, the advice of some friends that have done this before, and, most productively, a question on Twitter that Hilary Mason was nice enough to retweet.

The bad news is that none of the tools I tested would work out of the box for the specific use case I was testing. To understand why, I’ll break down the steps required for a working web scraper, and then use those steps to explain where various solutions broke down.

The anatomy of a web scraper

There are three steps to a structured extraction pipeline:

Authenticate yourself. This might require logging in to a website or filling out a CAPTCHA to prove you’re not…a web scraper. Because the source I wanted to scrape required filling out a CAPTCHA, all of the automated tools I’ll review below failed step 1. It suggests that as a low bar, good scrapers should facilitate a human in the loop: automate the things machines are good at automating, and fall back to a human to perform authentication tasks the machines can’t do on their own.
Navigate to the pages with the data. This might require entering some text into a search box (e.g., searching for a product on Amazon), or it might require clicking “next” through all of the pages that results are split over (often called pagination). Some of the tools I looked at allowed entering text into search boxes, but none of them correctly handled pagination across multiple pages of results.
Extract the data. On any page you’d like to extract content from, the scraper has to help you identify the data you’d like to extract. The cleanest example of this that I’ve seen is captured in a video for one of the tools below: the interface lets you click on some text you want to pluck out of a website, asks you to label it, and then allows you to correct mistakes it learns how to extract the other examples on the page.

As you’ll see in a moment, the steps at the top of this list are hardest to automate.

What are the tools?

Here are some of the tools that came highly recommended, and my experience with them. None of those passed the CAPTCHA test, so I’ll focus on their handling of navigation and extraction.

Web Scraper is a Chrome plugin that allows you to build navigable site maps and extract elements from those site maps. It would have done everything necessary in this scenario, except the source I was trying to scrape captured click events on links (I KNOW!), which tripped things up. You should give it a shot if you’d like to scrape a simpler site, and the youtube video that comes with it helps get around the slightly confusing user interface.
import.io looks like a clean webpage-to-api story. The service views any webpage as a potential data source to generate an API from. If the page you’re looking at has been scraped before, you can access an API or download some of its data. If the page hasn’t been processed before, import.io walks you through the process of building connectors (for navigation) or extractors (to pull out the data) for the site. Once at the page with the data you want, you can annotate a screenshot of the page with the fields you’d like to extract. After you submit your request, it appears to get queued for extraction. I’m still waiting for the data 24 hours after submitting a request, so I can’t vouch for the quality, but the delay suggests that import.io uses crowd workers to turn your instructions into some sort of semi-automated extraction process, which likely helps improve extraction quality. The site I tried to scrape requires an arcane combination of javascript/POST requests that threw import.io’s connectors for a loop, and ultimately made it impossible to tell import.io how to navigate the site. Despite the complications, import.io seems like one of the more polished website-to-data efforts on this list.
Kimono was one of the most popular suggestions I got, and is quite polished. After installing the Kimono bookmarklet in your browser, you can select elements of the page you wish to extract, and provide some positive/negative examples to train the extractor. This means that unlike import.io, you don’t have to wait to get access to the extracted data. After labeling the data, you can quickly export it as CSV/JSON/a web endpoint. The tool worked seamlessly to extract a feed from the Hackernews front page, but I’d imagine that failures in the automated approach would make me wish I had access to import.io’s crowd workers. The tool would be high on my list except that navigation/pagination is coming soon, and will ultimately cost money.
Dapper, which is now owned by Yahoo!, provides about the same level of scraping capabilities as Kimono. You can extract content, but like Kimono it’s unclear how to navigate/paginate.
Google Docs was an unexpected contender. If the data you’re extracting is in an HTML table/RSS Feed/CSV file/XML document on a single webpage with no navigation/authentication, you can use one of the Import* functions in Google Docs. The IMPORTHTML macro worked as advertised in a quick test.
iMacros is a tool that I could imagine solves all of the tasks I wanted, but costs more than I was willing to pay to write this blog post. Interestingly, the free version handles the steps that the other tools on this list don’t do as well: navigation. Through your browser, iMacros lets you automate filling out forms, clicking on “next” links, etc. To perform extraction, you have to pay at least $495.
A friend has used Screen-scraper in the past with good outcomes. It handles navigation as well as extraction, but costs money and requires a small amount of programming/tokenization skills.
Winautomation seems cool, but it’s only available for Windows, which was a dead end for me.

So that’s it? Nothing works?

Not quite. None of these tools solved the problem I had on a very challenging website: the site clearly didn’t want to be crawled given the CAPTCHA, and the javascript-submitted POST requests threw most of the tools that expected navigation through links for a loop. Still, most of the tools I reviewed have snazzy demos, and I was able to use some of them for extracting content from sites that were less challenging than the one I initially intended to scrape.

All hope is not lost, however. Where pure automation fails, a human can step in. Several proposals suggested paying people on oDesk, Mechanical Turk, or CrowdFlower to extract the content with a human touch. This would certainly get us past the CAPTCHA and hard-to-automate navigation. It might get pretty expensive to have humans copy/paste the data for extraction, however. Given that the tools above are good at extracting content from any single page, I suspect there’s room for a human-in-the-loop scraping tool to steal the show: humans can navigate and train the extraction step, and the machine can perform the extraction. I suspect that’s what import.io is up to, and I’m hopeful they keep the tool available to folks like the ones I initially tried to help.

While we’re on the topic of human-powered solutions, it might make sense to hire a developer on oDesk to just implement the scraper for the site this organization was looking at. While a lot of the developer-free tools I mentioned above look promising, there are clearly cases where paying someone for a few hours of script-building just makes sense.

Locu has a new home

Thu, 22 Aug 2013 12:15:00 -0400

On Monday, we announced that Locu has been acquired by GoDaddy. As a friend, technologist, or researcher, the acquisition might initially surprise you. Rather than repeat myself a thousand times, I figured I’d share some thoughts on the topic. Standard caveat: these words represent my thoughts, not my employer’s.

I’m personally excited about the acquisition. We’ve been working with the folks from GoDaddy for several months now, and the team is sharp and energized about helping hundreds of millions of local merchants find their home on the web.
Locu remains Locu as a team, a set of offices, a product, and a mission. For the most part, Locu will be bringing new technology and design to the table, and GoDaddy will be bringing a level of scale that would take years to build up on our own. Locu offers a healthy dose of data structuring and crowdsourcing technology alongside the design chops to make previously complicated things simple. GoDaddy is the largest privately held company in the world that focuses on helping small businesses with their web presence, and brings years of sales and marketing experience to Locu’s products. GoDaddy also has a deep understanding of scale both in terms of the tens of millions of people they work with, and the billions of dollars of revenue they bring in.
Aside from the business side of things, we’re still very excited to be releasing open source projects and publishing more about our approach to structured data extraction and crowd work. The open source and research communities have been so fundamental to what we do, and I’m excited we can continue to repay that debt.
As a human being, I care a lot about the values of the company I work for. It would be ignorant to ignore the fact that previous incarnations of GoDaddy have been responsible for sexist Super Bowl commercials, and have supported web-endangering efforts like SOPA. We’ve been assured that the people who were behind these efforts are no longer working at GoDaddy. In fact, an entirely new leadership team (including CEO, COO, CTO, Chief Architect, etc.) has been put in place since these controversies, and I count myself as one of the folks that expects a lot of them in the coming years.

From everything I’ve heard, I know that acquisitions are hard to execute well. If we pull this off, we’ll be improving the lives of local merchants and crowd workers alike, and putting new force behind structured data. I’m excited to give it a shot!

Many thanks to Rene Reinsberg for giving me feedback on many things in life, including this post.

My N=1 Guide to Grad School

Thu, 22 Aug 2013 11:45:15 -0400

A little delayed, but I put together a guide of advice I’ve given other students in grad school. Send feedback, or write your own!

Locu

Thu, 06 Dec 2012 16:18:36 -0500

Life update: I’ve defended my thesis and I’m now the Director of Data at Locu. This doesn’t change much on the blog, as I’ll still periodically update it with random thoughts. I’m also doing a bit of blogging on the Locu blog on topics like our technology workflow, designing for crowds, and the human side of crowdsourcing.

It’s an exciting and very different next step for me. I’m still very excited about introducing new students to data and computer science, and will keep that up as well.

What Should be Included in a Data Science Curriculum?

Wed, 11 Apr 2012 15:02:00 -0400

(I recently wrote an answer to What Should be Included in a Data Science Curriculum? on Quora. Here’s a subset of that answer)

Eugene Wu and I recently taught a 6-day (3 hours per day) course on data literacy basics targeted at computer science undergraduates. Our initial motivation was selfish: as databases researchers, we didn’t have a lot of experience with an end-to-end raw data->data product pipeline. After a few trial runs of our own, we realized certain data processing patterns kept showing up, and saw that we had a small course worth of content on our hands. The important thing here is that even with undergraduate- and graduate-level machine learning, statistics, and database courses under our belts, we still had a lot to learn about working with honest-to-goodness dirty data.

Each module of our course could have had an entire semester dedicated to it, and so we favored basic skills with lots of hands-on experience over intellectual depth and rigor. We kept lectures to 20-30 minutes, giving students the remaining 2.5 hours to go through the labs we set up while we walked around answering questions. Lectures allowed students to know what they were in for at a high level, and the lab portion allowed them to cement those concepts with real datasets, code, and diagrams. All of the course content is available on github, and as an example, here is a direct link to day 1’s lab.

The syllabus we covered was:

Day 1: an end-to-end experience in downloading campaign contribution data from the federal election commission, cleaning it up, and programmatically displaying it using basic charts.

Day 2: visualization/charting skills using election and county health data.
Day 3: statistics to take the hunches they got on day 2 and quantify them, learning about T-Tests and linear regression along the way.
Day 4: text processing/summarization using the Enron email corpus.
Day 5: MapReduce to scale up Day 4’s analysis using Elastic MapReduce on Amazon Web Services. This felt a bit forced, but the students were clamoring for distributed data processing experience.
Day 6: the students teach us something they learned on their own datasets using techniques we’ve taught them.

While we set out to give computer science students with familiarity in python programming a dive into data, we ended up with folks from the physical sciences, doctors, and a few social scientists who had their own datasets to answer questions about. The last day allowed them to experiment with their new skills on their own data. Attendance on this day was lower than the previous days: the majority of the folks in attendance on day 6 were on the more experienced end, and I suspect that the undergrads, who were not yet exposed to data problems of their own, didn’t find it as engaging. It would be interesting to see how to develop course content that allows self-directed data science for students who still need a bit more inspiration.

I should also say that our attempt is not the first one to bring data to the classroom. Jeff Hammerbacher and Mike Franklin at Berkeley have a wonderful semester-length course on data science. The high-level outline of the course seems similar, but they get farther into data product design, and jump into each topic in more depth. Their resources page has a nice set of links to other educational efforts worth checking out.

One Gray Lady

Sun, 01 Apr 2012 14:07:42 -0400

I consume content through many aggregators, but The New York Times (The Gray Lady) is the single source of content I go to directly at least daily to know what’s happening in the world. While it’s good for news, what sets The Times apart from other content sources is its depth of reporting. There’s one problem, though: by default, longer NYT articles do not appear in Single Page mode. This has caused me problems in the past, ranging in severity from slightly annoying (having to click Next Page) to pretty frustrating (loading articles for offline reading only to realize I only had the first page).

So I created One Gray Lady, a Greasemonkey plugin that loads all NYT content in single page mode.

To install it in Google Chrome or Firefox with the Greasemonkey plugin, click here.

I have only tested the code in Chrome, and while I did a bit of testing on various URLs, I’m sure I missed something. Feel free to send updates or suggestions!

John Glaser on Healthcare Information Technology

Thu, 15 Dec 2011 11:21:00 -0500

I recently sat in on a lecture for Professor Peter Szolovits’s Biomedical Computing course. The lecture was open to a greater audience, given the prominence of the speaker. As a non-expert, I found it to be a useful look into the current state of healthcare IT and the coming legislative and technical challenges facing the industry. My notes are below.

John Glaser, Ph.D.
Formerly CIO of Partners/Brigham And Women’s Hospital
Currently CEO of Siemens Health Services

Free advice: get a healthcare proxy and power of attorney set up. Easier to do now than have someone else guess later how you want to live/die.

Why does Health IT suck?

Not for lack of money put into the system
Not for lack of smart people working on the problem

Current model

Insurance companies/patients pay per volume (per birth, per surgery, etc.) almost regardless of quality
Boards of directors are very conservative. Don’t want to be the board that made an IT decision that made a huge hospital fail.

U.S. Numbers to give context

60% of hospitals are <= 100 beds
Of 500K physicians, majority work in 2-3-doctor practice (not IT-savvy, or modestly interested in IT at best)
2/3 of medical decisions are heuristic/not scientific, and many have a difficult-to-verify outcome
volatile knowledge domain: 700k academic articles have come out in the last (decade?)
20% of doctors are a decade away from retirement, so perhaps newer doctors will bring IT mentality with them?
PricewaterhouseCoopers survey: 58% of (independent?) doctors considering quitting, selling practice, or joining a larger practice
various societies are discussing requirements: to become board (re-)certified (oncology, etc.), you have to show facility in technology.

Health IT Services

huge fragmentation: the 3rd largest health IT services company has 7% of market. if they win every open engagement from now until (?), they will have 11% of the market.
lots of players: 300 electronic health record providers in US, 25% exit and 25% enter per year
engagements are long: bringing up a new hospital IT system takes 2-4 years. from the moment you decide to change IT systems, you will continue to use your old one for the next 4-5 years as you transition.

Affordable Care Act (ACA)

costs are projected to go up 26% in the next decade. ACA stipulates that govt. will compensate 12% more in the next decade: providers have to make up the difference.
to incentivize quality care, govt. will hold on to 10% of payments until you prove treatment was effective (hard to define).
currently, for a single procedure (e.g., total hip replacement) you might get 12 different bills (e.g., surgeon, materials, anesthesia). new system: govt. pays a single provider one bill, with a fixed amount. incentivizes a holistic view.
risk: hospitals go out of business. potential future doctors don’t enter medicine. doctors “fire” bad patients to make their numbers look good.

Consolidation

doctors in small practices joining larger networks to avoid managing the ACA requirements.
single payment requirement will cause groups of doctors to more tightly collaborate (contractually).

Transition challenges

ACA is rolling out over the course of a decade.
need to be careful, since some patients will be handled by old rules, and some by new rules. so do you not apply decision support-based treatment to patients on old rules, or just do fee-for-service? lots of mental overhead for doctors.

Fixed fee challenges

paying a fixed amount per treatment doesn’t work for everything. Diabetes is sort of predictable, but a trauma might range from a broken toe to severe burns on 90% of body.
(Adam’s note) perhaps large pools of insured patients will smooth over the individual spikes in cost of care.

Information Technology needs

systems must span inpatient, outpatient, emergency care, rehab
need revenue cycle + contract management system that handles continuum of care. this is complex: medicare + blue cross might pay diff amounts for “good” diabetes treatment, and “good” might be defined differently.
systems should manage individuals and populations: how did all 100 people w/ respiratory problems do last month? which patients strayed from predicted path? what should have happened? why/why not?
sophisticated business intelligence + analysis: predict who will get worse, etc.
interoperability w/ different providers
rules+workflow engines to ensure followups/next steps/help primary care doctors coordinate care, manage exceptions, follow up properly. also allow this in collaborative care environment w/ lots of specialists checking in and out.
high availability + low total cost of ownership
engage patients

New challenges for primary care physicians (PCPs)

At the moment, PCP moves from one patient to the next every 15 minutes, sees 100s of lab results per day
Only 25% of data from specialists comes back to a PCP within a month
In future, PCPs will be responsible for closing the loop on specialists, tests, etc., with more accountability, but still be given just as much or more information, with similar delays. Workflow management systems are key here!

Interesting technical challenges

filtering patient care notes: 10s of pages of patient care history. No doctor can read them all before seeing patient. how to help doctors find relevant notes across different doctors, annotations, etc.
supporting collaboration between multiple providers
parsing notes to remind providers. e.g., “Ask about patient’s daughter next time.”
cleaning up conflicting medical record data: was it type 1 or type 2 diabetes? was it a heart attack, or just a test for one?

Human-powered Sorts and Joins

Thu, 08 Dec 2011 11:53:55 -0500

(Cross-posted on the Crowd Research Blog)

There has been a lot of excitement in the database community about crowdsourced databases. At first blush, it sound like databases are yet another application area for crowdsourcing: if you have data in a database, a crowd can help you process it in ways that machines cannot. This view of crowd-powered databases misses the point. The real benefit of thinking of human computation as a databases problem is that it helps you manage complex crowdsourced workflows.

Many crowd-powered tasks require complicated workflows in order to be effective, as we see in algorithms like Soylent’s Find-Fix-Verify. These custom workflows require thousands of lines of code to curry data between services like MTurk and business logic in several languages (1000-2000 in the case of Find-Fix-Verify!). If we provide workflow developers with a set of common operators, like filters and sorts, and a declarative interface to combine those operators, such as SQL or PigLatin, we can reduce the painful crowdsourced plumbing code while focusing on a set of operators to improve as a community.

This is not an academic argument: Find-Fix-Verify can be implemented with a FOREACH-FOREACH-SORT in PigLatin, or a SELECT-SELECT-ORDERBY in SQL, resulting in several tens of lines of code. All told, we can get a two order-of-magnitude reduction in workflow code. The task at hand is thus to make the best-of-breed reusable operators for crowd-powered workflows. In our VLDB 2012 paper, we look at two such operators: Sorts and Joins.

Sorts

Human-powered sorts are everywhere. When you submit a product review with a 5-star rating, you’re implicitly contributing a datapoint to a large product ranking algorithm. In addition to rating-based sorts, there are also comparison-based ones, where a user is asked to compare two or more items along some axis. For a particularly cute example of comparison-based sorting, see The Cutest, a site that identifies the cutest animals in the world by getting pairwise comparisons from heartwarmed visitors.

The two sort-input methods can be found in the image below. On the left, users compare five squares by size. On the right, users rate each square on a scale from one to seven by size after seeing 10 random examples.

In our paper, we show that comparisons provide accurate rankings, but are expensive: they require a number of comparisons quadratic in the number of items being compared. Rating is quite accurate, and cheaper than sorts: it’s linear in the number of items rated. We also propose a hybrid of the two that balances cost and accuracy, where we first rate all items, and then compare items with similar ratings.

These techniques can reduce the cost of sorting a list of items by 2-10x. Human-powered sorts are valuable for a variety of tasks. Want to know which animals are most dangerous? From least to most dangerous, a crowd of Turkers said:

flower, ant, grasshopper, rock, bee, turkey, dolphin, parrot, baboon,
rat, tazmanian devil, lemur, camel, octopus, dog, eagle,
elephant seal, skunk, hippo, hyena, great white shark, moose,
komodo dragon, wolf, tiger, whale, panther

The different sort implementations highlight another benefit of declaratively defined workflows. A system like Qurk can take user constraints into account (linear costs? quadratic costs? something in between?) and identify a comparison-, rating-, or hybrid-based sort implementation to meet their needs.

Joins

Human-powered Joins are equally pervasive. The area of Entity Resolution has captured the attention of researchers and practitioners for decades. In the space of finance, is IBM the same as International Business Machines? Intelligence analysis runs into a combinatorial explosion in the number of ways to say Muammar Muhammad Abu Minyar al-Gaddafi’s name. And most importantly, how can I tell if Justin Timberlake is the person in the image I’m looking at?

We explored three interfaces for solving the celebrity matching problem (and more broadly, the human-powered entity resolution problem). The first is a simple join interface, asking users if the same celebrity is displayed in two images. The second employs batching, asking Turkers to match several pairs of celebrity images. The third interface employs more complex batching by asking Turkers to match celebrities arrayed in two columns.

As we batch more pairs to match per task, cost goes down, but so does Turker accuracy. Still, we found that we can achieve around a 10x cost reduction without significantly losing in result quality. We can achieve even more savings by having workers identify features of the celebrities, so that we don’t, for example, try to match up males with females.

We’re Not Done Yet

We now have insight into how to effectively design two important human-powered operators, sorts and joins. There are two directions to go from here: bring in learning models, and design more reusable operators.

Our paper shows how to achieve more than order-of-magnitude cost reductions in join and sort costs, but this is often not enough. To further reduce costs while maintaining accuracy, we’re looking at training machine learning classifiers to perform simple join and sort tasks, like determining that Cambridge Brewing Co. is likely the same as Cambridge Brewing Company. We’ll still need humans to handle the really tricky work, like figuring out which of the phone numbers for the brewing company is the right one.

Sorts and joins aren’t the only reusable operators we can implement. Next up: human-powered aggregates. In groups, humans are surprisingly accurate at estimating quantities (jelly beans in a jar, anyone?). We’re building an operator that takes advantage of this ability to count with a crowd.

For more, see our full paper, Human-powered Sorts and Joins.
This is joint work with Eugene Wu, David Karger, Sam Madden, and Rob Miller.

I'm a (STEM) Graduate Student: Please Tax Me

Thu, 17 Nov 2011 10:43:21 -0500

Over the past month, a petition has been circulating asking the Obama administration to bring graduate student stipends back to their pre-1986 tax-exempt status. I urge you to not sign this petition, as it is misguided and damaging to our image. If you believe graduate student researchers are more valuable than their compensation, then demand more compensation, not a tax loophole.

First, the caveat: I can only speak for the STEM fields. In these fields, a combination of government, corporate, and university grants support research-track students in the lab and classroom. This compensation usually comes in the form of full tuition coverage and a stipend in the range of $1500-$2500 per month, and sometimes includes health coverage.

Our stipends put our yearly income at $18,000-$30,000/year. Compare this to a poverty threshold of $18,530 for a family of three, or $29,990 for a family of six. In computer science, you can double your income with a summer internship, placing you above the median 2009 household income. At first glance, it seems like we are reasonably compensated before we take into account the education, advising, networking, and travel opportunities our life decision has earned us.

Of course, the argument in the petition is more nuanced than one of unreasonable taxation. The petition speaks to the value of our “innovative, cutting-edge thinking” relative to “bankers, lobbyists, or hedge-fund managers.” The comparison is certainly timely, but sweeps under the rug other valuable fields, like Nursing or Carpentry. Both of these fields earn more than the median graduate student in STEM, but optimistically, we are in a position of higher upward mobility once we graduate.

Perhaps a better comparison is what we could earn if we had not chosen graduate studies. With a B.S. in Computer Science, my undergraduate colleagues at large technology firms and startups are earning 3-5x what I earn through my stipend. Am I more valuable as a researcher than I would be in their shoes? This seems like a good conversation to have.

This is a discussion one of relative value. In the absolute sense, graduate students in STEM are not poor, and should pay taxes in whatever tax bracket we fall. Perhaps we’re not compensated enough for what we provide to society. I would like to believe that STEM’s contribution to social and economic development is significant. If we’re seeing a dirth of STEM researchers and our value to society is high, the market failure should be supplemented by the government. Not in the form of yet another tax break, but as an increase in the number of stipends or the amount of compensation distributed per researcher.

STEM is under attack. We should elevate its image by discussing how valuable our work is, not by asking for pity. Demand what you are worth, but remember how lucky you are.

Database papers at CHI

Thu, 26 May 2011 22:05:10 -0400

There is little I like more than a fine cheese and fresh-baked bread. Still, to fill the rest of my day without expanding my waistline, I go for a mix of databases and human-computer interaction. That’s why I was excited to see several database-oriented papers presented at CHI. While many papers contained some amount of data, I’ll stick to the three that are unquestionably of interest to the databases community.

The first paper was for the social scientist in all of us. Amy Voida, Ellie Harmon, and Ban Al-Ani presented Homebrew Databases: Complexities of Everyday Information Management in Nonprofit Organizations. Nonprofits are arguably some of the most difficult database users to design for. They have minimal resources, rarely employ fulltime technical staff, and solve non-core problems as they show up. This practice leads to homebrew, just-functional-enough solutions to many data management problems. The authors provide an interesting qualitative study of how nonprofits manage volunteer demographic and contact information. They provide descriptions of the homebrewed, often fractured collections of data stored in several locations. Reading this paper, I couldn’t help but think of how perfectly these homebrewed databases resembled Franklin, Halevy, and Maier’s dataspaces.

Sean Kandel presented Wrangler, a project he’s been working on with Andreas Paepcke, Joe Hellerstein, and Jeff Heer. Wrangler lets users specify transformations on datasets by example. Each time a user shows Wrangler how to modify a record (or line of unstructured text), Wrangler updates its rank-ordered list of potential transformations that could have led to this modification. Wrangler borrows concepts such as interactive transformation languages from Vijayshankar Raman and Joe Hellerstein’s Potter’s Wheel. Its interface has a taste of David Huynh and Stefano Mazzocchi’s Refine as well as Huynh’s Potluck. Wrangler’s novelty comes in combining the interfaces and transformation languages with an inference and ranking engine. Since Wrangler is hosted, it is also capable of learning which transformations users prefer and improving its rankings over time!

The last slot goes to our own Eirik Bakke, who presented Related Worksheets along with David Karger and Rob Miller. Related worksheets make foreign key references a first-class citizen in the world of spreadsheets. Just as spreadsheets secretly made every office worker capable of maintaining a single-user, single-table relational database, Eirik has secretly enabled those workers to make references between spreadsheets without having to program. While adding foreign key references to a spreadsheet requires a simple user interface modification, its implications on how to display multi-valued cells in the spreadsheet are significant. Read the paper to see Eirik’s hierarchical solution to this problem!

Keep it up, data nerds! Soon we’ll be able to start a data community at CHI!

Evening Project: What Would Hacker News Say?

Thu, 03 Feb 2011 20:16:00 -0500

What Would Hacker News Say (WWHNS) is a bookmarklet that allows you to see if there is a Hacker News (HN) discussion about a page you are currently viewing.

I often find a link through a feed reader or Twitter and want to know if there is an HN thread discussing the link. This happens more often now that I have moved over to following @newsyc20 on Twitter rather than visiting the HN website directly. I batch up a bunch of stories to read at once, and lose context of which HN thread pointed to that page.

The WWHNS bookmarklet, when clicked, looks the current page up in Ronnie Roller’s wonderful HN API, and adds a link to the top right of the current page to any existing HN comment threads.

I tested it in Chrome and Firefox. Let me know if it works in other browsers.

Caveat: This bookmarklet will work for links you followed by way of HN or another source which replicates it. It may not work if you arrived at a page from a source outside of HN, since that link might be slightly different from the one posted to HN.

To use WWHNS

Easy

Drag this WWHNS bookmarklet to your bookmark toolbar.
For any page, click on the WWHNS button in your bookmark toolbar.

Hard

Check out the WWHNS git repository
Type make
Open wwhns.html in a browser
Copy the WWHNS link to your bookmark toolbar
For any page, click on the WWHNS button in your bookmark toolbar

To edit the bookmarklet

Fork this git repository
Edit wwhns.js
Type make
Open wwhns.html in a browser
Copy the WWHNS link to your bookmark toolbar
For any page, click on the WWHNS button in your bookmark toolbar
Push the changes back to me. I’d love to see what you do with it!

License

BSD

Shoutouts

Ben Alman–For the jQuery bookmarklet HOWTO
Ronnie Roller–For an awesome HN API
YUI Compressor–Makes JavaScript small
Hacker News–For having comment threads worth reading

Comments as content: The medium hinders the message

Wed, 09 Jun 2010 09:33:40 -0400

When articles were published in hard-copy newspapers, reader response was left to the ultimate in asynchronous communication: letters to the editor for differences of opinion, and corrections when a mistake was discovered. As brick-and-mortar newspapers moved into the digital realm, the static publishing model initially stuck, albeit with an easier method for correcting mistakes.

When we digest a story published by a large newspaper, be it in digital or dead tree form, we assign the strongest signal to the content of the article. In exchange for giving the journalist our full attention, we expect that the news organization has put significant effort to researching, writing, and editing the story. Newspapers rarely put uncurated content front-and-center because they trust their own vetted content more, and in part to justify the expense that went into their refined content.

Along the path from single-source hard copies of stories to the everyone-gets-a-voice world of microblogging, we got comments. Blogs frequently display discussion threads following each entry, and sites such as Digg, Reddit, and Hacker News provide us with another forum to chat with the community about articles we find interesting.

Many blogging outfits, including those run by organizations as large as the New York Times, now employ comment systems beyond their purpose as a meta-article discussion medium. One often finds blog entries that end with prompts such as “What has your experience been? Let us know in the comments!” Or “If you know more about this late-breaking story, leave a comment below!” In the same way that live-blogging has taken blog entries from static entities to up-to-the-minute documents, comments sometimes become a necessary part of the stories which they adjoin. Slashdot sometimes takes this one step further: when a topic of wide interest appears, the editors open an essentially content-free story with the express purpose of leaving a place for comments.

If comments can sometimes be the content of a story, then why are they always relegated to the bottom of the story? What is the user interface for displaying articles where readers are assigned a reporter’s role? How do we assign prominence to the most informative fragments of story and user-generated content? Flickr and Facebook have figured this out to some extent–you can annotate photos and witness the result in situ. Youtube lets users embed annotations in videos. How do we apply this concept to text media? What tools already do this, and what ideas do you have for improvement? Leave your comment below!

Twitter Papers at the WWW 2010 Conference

Sun, 02 May 2010 16:54:25 -0400

This past week at WWW 2010 has resulted in quite the spread of Twitter papers. Topic included systems, novel uses, and studies of tweets and users. I’ve made an attempt to provide a taste of each paper/presentation I experienced. Feel free to comment if I missed anything!

At the web science conference on Monday, we saw two presentations on Twitter. Devin Gaffney presented a paper entitled #iranElection: quantifying online activism. Devin collected around 766,000 worth of tweets across nearly 74,000 users around the time of the #iranElection. He first showed that there was a spike in signups around the time that #iranElection became a trending topic with the seeming purpose of adding #iranElection updates to the tweet stream. A retweet analysis showed that as more users became interested in the #iranElection, users with influence (as measured by follower count or retweet count) lost influence relative to the entirety of relevant users.

Panagiotis Metaxas presented the other paper at the Web Science workshop, entitled From Obscurity to Prominence in Minutes: Political Speech and Real-Time Search. In this work, the authors studied the recent Massachusetts special election between Scott Brown and Martha Coakley through the lens of Twitter. Metaxas presented the notion of twitterbombing, where, similar to googlebombing, sneaky twitter users abuse various mechanisms to appear in the relatively prominent real-time search results that search engines have recently added. 32% of tweets were repeated several times by the same account, presumably in an attempt to increase the ranking of their tweets’ content by naive real-time ranking algorithms. The authors described how they identified Republicans and Democrats through follower and retweet analysis, and showed an example where twitterbombing was used to lead searchers to a page designed to dissuade voters from voting for Coakley.

Next, at the Linked Data Workshop, Joshua Shinavier presented Real-time #SemanticWeb in <=140 characters. Joshua’s goal is to extract structured data from tweets using his TwitLogic system. Instead of extracting data from all tweets, Joshua’s system looks for tweets that follow a format called nanotations and are identified by hashtags. It is unclear what sort of adoption this format will see, but the value in such annotations (as well as those in the up-and-coming twitter annotation system) is that with precise structure, the extracted data can be a far more rich data source for the linked data web.

Moving into the main WWW2010 presentation tracks, Yi Chang and his colleagues at Yahoo! presented Time is of the essence: Improving Recency Ranking using Twitter Data, which studied how to turn relevant and popular tweets into search results. Crawling for real-time content is typically resource-intensive on search engines which have to frequently revisit many sources of such content, and belabors the servers of the content providers if recrawled too frequently. The authors of this paper studied how to use streaming Twitter results to discover URLs and avoid having to actively recrawl for new content. In a 5-hour sampling of tweets, Chang and team found 1M URLs, and after cleaning these results to avoid spam, adult content, or self-promoting tweets, approximately 5.9% of the URLs remained. From here, the authors describe how various features including tweet content, retweets, and social network topology can be used to rank the discovered URLs. Finally, the authors found that they can use the tweet text describing a URL in much the same way that search engines traditionally use the contents of anchor text linking to a webpage to index discovered URLs.

Next, Haewoon Kwak presented What is Twitter: Social Network or News Media? One impressive contribution of this work is the large dataset that the authors collected, featuring 41.7M user profiles, 1.47B following relations, 4262 trending topics, and 106M tweets mentioning these trending topics. The authors presented some interesting network structure statistics. Twitter has an asymmetric following model, and only 22% of user pairings are symmetric, compared to a symmetric follower rate of around 70-85% on other asymmetric social networks. This should not suggest that Twitter is more a news medium than a social network. For example, Twitter may be a different medium to different users, and the high rate of updates might discourage users from following everyone that follows them. Other interesting factoids presented by the authors included that 96% of retweet trees are of height 1, 35% of retweets occur within 10 minutes of the original tweet, and 55% occur within 1 hour.

Finally, Takeshi Sakaki presented Earthquake Shakes Twitter Users:
Real-time Event Detection by Social Sensors, which described how to build an earthquake detection and location system with the tweetstream as its input. The authors passed all tweets with the term ‘earthquake’ or ’shaking’ to a classifier, and showed which features of tweets helped classify positive and negative instances of tweets relating to an earthquake. They then built a temporal model to identify when the earthquake-positive tweets strayed from the norm. Finally, they compared several spatial methods for using geotagged tweets to determine the epicenter of an earthquake. The authors point out one weakness in their location logic: their algorithms have a hard time identifying an accurate location of earthquakes which have an epicenter in the ocean.

A summary of the Twitter analysis papers would not be complete without a hat tip to danah boyd, who gave a wonderful keynote which touched on the intersection of big data analysis and privacy. boyd pushed researchers with access to content outside of the context in which it was created, such as a message sent to a friend or a tweet directed at a tight social network, to be ethical with their handling of that data. Doing her talk justice would take a blog post of its own, so I will just mention one point that danah made toward the beginning of her talk. When confronted with a large dataset, big data hackers sometimes equate aggregate statistics to facts that need not be backed or understood by social models, and sometimes fail to think about the limitations of their population samples. Little things matter: sampling 5% of all tweets biases toward users that tweet more frequently. Similarly, sampling 5% of twitter accounts does not properly account for people with multiple accounts/identities or lurkers with no accounts. Social streams are a wonderful data source for data scientists, but we should ford the streams responsibly.

256 colors in your xterm!

Mon, 29 Mar 2010 12:25:00 -0400

Have you ever used emacs or vim from the command line in GNU/Linux and been offended by the horrible color scheme you saw? I’m embarrassed to admit that I’ve been through tons of vim color schemes and have never been able to understand why the colors did not show up as desired.

Yang’s blog post has changed my life. See here for more notes on which color schemes work well for vim. I’ve been enjoying wombat256.

On Ubuntu on my laptop, I added “export TERM=xterm-256color” to the end of my “~/.bashrc”–You will have to re-open another terminal to see the results after saving your bashrc, or type “source ~/.bashrc” in your current terminal if you’re too antsy.