Joining Cloudera

August 10, 2009

I will be leaving Yahoo! at the end of this month to join Cloudera.

About five years ago I was working with Mike Cafarella on Apache Nutch, an open-source web-search engine. Initially we were able to crawl and index on four machines in parallel, but with a lot of manual steps. Inspired by two Google papers, we implemented a distributed filesystem and MapReduce implementation that automated most of these steps. Operation became much simpler, and we were then able to easily run Nutch on twenty machines, with near-linear scaling.

But to scale to the many billions of pages in the web we’d need to be able to run it on thousands of machines. And the more we worked on it the more I realized that would take a lot more developers and resources than we had to make this happen.

Yahoo! proposed to fill this gap. Eric Baldeschwieler led a team with talented folks, like Owen O’Malley, Sameer Paranjpye, and Nigel Daley. Eric said he’d dedicate his team to scaling this system to be able to process the full web. So, three and half years ago, I joined Yahoo! to help make this happen.

We exceeded my dreams. First we moved the distributed computing code out of Nutch into a new Apache project christened Hadoop. Then we set out to improve scalability, performance, and reliability, all the while adding many features. After one year Hadoop was used daily by many research groups within Yahoo!. After two years it generated Yahoo!’s web search index, achieving web-scale. Now, after three years, Hadoop holds the big-data sort record and the project has become a de-facto industry standard for big-data computing, used by scores of companies. The recent Hadoop Summit was attended by over 750 people from around the world.

Many folks at Yahoo! were instrumental in this story, including: Raymie Stata, Dhruba Borthakur, Arun C Murthy, Devaraj Das, Raghu Angadi, Hairong Kuang, Konstantin Shvachko, Runping Qi, Chris Douglas, Allen Wittenauer, Sharad Agarwal and Hemanth Yamijala, to name just a few. Yahoo! deserves enormous and ongoing thanks for the key role it plays in making Hadoop useful.

Now Hadoop is a thriving open-source project, with large and diverse developer and user communities. Going forward, Cloudera presents an opportunity to work with a wider range of Hadoop users. I hope to help synthesize these many voices into a project that best serves all.

Hadoop has grown to be a large, active, project very quickly, but it is still a young project. At Cloudera I will be well positioned to help it mature. This move will not fundamentally change my day-to-day activities. I will continue to work on Hadoop, working closely with developers from Yahoo! and elsewhere to build great software.

Some early Avro benchmarks

May 12, 2009

Avro is my current project. It’s a slightly different take on data serialization.

Most data serialization systems, like Thrift and Protocol Buffers, rely on code generation, which can be awkward with dynamic languages and datasets. For example, many folks write MapReduce programs in languages like Pig and Python, and generate datasets whose schema is determined by the script that generates them. One of the goals for Avro is to permit such applications to achieve high performance without forcing them to run external compilers.

A few early Avro benchmarks are now in. A month ago, Johan Oskarsson (of Last.fm) ran his serialization size benchmark using Avro. And today, Sharad Agarwal (my Avro collaborator) ran an existing java serialization benchmark using Avro, and the initial results look decent. Curiously, Avro’s generic (no code generation) and specific (generated classes) APIs diverged significantly and unexpectedly despite sharing much of their implementation. This suggests that both might be easily improved.

Hadoop Sorts a Petabyte

May 12, 2009

Woot! Owen and Arun have posted new Hadoop sort benchmark results. This is a great milestone for both throughput (a petabyte in ~16 hours) and latency (a terabyte in ~1 minute).

Cloud: commodity or proprietary?

April 9, 2008

A few days ago Google announced its App Engine, which lets folks build applications that run in Google’s cloud. Amazon has for a while had a number of services to let folks run applications in Amazon’s cloud. But in both of these cases, one must use their proprietary APIs.

For example, Google provides a datastore API that applications must use to persist state, while Amazon similarly provides a simple DB API. Amazon’s services are generally lower-level and easier to adopt ala-carte, while Google provides one-stop-shopping. Either way, one’s application code becomes dependent on a particular vendor. This is in contrast to most web applications today, where, with things like the LAMP stack, folks can build vendor-neutral applications from free (as in beer) parts and select from a competitive, commodity hosting market.

As we shift applications to the cloud, do we want our code to remain vendor-neutral? Or would we rather work in silos, where some folks build things to run in the Google cloud, some for the Amazon cloud, and others for the Microsoft cloud? Once an application becomes sufficiently complex, moving it from one cloud to another becomes difficult, placing folks at the mercy of their cloud provider.

I think most would prefer not to be locked-in, that cloud providers instead sold commodity services. But how can we ensure that?

If we develop standard, non-proprietary cloud APIs with open-source implementations, then cloud providers can deploy these and compete on price, availability, performance, etc., giving developers usable alternatives. But such APIs won’t be developed by the cloud providers. They have every incentive to develop proprietary APIs in order to lock folks into their services. Good open-source implementations will only come about if the community makes them a priority and builds them.

Hadoop is a big initial step in this direction. Its current focus is on batch computing, but several of its components are also key to cloud hosting. HDFS provides a scalable, distributed filesystem. It doesn’t yet meet the high-availability requirements of cloud hosting, but once folks who need that help to build it, it will. HBase provides a database comparable to Amazon’s Simple DB and Google’s Datastore API. It’s still young, but, if folks want, it could become a solid competitor to these.

Moral: if you want commodity cloud hosting, pitch in now.

MapReduce cookbook for machine learning

July 30, 2007

Here’s a paper from Stanford showing how to use MapReduce to scalably implement ten different machine learning algorithms!