Myrrix Joins Cloudera to Bring "Big Learning" to Hadoop

Categories: Data Science Hadoop Mahout

What a short, strange trip it’s been. Just a year ago, I founded Myrrix in London’s Silicon Roundabout to commercialize large-scale machine learning based on Apache Hadoop and Apache Mahout. It’s been a busy scramble, building software and proudly watching early customers get real, big data-sized machine learning into production.

And now another beginning: Myrrix has a new home in Cloudera. I’m excited to join as Director of Data Science in London, alongside Josh Wills. Some of the Myrrix technology will be coming along to benefit CDH and its customers too. There was no question that Cloudera is the right place to continue building out the vision that started as Myrrix, because Josh, Jeff Hammerbacher and the rest of the data science team here have the same vision. It’s an unusually perfect match. Cloudera has made an increasingly complex big-data ecosystem increasingly accessible (Hadoop, real-time queries, search), and we’re going to make “Big Learning” on Hadoop easy and accessible too.

What is Old is New Again

Data-savvy companies of all sizes can now accomplish many viable machine learning projects.

Why the fuss now about machine learning, a decades-old field? I started working on recommender systems relatively late, in 2005, as the open-source project Taste. In 2008, this was merged into the open source machine learning project Apache Mahout, and rebuilt on top of a nascent Hadoop project. Yet as a committer and part of the Mahout PMC, I have watched interest in machine learning suddenly reignite, and skyrocket, as interest in this new Hadoop thing did.

It’s because these should go together well. Hadoop and cheap hardware have made big data analysis so much more feasible. With cheap disks and CPUs, and mature open-source databases and computation frameworks, startups and even individuals can afford to run terribly complex computations over terabytes.

This is great for machine learning. Generally, learning works better with more data. If the price of collecting and processing data is falling, while the value of learning from it is increasing, then the number of situations where learning is profitable to deploy is exploding. Whereas before large-scale machine learning was something a few big specialized companies bothered with, now, data-savvy companies of all sizes can accomplish many viable machine learning projects. And, large companies can improve their existing learning by adding orders of magnitude more data into a system that might before be limited by scale.

Making Big Learning Accessible

Cheap infrastructure doesn’t help without accessible applications on top. And, machine learning gets surprisingly harder to implement at scale. Most research assumes a world in which all data fits on one machine. Adjusting these ideas to Hadoop’s data-parallel world takes some clever reinvention. This began most visibly in the Mahout project, where many algorithms have been parallelized for Hadoop.

There is still so much to be done from these beginnings before learning on Hadoop is as accessible as it can be. After all, in the early days, Hadoop itself was a ball of source code that only adventurous specialists could effectively embrace. However, Cloudera has shown how to extend it, package it, support it and make it far more accessible to a much bigger audience. The same will happen for applications like Big Learning — that’s always been the Myrrix vision too, and now, we’re working together within Cloudera to start building this out for you, the bigger audience.

That’s All For Now

Exactly what form that will take is to be determined. There are no new products to announce at this point, as we’re busy in the lab figuring out how to incorporate the technology into CDH in just the right way.

Finally, a public word of thanks to users and customers of Myrrix, who also share credit in evolving the vision of what large-scale learning should look like. As you’ve heard, while the software will eventually be discontinued in its current form, it is now freely available and remains fully supported in the medium term.

This should be a new and interesting trip — watch this space.