Ph.D. Interns at Cloudera: Bringing Big Data Back to School

Categories: Careers CDH General Hadoop MapReduce

The following is a series of stories from people who in the recent past worked as Engineering Interns at Cloudera. These experiences concretely illustrate how collaboration between commercial companies like Cloudera and academia, such as in the form of these internships, helps promote big data research at universities. (These experiences were previously published in the ACM student journal, XRDS.)

Yanpei Chen (Intern 2011)

I Interned with Cloudera during my last summer of grad school. My dissertation was on “Workload Driven Design and Evaluation of Large-Scale Data-Centric Systems”, and I already had collaborations with Facebook and NetApp, two other big data companies. The goal of my work was to develop and demonstrate a set of empirical, workload-driven design and evaluation methods that complemented the traditional, subjective approach of designing by intuition and experience. It was very important that these methods generalized across many types of customer workloads. Hence, when Cloudera offered me an internship, I leapt at the unique opportunity to collect insights from customers in traditional industries who were still dealing with big data.

My internship project was to collect and analyze CDH traces from Cloudera customers. Cloudera’s cutting edge knowledge allowed it to realize that several MapReduce “benchmarks”, popular even now, were not at all representative of real-world use cases. As Cloudera’s customers increased their MapReduce expertise, they voiced similar concerns. Thus, the lack of empirical, real-life cluster traces created huge barriers for quality assurance and performance testing of the core CDH product, and technology certification for Cloudera’s partner vendors. Furthermore, Cloudera’s customers and prospects in non-technology industries were beginning to lament the limited attention paid to their use cases. Thus, empirical insights of real-life use cases also assisted Cloudera’s customer support and marketing efforts.

The actual process of collecting customer cluster traces proved necessarily difficult. Customers were rightly concerned about leaking proprietary information. I was fortunate to have members of the support, marketing, and partner-relations teams helping me initiate and moderate my discussions with customers. Cloudera’s internal infrastructure teams helped set up special file transfers to comply with our customers’ firewall policies. Cloudera executives also occasionally stepped in to offer encouragement and support. In the end, we collected an unprecedented set of real-world MapReduce cluster traces from both technology and traditional enterprises. Insights from this data set has led to key publications of my dissertation, while helping Cloudera’s ongoing efforts in quality assurance, performance testing, technology certification, customer support, and marketing.

The internship was truly a collaborative, multi-disciplinary experience. It also led to a full-time job offer, which I accepted after finishing my dissertation.

Andrew Wang (Intern 2012)

As a part of the AMPLab at Berkeley, much of my research revolves around big data and the components of the Hadoop software ecosystem. More specifically, I’m interested in providing high-level service-level objectives (SLOs) for distributed storage systems.

Working at Cloudera has been an eye-opening experience in three major ways. First, my internship has provided incredible perspective on how storage systems like HDFS and Apache HBase are used in practice. Getting to talk directly with customers and developers has helped me refine my understanding of the problems faced in practice, sometimes in surprising ways. This has influenced my research agenda in terms of both problem selection and approach.

Second, my internship solidified in my mind the importance of academic systems research. The continual stream of support tickets and ship dates in industry can preclude full examination of a design space. Researchers have the luxury of a more measured approach to problem solving which focuses on examining fundamental tradeoffs, methodology, and quantifying differences with other solutions. Part of what impresses me about Cloudera is how closely they watch the output of academic research conferences. If a paper thoroughly solves a real-world problem, it’s likely to be quickly applied to actual code used in production.

Third, Cloudera is a great place to learn and practice open-source software development. Too often, research code is left to succumb to “bit rot” after the associated paper is published. Open-source is an opportunity for researchers to have additional impact and is also a way of further publicizing your research.

Overall, I strongly enjoyed my experience at Cloudera. I was able to disseminate my research ideas within the open-source community, as well as work on directly applying them to a product that will ultimately be used by hundreds of companies.

[Ed: Andrew’s internship also led to a full-time offer, which he accepted.]

Brian Martin (Intern 2012)

At UMass, I am a student in machine learning and natural language processing; specifically, I research parallel inference and learning in graphical models for large-scale information extraction.

Cloudera is not only about systems design. In my first month, working directly with the Director of Data Science Josh Wills, I have developed several new statistics and machine-learning tools for advanced analytics on Hadoop.

First, I wrote a tool for calculating distance correlation over giant tables of data (e.g. all Chicago crimes and building permits in the last decade grouped by location, or all purchase histories grouped by various demographic variables). Distance correlation is a recent statistical measure of dependence, linear and nonlinear, between variables. This tool will soon be available open-source and as a Cloudera product.

Second, I implemented a recently proposed solver for very large-scale linear regression problems using the new Hadoop feature, YARN.  YARN allows for running non-MapReduce applications on a Hadoop cluster, while playing well with the resource manager.

The biggest advantage of doing these projects at Cloudera was the insight into customer needs. Coming from academia, it is difficult to know what companies are doing behind closed doors. With so many customers and so much experience, Cloudera has provided me with a more complete picture of the diversity of industry’s data and needs.  In academia it is all too easy to pursue a novel or cute idea over one with more tangible benefit.

Another advantage of doing this work at Cloudera is the amount of input I was able to have on the tools I was using. For example, I used Apache Crunch, a library for composing pipelines of MapReduce jobs, which was written by my manager. This makes for a very productive loop.  That is, while the project is open-source, being able to meet with the lead developer in person reduces a lot of the usual friction of submitting to open-source. Whenever I was missing a feature or found a bug, it was very easy to write up the fix and have it integrated quickly.

Andrew Ferguson (Intern 2012)

As a student at Brown University, I work on software defined networks (SDNs) and platforms for Big Data processing, such as Hadoop and Microsoft’s Dryad/Cosmos. In both these areas, the core technologies were developed in academic labs and Internet-based companies, and are now reaching new markets via more traditional companies. As technologies are adopted by new users, new use cases and new problems arise — opening fresh avenues for systems research. This exposure to new types of Hadoop customers drew me to Cloudera for the Summer of 2012.

My internship at Cloudera also allows me to study a young company that transformed a small project into a mature product, and is using it to disrupt a large and established market for data processing. While the company may have started with a few engineers and sales staff, it now employs teams dedicated to all phases of a product’s lifecycle, from training, marketing, sales, and installation, to support and development, all within an organization still small enough for an intern to get to know. As the development of software-defined networks is still several years behind that of Big Data platforms, my summer at Cloudera lets me look into the future of SDN companies.

Finally, I would encourage any Ph.D. student, and particularly those in systems research, to consider spending a summer at a start-up or other small company, even if they are set on joining the academy after graduation. Numerous faculty members start companies during their careers, as it can be an effective way to change the world through research. And even for those who don’t start companies, the experience will help when advising future students on career options and selecting their own internships. As graduate students, we often have the twin luxuries of unstructured time and an ease of moving, so pick a city and an interesting company, and explore a new side of your life and research!

Patrick Wendell (Intern 2012)

At U.C. Berkeley, I work on resource management for large-scale data processing systems. My summer work at Cloudera, however, was off the beaten path for most academic engineers: I spent spending three months travelling out in the field with Cloudera’s engineers and working directly with customers as they assess, prototype, and deploy Hadoop in live environments. This experience put me right where the “rubber meets the road” in large-scale data management, and led to several insights about the problems faced day-to-day in big data deployments. It also provided perspective on which types of engineering solutions are most successful in the wild, which I will take this back with me as I continue my research degree.

The most salient lesson I took from the summer is that when companies are evaluating a new technology, performance with respect to alternative solutions is but one of many criteria considered. Factors like deployment complexity, interoperability with existing systems, cost, fitness for a particular business problem, and overall user-friendliness combine to influence adoption of new technologies. This is even ignoring “human” elements like trust in particular brands, history of prior relationships, and internal company politics.

Interoperability, simplicity, and ease-of-use are rarely stated goals in systems research projects — and indeed, these are partially the responsibility of “productizing” engineers at companies like Cloudera — but they should be considered first-class for any researchers who want to have impact. As de-facto standards arise around storage and processing of big data, the responsibility falls on researchers to inter-operate new technology with existing solutions, or at least propose a path towards integration or evolution in the long term. Simplicity also reigns supreme: given the unavoidable complexity of administering and deploying distributed systems, users will always opt for a simpler, more stable design at the cost of some performance. Finally, ease-of-use remains a major pain point for state-of-the-art big data solutions such as Hadoop. At a minimum, Hadoop’s processing abstraction, MapReduce, is too low-level for most technology consumers. Higher-level abstractions and languages exist, but these are still nascent and don’t sufficiently take advantage of obvious performance optimizations available lower in the stack. Finding the right abstractions for data processing at scale remains an open problem, and also one that I plan to directly focus on this coming year.

The question then, is how to simultaneously meet the canonical requirements for great research (innovative and groundbreaking progress) with the more mundane requirements for viable technology solutions (compatibility and simplicity). A key challenge for any great systems researcher is to walk this line adeptly!

These experiences illustrate how Ph.D. students benefit from understanding real-life big data problems, and accessing a broad spectrum of industrial engineers, partners, and customers. Conversely, Cloudera benefits from the industry-academia cross-pollination of ideas, and the methodical approach to problem solving brought by the Ph.D. interns.

Here is a list of current internship and full-time job openings at Cloudera.