Cloudera Developer Blog · Careers Posts
The following is a series of stories from people who in the recent past worked as Engineering Interns at Cloudera. These experiences concretely illustrate how collaboration between commercial companies like Cloudera and academia, such as in the form of these internships, helps promote big data research at universities. (These experiences were previously published in the ACM student journal, XRDS.)
Yanpei Chen (Intern 2011)
I Interned with Cloudera during my last summer of grad school. My dissertation was on “Workload Driven Design and Evaluation of Large-Scale Data-Centric Systems”, and I already had collaborations with Facebook and NetApp, two other big data companies. The goal of my work was to develop and demonstrate a set of empirical, workload-driven design and evaluation methods that complemented the traditional, subjective approach of designing by intuition and experience. It was very important that these methods generalized across many types of customer workloads. Hence, when Cloudera offered me an internship, I leapt at the unique opportunity to collect insights from customers in traditional industries who were still dealing with big data.
Data science has been a ubiquitous topic of conversation in the IT and business worlds across the month of November. In this brief post, I’ll bring you just a small cross-section of the data science meme on the Interwebs in the past 4 weeks:
Have you helped your company ask bigger questions? Our mission at Cloudera University is to equip Hadoop professionals with the skills to manage, process, analyze, and monetize more data than they ever thought possible.
Over the past three years, we’ve heard many great stories from our training participants about faster cluster deployments, complex data workflows made simple, and superhero troubleshooting moments. And we’ve heard from executives in all types of businesses that staffing Cloudera Certified professionals gives them confidence that their Hadoop teams have the skills to turn data into breakthrough insights.
Now, it’s your turn to tell us your bigger questions story! Cloudera University is seeking tales of Apache Hadoop success originating with training and certification. How has an investment in your education paid dividends for your company, team, customer, or career?
You may have noticed that Harvard Business Review is calling data science “the sexiest job of the 21st century.” So our answer to the question is: Hot. Definitely hot.
If you need an explanation, watch the “Definition of a Data Scientist” talk embedded below from Cloudera data science director Josh Wills, which was hosted by Cloudera partner Lilien LLC recently in Portland, Ore. The key take-away is, you don’t literally have to be a “scientist,” just someone with the curiosity of one.
This was my summer internship project at Cloudera, and I’m very thankful for the level of support and mentorship I’ve received from the Apache HBase community. I started off in June with a very limited knowledge of both HBase and distributed systems in general, and by September, managed to get this patch committed to HBase trunk. I couldn’t have done this without a phenomenal amount of help from Cloudera and the greater HBase community.
The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.
However, despite the availability of high memory servers, the garbage collection algorithms available on production quality JDK’s have not caught up. Attempting to use large amounts of heap will result in the occasional stop-the-world pause that is long enough to cause stalled requests and timeouts, thus noticeably disrupting latency sensitive user applications.
This is a guest post contributed by Loren Siebert. Loren is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.
A year ago I rolled my first Apache Hadoop system into production. Since then, I’ve spoken to quite a few people who are eager to try Hadoop themselves in order to solve their own big data problems. Despite having similar backgrounds and data problems, few of these people have sunk their teeth into Hadoop. When I go to Hadoop Meetups in San Francisco, I often meet new people who are evaluating Hadoop and have yet to launch a cluster. Based on my own background and experience, I have some ideas on why this is the case.
I studied computer science in school and have worked on a wide variety of computer systems in my career, with a lot of focus on server-side Java. I learned a bit about building distributed systems and working with large amounts of data when I built a pay-per-click (PPC) ad network in 2004. The system is still in operation and at one point was handling several thousand searches per second. As the sole technical resource on the system, I had to educate myself very quickly about how to scale up.
David joined us as part of our intern program, and built the prototype for the distributed log search functionality that’s available as part of Cloudera Manager 3.7. He did an awesome job, and wrote the following blog post which, now that CM3.7 has been released, we’re pleased to publish.
My intern project was to build a log searching tool, specialized for Apache Hadoop. My mini-app allows Hadoop cluster admins and operators to search their error logs across many machines, filter by time range, text in the log message, and find the namenode machine, for example. The results are then ordered by time, and shown to the user.
This project was inspired by the extreme wizardry required to search logs with traditional tools, such as grep and ssh (or parallel ssh), especially since these tools do not order the results by time. Ordering by time is very important, as it allows one to triage the sources of failures across your cluster, and figure out where it all started.
How do I feel about my project in retrospect?
The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
Building Web Analytics Processing on Hadoop at CBS Interactive
Michael Sun, CBS Interactive
This post was written by Daniel Jackoway following his internship at Cloudera during the summer of 2011.
When I started my internship at Cloudera, I knew almost nothing about systems programming or Apache Hadoop, so I had no idea what to expect. The most important lesson I learned is that structured data is great as long as it is perfect, with the addendum that it is rarely perfect.
My project was to develop a unified view of our customer data. The requirements were simple: pull in data from a variety of systems, group it by customer, and display it. The goal is that when someone at Cloudera needs to see all of the key information about our customers, it is available in one place. In addition, downloading and grouping data will make performing analysis much easier, allowing us to draw new insights about our business and our customers.
The consensus from the Cloudera attendees of the O’Reilly Strata Conference last week was that the data-focused conference was nearly pitch perfect for the data scientist, practitioners and enthusiast who attended the event. It was filled with educational and sometimes entertaining sessions, provided ample time for mingling with vendors and attendees and was well run in general.
One of the cool activities happening at the conference was live streaming video brought to us from the good folks at SiliconAngle. Using a mobile production system called The Cube, Silicon Angle hosts John Furrier (@furrier) and Dave Vellante interviewed industry luminaries and up and comers while bringing their own perspective. After streaming live for nearly two days these hosts are still able to keep the energy high and the tone light.
In the interviews below John and Dave interview Amr Awadallah, CTO and Co-Founder of Cloudera (@awadallah), and John Kreisa, VP Marketing at Cloudera (@marked_man); followed by a John and Dave interview with Sarah Sproehnle director of education at Cloudera. During the interviews they cover many different aspects of Cloudera and Apache Hadoop.