Cloudera Engineering Blog · Careers Posts
The following is a series of stories from people who in the recent past worked as Engineering Interns at Cloudera. These experiences concretely illustrate how collaboration between commercial companies like Cloudera and academia, such as in the form of these internships, helps promote big data research at universities. (These experiences were previously published in the ACM student journal, XRDS.)
Yanpei Chen (Intern 2011)
Data science has been a ubiquitous topic of conversation in the IT and business worlds across the month of November. In this brief post, I’ll bring you just a small cross-section of the data science meme on the Interwebs in the past 4 weeks:
Have you helped your company ask bigger questions? Our mission at Cloudera University is to equip Hadoop professionals with the skills to manage, process, analyze, and monetize more data than they ever thought possible.
Over the past three years, we’ve heard many great stories from our training participants about faster cluster deployments, complex data workflows made simple, and superhero troubleshooting moments. And we’ve heard from executives in all types of businesses that staffing Cloudera Certified professionals gives them confidence that their Hadoop teams have the skills to turn data into breakthrough insights.
You may have noticed that Harvard Business Review is calling data science “the sexiest job of the 21st century.” So our answer to the question is: Hot. Definitely hot.
If you need an explanation, watch the “Definition of a Data Scientist” talk embedded below from Cloudera data science director Josh Wills, which was hosted by Cloudera partner Lilien LLC recently in Portland, Ore. The key take-away is, you don’t literally have to be a “scientist,” just someone with the curiosity of one.
This was my summer internship project at Cloudera, and I’m very thankful for the level of support and mentorship I’ve received from the Apache HBase community. I started off in June with a very limited knowledge of both HBase and distributed systems in general, and by September, managed to get this patch committed to HBase trunk. I couldn’t have done this without a phenomenal amount of help from Cloudera and the greater HBase community.
The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.
This is a guest post contributed by Loren Siebert. Loren is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.
A year ago I rolled my first Apache Hadoop system into production. Since then, I’ve spoken to quite a few people who are eager to try Hadoop themselves in order to solve their own big data problems. Despite having similar backgrounds and data problems, few of these people have sunk their teeth into Hadoop. When I go to Hadoop Meetups in San Francisco, I often meet new people who are evaluating Hadoop and have yet to launch a cluster. Based on my own background and experience, I have some ideas on why this is the case.
David joined us as part of our intern program, and built the prototype for the distributed log search functionality that’s available as part of Cloudera Manager 3.7. He did an awesome job, and wrote the following blog post which, now that CM3.7 has been released, we’re pleased to publish.
My intern project was to build a log searching tool, specialized for Apache Hadoop. My mini-app allows Hadoop cluster admins and operators to search their error logs across many machines, filter by time range, text in the log message, and find the namenode machine, for example. The results are then ordered by time, and shown to the user.
The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
This post was written by Daniel Jackoway following his internship at Cloudera during the summer of 2011.
When I started my internship at Cloudera, I knew almost nothing about systems programming or Apache Hadoop, so I had no idea what to expect. The most important lesson I learned is that structured data is great as long as it is perfect, with the addendum that it is rarely perfect.
The consensus from the Cloudera attendees of the O’Reilly Strata Conference last week was that the data-focused conference was nearly pitch perfect for the data scientist, practitioners and enthusiast who attended the event. It was filled with educational and sometimes entertaining sessions, provided ample time for mingling with vendors and attendees and was well run in general.
One of the cool activities happening at the conference was live streaming video brought to us from the good folks at SiliconAngle. Using a mobile production system called The Cube, Silicon Angle hosts John Furrier (@furrier) and Dave Vellante interviewed industry luminaries and up and comers while bringing their own perspective. After streaming live for nearly two days these hosts are still able to keep the energy high and the tone light.
We blogged about 104 different topics in 2010 and we recently decided to take a look back and see what folks were most interested in reading. The topics that were featured ranged from Cloudera’s Distribution for Apache Hadoop technical updates (CDH3b3 being the most recent) to highlighting upcoming Hadoop related events and activities to sharing practical insights for implementing Hadoop. We also featured a number of guest blog posts.
Here are the top 10 blog posts from 2010:
- How to Get a Job at Cloudera
Cloudera is hiring around the clock, and this blog highlights the best course of action to increase your chances of becoming a Clouderan.
- Why Europe’s Largest Ad Targeting Platform Uses Hadoop
“As data volumes increased and performance suffered, we recognized a new approach was needed (Hadoop).” –Richard Hutton, Nugg.ad CTO
- What’s New in CDH3b2 Flume
Flume, our data movement platform, was introduced to the world and into the open source environment.
- What’s New in CDH3b2 Hue
Hue, a web UI for Hadoop, is a suite of web applications as well as a platform for building custom applications with a nice UI library.
- Natural Language Processing with Hadoop and Python
Data volumes are increasing naturally from text (blogs) and speech (YouTube videos) posing new questions for Natural Language Processing. This involves making sense of lots of data in different forms and extracting useful insights.
- How Raytheon BBN Technologies Researchers are Using Hadoop to Build a Scalable, Distributed Triple Store
Raytheon BBN Technologies built a cloud-based triple-store technology, known as SHARD, to address scalability issues in the processing and analysis of Semantic Web data.
- Cloudera’s Support Team Shares Some Basic Hardware Recommendations
The Cloudera support team discusses workload evaluation and the critical role it plays in hardware selection.
- Integrating Hive and HBase
Facebook explains integrating Hive and HBase to keep their warehouse up to date with the latest information published by users.
- Pushing the Limits of Distributed Processing
Google built a 100,000 node Hadoop cluster running on Nexus One mobile phone hardware and powered by Android. The environmental cost of this solution is 1/100th the equivalent of running it within their data center. (April Fools)
- Using Flume to Collect Apache 2 Web Server Logs
This post presents the common use case of using a Flume node to collect Apache 2 web server logs and deliver them to HDFS.
Here at Cloudera we embraced the holiday spirit with the light heartedness that is Halloween by hosting several activities including an engineering hack-a-thon, a hack-a-pumpkin-a-thon, and a costume competition.
If there is one thing that chefs are proud of, it’s their kitchens. Whether cavernous top-of-the-line affairs or cramped New York apartments, kitchens are the place where raw ingredients are combined with talent and hard work to produce results. The only difference in the world of software is what you will find in our kitchens. (more…)
We’re doing a lot of hiring at Cloudera — we have jobs open in operations, sales, engineering and elsewhere. Hiring well is hard work. We spend a lot of time on it, and have learned a lot about the kind of people we want to bring in. One of the best ways for us to do a good job of hiring is to help you do a good job of applying for a job here.
I’ll begin the post, though, by telling you what doesn’t work. Several times a day, we get an unsolicited email or phone message from a contingency recruiter like this one: