My Internship at Cloudera

David joined us as part of our intern program, and built the prototype for the distributed log search functionality that’s available as part of Cloudera Manager 3.7. He did an awesome job, and wrote the following blog post which, now that CM3.7 has been released, we’re pleased to publish.

The project

My intern project was to build a log searching tool, specialized for Apache Hadoop. My mini-app allows Hadoop cluster admins and operators to search their error logs across many machines, filter by time range, text in the log message, and find the namenode machine, for example. The results are then ordered by time, and shown to the user.

This project was inspired by the extreme wizardry required to search logs with traditional tools, such as grep and ssh (or parallel ssh), especially since these tools do not order the results by time. Ordering by time is very important, as it allows one to triage the sources of failures across your cluster, and figure out where it all started.

How do I feel about my project in retrospect?

I had a ton of independence when it came to building my app. As part of the Enterprise Team, which wrote Cloudera Manager allowing one to easily deploy Hadoop to a whole cluster within a few clicks, I wrote the REST API available on the individual machines of the cluster. I wrote the master server code that makes requests in parallel to each of the cluster machines, asking for their search results. I designed and wrote the UI for the app, in addition to conceiving ways to make life easier for users who interact with it (with some help from user-testing, of course).

A couple of the niceties I added to the search page include:

  • Search without page refresh.
  • Saves the current search’s options to the URL bar, so that if you send the URL to a fellow admin, they can run the exact same search and see exactly what you’re seeing. Or you can save the URL and re-run the search at a later date.
  • A context view that does the same thing as the above, but lets you browse a single log file, with pagination (a single day’s log file can get as big as 1GB, so it wouldn’t be a great idea to send it all to the client at the same time).

In this process I learned more about python, the difficulty that python’s built-in date capabilities can cause, and how it can be quite fun to run code distributed across hundreds or thousands of machines.

I also spent some time profiling the internal Cloudera log searching library (written by Adam Warrington) which is the workhorse of the REST API (the master server communicates with its minions over HTTP). We were able to cut the worst-case run time on sample data by ~88%, which made me happy. During the process, I learned when possible, it’s best to meta-program other people’s code by asking them to make it faster, as the process of learning and reading all the code they’ve written can take some time, especially when you only need to make what you hope is a small change. It’s really great to arrive at work and hear that someone else has just finished coding up the optimizations your app needed.

Technical Portion: how the log search feature works.

Whenever you run a search, the main page of the search UI makes a request to a JSON endpoint, asking for log search results from, say, yesterday on all datanodes in the cluster. This request reaches the master server (SCM), which knows all about the machines in the cluster. The master server has a number of threads which make requests in parallel to each of the applicable cluster machines, each of which exposes a JSON endpoint. Each individual cluster machine then runs some python code that searches the relevant log files and returns the result as JSON. The master server collates the results from each cluster machine, and returns this to the browser. The results are then displayed to the user in what can sometimes be a very long list.

We decided it would be best not to maintain an index on the log files, as there can be many terabytes of data to sift through. For this reason, searches are done on demand by each individual cluster machine. Searches which include a time range are quite fast, as binary search is used to find the relevant time range, and then the first 20 results are returned. We also made an effort to optimize searches similar to grep helloworld that filter out certain words when we scan the particular line for the word, and skip the line without parsing it into an event if that line does not contain helloworld. We made this optimization because parsing each log event into date, message, and source was quite slow when searching large files.

Because I wrote the three components that make the search work (UI, JSON route on master server, JSON route on cluster machines), I got a good overview of many aspects of the code base.

A brief overview of the indirectly-dev related skills I learned a Cloudera

I’ve learned git really well, which I totally love now. I can rebase, cherry-pick, :/search, and reflog like the best of them. My git skills could be considered quite fetching among certain branches of society.

While here I’ve also had the opportunity to really flesh out my dotfiles, especially my .gitconfig and my .profile (aka .bashrc).

I also got a real feel for Dojo while I’ve been here, and I’d say that my next choice of javascript toolkit will be much better informed because of this.

Code reviews kept my code quality up, helping me catch little things like comments that were no longer relevant. I have yet to master the art of spotting and veto-ing changes that will break what I’ve written. Maybe next year!

Dev environment

From the web developer’s point of view, there’s are a couple times when you’d normally need to do a couple extra alt-tabs and/or refreshes. Of course, I’m pampered because I’ve never had to build anything that goes into production in a massive and/or distributable way, as we do here at Cloudera with our management tools. Also, I’ve traditionally done my work with web-servers that don’t require one to compile source code.

I found the following things quite annoying about the dev environment:

  1. static files, when changed, would not show their changes upon refresh
  2. .less files, an advanced and superior alternative to CSS, would not
    auto-recompile upon change

A solution to the first one was found by a coworker. I fixed the second problem by writing a little node script that watches less files and recompiles when any of them change.

While I was at it, I also made it easy for devs to use vogue, which reloads stylesheets whenever they’ve changed, without requiring a page refresh. This further improves development, as pages can get quite heavy when in development mode, where every javascript file is loaded individually, and it’s nice to have CSS changes automatically reflected in the UI.

Thanks Cloudera!

That’s about all I have on my mind when it comes to my internship. I learned a ton, enjoyed the free lunches a lot, as well as the 30 inch monitors. These things make a big difference, and also make me feel way cooler than some of my friends who don’t get these things.

So long Cloudera and thanks for all the fish! Now I’m off to another planet for a year in the world of academia!

David Trejo
Software Engineer Intern
Cloudera Summer ‘11 Enterprise Team
Brown University Computer Science ‘13

Go to Cloudera Careers >

Filed under:

1 Response
  • Fuller / March 25, 2012 / 4:27 AM

    Thanks for your efforts writing this. I learned a lot from your experience. I wish to read more in the near future. More power!

Leave a comment

− 5 = two