In this installment, meet Cloudera Software Engineer/Apache Bigtop Committer Mark Grover (@mark_grover).
What do you do at Cloudera and in which Apache project are you involved?
I’m a Software Engineer at Cloudera, involved mostly with Apache Bigtop, an open source project aimed at building a community around packaging and interoperability testing of projects in the Apache Hadoop ecosystem. In addition, I contribute to Apache Hive, a data warehousing system built on top of Apache Hadoop that allows users to structure and query their Hadoop data using familiar SQL-like syntax. I have also written a section in O’Reilly’s book on Hive, Programming Hive.
Why do you enjoy your job?
The Hadoop ecosystem is comprised of many different projects, each with their own specific problem space and use cases. Not only do I get a chance to take a deep-dive into some of these projects and the complex technical problems associated with them, but I also get an opportunity to integrate these projects so they interface well with each other, providing our customers with an easily deployable and well-integrated platform. While one day I may be working on creating a new datatype in Hive, the next day I may be looking into compatibility issues among two projects in the ecosystem. It is this unique mix of depth and breadth that makes me enjoy my job.
To top that, I also get to contribute to open source software, collaborate with smart people both within and outside of Cloudera, and share and present open source projects at conferences and meetups.
What is your favorite thing about Hadoop?
Hadoop has become the de-facto framework for scalable storage and processing of large datasets. It allows users to store, access and process data that they didn’t previously have access to. Consequently, this allows them to gain more insight into their data and make better, data driven business decisions. Gaining new insights for a business is almost like giving someone eyeglasses who never had them. What’s more important though is that Hadoop brings this power to a much larger market. It enables users who don’t have the resources to invest in expensive data warehousing systems to make the most of their data in a much more cost effective manner. Today, Hadoop is used in finance, healthcare, bioinformatics, advertising, business intelligence, retail, government, social sciences, and many other avenues.
Hadoop doesn’t just enable users to make better use of their data, it also opens it up to a much larger section of population — a section that hasn’t been catered to, up until now.
What is your advice for someone who is interested in participating in any open source project for the first time?
When I first got involved with Apache Hive, I was dealing with the problem of scalably storing web click and impression logs in a data warehouse. We were using MySQL but soon realized that it wouldn’t scale. I did an analysis of various open source technologies out there and Hive (along with Hadoop) emerged to be the winner. Consequently, I deployed Hadoop and Hive on a cluster and became a user of those projects. Eventually, our data warehouse moved over entirely to Hadoop and Hive and scalability was just a matter of adding more hardware. During this process, I ran into certain pain points with Hive. I created a few JIRAs for the same and uploaded patches wherever I could. Also, in the meantime, I started helping other users on the mailing lists with their questions.
Therefore, based on my experience, I would suggest becoming a user of the project first. It helps to have a problem that the project tries to address but it’s also completely fine if you are using the project just for the sake of learning it. Join the user mailing lists and the IRC channel so you can post any questions that you have along the way, become aware of the problems other users are having and maybe even help others out whenever possible. Soon enough, you will find out pain points while using the project and from other users on the mailing lists. Some of these pain points are low hanging fruits and easy to fix, so get started on those.
I have had conversations with other committers on the projects and asked them about issues that are most important to them. Create issues on the project JIRA for such issues and started posting patches. You may get some feedback from other committers, be open to it, their intent is the same as yours – to make the project better. Be a good community citizen, be genuinely interested in improving the project and show the same by posting patches, helping other users and expanding the community. You will be a committer before you know it!
On a related note, projects like Apache Bigtop are always looking for new and exciting ideas to make things better for our users. If you have any ideas, or would like to contribute in shaping an open source integrated distribution of projects in the Hadoop ecosystem, I would strongly encourage you to try out Apache Bigtop.
At what age did you become interested and programming, and why?
It all started for me in high school. I was learning C/C++, and to do so I wrote a Point-of-Sale system. The vision there was to build a system that would be used in the retail industry by cashiers. It was a rather old-school console application that allowed the cashier to add items to an invoice, print the invoice, and save/retrieve the invoice on demand, backed by a simple file-based database. That project provided me with a great holistic introduction to the software craft. I learned not just what a programmer does but also about the roles of an architect, program manager, tester, and release manager. From there, I never looked back!
If you’re attending Big Data TechCon in Boston (April 8-10), you can catch Mark’s half-day tutorial “Introduction and Best Practices for Storing and Analyzing Your Data with Apache Hive” on April 8. (See full list of Cloudera speakers here.)