The Hadoop FAQ for Oracle DBAs

Oracle DBAs, get answers to many of your most common questions about getting started with Hadoop.

As a former Oracle DBA, I get a lot of questions (most welcome!) from current DBAs in the Oracle ecosystem who are interested in Apache Hadoop. Here are few of the more frequently asked questions, along with my most common replies.

How much does the IT industry value Oracle DBA professionals who have switched to Hadoop administration, or added it to their skill set?

Right now, a lot. There are not many experienced Hadoop professionals around (yet)!

In many of my customer engagements, I work with the DBA team there to migrate parts of their data warehouse from Teradata or Netezza to Hadoop. They don’t realize it at the time, but while working with me to write Apache Sqoop export jobs, Apache Oozie workflows, Apache Hive ETL actions, and Cloudera Impala reports, they are learning Hadoop. A few months later, I’m gone, but a new team of Hadoop experts who used to be DBAs is left in place.

My solutions architect team at Cloudera also hires ex-DBAs as solutions consultants or system engineers. We view DBA experience as invaluable for those roles.

What do you look for when hiring people with no Hadoop experience?

I strongly believe that DBAs have the skills to become excellent Hadoop experts – but not just any DBAs. Here are some of the characteristics I look for:

  • Comfort with the command line. Point-and-click DBAs and ETL developers need not apply.
  • Experience with Linux. Hadoop runs on Linux so that’s where much of the troubleshooting will happen. You need to be very comfortable with Linux OS, filesystem, tools, and command line. You should understand OS concepts around memory management, CPU scheduling, and IO.
  • Knowledge of networks. ISO layers, what ssh is really doing, name resolution, basic understanding of switching.
  • Good SQL skills. You know SQL and you are creative in your use of it. Experience with data warehouse basics such as partitioning and parallelism is a huge plus. ETL experience is a plus. Tuning skills are a plus.
  • Programming skills. Not necessarily Java (see below). But, can you write a bash script? Perl? Python? Can you solve few simple problems in pseudo-code? If you can’t code at all, that’s a problem.
  • Troubleshooting skills. This is huge, as Hadoop is far less mature than Oracle. You’ll need to Google error messages like a pro, but also be creative and knowledgeable about where to look when Google isn’t helpful.
  • For senior positions, we look for systems and architecture skills too. Prepare to explain how you’ll design a flight-scheduling system or something similar.
  • And since our team is customer facing, communication skills are a must. Do you listen? Can you explain a complex technical point? How do you react when I challenge your opinion?

Is that maybe too much to ask? Possibly. But I can’t think of anything I could remove and still expect success with our team.

How do I start learning Hadoop?

The first task we give new employees is to set up a five-node cluster in the AWS cloud. That’s a good place to start. Neither Cloudera Manager nor Apache Whirr is allowed; they make things too easy.

The next step is to load data into your cluster and analyze it. I recommend following the tutorials here, which show how to load Twitter data using Apache Flume and analyze it using Hive:

Also, Cloudera’s QuickStart VM (download here) includes TPC-H data and queries. You can run your own TPC-H benchmarks in the VM.

There are also some good books to help you get started. My favorite is Eric Sammer’s Hadoop Operations – it’s concise and practical, and I think DBAs will find it very useful. The chapter on troubleshooting is very entertaining. Other books that DBAs will find useful are Hadoop: The Definitive Guide, Programming Hive, and Apache Sqoop Cookbook (all of which are authored or co-authored by Clouderans).

I also recommend taking a Cloudera University training course or two and perhaps even getting certified. Talking to a live instructor often provides insights that you can’t find on your own.

For even more resources, see the “New to Hadoop” page on cloudera.com.

Do I need to know Java?

Yes and no :)

You don’t need to be a master Java programmer. I’m not, and many of my colleagues are not. Some never write Java code at all.

You do need to be comfortable reading Java stack traces and error messages. You’ll see many of those. You’ll also need to understand basic concepts like jars and classpath.

Being able to read Java source code is useful. Hadoop is open source, and digging into the code often helps you understand why something works the way it does.

Even without mastery required, the ability to write Java is often useful. For example, Hive UDFs are typically written in Java (and it’s easier to do that than you think).

Conclusion

If you’re an Oracle DBA interested in learning Hadoop (or working for Cloudera), this post should get you started.

I’m happy to answer any other questions in comments!

Gwen Shapira is a Solutions Architect for Cloudera, and a former Oracle DBA.

Filed under:

7 Responses
  • Rakesh Tripathi / January 07, 2014 / 11:29 PM

    Good and relevant information for Database Administrators. Thanks !

  • Egidio Ndabagoye / January 08, 2014 / 1:00 AM

    Thanks Gwen.Very informative post.

  • Sridhar Govardhanan / January 15, 2014 / 8:56 AM

    Thanks Gwen, its a wonderful post. I am an Oracle DBA with 4 years of experience, I am stuck in a dilemma whether to choose the path of Data Analytics or Hadoop Admin. Given my DBA experience, can you guide me which would be my best option…Thank you…

  • Ramnath / February 06, 2014 / 11:40 PM

    Thanks Gwen. I am an Oracle DBA struggling for job. but unfortunately i got an offer in BigData based company.These information will help me a lot …….

  • Amp / February 08, 2014 / 7:11 PM

    Hi Gwen,

    I am a 8+ years experienced Programmer with Java and open source technology and looking to move into Hadoop ecosystem.

    Will getting into Hadoop will work work for me being I don’t have any DW or ETL related experience.

  • Robin Dong / March 09, 2014 / 7:56 PM

    Hi Gwen,
    Your article here is very encourage, I am looking for Hadoop admin/developer job right now.

    I am an Oracle prod/dev DBA and Sql developer for a long time, I just got my Hadoop admin and developer certified.

    I have a question for you. I tried to setup 2 nodes hadoop cluster at home. I have my Internet(all IPs is dynamic IP).
    I had no problem install Centos 6.2 and Cloudera Manager 4.5. However once I log onto Cloudera manager, I add these 2 nodes to Hadoop, there is a error always popup: ‘ Can’t find scm server’….

    1. First of all, I like to know if possible I can just use my Internet and 3 machines at home to setup this Hadoop cluster? what else I need for this setup?

    2. Seemed, when log onto Cloudera Manager and add nodes, CM connect to archive.cloudera.com to find package to load or so, I am afraid this dynamic IP on my Internet service would not work that way.

    3. If I am wrong on the above, please let me know what I can do to make this 2 nodes cluster installed.

    4. If I dont use Cloudera Manager, just pure Hadoop/java installed, do you think I can make it?

    Anyhow, basically, I am looking for a way to install Multiple nodes Hadoop at home. Hope I dont need static IP for this.

    BTW, I had no problem to install single note Hadoop on rackspace.com, not multiple nodes yet.

    Do you have any suggestion on how to find a Hadoop admin/developer job?

    your help is greatly appreciate.

Leave a comment


− 1 = one