Exploring Compression for Hadoop: One DBA’s Story

Categories: General Guest HDFS Ops and DevOps

This guest post comes to us courtesy of Gwen Shapira (@gwenshap), a database consultant for The Pythian Group (and an Oracle ACE Director).

Most western countries use street names and numbers to navigate inside cities. But in Japan, where I live now, very few streets have them.

Sometimes solving technical problems is similar to navigating a city without many street names: Once you arrive at the desired location, the path seems obvious, but on the way there are many detours and interesting sights to be seen.

Here’s an example: in a client engagement in Tokyo for which I was a database consultant, we needed to move about 10TB from Hadoop into an Oracle Exadata box as a one-time load.

The options we considered were:

  1. Use FUSE to mount HDFS on Exadata and use Oracle external tables (an Oracle Database feature through which you can access external data as if it were stored in a table) to load the files.
  2. Install Hadoop on Oracle Exadata, use HDFS to copy the files to the compute nodes, and load them from there.
  3. Write process on the existing server that will read data from Hadoop and load it into Oracle Exadata.
  4. Use Sqoop to insert data into Exadata directly from Hadoop.
  5. Use a pre-packed Oracle connector.

FUSE and Sqoop seemed like the least intrusive options, requiring minimal changes on the Exadata or Hadoop, and we debated the pros and cons of each. Ultimately using FUSE and external tables, while not a common approach, seemed the best choice for our situation. Once we had the data in text files, we could load data into our tables using any method we wanted, using all of Oracle Exadata’s features to optimize compression, parallelism, and load times.

Being a consultant, I sent the customer system administrators links to relevant Hadoop-FUSE documentation and asked them to get cranking.

The system administrator called me back to say that Hadoop is “backed up" nightly to NFS, and that it would be easier if we just used that data. Since I didn’t mind getting data that is few hours old (I know how to close the data gap), I agreed to their plan.

I used "less" to peek at one of the files. This is what I saw:

An experienced Hadoop-er would know exactly what she was looking at, but I was still in learning mode. In a discussion with the application developer, I was told that the file was compressed.

Just for fun, I decided to write the decompression Java code myself. Read the compressed file, output text to STDOUT, use it as an external tables pre-processor: How difficult could it be? I just needed some code samples of how to use the LZO libraries, and we would be set.

Or so I thought. Almost immediately, I got as lost as a drunk tourist in Shinjuku at night. Hadoop in fact has more than one LZO library, and I was not sure if they are mutually compatible. Files can be block compressed or row compressed, and I did not know what my file was. The libraries were used only as plug-ins to Hadoop; apparently no one used them on files outside Hadoop. There was also something called LZOP, and a lot of C libraries that I tried using but that kept claiming my file was not in fact LZO. At that point, I called it a day.

But before going to sleep, I took a look at Hadoop: The Definitive Guide (2nd edition), by Cloudera’s Tom White, hoping for some ideas and insights. As expected, the book proved as useful as Google Maps in Tokyo.

Two hours and three glasses of Kirin later, I discovered that:

  • The file is not just compressed – it is in fact a SequenceFile. I heard about SequenceFiles at my Cloudera developer course, but I had not actually used one before.
  • SequenceFiles have very descriptive headers that include the key data type (BytesWritable), value data type (Text), whether the file is compressed (YES), block compressed (YES) and codec (LZO).
  • Tom White had thoughtfully included example code that reads SequenceFiles. The book didn’t mention compression, but from my readings I gathered that SequenceFile.Reader might be able to handle that data automatically if I include the LZO libraries in my class path.

This looked easy enough. I just needed to get Tom’s example to work with my environment (within Exadata, no Hadoop installed, to be used as input for external tables).

The following steps turned out to be the correct ones:

Step 1: Prepare the environment. To do this I needed the Hadoop core package and the Hadoop-LZO package. Hadoop core (hadoop-0.20.2-cdh3u2-core.jar) was copied from the Hadoop server (also Linux 64-bit, so it was compatible), LZO was installed from Hadoop-GPL-Packing, an excellent little project that saved me much pain. The location of the packages was added to CLASS_PATH environment variable. You will also need commons-logging.jar somewhere in the class path as Hadoop uses log4j.

I also needed native libraries for Hadoop and LZO. I copied libhadoop.so from same server, the hadoop-gpl-packing package included native LZO libraries; the locations of both went into LD_LIBRARY_PATH.

Step 2: Compile SequenceFileReaderDemo (from the book) and test it on a sample file:

To my joy and disbelief, this "just worked". I had to add import org.apache.hadoop.io.BytesWritable to support our particular key, but once I set the environment right, javac SequenceFileReaderDemo.java ran without issues. Testing the reader of a sample file also worked with no special problems other than those that led to the discovery of native libraries:

Step 3: Use SequenceFileReaderDemo as pre-processor for external tables. This was probably the most difficult step (relatively speaking). Integration usually is.

First, I had to tweak the output of SequenceFileReaderDemo to match the expectations of Oracle Database. Don’t output row numbers, or keys, or alignment marks – just the values and nothing but the values. Also, Hadoop will try to send info messages to STDERR, which causes Oracle Database to assume something is wrong with the External table and quit. Rather than learn how to configure log4j parameters for my non-existing Hadoop, I just piped STDERR into null, which although sloppy would do for a POC.

Step 4: Load the data from the external table into a real table. I used Create Table as Select for the POC, but there are many other options. (That’s one of the nicest things about external tables; the feature is so flexible that you can even read sequence files!)

To test the output, I also had to set one more environment variable: export NLS_LANG=JAPANESE_JAPAN.UTF8. Otherwise, most text came out as junk:

Of course, this was just enough to show the reluctant developers how things are done. I think they were happy with the results.

But this was not real production code. The most glaring issue is that the error handling is dubious at best. But what was more worrying for me was that most of the data is stored in many small files – not exactly the best way to Hadoop. The overhead of creating a new JVM to load each file would become an issue.

I could have had SequenceFileReader accept entire directories as inputs and process all files in the directories at once, but directories have highly variable sizes and there are not enough of them. I decided that the input would have to be a range of dates or filenames within a directory, and used this to partition the loading process manually.

Maybe I was obsessing over the details of a one-time load, but no one complained. Japanese are famous for getting the small details right!