Cloudera Development Kit (CDK): Hadoop Application Development Made Easier
- by Eric Sammer & Tom White
- May 07, 2013
- no comments
At Cloudera, we have the privilege of helping thousands of developers learn Apache Hadoop, as well as build and deploy systems and applications on top of Hadoop. While we (and many of you) believe that platform is fast becoming a staple system in the data center, we’re also acutely aware of its complexities. In fact, this is the entire motivation behind Cloudera Manager: to make the Hadoop platform easy for operations staff to deploy and manage.
So, we’ve made Hadoop much easier to “consume” for admins and other operators — but what about for developers, whether working for ISVs, SIs, or users? Until now, they’ve largely been on their own.
That’s why we’re really excited to announce the Cloudera Developer Kit (CDK), a new open source project designed to help developers get up and running to build applications on CDH, Cloudera’s open source distribution including Hadoop, faster and easier than before. The CDK is a collection of libraries, tools, examples, and documentation engineered to simplify the most common tasks when working with the platform. Just like CDH, the CDK is 100% free, open source, and licensed under the same permissive Apache License v2, so you can use the code any way you choose in your existing commercial code base or open source project.
The CDK lives on GitHub where users can freely browse, download, fork, and contribute back to the source. Community contributions are not only welcome but strongly encouraged. Since most Java developers use tools such as Maven (or tools that are compatible with Maven repositories), artifacts are also available from the Cloudera Maven Repository for easy project integration.
The CDK is a collection of libraries, tools, examples, and docs engineered to simplify common tasks.
What’s In There Today
Our goal is to release a number of CDK modules over time. The first module that can be found in the current release is the CDK Data module; a set of APIs to drastically simplify working with datasets in Hadoop filesystems such as HDFS and the local filesystem. The Data module handles automatic serialization and deserialization of Java POJOs as well as Avro Records, automatic compression, file and directory layout and management, automatic partitioning based on configurable functions, and a metadata provider plugin interface to integrate with centralized metadata management systems (including HCatalog). All Data APIs are fully documented with javadoc. A reference guide is available to walk you through the important parts of the module, as well. Additionally, a set of examples is provided to help you see the APIs in action immediately.
The current version of the CDK is 0.2.0, with maintenance releases rolling out monthly, so you should expect rapid evolution as we build toward a 1.0.0 release. What you see today is just the tip of the iceberg — a framework and long-term initiative for bringing more codified best practices, docs, examples, and APIs to developers.
To get a jump-start, take a look at the CDK Data module javadoc.
What You Can Expect
- Features and functionality driven by the collective experience and requirements of Cloudera’s users and partners, as well as its own solution architects
- A fast path to get up and running for the most common use cases
- Frequent releases with new features, bug fixes, and your contributions
- Docs, examples, and guides for all modules
- Well-defined API compatibility guarantees for public APIs
- All open source, all Apache License v2, all the time
Can I contribute to the CDK?
Yes, please! As explained above, we welcome and encourage contributions to the CDK, and look forward to your pull requests.
On the other side of the coin, feel free to fork and modify the CDK for your own purposes, if that’s your desire.
Where do I go for CDK discussion, questions, and help?
For now, we’re going to direct all CDK discussion to the email@example.com discussion group. If you’re not a member, please join the group at https://groups.google.com/a/cloudera.org/d/forum/cdk-dev.
Where do I file bugs and feature requests?
There’s a dedicated public JIRA project for the CDK.
Where can I see the road map?
The best place to look is the road map view in the CDK JIRA project. All work we do will have public JIRAs so you can see what’s coming and participate.
Eric Sammer is an Engineering Manager at Cloudera and a CDK project co-lead. He is also a Committer/PMC Member on the Apache Flume and Apache MRUnit projects and the author of the O’Reilly book, Hadoop Operations.
Tom White is a Software Engineer at Cloudera and a CDK project co-lead. He is also a Committer/PMC Member on the Apache Avro, Apache Hadoop, and Apache Whirr projects and the author of the O’Reilly book, Hadoop: The Definitive Guide.
To learn more about the CDK, register for this webinar with Eric Sammer airing on May 21, 2013.