Cloudera Developer Blog · Guest Posts
Thanks to Marshall Bockrath-Vandegrift of advanced threat detection/malware company (and CDH user) Damballa for the following post about his Parkour project, which offers libraries for writing MapReduce jobs in Clojure. Parkour has been tested (but is not supported) on CDH 3 and CDH 4.
Clojure is Lisp-family functional programming language which targets the JVM. On the Damballa R&D team, Clojure has become the language of choice for implementing everything from web services to machine learning systems. One of Clojure’s key features for us is that it was designed from the start as an explicitly hosted language, building on rather than replacing the semantics of its underlying platform. Clojure’s mapping from language features to JVM implementation is frequently simpler and clearer even than Java’s.
Parkour is our new Clojure library that carries this philosophy to the Apache Hadoop’s MapReduce platform. Instead of hiding the underlying MapReduce model behind new framework abstractions, Parkour exposes that model with a clear, direct interface. Everything possible in raw Java MapReduce is possible with Parkour, but usually with a fraction of the code.
Our thanks to Databricks, the company behind Apache Spark (incubating), for providing the guest post below. Cloudera and Databricks recently announced that Cloudera will distribute and support Spark in CDH. Look for more posts describing Spark internals and Spark + CDH use cases in the near future.
Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL, computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms.
Our thanks to Telvis Calhoun, Zach Hanif, and Jason Trost of Endgame for the guest post below about their BinaryPig application for large-scale malware analysis on Apache Hadoop. Endgame uses data science to bring clarity to the digital domain, allowing its federal and commercial partners to sense, discover, and act in real time.
Over the past three years, Endgame received 40 million samples of malware equating to roughly 19TB of binary data. In this, we’re not alone. McAfee reports that it currently receives roughly 100,000 malware samples per day and received roughly 10 million samples in the last quarter of 2012. Its total corpus is estimated to be about 100 million samples. VirusTotal receives between 300,000 and 600,000 unique files per day, and of those roughly one-third to half are positively identified as malware (as of April 9, 2013).
Our thanks to Concurrent Inc. for the how-to below about using Cascading Pattern with CDH. Cloudera recently tested CDH 4.4 with the Cascading Compatibility Test Suite verifying compatibility with Cascading 2.2.
Cascading Pattern is a machine-learning project within the Cascading development framework used to build enterprise data workflows. Cascading provides an abstraction layer on top of Apache Hadoop and other computing topologies that allows enterprises to leverage existing skills and resources to build data processing applications on Hadoop, without the need for specialized Hadoop skills.
Pattern, in particular, leverages an industry standard called Predictive Model Markup Language (PMML), which allows data scientists to leverage their favorite statistical and analytics tools (such as R, SAS, Oracle, and so on) to export predictive models and quickly run them on data sets stored in Hadoop. Pattern’s benefits include reduced development costs, time savings, and reduced licensing issues at scale – all while leveraging Hadoop clusters, core competencies of analytics staff, and existing intellectual property in the predictive models.
Thanks to Victor Bittorf, a visiting graduate computer science student at Stanford University, for the guest post below about how to use the new prebuilt analytic functions for Cloudera Impala.
Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.
Having recently completed my masters degree while working in the database systems group at University of Madison Wisconsin, I’m excited to work with the Impala team on this project while I continue my research as a visiting student at Stanford. I’m going to go through some details about what we’ve implemented and how to use it.
The following Parquet blog post was originally published by Salesforce.com Lead Engineer and Apache Pig Committer Prashant Kommireddi (@pRaShAnT1784). Prashant has kindly given us permission to re-publish below. Parquet is an open source columnar storage format co-founded by Twitter and Cloudera.
Parquet is a columnar storage format for Apache Hadoop that uses the concept of repetition/definition levels borrowed from Google Dremel. It provides efficient encoding and compression schemes, the efficiency being improved due to application of aforementioned on a per-column basis (compression is better as column values would all be the same type, encoding is better as values within a column could often be the same and repeated). Here is a nice blog post from Julien Le Dem of Twitter describing Parquet internals.
Parquet can be used by any project in the Hadoop ecosystem, there are integrations provided for MR, Pig, Hive, Cascading, and Cloudera Impala.
Our thanks to Kishore Gopalakrishna, staff engineer at LinkedIn and one of the original developers of Apache Helix (incubating), for the introduction below. Cloudera’s Patrick Hunt is a mentor for the project.
With the trend of exploding data growth and the systems in the NoSQL and Big Data space, the number of distributed systems has grown significantly. At LinkedIn, we have built a number of distributed systems over the years. Such systems run on a cluster of multiple servers and need to handle the problems that come with distributed systems. Fault tolerance – that is, availability in the presence of server failures and network problems — is critical to any such system. Horizontal scalability and seamless cluster expansion to handle increasing workloads are also essential properties.
Without a framework that provides these capabilities, developers of distributed platforms have to continually re-implement features that support them. From LinkedIn’s own experiences building various distributed systems and having to solve the same problems repeatedly, we decided to build a generic platform to simply this process – which we eventually called Helix (and which is now in the Apache Incubator).
The guest post below is provided by Justin Langseth, Founder & CEO of Zoomdata, Inc. Thanks, Justin!
What if you could affordably manage billions of rows of raw Big Data and let typical business people analyze it at the speed of thought in beautiful, interactive visuals? What if you could do all the above without worrying about structuring that data in a data warehouse schema, moving it, and pre-defining reports and dashboards? With the approach I’ll describe below, you can.
The traditional Apache Hadoop approach — in which you store all your data in HDFS and do batch processing through MapReduce — works well for data geeks and data scientists, who can write MapReduce jobs and wait hours for them to run before asking the next question. But many businesses have never even heard of Hadoop, don’t employ a data scientist, and want their data questions answered in a second or two — not in hours.
The following guest post is re-published here courtesy of Gerd König, a System Engineer with YMC AG. Thanks, Gerd!
Cloudera Manager is a great tool to orchestrate your CDH-based Apache Hadoop cluster. You can use it from cluster installation, deploying configurations, restarting daemons to monitoring each cluster component. Starting with version 4.6, the manager supports the integration of Cloudera Search, which is currently in Beta state. In this post I’ll show you the required steps to set up a Hadoop cluster via Cloudera Manager and how to integrate Cloudera Search.
Furthermore, I want to introduce a small level of automation using the tool Ansible, to avoid performing the same manual steps again and again – thereby being able to replay them whenever you want, or to just use it for any upcoming cluster installations.
The following guest post, from Mike Pittaro of Dell’s Cloud Software Solutions team, describes his team’s use of the Dell Crowbar tool in conjunction with the Cloudera Manager API to automate cluster provisioning. Thanks, Mike!
Deploying, managing, and operating Apache Hadoop clusters can be complex at all levels of the stack, from the hardware on up. To hide this complexity and reduce deployment time, since 2011, Dell has been using Dell Crowbar in conjunction with Cloudera Manager to deploy the Dell | Cloudera Solution for Apache Hadoop for joint customers.
Cloudera Manager does a great job of deploying and managing the Hadoop layers of a cluster, but it depends on an operating system to be in place first. Meanwhile, to complement those capabilities, Dell Crowbar is a complete automated operations platform, designed to deploy layers of infrastructure on bare-metal servers and all the way up the stack.