Cloudera Engineering Blog

Big Data best practices, how-to's, and internals from Cloudera Engineering and the community


New in CDH 5.1: Apache Spark 1.0

Spark 1.0 reflects a lot of hard work from a very diverse community.

Cloudera’s latest platform release, CDH 5.1, includes Apache Spark 1.0, a milestone release for the Spark project that locks down APIs for Spark’s core functionality. The release reflects the work of hundreds of contributors (including our own Diana Carroll, Mark Grover, Ted Malaska, Colin McCabe, Sean Owen, Hari Shreedharan, Marcelo Vanzin, and me).

New in Cloudera Manager 5.1: Direct Active Directory Integration for Kerberos Authentication

With this new release, setting up a separate MIT KDC for cluster authentication services is no longer necessary.

Kerberos (initially developed by MIT in the 1980s) has been adopted by every major component of the Apache Hadoop ecosystem. Consequently, Kerberos has become an integral part of the security infrastructure for the enterprise data hub (EDH).

New in CDH 5.1: Document-level Security for Cloudera Search

Cloudera Search now supports fine-grain access control via document-level security provided by Apache Sentry.

In my previous blog post, you learned about index-level security in Apache Sentry (incubating) and Cloudera Search. Although index-level security is effective when the access control requirements for documents in a collection are homogenous, often administrators want to restrict access to certain subsets of documents in a collection.

New Apache Spark Developer Training: Beyond the Basics

While the new Spark Developer training from Cloudera University is valuable for developers who are new to Big Data, it’s also a great call for MapReduce veterans.

When I set out to learn Apache Spark (which ships inside Cloudera’s open source platform) about six months ago, I started where many other people do: by following the various online tutorials available from UC Berkeley’s AMPLab, the creators of Spark. I quickly developed an appreciation for the elegant, easy-to-use API and super-fast results, and was eager to learn more.

Cloudera Enterprise 5.1 is Now Available

Cloudera Enterprise’s newest release contains important new security and performance features, and offers support for the latest innovations in the open source platform.

We’re pleased to announce the release of Cloudera Enterprise 5.1 (comprising CDH 5.1, Cloudera Manager 5.1, and Cloudera Navigator 2.0).

Jay Kreps, Apache Kafka Architect, Visits Cloudera

It was good to see Jay Kreps (@jaykreps), the LinkedIn engineer who is the tech lead for that company’s online data infrastructure, visit Cloudera Engineering yesterday to spread the good word about Apache Kafka.

Kafka, of course, was originally developed inside LinkedIn and entered the Apache Incubator in 2011. Today, it is being widely adopted as a pub/sub framework that works at massive scale (and which is commonly used to write to Apache Hadoop clusters, and even data warehouses).

The New Hadoop Application Architectures Book is Here!

There’s an important new addition coming to the Apache Hadoop book ecosystem. It’s now in early release!

We are very happy to announce that the new Apache Hadoop book we have been writing for O’Reilly Media, Hadoop Application Architectures, is now available as an early release! It contains the first two chapters and can be found in O’Reilly’s Catalog and via Safari.        

Estimating Financial Risk with Apache Spark

Learn how Spark facilitates the calculation of computationally-intensive statistics such as VaR via the Monte Carlo method.

Under reasonable circumstances, how much money can you expect to lose? The financial statistic value at risk (VaR) seeks to answer this question. Since its development on Wall Street soon after the stock market crash of 1987, VaR has been widely adopted across the financial services industry. Some organizations report the statistic to satisfy regulations, some use it to better understand the risk characteristics of large portfolios, and others compute it before executing trades to help make informed and immediate decisions.

This Month in the Ecosystem (June 2014)

Welcome to our 10th edition of “This Month in the Ecosystem,” a digest of highlights from June 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

Pretty busy for early Summer:

Jeff Dean’s Talk at Cloudera

Google’s Jeff Dean — among the original architects of MapReduce, Bigtable, and Spanner — revealed some fascinating facts about Google’s internal environment at Cloudera HQ recently.

Earlier this week, we were pleased to welcome Google Senior Fellow Jeff Dean to Cloudera’s Palo Alto HQ to give an overview of some of his group’s current research. Jeff has a peerless pedigree in distributed computing circles, having been deeply involved in the design and implementation of Google’s original advertising serving system, MapReduce, Bigtable, Spanner, and a host of other projects.

Newer Posts Older Posts