Cloudera Engineering Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
Spark 1.0 reflects a lot of hard work from a very diverse community.
Cloudera’s latest platform release, CDH 5.1, includes Apache Spark 1.0, a milestone release for the Spark project that locks down APIs for Spark’s core functionality. The release reflects the work of hundreds of contributors (including our own Diana Carroll, Mark Grover, Ted Malaska, Colin McCabe, Sean Owen, Hari Shreedharan, Marcelo Vanzin, and me).
With this new release, setting up a separate MIT KDC for cluster authentication services is no longer necessary.
Kerberos (initially developed by MIT in the 1980s) has been adopted by every major component of the Apache Hadoop ecosystem. Consequently, Kerberos has become an integral part of the security infrastructure for the enterprise data hub (EDH).
Cloudera Search now supports fine-grain access control via document-level security provided by Apache Sentry.
In my previous blog post, you learned about index-level security in Apache Sentry (incubating) and Cloudera Search. Although index-level security is effective when the access control requirements for documents in a collection are homogenous, often administrators want to restrict access to certain subsets of documents in a collection.
While the new Spark Developer training from Cloudera University is valuable for developers who are new to Big Data, it’s also a great call for MapReduce veterans.
When I set out to learn Apache Spark (which ships inside Cloudera’s open source platform) about six months ago, I started where many other people do: by following the various online tutorials available from UC Berkeley’s AMPLab, the creators of Spark. I quickly developed an appreciation for the elegant, easy-to-use API and super-fast results, and was eager to learn more.
Cloudera Enterprise’s newest release contains important new security and performance features, and offers support for the latest innovations in the open source platform.
We’re pleased to announce the release of Cloudera Enterprise 5.1 (comprising CDH 5.1, Cloudera Manager 5.1, and Cloudera Navigator 2.0).
It was good to see Jay Kreps (@jaykreps), the LinkedIn engineer who is the tech lead for that company’s online data infrastructure, visit Cloudera Engineering yesterday to spread the good word about Apache Kafka.
Kafka, of course, was originally developed inside LinkedIn and entered the Apache Incubator in 2011. Today, it is being widely adopted as a pub/sub framework that works at massive scale (and which is commonly used to write to Apache Hadoop clusters, and even data warehouses).
There’s an important new addition coming to the Apache Hadoop book ecosystem. It’s now in early release!
We are very happy to announce that the new Apache Hadoop book we have been writing for O’Reilly Media, Hadoop Application Architectures, is now available as an early release! It contains the first two chapters and can be found in O’Reilly’s Catalog and via Safari.
Learn how Spark facilitates the calculation of computationally-intensive statistics such as VaR via the Monte Carlo method.
Under reasonable circumstances, how much money can you expect to lose? The financial statistic value at risk (VaR) seeks to answer this question. Since its development on Wall Street soon after the stock market crash of 1987, VaR has been widely adopted across the financial services industry. Some organizations report the statistic to satisfy regulations, some use it to better understand the risk characteristics of large portfolios, and others compute it before executing trades to help make informed and immediate decisions.
Pretty busy for early Summer:
Google’s Jeff Dean — among the original architects of MapReduce, Bigtable, and Spanner — revealed some fascinating facts about Google’s internal environment at Cloudera HQ recently.
Earlier this week, we were pleased to welcome Google Senior Fellow Jeff Dean to Cloudera’s Palo Alto HQ to give an overview of some of his group’s current research. Jeff has a peerless pedigree in distributed computing circles, having been deeply involved in the design and implementation of Google’s original advertising serving system, MapReduce, Bigtable, Spanner, and a host of other projects.