The first thing that comes to mind when talking about synergy is how 2+2=5. Being the writer that he is, Mark Twain described it a lot more eloquently as “the bonus that is achieved when things work together harmoniously”. There is a multitude of product and business examples to illustrate the point and I particularly […]
This blog was originally published on Medium The Data Cloud — Powered By Hadoop One key aspect of the Cloudera Data Platform (CDP), which is just beginning to be understood, is how much of a recombinant-evolution it represents, from an architectural standpoint, vis-à-vis Hadoop in its first decade. I’ve been having a blast showing CDP to […]
Apache Hadoop Ozone was designed to address the scale limitation of HDFS with respect to small files and the total number of file system objects. On current data center hardware, HDFS has a limit of about 350 million files and 700 million file system objects. Ozone’s architecture addresses these limitations[4]. This article compares the performance […]
We are all hungry for data. Not just more data… also new types of data so that we can best understand our products, customers, and markets. We are looking for real-time insight on the newest available data in all shapes and sizes, structured and unstructured. We want to embrace the new generation of business and […]
Machine learning (ML) has become one of the most critical capabilities for modern businesses to grow and stay competitive today. From automating internal processes to optimizing the design, creation and marketing processes behind virtually every product consumed, ML models have permeated almost every aspect of our work and personal lives — and for businesses, the […]
The next generation of big data warehousing is being built on transactional tables. Transactions, of course, enable new use cases that require updating, deleting, and merging rows of data. But more importantly, a transaction-centric design enables advanced features such as materialized views, aggressive data caching and efficient replication between warehouses which are critical for modern […]
With every release, Hive’s built-in replication is expanding its territory by improving support for different table types. In this blog post, we will discuss the recent additions i.e. replication of transactional tables (a.k.a ACID tables), external tables and statistics associated with all kinds of tables. Transactional table replication Transactional tables in Hive support ACID properties. […]
The annual ACM SIGMOD/PODS Conference is a leading international forum for database researchers, practitioners, developers, and users to explore cutting-edge ideas and results, and to exchange techniques, tools, and experiences. This year ACM SIGMOD/PODS will be held in Amsterdam, The Netherlands on June 30th – July 5th, 2019, and Cloudera will be present in the conference, contributing […]
HDFS erasure coding (EC), a major feature delivered in Apache Hadoop 3.0, is also available in CDH 6.1 for use in certain applications like Spark, Hive, and MapReduce. The development of EC has been a long collaborative effort across the wider Hadoop community. Including EC with CDH 6.1 helps customers adopt this new feature by […]
Guest blog post written by Adir Mashiach In this post I’ll talk about the problem of Hive tables with a lot of small partitions and files and describe my solution in details. A little background In my organization, we keep a lot of our data in HDFS. Most of it is the raw data but […]