One of the principal features used in analytic databases is table partitioning. This feature is so frequently used because of its ability to significantly reduce query latency by allowing the execution engine to skip reading data that is not necessary for the query. For example, consider a table of events partitioned on the event time using calendar day granularity. If the table contained 2 years of events and a user wanted to find the events for a given 7-day window,
Contributors from Intel, Cloudera, and the rest of the community have been making strong progress on the Hive-on-Spark initiative. This post provides an update.
[Editor’s note (April 20, 2016): Hive-on-Spark is now GA/shipping starting in CDH 5.7.]
Since its inception about one year ago, the community initiative to make Apache Spark a data processing engine for Apache Hive (HIVE-7292) has attracted widespread interest from developers around the world and gone through phases of rapid development,
Starting in Cloudera Enterprise 5.5, Cloudera Navigator offers interactive visual analytics that help answer important questions about the data that’s in your CDH clusters.
The new analytics system in Cloudera Navigator shows the distribution of data along various metadata dimensions and supports interactive filtering and grouping with a simple point-and-click interface. This new functionality a great complement to Cloudera Navigator’s search capabilities and is integrated with Navigator’s policy engine, so you can easily understand the impact of data management policies before applying them to your data.
Cloudera Enterprise 5.5 (comprising CDH 5.5, Cloudera Manager 5.5, and Cloudera Navigator 2.4) has been released.
Cloudera is excited to bring you news of Cloudera Enterprise 5.5. Our persistent emphasis on quality is especially pronounced in this release, with more than 500 issues identified and triaged during its development.
A highlight of this release is the inclusion of Cloudera Navigator Optimizer (available in limited beta for select Cloudera Enterprise customers;
Impala is designed to deliver insight on data in Apache Hadoop in real time. As data often lands in Hadoop continuously in certain use cases (such as time-series analysis, real-time fraud detection, real-time risk detection, and so on), it’s desirable for Impala to query this new “fast” data with minimal delay and without interrupting running queries.
In this blog post, you will learn an approach for continuous loading of data into Impala via HDFS,