Category Archives: CDH

Deploy Cloudera EDH Clusters Like a Boss Revamped – Part 2

Categories: CDH Hadoop HDFS

In Part 1: Infrastructure Considerations in this three part revamped series on deploying clusters like a boss, we provided a general explanation for how nodes are classified, disk layout configurations and network topologies to think about when deploying your clusters.

In this Part 2: Service and Role Layouts segment of the series, we take a step higher up the stack looking at the various services and roles that make up your Cloudera Enterprise deployment.

Read more

Large-Scale Health Data Analytics with OHDSI

Categories: CDH Data Science

Data analytics is increasingly being brought to bear to treat human disease, but as more and more health data is stored in computer databases, one significant challenge is how to perform analyses across these disparate databases. In this post I take a look at the Observational Health Data Sciences and Informatics (or OHDSI, pronounced “Odyssey”) program that was formed to address this challenge, and which today accounts for 1.26 billion patient records collectively stored across 64 databases in 17 countries.

Read more

Faster Performance for Selective Queries

Categories: CDH Impala

One of the principal features used in analytic databases is table partitioning. This feature is so frequently used because of its ability to significantly reduce query latency by allowing the execution engine to skip reading data that is not necessary for the query. For example, consider a table of events partitioned on the event time using calendar day granularity. If the table contained 2 years of events and a user wanted to find the events for a given 7-day window,

Read more

Hadoop Delegation Tokens Explained

Categories: CDH Hadoop HDFS Platform Security & Cybersecurity

Apache Hadoop’s security was designed and implemented around 2009, and has been stabilizing since then. However, due to a lack of documentation around this area, it’s hard to understand or debug when problems arise. Delegation tokens were designed and are widely used in the Hadoop ecosystem as an authentication method. This blog post introduces the concept of Hadoop Delegation Tokens in the context of Hadoop Distributed File System (HDFS) and Hadoop Key Management Server (KMS),

Read more

Apache Impala is now a Top-Level Apache Project

Categories: CDH Hadoop Impala

Five years ago, Cloudera shared with the world our plan to transfer the lessons from decades of relational database research to the Apache Hadoop platform via a new SQL engine — Apache Impala — the first and fastest open source MPP SQL engine for Hadoop.  Impala enabled SQL users to operate on vast amounts of data in open formats, stored on HDFS originally (with Apache Kudu, Amazon S3, and Microsoft ADLS now also native storage options),

Read more