Cloudera Engineering Blog · Impala Posts
Thanks to Chen Song, Data Team Lead at AppNexus, for allowing us to republish the following post about his company’s use case for Apache Parquet (incubating at this writing), the open standard for columnar storage across the Apache Hadoop ecosystem.
At AppNexus, over 2MM log events are ingested into our data pipeline every second. Log records are sent from upstream systems in the form of Protobuf messages. Raw logs are compressed in Snappy when stored on HDFS. That said, even with compression, this still leads to over 25TB of log data collected every day. On top of logs, we also have hundreds of MapReduce jobs that process and generate aggregated data. Collectively, we store petabytes of data in our primary Hadoop cluster.
Cloudera customers can now install, launch, and monitor CDAP directly from Cloudera Manager. This post from Nitin Motgi, Cask CTO, explains how.
Today, Cloudera and Cask are very happy to introduce the integration of Cloudera’s enterprise data hub (EDH) with the Cask Data Application Platform (CDAP). CDAP is an integrated platform for developers and organizations to build, deploy, and manage data applications on Apache Hadoop. This initial integration will enable CDAP to be installed, configured, and managed from within Cloudera Manager, a component of Cloudera Enterprise. Furthermore, it will simplify data ingestion for a variety of data sources, as well as enable interactive queries via Impala. Starting today, you can download and install CDAP directly from Cloudera’s downloads page.
Thanks to Jesus Centeno of Qlik for the post below about using Impala alongside Qlik Sense.
Cloudera and Qlik (which is part of the Impala Accelerator Program) have revolutionized the delivery of insights and value to every business stakeholder for “small data,” to something more powerful in the Big Data world—enabling users to combine Big Data and “small data” to yield actionable business insights.
Thanks to Michael Williams, BIRT Product Evangelist & Forums Manager at analytics software specialist Actuate Corp. (now OpenText), for the guest post below. Actuate is the primary builder and supporter of BIRT, a top-level project of the Eclipse Foundation.
The Actuate (now OpenText) products BIRT Designer Professional and BIRT iHub allow you to connect to multiple data sources to create and deliver meaningful visualizations securely, with scalability reaching millions of users and devices. And now, with Impala emerging as a standard Big Data query engine for many of Actuate’s customers, solid BIRT integration with Impala has become critical.
Strata + Hadoop World San Jose 2015 (Feb. 17-20) is a focal point for learning about production-izing Hadoop.
Strata + Hadoop World sessions have always been indispensable for learning about Hadoop internals, use cases, and admin best practices. When deep learning is needed, however—and deep dives are a necessity if you’re running Hadoop in production, or aspire to—tutorials are your ticket.
Bookmark this new living document to ensure use of current and proper configuration, sizing, management, and measurement practices.
Impala, the open source MPP analytic database for Apache Hadoop, is now firmly entrenched in the Big Data mainstream. How do we know this? For one, Impala is now the standard against which alternatives measure themselves, based on a proliferation of new benchmark testing. Furthermore, Impala has been adopted by multiple vendors as their solution for letting customers do exploratory analysis on Big Data, natively and in place (without the need for redundant architecture or ETL). Also significant, we’re seeing the emergence of best practices and patterns out of customer experiences.
The number of powerful data query tools in the Apache Hadoop ecosystem can be confusing, but understanding a few simple things about your needs usually makes the choice easy.
Ah, the good old days. I recall vividly that in 2007, I was faced to store 1 billion XML documents and make them accessible as well as searchable. I had few choices on a given shoestring budget: build something one my own (it was the rage back then—and still is), use an existing open source database like PostgreSQL or MySQL, or try this thing that Google built successfully and that was now implemented in open source under the Apache umbrella: Hadoop.
Impala authentication can now be handled by a combination of LDAP and Kerberos. Here’s why, and how.
Impala, the open source analytic database for Apache Hadoop, supports authentication—the act of proving you are who you say you are—using both Kerberos and LDAP. Kerberos has been supported since release 1.0, LDAP support was added more recently, and with CDH 5.2, you can use both at the same time.
This new feature, jointly developed by Cloudera and Intel engineers, makes management of role-based security much easier in Apache Hive, Impala, and Hue.
Apache Sentry (incubating) provides centralized authorization for services and applications in the Apache Hadoop ecosystem, allowing administrators to set up granular, role-based protection on resources, and to review them in one place. Previously, Sentry only designated administrators to
REVOKE privileges on an authorizable object. In Apache Sentry 1.5.0 (shipping inside CDH 5.2), we have implemented a new feature (SENTRY-327) that allows admin users to delegate the
GRANT privilege to other users using
WITH GRANT OPTION. If a user has the
GRANT OPTION privilege on a specific resource, the user can now grant the
GRANT privilege to other users on the same resource. Apache Hive, Impala, and Hue have all been updated to take advantage of this new Sentry functionality.
Impala 2.0 is the most SQL-complete/SQL-compatible release yet.
As we reported in the most recent roadmap update (“What’s Next for Impala: Focus on Advanced SQL Functionality”), more complete SQL functionality (and better SQL compatibility with other vendor extensions) is a major theme in Impala 2.0.