Cloudera Engineering Blog · How-to Posts
Learn how to read FIX message files directly with Hive, create a view to simplify user queries, and use a flattened Apache Parquet table to enable fast user queries with Impala.
The Financial Information eXchange (FIX) protocol is used widely by the financial services industry to communicate various trading-related activities. Each FIX message is a record that represents an action by a financial party, such as a new order or an execution report. As the raw point of truth for much of the trading activity of a financial firm, it makes sense that FIX messages are an obvious data source for analytics and reporting in Apache Hadoop.
Installing Cloudera Navigator Encrypt on SUSE is a one-off process, but we have you covered with this how-to.
Cloudera Navigator Encrypt, which is integrated with Cloudera Navigator governance software, provides massively scalable, high-performance encryption for critical Apache Hadoop data. It leverages industry-standard AES-256 encryption and provides a transparent layer between the application and filesystem. Navigator Encrypt also includes process-based access controls, allowing authorized Hadoop processes to access encrypted data, while simultaneously preventing admins or super-users like root from accessing data that they don’t need to see.
The conclusion to this series covers Combiner-like aggregation functionality, counters, partitioning, and serialization.
Apache Spark is rising in popularity as an alternative to MapReduce, in a large part due to its expressive API for complex data processing. A few months ago, my colleague, Sean Owen wrote a post describing how to translate functionality from MapReduce into Spark, and in this post, I’ll extend that conversation to cover additional functionality.
Learn how to set up Hue, the open source GUI that makes Apache Hadoop easier to use, on your Mac.
You might have already all the prerequisites installed but we are going to show how to start from a fresh Yosemite (10.10) install and end up with running Hue on your Mac in almost no time!
In the conclusion to this series, learn how resource tuning, parallelism, and data representation affect Spark job performance.
In this post, we’ll finish what we started in “How to Tune Your Apache Spark Jobs (Part 1)”. I’ll try to cover pretty much everything you could care to know about making a Spark program run fast. In particular, you’ll learn about resource tuning, or configuring Spark to take advantage of everything the cluster has to offer. Then we’ll move to tuning parallelism, the most difficult as well as most important parameter in job performance. Finally, you’ll learn about representing the data itself, in the on-disk form which Spark will read (spoiler alert: use Apache Avro or Apache Parquet) as well as the in-memory format it takes as it’s cached or moves through the system.
Tuning Resource Allocation
Use the scripts and screenshots below to configure a Kerberized cluster in minutes.
Kerberos is the foundation of securing your Apache Hadoop cluster. With Kerberos enabled, user authentication is required. Once users are authenticated, you can use projects like Apache Sentry (incubating) for role-based access control via GRANT/REVOKE statements.
Set up your own, or even a shared, environment for doing interactive analysis of time-series data.
Although software engineering offers several methods and approaches to produce robust and reliable components, a more lightweight and flexible approach is required for data analysts—who do not build “products” per se but still need high-quality tools and components. Thus, recently, I tried to find a way to re-use existing libraries and datasets stored already in HDFS with Apache Spark.
Learn techniques for tuning your Apache Spark jobs for optimal efficiency.
(Editor’s note: Sandy presents on “Estimating Financial Risk with Spark” at Spark Summit East on March 18.)
Providing Hadoop-as-a-Service to your internal users can be a major operational advantage.
Cloudera Director (free to download and use) is designed for easy, on-demand provisioning of Apache Hadoop clusters in Amazon Web Services (AWS) environments, with support for other cloud environments in the works. It allows for provisioning clusters in accordance with the Cloudera AWS Reference Architecture.
Cloudera recently announced formal support for Apache Kafka. This simple use case illustrates how to make web log analysis, powered in part by Kafka, one of your first steps in a pervasive analytics journey.
If you are not looking at your company’s operational logs, then you are at a competitive disadvantage in your industry. Web server logs, application logs, and system logs are all valuable sources of operational intelligence, uncovering potential revenue opportunities and helping drive down the bottom line. Whether your firm is an advertising agency that analyzes clickstream logs for customer insight, or you are responsible for protecting the firm’s information assets by preventing cyber-security threats, you should strive to get the most value from your data as soon as possible.