Category Archives: General

How-to: Index and Search Multilingual Documents in Hadoop

Categories: General Guest Search

Learn how to use Cloudera Search along with RBL-JE to search and index documents in multiple languages.

Our thanks to Basis Technology for providing the how-to below!

Basis Technology’s Rosette Base Linguistics for Java (RBL-JE) provides a comprehensive multilingual text analytics platform for improving search precision and recall. RBL provides tokenization, lemmatization, POS tagging, and de-compounding for Asian, European, Nordic, and Middle Eastern languages, and has just been certified for use with Cloudera Search.

Read More

How-to: Write and Run Apache Giraph Jobs on Apache Hadoop

Categories: General Graph Processing Hadoop

Create a test environment for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.

Apache Giraph is a scalable, fault-tolerant implementation of graph-processing algorithms in Apache Hadoop clusters of up to thousands of computing nodes. Giraph is in use at companies like Facebook and PayPal, for example, to help represent and analyze the billions (or even trillions) of connections across massive datasets.

Read More

Impala Performance Update: Now Reaching DBMS-Class Speed

Categories: General Hive Impala Parquet

Impala’s speed now beats the fastest SQL-on-Hadoop alternatives. Test for yourself!

Since the initial beta release of Cloudera Impala more than one year ago (October 2012), we’ve been committed to regularly updating you about its evolution into the standard for running interactive SQL queries across data in Apache Hadoop and Hadoop-based enterprise data hubs. To briefly recap where we are today:

  • Impala is being widely adopted.

Read More

Doing DevOps with Cloudera Manager

Categories: Cloudera Manager General Ops and DevOps

More and more customers are using automation/configuration management frameworks alongside Cloudera Manager.

As Apache Hadoop clusters continue to grow in size, complexity, and business importance as the foundational infrastructure for an Enterprise Data Hub, the use cases for a robust and mature management console expand. 

Dev Ops

As those clusters become larger and more complex, many operators are looking to use configuration management/automation frameworks like Ansible,

Read More

Migrating to MapReduce 2 on YARN (For Operators)

Categories: General Hadoop MapReduce YARN

Cloudera Manager lets you add a YARN service in the same way you would add any other Cloudera Manager-managed service.

In Apache Hadoop 2, YARN and MapReduce 2 (MR2) are long-needed upgrades for scheduling, resource management, and execution in Hadoop. At their core, the improvements separate cluster resource management capabilities from MapReduce-specific logic. They enable Hadoop to share resources dynamically between MapReduce and other parallel processing frameworks, such as Cloudera Impala;

Read More