Cloudera Engineering Blog · Ops And DevOps Posts
Get an overview of the available mechanisms for backing up data stored in Apache HBase, and how to restore that data in the event of various data recovery/failover scenarios
With increased adoption and integration of HBase into critical business systems, many enterprises need to protect this important business asset by building out robust backup and disaster recovery (BDR) strategies for their HBase clusters. As daunting as it may sound to quickly and easily backup and restore potentially petabytes of data, HBase and the Apache Hadoop ecosystem provide many built-in mechanisms to accomplish just that.
StackIQ takes a “software defined infrastructure” approach to provision and manage cluster infrastructure that sits below Big Data platforms such as Apache Hadoop. In the guest post below, StackIQ co-founder and VP Engineering Greg Bruno explains how to install Cloudera Enterprise on top of StackIQ’s management system so they can work together.
The hardware used for this deployment is a small cluster: one node (i.e. one server) for the StackIQ Cluster Manager and four nodes as backend/data nodes. Each node has two disks and all nodes are connected via 1Gb Ethernet on a Private Network. The Cluster Manager node is also connected to a Public Network using its second NIC. (StackIQ Cluster Manager is used in similar deployments between two nodes and 4,000+ nodes in size.)
The following guest post is re-published here courtesy of Gerd König, a System Engineer with YMC AG. Thanks, Gerd!
Cloudera Manager is a great tool to orchestrate your CDH-based Apache Hadoop cluster. You can use it from cluster installation, deploying configurations, restarting daemons to monitoring each cluster component. Starting with version 4.6, the manager supports the integration of Cloudera Search, which is currently in Beta state. In this post I’ll show you the required steps to set up a Hadoop cluster via Cloudera Manager and how to integrate Cloudera Search.
At Cloudera, we believe that Cloudera Manager is the best way to install, configure, manage, and monitor your Apache Hadoop stack. Of course, most users prefer not to take our word for it — they want to know how Cloudera Manager works under the covers, first.
In this post, I’ll explain some of its inner workings.
The Vocabulary of Cloudera Manager
One of the complexities of Apache Hadoop is the need to deploy clusters of servers, potentially on a regular basis. At Cloudera, which at any time maintains hundreds of test and development clusters in different configurations, this process presents a lot of operational headaches if not done in an automated fashion. In this post, I’ll describe an approach to cluster automation that works for us, as well as many of our customers and partners.
At Cloudera engineering, we have a big support matrix: We work on many versions of CDH (multiple release trains, plus things like rolling upgrade testing), and CDH works across a wide variety of OS distros (RHEL 5 & 6, Ubuntu Precise & Lucid, Debian Squeeze, and SLES 11), and complex configuration combinations — highly available HDFS or simple HDFS, Kerberized or non-secure, using YARN or MR1 as the execution framework, etc. Clearly, we need an easy way to spin-up a new cluster that has the desired setup, which we can subsequently use for integration, testing, customer support, demos, and so on.
Vagrant is a very nice tool for programmatically managing many virtual machines (VMs) on a single physical machine. It natively supports VirtualBox and also provides plugins for VMware Fusion and Amazon EC2, supporting the management of VMs in those environments as well.
Editor’s Note (added Feb. 25, 2015): For releases beyond 4.5, Cloudera recommends the use of Cloudera Director for deploying CDH in cloud environments.
Cloudera Manager includes a new express installation wizard for Amazon Web Services (AWS) EC2. Its goal is to enable Cloudera Manager users to provision CDH clusters and Cloudera Impala (the open source distributed query engine for Apache Hadoop) on EC2 as easily as possible (for testing and development purposes only, not supported for production workloads) - and thus is currently the fastest way to provision a Cloudera Manager-managed cluster in EC2.
I am very pleased to announce the availability of Cloudera Manager 4.1. This release adds support for the Cloudera Impala beta release, and management and monitoring of key CDH features.
Here are the highlights of Cloudera Manager 4.1:
This guest post comes to us courtesy of Gwen Shapira (@gwenshap), a database consultant for The Pythian Group (and an Oracle ACE Director).
Most western countries use street names and numbers to navigate inside cities. But in Japan, where I live now, very few streets have them.
API access was a new feature introduced in Cloudera Manager 4.0 (download free edition here.). Although not visible in the UI, this feature is very powerful, providing programmatic access to cluster operations (such as configuration and restart) and monitoring information (such as health and metrics). This article walks through an example of setting up a 4-node HDFS and MapReduce cluster via the Cloudera Manager (CM) API.
Cloudera Manager API Basics
The CM API is an HTTP REST API, using JSON serialization. The API is served on the same host and port as the CM web UI, and does not require an extra process or extra configuration. The API supports HTTP Basic Authentication, accepting the same users and credentials as the Web UI. API users have the same privileges as they do in the web UI world.