Tag Archives: secondarynamenode

A Guide to Checkpointing in Hadoop

Categories: Hadoop HDFS Ops and DevOps

Understanding how checkpointing works in HDFS can make the difference between a healthy cluster or a failing one.

Checkpointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient NameNode recovery and restart, and is an important indicator of overall cluster health. However, checkpointing can also be a source of confusion for operators of Apache Hadoop clusters.

In this post, I’ll explain the purpose of checkpointing in HDFS,

Read more

What is Hadoop Metrics2?

Categories: CDH Hadoop

Metrics are collections of information about Hadoop daemons, events and measurements; for example, data nodes collect metrics such as the number of blocks replicated, number of read requests from clients, and so on. For that reason, metrics are an invaluable resource for monitoring Apache Hadoop services and an indispensable tool for debugging system problems. 

This blog post focuses on the features and use of the Metrics2 system for Hadoop, which allows multiple metrics output plugins to be used in parallel,

Read more

How-to: Automate Your Cluster with Cloudera Manager API

Categories: Cloudera Manager Hadoop How-to MapReduce Ops and DevOps Tools

API access was a new feature introduced in Cloudera Manager 4.0 (download free edition here.). Although not visible in the UI, this feature is very powerful, providing programmatic access to cluster operations (such as configuration and restart) and monitoring information (such as health and metrics). This article walks through an example of setting up a 4-node HDFS and MapReduce cluster via the Cloudera Manager (CM) API.

Cloudera Manager API Basics

The CM API is an HTTP REST API,

Read more

Processing Rat Brain Neuronal Signals Using an Apache Hadoop Computing Cluster – Part II

Categories: Guest Hadoop Hive Use Case

Background

As mentioned in Part I, although Apache Hadoop and other Big Data technologies are typically applied to I/O intensive workloads, where parallel data channels dramatically increase I/O throughput, there is growing interest in applying these technologies to CPU intensive workloads.  In this work, we used Hadoop and Hive to digitally signal process individual neuron voltage signals captured from electrodes embedded in the rat brain. Previously, this processing was performed on a single Matlab workstation,

Read more

Apache Hadoop Log Files: Where to find them in CDH, and what info they contain

Categories: Hadoop

Apache Hadoop’s jobtracker, namenode, secondary namenode, datanode, and tasktracker all generate logs. That includes logs from each of the daemons under normal operation, as well as configuration logs, statistics, standard error, standard out, and internal diagnostic information. Many  users aren’t entirely sure what the differences are among these logs, how to analyze them, or even how to handle simple administrative tasks like log rotation.  This blog post describes each category of log, and then details where they can be found for each Hadoop component.

Read more