Category Archives: HBase

implyr: R Interface for Apache Impala

Categories: CDH Data Science HBase HDFS Impala Kudu Tools

New R package implyr enables R users to query Impala using dplyr.

Apache Impala (incubating) enables low-latency interactive SQL queries on data stored in HDFS, Amazon S3, Apache Kudu, and Apache HBase. With the availability of the R package implyr on CRAN and GitHub, it’s now possible to query Impala from R using the popular package dplyr.

dplyr provides a grammar of data manipulation,

Read more

Introducing Apache HBase Medium Object Storage (MOB) compaction partition policies

Categories: HBase

Introduction

The Apache HBase Medium Object Storage (MOB) feature was introduced by HBASE-11339. This feature improves low latency read and write access for moderately-sized values (ideally from 100K to 10MB based on our testing results), making it well-suited for storing documents, images, and other moderately-sized objects [1]. The Apache HBase MOB feature achieves this improvement by separating IO paths for file references and MOB objects, applying different compaction policies to MOBs and thus reducing write amplification created by HBase’s compactions.

Read more

Offheap Read-Path in Production – The Alibaba story

Categories: Hadoop HBase Performance Use Case

This article is syndicated with permission from the Apache HBase blog and highlights a collaboration between our partners at Intel and Alibaba engineering in time for “Singles Day“, the biggest shopping day on the net. For more on HBase, mark your calendars! On June 12th, 2017 the Apache HBase community will be hosting their annual HBaseCon.

Introduction

HBase is the core storage system in Alibaba’s Search Infrastructure.

Read more

Performance comparison of different file formats and storage engines in the Apache Hadoop ecosystem

Categories: Avro Guest Hadoop HBase Kudu Parquet

Zbigniew Baranowski is a database systems specialist and a member of a group which provides and supports central database and Hadoop-based services at CERN. This blog was originally released on CERN’s “Databases at CERN” blog, and is syndicated here with CERN’s permission.

 

TOPIC

This post presents a performance comparison of few popular data formats and storage engines available in the Apache Hadoop ecosystem: Apache Avro,

Read more

New Study: Evaluating Apache HBase Performance on Modern Storage Media

Categories: Guest Hardware HBase Performance

For the first time, this new study by Intel software engineers analyzes the performance impact of using Apache HBase on various modern storage technologies.

As more “fast” storage technologies (such as SSD and NVMe SSD) emerge, organizations with big data use cases want to make better use of them to achieve better throughput and latency. But to this point, there have been no detailed analyses published about the true significance of that performance boost, nor about how to best mix fast and “slow”

Read more