This blog discusses answers to questions like what is the right disk size in datanode and what is the right capacity for a datanode. A few of our customers have asked us about using dense storage nodes. It is certainly possible to use dense nodes for archival storage because IO bandwidth requirements are usually lower […]
Hello world, it’s been a while! We are super excited today to announce the open-sourcing of one of the exciting new projects we’ve been working behind the scenes at the intersection of big-data and computation platforms – YuniKorn! Yunikorn is a new standalone universal resource-scheduler responsible for allocating/managing resources for big-data workloads including batch jobs and […]
Background Authentication is a basic security requirement for any computing environment. In simple terms, users and services must prove their identity (authenticate) to the system before they can use system features. Kerberos provides strong authentication which is used in the exchange between requesting user or process and service during authentication. When a user authenticates to […]
HDFS erasure coding (EC), a major feature delivered in Apache Hadoop 3.0, is also available in CDH 6.1 for use in certain applications like Spark, Hive, and MapReduce. The development of EC has been a long collaborative effort across the wider Hadoop community. Including EC with CDH 6.1 helps customers adopt this new feature by […]
Small files are a common challenge in the Apache Hadoop world and when not handled with care, they can lead to a number of complications. The Apache Hadoop Distributed File System (HDFS) was developed to store and process large data sets over the range of terabytes and petabytes. However, HDFS stores small files inefficiently, leading […]
Guest blog post written by Adir Mashiach In this post I’ll talk about the problem of Hive tables with a lot of small partitions and files and describe my solution in details. A little background In my organization, we keep a lot of our data in HDFS. Most of it is the raw data but […]
This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate. (This Blogpost is coauthored by Xun Liu and Quan Zhou from Netease). Introduction Hadoop is the most popular open source framework for the distributed processing of large, enterprise data sets. It is heavily […]
This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate. Special thanks to Bill Preachuk and Brandon Wilson for reviewing and providing their expertise Introduction Columnar storage is an often-discussed topic in the big data processing and storage world today – there are […]
This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate. Hortonworks announced the general availability of HDP 3.0 this year. You may read more about it here. Bundled with HDP 3.0, Apache Hive 3 with LLAP took a significant leap as a Enterprise […]
This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate. According to a recent study from Marketwatch, the Hadoop market is expected to exceed more than $50.0 billion by 2022. The global Hadoop market is positioned for staggering growth in the upcoming years. […]