Apache Hadoop Archives - Page 3 of 12

February 6, 2020 | Technical

Disk and Datanode Size in HDFS

This blog discusses answers to questions like what is the right disk size in datanode and what is the right capacity for a datanode. A few of our customers have asked us about using dense storage nodes. It is certainly possible to use dense nodes for archival storage because IO bandwidth requirements are usually lower […]

by Lokesh Jain 3 min read

July 17, 2019 | Technical

YuniKorn: a universal resources scheduler

Hello world, it’s been a while! We are super excited today to announce the open-sourcing of one of the exciting new projects we’ve been working behind the scenes at the intersection of big-data and computation platforms – YuniKorn! Yunikorn is a new standalone universal resource-scheduler responsible for allocating/managing resources for big-data workloads including batch jobs and […]

by WeiWei Yang , Wangda Tan , Vinod Kumar Vavilapalli , Sunil Govindan , Wilfred Spiegelenburg 4 min read

Apache Hadoop Apache Yarn Cloud

July 11, 2019 | Technical

Best Practices Guide for Systems Security Services Daemon Configuration and Installation – Part 1

Background Authentication is a basic security requirement for any computing environment. In simple terms, users and services must prove their identity (authenticate) to the system before they can use system features. Kerberos provides strong authentication which is used in the exchange between requesting user or process and service during authentication. When a user authenticates to […]

by Gabor Roczei 13 min read

Apache Hadoop Security, Risk, & Compliance

June 10, 2019 | Technical

HDFS Erasure Coding in Production

HDFS erasure coding (EC), a major feature delivered in Apache Hadoop 3.0, is also available in CDH 6.1 for use in certain applications like Spark, Hive, and MapReduce. The development of EC has been a long collaborative effort across the wider Hadoop community. Including EC with CDH 6.1 helps customers adopt this new feature by […]

by Kitti Nansai , Xiao Chen , Sammi Chen , Jian Zhang 15 min read

Apache Hadoop Apache Hive Apache Spark MapsReduce Cloudera Enterprise

May 9, 2019 | Technical

Small Files, Big Foils: Addressing the Associated Metadata and Application Challenges

Small files are a common challenge in the Apache Hadoop world and when not handled with care, they can lead to a number of complications. The Apache Hadoop Distributed File System (HDFS) was developed to store and process large data sets over the range of terabytes and petabytes. However, HDFS stores small files inefficiently, leading […]

by Shashank Naik , Bhagya Gummalla 11 min read

Apache Hadoop Apache HDFS

May 7, 2019 | Technical

Partition Management in Hadoop

Guest blog post written by Adir Mashiach In this post I’ll talk about the problem of Hive tables with a lot of small partitions and files and describe my solution in details. A little background In my organization, we keep a lot of our data in HDFS. Most of it is the raw data but […]

by Cloudera 8 min read

Apache Hadoop Apache Hive

December 20, 2018 | Technical

{Submarine} : Running deep learning workloads on Apache Hadoop

This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate. (This Blogpost is coauthored by Xun Liu and Quan Zhou from Netease). Introduction Hadoop is the most popular open source framework for the distributed processing of large, enterprise data sets. It is heavily […]

by Wangda Tan 8 min read

Apache Hadoop Apache Yarn Hortonworks Data Platform

December 18, 2018 | Technical

Big Data Processing Engines – Which one do I use?: Part 1

This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate. Special thanks to Bill Preachuk and Brandon Wilson for reviewing and providing their expertise Introduction Columnar storage is an often-discussed topic in the big data processing and storage world today – there are […]

by Ashish Narasimham 9 min read

Apache Druid Apache Hadoop Apache HBase Apache Hive Apache Phoenix Hortonworks Data Platform

December 17, 2018 | Technical

2x Faster BI Interactive queries with HDP 3.0

This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate. Hortonworks announced the general availability of HDP 3.0 this year. You may read more about it here. Bundled with HDP 3.0, Apache Hive 3 with LLAP took a significant leap as a Enterprise […]

by Nita Dembla 5 min read

Apache Hadoop Customer Analytics

November 6, 2018 | Business

Apache Hadoop is Thriving!

This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate. According to a recent study from Marketwatch, the Hadoop market is expected to exceed more than $50.0 billion by 2022. The global Hadoop market is positioned for staggering growth in the upcoming years. […]

by Roni Fontaine 4 min read

Apache Hadoop

Filter By