Apache Impala (incubating) vs. Amazon Redshift: S3 Integration, Elasticity, Agility, and Cost-Performance Benefits on AWS

Categories: Cloud Impala Performance

As measured across multiple dimensions (see analysis below), Impala provides a better cloud-native experience than Redshift for a number of common use cases.

Impala 2.6 brings read/write support on Amazon S3, which provides cloud capabilities such as direct querying of data from S3, elastic scaling of compute, and seamless data portability and flexibility that are unique amongst cloud-based analytic databases. With more and more users looking to deploy and run in public-cloud environments,

Read More

Getting to Know the Apache Hadoop 3 Alpha

Categories: CDH Hadoop

The Apache Hadoop project recently announced its 3.0.0-alpha1 release.

Given the scope of a new major release, the Apache Hadoop community decided to release a series of alpha and beta releases leading up to 3.0.0 GA. This gives downstream applications and end users an opportunity to test and provide feedback on the changes, which can be incorporated during the alpha and beta process.

The 3.0.0-alpha1 release incorporates thousands of new fixes,

Read More

Apache Impala (Incubating) on Amazon: Performance and Cost Considerations for S3 vs. EBS

Categories: Cloud Impala Performance

The benchmark testing results detailed below can help you make an informed decision about AWS storage options for Impala.

In a recent post, you learned how Impala 2.6 on S3 delivers cloud-native features unmatched by other analytic databases in the cloud. With support to read/write data from Amazon S3, Impala provides cloud capabilities such as direct querying of data from S3, elastic scaling of compute, and seamless data portability and flexibility not found on other cloud-based analytic databases, 

Read More

Solving Real-Life Mysteries with Big Data and Apache Spark

Categories: Data Science Spark

Can using simple statistical techniques in combination with big data help solve the Tamam Shud mystery?

Everyone loves a good real-life mystery. That’s why the three most popular TV shows of the 80s and 90s were Jack Palance’s reboot of Ripley’s Believe It or Not!, Unsolved Mysteries with Robert Stack, and Beyond Belief: Fact or Fiction hosted by Commander Riker.

Read More

Skool: An Open Source Data Integration Tool for Apache Hadoop from BT Group

Categories: Data Ingestion Guest Hadoop

In this guest post, Skool’s architects at BT Group explain its origins, design, and functionality.

With increased adoption of big data comes the challenge of integrating existing data sitting in various relational and file-based systems with Apache Hadoop infrastructure. Although open source connectors (such as Apache Sqoop) and utilities (such as Httpfs/Curl on Linux) make it easy to exchange data, data engineering teams often spend an inordinate amount of time writing code for this purpose.

Read More