Tag Archives: Hive

RecordService: For Fine-Grained Security Enforcement Across the Hadoop Ecosystem

Categories: Hadoop Impala Platform Security & Cybersecurity Sentry

This new core security layer provides a unified data access path for all Hadoop ecosystem components, while improving performance.

We’re thrilled to announce the beta availability of RecordService, a distributed, scalable, data access service for unified access control and enforcement in Apache Hadoop. RecordService is Apache Licensed open source that we intend to transition to the Apache Software Foundation. In this post, we’ll explain the motivation, system architecture,

Read more

How-to: Prepare Unstructured Data in Impala for Analysis

Categories: How-to Impala

Learn how to build an Impala table around data that comes from non-Impala, or even non-SQL, sources.

As data pipelines start to include more aspects such as NoSQL or loosely specified schemas, you might encounter situations where you have data files (particularly in Apache Parquet format) where you do not know the precise table definition. This tutorial shows how you can build an Impala table around data that comes from non-Impala or even non-SQL sources,

Read more

Meet Cloudera’s Apache Spark Committers

Categories: Community General Meet the Engineer Spark

The super-active Apache Spark community is exerting a strong gravitational pull within the Apache Hadoop ecosystem. I recently had that opportunity to ask Cloudera’s Apache Spark committers (Sean Owen, Imran Rashid [PMC], Sandy Ryza, and Marcelo Vanzin) for their perspectives about how the Spark community has worked and is working together, and the work to be done via the One Platform initiative to make the Spark stack enterprise-ready.

Recently, Apache Spark has become the most currently active project in the Apache Hadoop ecosystem (measured by number of contributors/commits over time),

Read more

Using Apache Spark for Massively Parallel NLP at TripAdvisor

Categories: Guest Spark Use Case

Thanks to Jeff Palmucci, Director of Machine Learning at TripAdvisor, for permission to republish the following (originally appeared in TripAdvisor’s Engineering/Operations blog).

Here at TripAdvisor we have a lot of reviews, several hundred million according to the last announcement. I work with machine learning, and one thing we love in machine learning is putting lots of data to use.

I’ve been working on an interesting problem lately and I’d like to tell you about it.

Read more

Cloudera Engineering Interns Got Talent

Categories: Careers Cloudera Life Spark

As is their custom, Cloudera Engineering’s interns made innovation, especially for Apache Spark, the theme of the Summer season.

Cloudera has a long-time tradition of searching far and wide for the smartest summer engineering interns that it can find. Alumni of the program have become start-up co-founders, faculty at top-tier CS departments, employees at other prominent technology companies (including Google, Databricks, Uber, LinkedIn), as well as many current employees at Cloudera.

Read more