Recently, GoDataDriven installed a Cloudera Enterprise (CDH + Cloudera Manager) cluster on Microsoft Azure. This two-part series, written by Alexander Bij and Tünde Alkemade and republished with permission, includes information about use case, design, and installation.
Processing large amounts of unstructured data requires serious computing power and also maintenance effort. As load on computing power typically fluctuates due to time and seasonal influences and/or processes running on certain times, a cloud solution like Microsoft Azure is a good option to be able to scale up easily and pay only for what is actually used.
Our thanks to Manuel Spezzani, Indyco Technical Leader, and Edward William Gnudi, Indyco’s Chief of Customer Happiness, for the guest post below about using Indyco alongside Apache Impala.
In this post, you will learn how to automatically design a complete data warehouse solution on top of Impala using Indyco, a tool for designing, exploring, and understand your business model (recently named Cloudera Certificated Partner for the Impala platform).
Thanks to Michal Malohlava, Amy Wang, and Avni Wadhwa of H20.ai for providing the following guest post about building ML apps using Sparkling Water and Apache Spark on CDH.
The Sparkling Water project is nearing its one-year anniversary, which means Michal Malohlava, our main contributor, has been very busy for the better part of this past year. The Sparkling Water project combines H2O machine-learning algorithms with the execution power of Apache Spark.
Thanks to Holden Karau (@holdenkarau), Software Engineer at Alpine Data Labs (also a Spark contributor and book author), for providing the following post about her work on new base classes for testing Apache Spark programs.
Testing in the world of Apache Spark has often involved a lot of hand-rolled artisanal code, which frankly is a good way to ensure that developers write as few tests as possible. I’ve been doing some work with Spark Testing Base (also available on Spark Packages) to try and make testing Spark jobs as easy as “normal”
Thanks to Jeff Palmucci, Director of Machine Learning at TripAdvisor, for permission to republish the following (originally appeared in TripAdvisor’s Engineering/Operations blog).
Here at TripAdvisor we have a lot of reviews, several hundred million according to the last announcement. I work with machine learning, and one thing we love in machine learning is putting lots of data to use.
I’ve been working on an interesting problem lately and I’d like to tell you about it.