Category Archives: How-to

How to Distribute your R code with sparklyr and Cloudera Data Science Workbench

Categories: CDH How-to Spark

sparklyr is a great opportunity for R users to leverage the distributed computation power of Apache Spark without a lot of additional learning. sparklyr acts as the backend of dplyr so that R users can write almost the same code for both local and distributed calculation over Spark SQL.

 

Since sparklyr v0.6, we can run R code across our Spark cluster with spark_apply().

Read more

Customizing Docker Images in Cloudera Data Science Workbench

Categories: Altus CDH Cloud Data Science How-to Tools

This article shows how to build and publish a customized Docker image for usage as an engine in Cloudera Data Science Workbench. Such an image or engine customization gives you the benefit of being able to work with your favorite tool chain inside the web based application.

Motivation:

Cloudera Data Science Workbench (CDSW) enables data scientists to use their favorite tools such as R, Python, or Scala based libraries out of the box in an isolated secure sandbox environment.

Read more

Accessing Secure Cluster from Web Applications

Categories: CDH Hadoop How-to

As customers use Apache Hadoop clusters in ways other than through HUE and Hadoop Command Line Interface (CLI) and integrate it closely with the applications they develop, we often get asked how to access their secure Hadoop cluster from within the custom applications. Many customers use a service account in their application and access the cluster with a fixed service account. However, other customers would like to access as the end users who have authenticated to the application.

Read more

Implementing Temporal Graphs with Apache TinkerPop and HGraphDB

Categories: Graph Processing Hadoop HBase How-to

When most people think of Big Data, often they imagine loads of unstructured data. However, there is always some sort of structure or relationships within this data. Based on these relationships there are one or more representation schemes best suited to handle this type of data. A common pattern seen in the field is hierarchy/relationship representation. This form of representation is adept in handling scenarios like complex business models, chain of event or plans, chain of stock orders in banks,

Read more

Quicker Insight into Apache Solr and Collection Health

Categories: CDH Cloudera Manager How-to Search

Successful cluster administration can be very difficult without a real-time view of the state of the cluster. Solr itself does not provide aggregated views about its state or any historical usage data, which is necessary to understand how the service is used and how it is performing. Knowing the throughput and capacities not only helps detect errors and troubleshoot issues, but is also useful for capacity planning.

Questions may arise, such as:

  • What is the size of my cluster and each collection?

Read more