Every Wednesday we spend 15 minutes demoing our Cloudera Data Platform (CDP) as well as individual experiences, such as Machine Learning and Streaming. We keep these demos short while maximizing value by focusing on the product within the context of a specific use case (watch on-demand).
The CDP Weekly Demo series focus on six topics within the Connect the Data Lifecycle theme:
- Multistage Data Pipelines with Cloudera Data Platform (CDP)
- Security & Governance with Cloudera Shared Data Experience (SDX)
- Streaming Data with Cloudera DataFlow (CDF) [sample]
- Enterprise Machine Learning with Cloudera Machine Learning (CML) [sample]
- Analytics with Cloudera Data Warehouse (CDW) [sample]
- Application Development with Cloudera Operational Database (COD)
Given how short these sessions are, we do not have enough time to answer all the questions so we answer them in this post.
Cloudera Data Platform (CDP)
Q: Which clouds is CDP public supported on?
A: AWS, Azure and, soon, GCP.
Q: What’s the difference between Cloudera’s product and cloud providers?
A: Cloudera Data Platform (CDP) is a Platform-as-a-Service (PaaS) that is cloud infrastructure agnostic and easily portable between multiple cloud providers including private solutions such as OpenShift.
Q: How do CDP experiences compare to solutions from other cloud service providers?
A: CDP has an SDX layer that stores all policies and metadata for security and governance. This preservation of state is the big differentiating factor, especially when running transient workloads and a variety of experiences. The SDX layer is present across the entire data lifecycle.
Security and Governance (SDX)
Q: Does SDX provide governance for all the data in the cluster that’s in the cloud?
A: Yes. Any cluster that has been built with CDP will have governance applied to it, regardless of whether it’s deployed to a public or private cloud.
Q: How do you set up SDX?
A: SDX installation is completed when you provision an environment with wire and at-rest encryption preconfigured. Technical Metadata management functionalities are also set up automatically. Business Metadata and data policies must be implemented according to the customer’s context and requirements.
Q: How do you get an SDX license? How much does it cost?
A: At present, SDX is part of CDP and not licensed separately..
Cloudera DataFlow (CDF)
Q: What is the difference between Cloudera’s NiFi and Apache NiFi?
A: Cloudera Flow Management (CFM) is based on Apache NiFi but comes with all the additional platform integration that you’ve just seen in the demo. We make sure it works with CDP’s identity management, integrates with Apache Ranger and Apache Atlas. The original creators of Apache NiFi work for Cloudera.
Q: What types of read/write data does NiFi support?
A: NiFi supports over 300 processors with many sources/destinations.
Q: Do you support cloud native data sources and sinks?
A: Yes we support a lot of cloud native sources and sinks with dedicated processors for AWS, Azure and GCP. It allows you to interact with the managed services and object storage solutions of the cloud providers (S3, GCS, ADLS, BlobStorage, EventHub, Kinesis, PubSub, BigQuery etc).
Cloudera Machine Learning (CML)
Q: Is there a level of programming required for a data scientist to use this platform? What languages can developers use?
A: CML enables data scientists to write code in Python, R, or Scala in their editor of choice. Therefore, beginner data scientists can easily run sample code in the workbench, and more experienced data scientists can leverage open source libraries for more complex workloads.
Q: Can you run SQL-like queries? E.g. with Spark SQL?
A: Yes, Spark SQL can be run from CML
Q: Do pre-built models come out of the box?
A: While CML does not have built in libraries of pre-built models, CML will soon come with reusable and customizable Applied Machine Learning Prototypes . These prototypes are fully built ML projects with all the code, models, and applications for leveraging best practices and novel algorithms., Additionally, CML is a platform upon which you can leverage the ML libraries and approaches of your choice to be able to build your own models.
Cloudera Data Warehouse (CDW)
Q: Which data warehousing engines are available in CDW?
A: Hive for EDW and complex report building and dashboarding, Impala for interactive SQL and ad hoc exploration, Kudu for time-series and Druid for log analytics.
Q: How do you create data warehouses?
A: Step 1, create your CDP environment. Step 2, activate the CDW service. Step 3, create your virtual warehouse. Step 4, define tables, load data, run queries, integrate your BI tool, etc.
Q: What’s the relationship between a Database Catalog and Virtual Warehouse?
A: For each database catalog there can be one or more virtual warehouses. But each virtual warehouse is isolated from other warehouses while they share the same data and metadata.
Cloudera Operational Database (COD)
Q: What is the relationship between Data Hub & Data Lake?
A: Data Lake houses SDX (governance & authorization). Data Hub is the actual service that hosts the workload, in this case the Operational DB.
Q: When should I use Cloudera OpDB vs a template in Data Hub?
A: For new apps use Cloudera OpDB (COD) that is self-tuning and auto-improves performance over time. And for replicating on-prem environments to the cloud via lift and shift or for disaster recovery use Data Hub templates.
Q: How does Apache Phoenix relate to Apache HBase?
A: Phoenix is an OLTP SQL engine for OpDB. It adds relational capabilities on top of HBase. Phoenix provides a much more familiar programming paradigm and allows our customers to reach production faster. Think of Phoenix as a SQL persona and HBase as a NoSQL persona.
Watch the latest on-demand demos here.