An ingest pattern that we commonly see being adopted at Cloudera customers is Apache Spark Streaming applications which read data from Kafka. Streaming data continuously from Kafka has many benefits such as having the capability to gather insights faster. However, users must take into consideration management of Kafka offsets in order to recover their streaming application from failures. In this post, we will provide an overview of Offset Management and following topics.
- Storing offsets in external data stores
- Not managing offsets
Overview of Offset Management
Spark Streaming integration with Kafka allows users to read messages from a single Kafka topic or multiple Kafka topics.
In Part 1 of this blog, we covered some common challenges in memory tuning and baseline setup related to a production Solr deployment. In Part 2, you will learn memory tuning, GC tuning and some best practices.
We assume you have read part 1 of the blog and have a stable Solr deployment up running. The next step is memory tuning to get more out of Solr. Before changing any configuration please be aware that playing with some tuning knobs can cause unexpected consequences on the system,
Configuring Apache Solr memory properly is critical for production system stability and performance. It can be hard to find the right balance between competing goals. There are also multiple factors, implicit or explicit, that need to be taken into consideration. This blog talks about some common tasks in memory tuning and guides you through the process to help you understand how to configure Solr memory for a production system.
For simplicity, this blog applies to Solr in Cloudera CDH5.11 running on top of HDFS.
One of the most fundamental aspects a data model can convey is how something changes over time. This makes sense when considering that we build data models to capture what is happening in the real world, and the real world is constantly changing. The challenge is that it’s not just that new things are occurring, it’s that existing things are changing too, and if in our data models we overwrite the old state of an entity with the new state then we have lost information about the change.
Learn how to use Cloudera to spin up Apache Hadoop clusters across multiple cloud providers to take advantage of competing prices and avoid infrastructure lock-in.
Why is a multi-cloud strategy important?
In the early days of Cloudera, it was a fair assumption that our software would be running on industry-standard servers that were purchased, owned, and operated by the client in their own data center. In the last few years,