Robust Message Serialization in Apache Kafka Using Apache Avro, Part 2

by Andras Beni

Posted in Technical | July 26, 2018 3 min read

Implementing a Schema Store

In Part 1, we saw the need for an Apache Avro schema provider but did not implement one. In this part we will implement a schema provider that works with Apache Kafka as storage.

In-Memory SchemaStore

First we can implement an in-memory store for schemas. This is useful to understand the requirements for such a store and as the cache of the Kafka backed store. A SchemaStore has to be quick in looking up VersionedSchema entries. That’s why we create a separate map to help each lookup method. Using ConcurrentHashMap allows us to access these maps from multiple threads without locking.

public class InMemorySchemaStore implements SchemaStore {

  private final Map<Integer, VersionedSchema> schemasById = new
      ConcurrentHashMap<>();
  private final Map<SchemaNameWithVersion, VersionedSchema>
      schemasByNameAndVersion = new ConcurrentHashMap<>();
  private final Map<String, VersionedSchema> schemasByParsingForm =
      new ConcurrentHashMap<>();

  //...

  @Override
  public VersionedSchema getMetadata(Schema schema) {
    return schemasByParsingForm
        .get(SchemaNormalization.toParsingForm(schema));
  }
}

Adding a new schema to all three maps does not need to be atomic, because we do not expect the same store to be used in different use cases.

@Override
public void add(VersionedSchema schema) {
  schemasById.put(schema.getId(), schema);
  schemasByNameAndVersion.put(new SchemaNameWithVersion(schema.getName(),
      schema.getVersion()), schema);
  schemasByParsingForm.put(SchemaNormalization
      .toParsingForm(schema.getSchema()), schema);
}

Reading from and Writing to a Kafka Topic

The other half of our Kafka based SchemaProvider is a class that can perform all the communication with Kafka. This does not have to be tied to the concept of schemas, it can be a generic piece of code. We want it to read all schemas at startup and continue polling for new schemas. For this reason, we set up the consumer like this:

enable.auto.commit = false, because we will re-read all schemas at startup anyway
Manually assign all partitions to the consumer to avoid sharing messages with another consumer accidentally having the same group.id
Seek to the oldest message before reading
Poll until the latest record is read before allowing usage of the schema provider
Keep polling in a background thread to receive new schemas

class KafkaTopicStore<T> {

  //...

  public void startAndCatchUp() {
    List<PartitionInfo> partitions = consumer.partitionsFor(topic);
    logger.info("Start working with topic " + partitions);
    List<TopicPartition> topicPartitions = partitions.stream()
        .map(p -> new TopicPartition(topic, p.partition()))
        .collect(Collectors.toList());
    consumer.assign(topicPartitions);
    consumer.seekToBeginning(topicPartitions);
    Map<TopicPartition, Long> initialOffsets = consumer
        .endOffsets(topicPartitions);
    logger.info("Offsets to catch up with: " +  initialOffsets);
    final Set<TopicPartition> partitionsCaughtUp = new HashSet<>();
    if (initialOffsets.values().stream().mapToLong(v -> v).sum() > 0) {
      Predicate<ConsumerRecord<Void, T>> allCaughtUp = r -> {
        TopicPartition topicPartition = new
            TopicPartition(topic, r.partition());
        logger.debug("Read partition, offset: " + r.partition()
            + ",  " + r.offset());
        if (r.offset() >= initialOffsets.get(topicPartition) - 1) {
          partitionsCaughtUp.add(topicPartition);
        }
        return partitionsCaughtUp.size() == initialOffsets.size();
      };
      pollUntil(allCaughtUp);
    }
    pollerThread = new Thread(() -> pollUntil(r -> false), getClass().getSimpleName() + "-poller");
    pollerThread.start();
  }
}

KafkaTopicSchemaProvider does no more than glue together InMemorySchemaStore and KafkaTopicStore. We need to configure the store to send and poll for VersionedSchemas. Also we need to make sure all new schemas are added to the cache. Whenever a new schema is read, it is added to the InMemorySchemaStore.

To ensure that we do not lose schemas over time we also need to set retention.ms=-1 and retention.bytes=-1 for the topic that stores our schema information. This works with Kafka 0.9.0.0 (CDK 2.0) and newer. For older brokers we can just set a “big enough” retention time.

Administering the Schema Store

We need a command line tool to add, list, and print schemas. To achieve this, we create a KafkaTopicSchemaProvider based on the command line arguments and call the appropriate methods to read or write schema information.

OptionSet options = parser.parse(args);

if (options.has(ADD)) {
  // Error handling omitted
   KafkaTopicSchemaTool tool = new KafkaTopicSchemaTool(createProvider(options));
   VersionedSchema schema = tool.addSchema(options.valueOf(NAME).toString(),
       Integer.valueOf(options.valueOf(VERSION).toString()),
       options.valueOf(SCHEMA_FILE).toString());
   System.out.println("Successfully added schema.");
   printSchema(schema);
} else if (options.has(LIST)) {
  // ...
}

One crucial problem is schema identifier generation. In Kafka we do not have Sequence objects like in RDBMSs. Thus, we need to make sure we come up with a unique integer for each schema we add. One simple solution is to find the next available positive integer. In our example, we do not have any way to prevent two administrators to add schemas at about the same time with the same identifier. To prevent this, we can

Use a Kafka topic with a single partition as a sequence: produce a single message and use its offset
Use an ephemeral ZooKeeper node to “lock”
Introduce a service to add schemas. This application can lock in memory.
Allow writing to the topic storing schemas to a principal that only a few have access to and delegate the responsibility

Next Time…

Part 3 will show how to make your Kafka clients aware of schema versions by accessing the schema store.

Andras Beni

More by this author

Editor's Choice

Business

Generative AI for the Enterprise

Technical

Building Trust in Public Sector AI Starts with Trusting Your Data