Building a Kafka playground on AWS — Part 1: Setting the Foundation
As part of my recent studies I decided to explore Apache Kafka. Kafka is an open-source streaming platform used to collect/analyse high volumes of data, and Its ecosystem is composed by multiple components. These components are deployed separately to the core of Kafka and while this decoupled architecture has clear benefits — like scalability — it also introduces challenges, specially when planning what a platform deployment would look like.
There are solutions out there that facilitate getting the full architecture up and running, like Confluent’s platform for example. However, at this point I am interested in exploring AWS’s managed Kafka offering, AWS MSK. The caveat with this service is that currently it does not cover the full ecosystem, leaving Kafka components to be deployed separately. With this problem in mind I decided to investigate what is involved to run the full ecosystem on AWS and what service(s) I could use alongside AWS MSK to deploy Kafka components.
This blog series shares details about my learning journey so far and it is divided in three parts:
- Part 1 gives you an overview about the Kafka Components and the AWS services I am planning to use.
- Part 2 goes into the details about running Kafka with the chosen AWS services, the challenges overcome and some frustrations.
- Part 3 walks you through running the architecture by deploying my github project, connecting a source Postgres database to Kafka and running real time queries through KSQL.
Important: This blog series presents my learning journey and is intended to share where I got up to with Kafka. My end goal is to host Kafka on AWS purely for exploration purposes so keep in mind that a production deployment would require extra concerns not covered as part of this series.
Enough said, let’s jump straight to it.
Kafka Components
Before I introduce the AWS services I will point out what are the set of components/frameworks we want to ship to the cloud.
1 — Kafka Core: the main component.
Kafka Core is the key part. In Kafka, messages are stored in objects called topics - think of them as categories - where the original order messages were produced is maintained. Data Producers and Consumers must always specify a target topic to communicate with and for redundancy and scalability topics can be split into multiple partitions, each partition keeping its own ordered log of messages.
A Kafka Cluster is composed by a set of broker nodes. This means that a single topic may live across multiple brokers if configured to store messages in multiple partitions.
Partitions can be also be configured to have replicas. Replicas are partition copies and every partition has an elected leader replica where producers write messages to and consumer reads messages from. In case of broker failure, one of the follower replicas assume the role of leader to reduce (or avoid) downtime.
Note: I have seen some movements to make followers replicas also available for read operations and I believe this is changing in Kafka.
Partitions and Replicas also enable faster message consumption. With a multi-partitioned topic you can specify a record key to direct messages to a specific partition rather than letting Kafka perform the default round-robin selection. Consumers can then be scaled horizontally to read from single partitions.
To wrap up Kafka Core, one of the notable advantages I see in Kafka is that messages can be stored indefinitely in the cluster so applications can process and reprocess messages at their own pace. Keep in mind that storage limitations may arise so you might want to configure retention periods to free up storage.
2 — Kafka Connect: framework to connect Kafka to external systems;
As a developer you can use programming packages/libraries to write and read messages from Kafka. However, in most scenarios you might want to connect common sources/destinations like databases, S3, files, etc. For these cases, Kafka Connect comes handy as a framework that offers built-in connectors to facilitate communicating with Kafka. The main advantage is that they require connector configuration rather than custom code writing.
3 — Schema Registry: framework for schema management;
Kafka Producers and Consumers work separately and can be strategically scaled to accommodate distinct workloads. This implies that they don’t know about each other and consequently messages can be written from Producers without respecting a schema that Consumers expect on their end. Schema registry is a framework that creates a contract between Producers and Consumers by using Apache Avro schemas. This guarantees messages are written (serialized) and read (deserialized) with the same structure. It also accounts for schema evolution so new messages will not break the Producer/Consumer communication.
4 — KSQL: streaming SQL engine;
KSQL provides an interactive SQL interface on top of streaming data stored in Kafka topics. With KSQL you can write aggregations and also join messages from multiple topics to produce a unified resultset.
5 — Rest Proxy API: RESTful interface to a Kafka cluster.
The Rest Proxy API is used to interrogate Kafka about the state of the cluster and to perform administrative tasks. It is a useful tool in case there’s no interface to answer a question about the cluster.
The AWS Services
Since the beginning I knew that AWS MSK was going to be the chosen service to run Kafka Core but I still had to figure out where to deploy Kafka Components. Fortunately all components I am interested in are open-source and available as docker images through DockerHub, so my best choice was to use a container service, like AWS ECS.
AWS MSK
AWS MSK was announced in preview at re:Invent 2018 and became generally available in may 2019. It is a fully managed service that aims to give people a simple and fast way to spin up a Kafka cluster, not having to worry about the operational overhead of running Kafka on their own.
Amazon offers open-source versions of Kafka (at the time of writing this post versions available are 1.1.1, 2.2.1 and 2.3.1) and takes care of spinning up Brokers and Zookeeper nodes. Zookeeper is a service that acts as a controller inside a Kafka Cluster, being aware about what are the existing Brokers and how topic partitions are spread across the cluster.
Moving to security, an MSK cluster must be put inside an AWS VPC to account for high availability and security of its broker nodes. It is worth mentioning that even though we inform our own, Kafka brokers and the Zookeeper assemble are actually put inside an AWS managed VPC. Elastic Network Interfaces (ENI’s) are then placed into the customer VPC to represent broker nodes and zookeeper nodes. These ENI’s then enable network communication between the two of them.
While the service is relatively new I believe it perfectly fits what I want to perform: exploration. For a production deployment it is worth considering how MSK differs from other managed Kafka services, for example what is available on GCP, Azure and as part of Confluent’s Platform.
AWS ECS
Since I kicked off my studies using docker and had some experience with how Kafka Component containers work, I first considered a container service for my architecture.
An important distinction to make right away is that containers on AWS have Orchestration and Compute isolated from each other. It means that container definition is done through either AWS ECS or AWS EKS and container execution sits between Fargate and ECS Container Instances (EC2).
Starting with orchestration, ECS is AWS’s proprietary service that enables orchestration and provisioning of docker containers and EKS is their Kubernetes offering. I won’t go into the details about whether to pick one or the other as they are both sufficient for my use case but ECS seems to be the first choice for simplicity. Kubernetes might be the right platform to run containers in large scale but demands some expertise about how to properly operate with it. A notable advantage of Kubernetes compared to ECS is that you are not vendor locked to AWS.
As mentioned above, there are two available options for compute:
- Fargate: the serverless selection. You set up an ECS task and fire it to Fargate to manage the task execution. There are no running EC2 instances and you only pay for running tasks.
- ECS Container Instances: this mode spins up EC2 instances with the ECS container agent installed on them. Tasks are then placed into these instances and executed according to the ECS task definition.
In the end, ECS seems to get me up and running faster so that’s my orchestration service for this project. For compute, even though the Fargate launch type offers lower maintenance compared to ECS Container Instances I want to be able to interact with containers more closely so I decided to use the latter.
This completes the first part of this blog series. In Part 2 I am going to share details about the MSK and ECS setup and what kind of problems I ran into that I didn’t expect at the beginning. Stay tuned !