Building a Kafka playground on AWS — Part 2: The AWS Architecture

Maikel Penz
7 min readFeb 24, 2020

This is part 2 of my blog series about building a Kafka playground on AWS. If you missed part 1 please check this.

Now that we know the Kafka components and the AWS services of choice let’s look into a graphical representation of this architecture and explain how it works.

AWS Architecture for Kafka Deployment

VPC

A VPC is a virtual private network that AWS resources can be securely placed into, delimiting access to only allowed parts. Besides the security standpoint, with a VPC we design for high availability, distributing MSK Brokers and Kafka Components across distinct physical locations (or Availability Zones — AZ’s) in the same AWS region.

The image above presents 3 VPC Subnets (named Public Subnet A, Public Subnet B, Public Subnet C) and we can see that the two ECS Container Instances have been placed across subnets A and C and the MSK cluster in subnet B. This has been intentionally done in this graphical representation to help you familiarize with what containers communicate to MSK. Once we deploy to AWS, MSK brokers are actually spread across multiple subnets for high availability.

AWS MSK Cluster

Configuration

The MSK cluster configuration is quite simple and can be found on my github project’s configuration folder. However, there are three settings that must be planned in conjunction for MSK to work properly:

msk_number_of_brokers: how many brokers the cluster will have.

msk_default_replication_factor: how many copies of the data we want to have by default for any new topic.

msk_min_in_sync_replicas: the minimum number of copies of the data we are willing to have at any time for Kafka to continue accepting new messages from Producers.

If for example we spin up a cluster with 5 brokers and create a topic with a default replication factor of 5 it means that every broker will have a copy of the message. However, by setting the minimum in sync replicas to 4 we ensure that new messages can still be produced even if we lose one node.

When I first deployed a MSK cluster I completely overlooked these settings and accidentally configured the default replication factor to be lower than min in sync replicas. This resulted in the cluster not accepting new messages since the very beginning as it had less copies of the data than the minimum expected number of in sync replicas. The error I got was similar to the following:

[Producer clientId=producer-3] Got error produce response with correlation id 906 on topic-partition debezium-connect-source_config-0, retrying (2147482748 attempts left). Error: NOT_ENOUGH_REPLICAS [org.apache.kafka.clients.producer.internals.Sender]

Limitations

A limitation I found when setting up MSK was that Cluster Configurations cannot be deleted alongside the cluster. A Cluster Configuration is a resource attached to the cluster at spin up time that holds internal Kafka Settings like the ones I just mentioned above.

At this point in time neither the AWS UI nor the AWS CLI offer the option to delete them and this makes my list of Cluster Configurations grow indefinitely as every new cluster needs a distinct Cluster Configuration name to avoid conflict. You can see what I mean by checking the Configuration Name column in the image below.

Cluster configuration list

AWS ECS Cluster

Most of the other moving parts from the architecture relate to accessing and communicating with the ECS Kafka Components.

Configuration

To help us better understand how ECS works, consider that we have one ECS Cluster and every single docker container requires a Task Definition and a Service.

  • Task Definition: A task definition describes how a container should launch, memory and CPU required, the docker image, environment variables, etc.
ECS Task Definition
  • Service: Controls the state of running tasks. A Service points to a single Task Definition and it controls how many containers with the related task definition must be running at once, replaces failed tasks and exposes containers to external services, like a Load Balancer, for example.
ECS Services
ECS Container Instances

Container Communication

One of the challenges I faced with ECS was to figure out how make containers available. This architecture deploys six containers to ECS and there are three types of communication to configure:

- Communicating containers with Kafka (MSK);
- How to make containers talk to each other;
- How to make containers accessible through the browser.

  • Communicating containers with Kafka (MSK)

AWS MSK outputs a list of available brokers so other services can communicate with the cluster.

List of MSK Brokers

Containers like Schema Registry, Kafka Connect, Kafka Rest API and Kafka KSQL must be aware of Kafka brokers to operate. To achieve this communication, ECS Task Definitions are configured with environment variables that point to the bootstrap servers output provided by MSK.

KSQL Task Definition referencing list of Kafka brokers
  • How to make containers talk to each other

Some containers need to be aware of others to operate. For example, the Schema Registry UI container exposes an interface to manage schemas but it doesn’t store schemas itself. Instead, it needs to link to the Schema Registry container to access them.

In short, an environment variable must be set on the Schema Registry UI Task Definition pointing to the Schema Registry container. However, making this “link” is not that simple. As we know, our ECS cluster is composed by EC2 instances that are behind an auto scaling group. This means that instances will eventually go down, containers will be relocated to another instance and consequently IP addresses will change. We know that pointing to a static IP address referencing an EC2 instance will not work.

Fortunately, ECS has a feature called ECS Service Discovery. This enables an ECS service to register itself with a DNS name on Route 53. When an EC2 instance goes down, Service Discovery automatically updates the Route 53 Record Set with the new instance’s IP address and container communication resumes operating as before.

To make it easier to understand let’s go back to the Schema Registry example.

1 - The Schema Registry Service registers itself with a private DNS through ECS Service Discovery:

ECS Service pointing to a Route 53 endpoint

2 - The Schema Registry UI Task Definition has an environment variable pointing to the Route 53 endpoint rather than a static IP address:

ECS Task Definition

This makes the Schema Registry UI aware about the running Schema Registry container.

  • How to make containers accessible through the browser

To facilitate visualizing and managing Kafka Components, some containers are deployed to give frameworks a nice interface, like the Schema Registry UI and Kafka Connect UI.

In these cases we want to hit ECS containers from our browser and to achieve this we register ECS Services to a Load Balancer (ELB). An ELB provides a URL and the ECS service makes sure Container Instances are always up to date as target groups in our ELB.

To exemplify, let’s use the Schema Registry example one more time. The port configuration to access the Schema Registry UI from the browser is set as the following:

Schema Registry UI ports

1 - The Load Balancer provides a URL (DNS name column) and Listeners direct requests to target groups. The image below shows that the following URL on port 9000 will hit the kad-ecs-kafka-schema-registry-ui target group:

http://kad-ecs-application-lb-1640776068.ap-southeast-2.elb.amazonaws.com:9000

Load Balancer Listener

The target group forwards the request to port 8001 on EC2 instance
i-07cdac8b387c0168a

2 - The Schema Registry UI ECS Service is registered with a Load Balancer target group. The container port inside the EC2 instance is 8000.

ECS Service link to Load Balancer

3 - Hitting the ELB URL on port 9000 loads the Schema Registry UI interface as we expect.

Accessing Schema Registry UI

And we are done with part 2 ! In the next post we will deploy the architecture described here, configure Change Data Capture to listen to database events and query them in real time through KSQL.

--

--

Maikel Penz

Data Engineering Team Lead @ Plexure | AWS Certified Solution Architect | LinkedIn: https://www.linkedin.com/in/maikel-alexsander-penz/