Reshaping Data Engineering at Plexure

Maikel Penz
TASK Group
Published in
6 min readDec 6, 2022

--

In technology, it is very common to see companies shifting focus back and forth between delivery and foundational work. The former makes customers happy, while potentially introducing technical debt. The latter slows things down on what matters the most, customer delivery, but creates opportunities to achieve greater results in the long term.

At Plexure we went through this shift as we identified we could no longer keep the data platform running in its current state. To be able to continue delivering to our customers we directed our efforts into foundational work, taking existing requirements and future aspirations into account to drive the design and development of a new data platform.

This article walks you through the problems we were facing, expand on what we are doing differently in the new platform and the benefits we are gaining from it.

The legacy data platform

There are three sides of the problem worth exploring when talking about the legacy data platform. Its purpose, the engineering practices, and the overall architecture.

The purpose

The legacy data platform was built to meet customer requirements of the time, and while shaping it from the eyes of the customer did enable Plexure to deliver on many fronts, the way we built it unintentionally created strong dependencies between our data services and the customer.

A good example of this tightly coupled setup were our data models. Not much thought was put into evolving datasets within the data platform before delivering them to the customer. This resulted in poor reusability and limited our ability to rely on data to troubleshoot incidents.

The engineering practices

Besides the challenges to scale a bespoke platform, another problem was the lack of adequate engineering practices.

Building pipelines in an ad-hoc manner got in the way of adopting coding best practices. The absence of common engineering elements, like writing DRY code and automated tests, impacted directly the resilience of data pipelines and the quality of data outputs. The team was having to spend a significant amount of time fixing broken pipelines rather than delivering value to the business.

The overall architecture

The biggest problem in this area was the lack of consistency and visibility around how we used the cloud to solve our problems. Plexure’s transactional and data platforms run across two cloud vendors, Azure and AWS, and a mishmash of services was adopted across both to build data pipelines.

The service spaghetti in its form

Without a common understanding of how services were tied together and a big picture view of the entire platform, end-to-end monitoring was difficult. Reacting to pipeline failures and resolving them would take time, as the ability to mitigate downstream impact relied on deep domain knowledge of our platform.

Another missing piece of the architecture was an environment to truly empower people at Plexure to unlock the value and better understand the data processed by the platform. Without a definition of what AWS services to use to interact with data, or good data governance to help make sense of datasets, we were failing to get much value out of it.

In summary, with so many fundamental areas to rethink, like the engineering practices, architecture and data models, it was clear that building a new data platform to support existing and new data pipelines was the best way forward.

The new data platform emerges

A new data platform was put in place to serve as the core capability to power data requirements at Plexure, making data accessible to the right people and providing us full control about how data flows through our ecosystem. We achieved this by building an exceptional data engineering team, adopting a strong engineering focus and bringing best practices from the software world into the data space.

The technology

Considering the long list of agreed contracts with our customers we decided to continue to host the data platform on AWS. We make use of native services on areas like data ingestion, storage and computing, but to power new data pipeline developments we settled with Databricks and Prefect as the key engines of our new platform.

Bird’s eye view of the new data platform

Databricks sits at the core of the new architecture. It covers most of our data needs, by offering a suite of services that allows us to process, catalog and keep track of dependencies between datasets. It also creates secure interfaces to make data accessible to data engineers, analysts and data scientists, while keeping Apache Spark as our primary framework to interact with data.

Prefect is our orchestration tool of choice, and its primary use is to act as a layer around data pipelines. It offers a full view of how the platform operates and it helps us define strict SLAs and dependencies across the stack. Besides orchestrating Databricks jobs, we also extend its use to handle other workloads, like data ingestion, maintenance tasks or integrations with new tools we bring into our stack.

Code & Practices

One of our core principles is to have everything as code. Hence, the platform’s infrastructure is built with Terraform and data pipelines are written with Spark and orchestrated Python. The codebase lives in a monorepo, maintaining everything required to build data pipelines in one place and encouraging reusability in the team. We live and breathe our DevOps principles, so features are deployed through automated CI/CD pipelines to all our environments.

To assess and guarantee the quality of data pipelines we make sure new functionality is well tested. We have embraced the practice of writing unit, integration and end-to-end tests and running them as part of our release pipelines. To make sure applications are in accordance with end user’s expectations we have adopted BDD, which allows us to use plain English to describe and test an application’s behaviour. Vicky Avison, our staff data engineer talked about the subject on the 2022 Data and AI Summit.

The data

Besides rethinking the architecture and putting strong engineering practices in place we also took the opportunity to re-evaluate how we store and evolve data models throughout the platform.

We adopted the medallion architecture to conceptually organise data in layers, named bronze, silver and gold. This has allowed us to define clear boundaries between the customer and the data platform and has empowered our users to better understand data transformations.

The medallion architecture

The impact and Benefits

The last piece of the puzzle was to make data accessible, so we built an area called the Exploration Platform. This is a shared space, powered by Databricks, that empowers all of Plexure to explore our data.

Delivering the exploration space on top of the new data platform was the culmination of our main replatforming efforts, which has allowed us to reshape how we do data. The following are areas we have seen great improvements on:

Time to deliver: developing and releasing data pipelines in the new data platform is significantly faster, mainly because most of the code is generic and we are able to approach new data pipelines with configuration instead of writing large pieces of code.

Platform Stability: Databricks has significantly improved our experience developing and maintaining Spark applications, and with Prefect we have been able to wrap these pipelines with self-healing workflows and monitor them on a centralised manner. Adopting the right toolset and following strong engineering practices has delivered a stable data platform and reduced the time on-call engineers have to spend dealing with production issues.

Cross-team collaboration: having data engineers, analysts and data scientists working on the same platform encourages collaboration and allow us to iterate on solutions together. By introducing BDD as a testing strategy, data engineers get input and test reviews from analysts to validate outputs match business expectations before they make into production.

Better data: The introduction of the medallion architecture as a granular system to evolve our data has also brought many benefits. It encourages data engineers to write smaller pieces of code, as data evolves through layers and transformations are smaller, and it empowers people to find their own answers, freeing up data engineers’ time to deliver on our roadmap.

And that’s it! this is where we are at in the process of rethinking data engineering at Plexure. With our data ecosystem in constant evolution, there are many areas we are excited to explore in the near future. I hope you enjoyed reading about our journey!

--

--

Maikel Penz
TASK Group

Data Engineering Team Lead @ Plexure | AWS Certified Solution Architect | LinkedIn: https://www.linkedin.com/in/maikel-alexsander-penz/