Workflow Automation: Empowering teams through our in-house, self-service framework

A Workflow consists of steps, configured to respect a predefined order and accomplish a specific business objective. They vary from something as simple as defining an IT request process in a small company to complex data transformations aimed to deliver key business insights.

Leaving complexity aside, some characteristics are common across all of them:

  • Workflows need to run somewhere;
  • Repeatedly;
  • Triggered on a schedule or ad-hoc;
  • And sadly they will not always perform as expected.

To address the above is where the concept of Workflow Automation comes in. It can be thought as a framework that seeks to standardize and facilitate the development/deployment of workflows across an organization. These frameworks are especially important in mid/large-size companies where multiple teams have similar requirements but don’t necessarily communicate about their needs.

The Problem

At Trade Me, we identified the following pain points that led us to build our own framework:

  • Data Engineering backlog growing with data requests;

As Data Engineers we were responsible for delivering data outputs from customers requests. We soon realized that this was not sustainable going forward as the number of data analysts and scientists was growing but our team remained at the same size.

  • Analysts and Data Scientists running workflows on their own machines;

While we were spending most of our time going through the backlog we could not properly help the ones who wanted to write their own transformations. This generated frustration from the consumer’s point of view - especially regarding the performance of running things locally - but also on our end as data that should be kept in the cloud was being downloaded to their laptops.

  • Lack of awareness about existing workflows;

Data Analysts and Scientists are spread across the business working on problems related to their own area. There was no transparency about the code written to interact with our data sources resulting in duplication of work. Also, disparate development practices resulted in recurring bugs and slow review processes.

How are we solving the Problem ?

We kicked off this project with the goal to facilitate writing, deploying and running workflows to anyone across the organization.

We wanted to move away from people doing everything locally like…

Want to read this story later? Save it in Journal.

… to a more hands-off approach where data consumers would only worry about their code and let our framework do the rest…

To start simple and enable us to move faster we decided that our framework would support a single/primary programming language: Python. Besides being the go-to language for data workloads it is also heavily used by analysts and data scientists at Trade Me.

The next step was to figure out the Orchestration, Execution, Deployment and Visualization layers.

Orchestration: this is the layer that defines what the workflow looks like. It is also what not only data engineers but also data consumers need to be comfortable with. As we decided to go with Python and we had been experimenting with Prefect for a few months it made sense to incorporate this powerful workflow management system as part of it.

Execution: to better manage dependencies and treat workflows as single and separate units we decided to containerize them. We push every workflow as an ECR image to AWS and use AWS Fargate to run them. This setup works well as our data lives on AWS and Prefect integrates well with Fargate.

Deployment: we use Gitlab as our web-based Git repository and take advantage of its CI/CD functionality to register, build and push workflows to our environments.

Visualization: we signed up to use Prefect Cloud which is an interface on top of their open source engine. It allows us to connect AWS execution environments with their web UI to monitor, troubleshoot and interact with workflows.

There’s just one small missing part, the name. Here I introduce you to:

The image below illustrates the architecture from end to end.

How does DWAAS work ?

DWAAS pushes workflows to Test and Production environments. Both consist of a combination of Prefect Cloud Projects (where workflows are registered) and AWS accounts (where workflows run). This gives our consumers the ability to validate their code and the data before going to production.

The Deployment Process

Anyone who wants to deploy a Workflow through DWAAS must follow these steps:

  • 1 - Create a branch in a Gitlab repository;
  • 2 - Set Test and Production configurations through config files designed by the Data Engineering team. For example:
{
"name": "my-workflow",
"schedule-enabled": true,
"schedule": "cron(0 14 * * ? *)",
"parameters": {"env": "prod", "account": "12345678910”},
"cloud-execution-environment": "fargate",
"memory": 4096,
"cpu": 1024
}
  • 3 - Write the workflow using Prefect. Below is a simple ETL example.
from prefect import Flow, task


@task
def extract():
return [1, 2, 3]


@task
def transform(x):
return [i * 10 for i in x]


@task
def load(y):
print("Received y: {}".format(y))


with Flow("ETL") as flow:
e = extract()
t = transform(e)
l = load(t)
  • 4 - Commit/Push the code. This will trigger the CI/CD pipeline to push the workflow to Test.
  • 5 - Validate that the workflow works in Test through the Prefect Cloud UI.
  • 6 - Create a Merge Request. Once approved by a reviewer, the code is merged into master and the CI/CD pipeline pushes the workflow to Production.

The CI/CD Pipeline

These are the deployment stages:

  • Workflow docker image is built;

This step builds a prefect based docker image containing the workflow code and installs dependencies, like external python libraries being used.

  • Workflow docker image is pushed to AWS ECR;

An AWS ECR Repository is created for the workflow and the image is pushed to either our Test or Production AWS account.

  • Workflow is registered to Prefect Cloud;

This step registers the flow to Prefect Cloud and creates a Task Definition on AWS ECS. The Task Definition points to the previously built ECR image and has Memory and CPU requirements set based on what is defined in the workflow config files.

The image below illustrates a successful deployment pipeline on Gitlab.

Visualizing the Deployed Workflow

Once the Pipeline finishes we can check the Prefect Cloud UI and the workflow will be available inside the project.

The following image shows a graphical representation of a workflow run:

Our Learnings so far

DWAAS is still growing as an internal product at Trade Me, but the engagement and interest we have had from our consumers has been invaluable. Analysts have achieved great results, wrangling data from multiple sources and have sped up existing manual processes dramatically. They now own their workflows and have full autonomy to make changes or fix bugs as soon as they occur.

We also see an opportunity for this to be used for other business processes outside of data science and analytics. In a world where we’re constantly looking to reduce manual intervention, having automated workflows across the business could provide many opportunities and improvements.

To finalize, we learned that a framework doesn’t solve the problem itself. Recurrent training sessions with consumers and constant involvement with teams across the business are key to identifying opportunities to bring people onboard and make use of it. Also, while technology is evolving fast in the data space we still find that off the shelf solutions usually solve only part of the problem. Our job as engineers requires creativity to deliver value.

What’s Next ?

The following are three areas I find important to focus on:

AWS Fargate is the current chosen execution environment and it handles our use cases very well. However, to accommodate workflows that require Tasks to run in parallel we might look into how Prefect leverages on-the-fly creation of temporary Dask clusters on top of Fargate and Kubernetes.

  • Improve Documentation and Processes:

As more people start using DWAAS across the organization, the more important documentation becomes to help developers get up to speed with the framework. Also, once the maturity level reaches an acceptable point we might want to allow reviews and approvals to happen without interference from the Data Engineering team.

  • Focus on Quality and Code Sharing:

Explore ways to verify data and code quality as part of the workflow development process. Publish python libraries and packages to be reused across workflows.

I would like to give a big shout-out to Jessica Young, Joshua Lowe and Alan Featherston that along with me have been building DWAAS. #GoDataEngineeringTeam

📝 Save this story in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store