DAGs - learn how to improve them

6 October 2022

Airflow

Share this post

In the world where more and more organizations are data driven, volume of data is growing intensively and analytical platforms are moving to cloud and getting more complex to deploy, data engineers still wants to deliver best quality answers to business questions.
Apache Airflow is well known tool to author, schedule and monitor data workflows. In this article I describe an improvement introduced in our project, which make us deliver better quality DAGs faster.

Why do we need local environment?
Local environment for Apache Airflow
Docker Compose as a solution
Advantages of local Airflow environment based on Docker Compose in comparison to remote environment
Use cases for local Airflow environment based on Docker Compose

Why do we need local environment?

As developers and data engineers we want to deliver high quality product ready to production deploy with no bugs and performance issues. In order to achieve that goal we use local environments to develop, test and optimize our code.

Local environment for Apache Airflow

Sometimes production environments are very complex, for example they use Kubernetes clusters and connections to cloud services, and deployment takes a lot of time. How to make reasonable simplification and make your code testable? Let’s learn on example of Apache Airflow.
In the previous article we wrote about use cases and general purpose of the Apache Airflow.
If you are new in this topic I highly recommend to take a look at https://dsstream.com/when-should-you-consider-using-apache-airflow-use-cases/
I assume that you are familiar with Airflow, you use it constantly and wondering what is the better way to write and test DAGs. You are not alone, we have the same topic coming up during our retro meetings 😊
Our production and preprod environment works on cloud using Kubernetes Engine, DAGs synchronization took us each time around 2 to 5 minutes, so for efficient development it is far too long.
In order to deliver quality pipelines faster we decided to spin up local Airflow environments. Using Kubernetes locally was for us too much overhead, because we want to focus on DAG development not configuration. We went for a simpler solution and used Docker Compose.

Docker Compose as a solution

Docker Compose is a tool for defining and running multi-container Docker applications.

With Compose you can set the desired amount of containers counts, their builds, and storage designs, and then with a single set of commands you can build, run and configure all the containers.

It is great solution for development and testing.

There is official Docker Compose file Airflow Community provided us with, however to better mirror production settings we customize it a little bit. We have added libraries like black, pylint and pytest in order to check code and scripts helping us with image rebuild and cloud connection settings.

Advantages of local Airflow environment based on Docker Compose in comparison to remote environment

Main advantages that we discovered using local Airflow environment on Docker-Compose are:
– fast DAG update – catalogue with DAGs is mounted as volume, update of DAGs is as fast as values of SCHEDULER_DAG_DIR_LIST_INTERVAL EASY LOGS ACCESS parameter, which can be seconds
– easy logs access – catalogue with logs is also mounted as a volume and give us Chance to quickly grep, tai lor simply read log files
– easy to change config – we use configuration within .env file, it lets us transparently change parameters with no airflow.cfg overwrite and give us opportunity to have many .env files and easily switch between set-ups.
– easy to redeploy image – we have full control of environment

Read more about Airflow 2.0 setup with the DAG Versioning and Serialization here.

Use cases for local Airflow environment based on Docker Compose

Local Airflow environment with Docker-Compose was helpful for us not only with DAGs development but also it was major help when we want to:
– upgrade Airflow version,
– add python dependencies,
– test new versions of libraries
– test new integration (like BigQuery) when we were wondering where to start, we have used Cloud SDK with personal credentials and make development quick and efficient.

If you meet similar challenges at your work and would like to exchange experience or use our support please Contact us.

Author

Milena Kapciak
Milena works as a Product Owner of Airee - DS Stream Airflow Managed Service. She has over 7 years' experience in managing development and maintenance of Analytical Platform. She likes cooperating with tech specialists and learning new Big Data tools.
View all posts