Share this post

The smooth flow of business data is essential for companies’ success. Developing and managing data pipelines often requires a significant budget, the right tools and having experienced data engineers involved in the process. Airflow Celery Executor may make it easier for your experts to create a scalable application. Learn what it is and what the advantages are of using it.

You probably know something about Apache Airflow already if you are interested in a comparison of Airflow vs. Celery. It is an open-source tool that enables users to programmatically author, schedule and monitor workflows. It is quite popular among data engineers around the world, as it has a rather rich user interface that makes it easier to work efficiently with the workflows. You can view pipelines running in production, check progress and spot potential issues fast. Workflows in Airflow are in the form of DAGs (Directed Acyclic Graphs). The Airflow Celery Executor improves the efficiency of scaling and distributing tasks. While Apache Airflow has many executors, Airflow Celery is one of the most widely leveraged. Learn why.

What is an Executor in Airflow?

Once a DAG is defined, the tasks inside it have to be somehow executed. That is what Executor do in Apache Airflow. Executors are nothing other than the mechanisms that make task instances run. You can of course switch between those available in Airflow (thanks to their common API) if your requirements change.

The main types of executors in Apache Airflow are:

Local Executors:

  • Debug Executor 
  • Local Executor Sequential
  • Executor 

Remote Executors: 

  • Dask Executor 
  • Kubernetes Executor
  • Celery Executor
  • CeleryKubernetes Executor

Airflow Celery Executor

Two of Airflow’s executors allow the parallel operation of many tasks. One of them is the Airflow Celery Executor. With this solution, tasks can be efficiently distributed among many workers (their number, as well as the resources necessary, can be defined in advance) and carried out in parallel. 

What do you need to know about it?

First things first. A comparison of Airflow vs Celery is pointless because the Celery executor is actually a part of Apache Airflow. The right question to ask is: how does Celery differ from Airflow’s other executors? What makes it special and worth using? 

Airflow Celery is a task queue. It can be used for running Python processes (tasks). It comes with its own tools that a user may need and professional support for those who choose to leverage this solution in production. The Airflow Celery Executor is actually one of the most popular executors used by Airflow’s users for scaling out by distributing workloads to many workers, which can operate on different machines. 

So, how does it work? In order to deal with the workloads, the Airflow Celery Executor has to move tasks to multiple workers. It does that using messages. The deployment scheduler puts a message in the queue. The message is later received by the worker that will execute the task. It is good to mention what will happen if a failure occurs. If the worker assigned to a task fails to complete it, Airflow Celery will assign the task to a different worker rather quickly.

There are many executors and, of course, Airflow Celery is not the only one that can run multiple tasks in parallel. Why should you choose to use it in your production, then? Well, each executor excels at something, and you should – as companies should always do before selecting a tech stack or the right approach for a given project – analyze your needs. Local executors allow users to test applications efficiently in terms of resource management (even when dealing with really heavy workloads). Airflow Celery, though, seems to be a much better option for running DAGs in production if you need to deal with a particular process fast.

Challenges with using Celery with Airflow

The Airflow Celery Executor is used by data pipelines engineers all over the world. As with any other solution, it has its weak spots. The hardest thing, there is when using Celery with Airflow, is to properly assess the amount of resources required for carrying out all the scheduled tasks. Adjusting the number of workers is the key to success. Still, an experienced team can use it efficiently. One of the most often mentioned drawbacks is certainly the lack of a mechanism to scale Celery workers out of the box, based on actual task queue usage. The problem can be solved though by using KEDA.

In general, using Celery requires extended knowledge of one of two message brokers – RabbitMQ or Redis. There is a matter of setting up and maintaining the worker queues. 

Benefits of using the Airflow Celery Executor

One of the most important advantages of using Celery is the fact that it enables running tasks in parallel – as we mentioned before, there are only two such executors in Airflow, so this alone makes Celery special. The second reason why you should consider using it, is its horizontal scalability. In Airflow Celery, you can add new workers anytime if you require them, and they are ready-to-use. You can leverage them immediately, which significantly improves Apache Airflow’s overall efficiency. Users can also choose to prioritize some tasks that they consider more critical than others. 

To sum it up, the main benefits of using the Airflow Celery Executor are:

  • the capability of running tasks in parallel
  • horizontal scalability
  • efficiency
  • the prioritization option

You also have to remember that the potential challenges mentioned above can be easily overcome by using Airflow with the Kubernetes executor – Airflow with Celery can be deployed in Kubernetes. Additionally, in the new version of Airflow (2.0.) there is a new executor available – CeleryKubernetes. It provides users with the best of both worlds. 

The CeleryKubernetes Executor – what is it?

The Kubernetes Executor carries out each instance of the task in its own Kubernetes pod – this way particular tasks receive their own, dedicated resource space. Thanks to this solution, wrong assessment of the needed workers for one task does not affect another task execution.  Kubernetes, as does Celery, has its advantages and disadvantages. Celery seems to be great for taking care of a large number of tasks  that need a similar amount of resources. Kubernetes is an ideal solution for creating a separate environment for each task and allows Airflow to deal with more demanding tasks. It is actually quite important to mention that while Kubernetes doesn’t preserve unused pods (such as those with no tasks in them), Celery’s preserves predefined number of workers whether they are needed or not. 

CeleryKubernetes is a Celery – Kubernetes combination. To put it simply, by picking it, you gain the opportunity to use both these executors at the same time, which means benefiting from all the good they offer. Naturally, to use this combination, you have to configure both of these executors and that may require a bit of time. Still, you’ll see for yourself that it is worth it. The CeleryKubernetes Executor is recommended for processes that consist of many undemanding tasks that could be performed using Celery, but also contain some resource-intensive tasks or tasks that should be run in isolation). 

Apache Airflow 2.0. came with various changes and improvements. It takes time to learn how to use them to your advantage, but you should certainly try out some of the new solutions. Combining Airflow Celery and Kubernetes Executors allows companies to perform their work more efficiently. It also gives them more flexibility, as they are no longer limited by the necessity to use only one of these executors. Do you want to start using the new version of Apache Airflow? Perhaps you’d like to learn how to use this tool more efficiently? We will be glad to assist you or advise you on improving your business processes. Contact us for more information. 

Data Pipeline Automation_2

Author

  • Tomasz Stachera

    Tomasz is a Kubernetes Team Leader and CI/CD expert, evangelizing DevOps culture in DS Stream. For our customers, Tomasz is delivering end-to-end MLOps solutions on GCP and architecting Airflow as a Service mutli-cloud product. Never stopping to learn new technologies and spreading them in the organization. In previous life was Barça and Premier League fan, currently all free time spending on preparing a 2-year-old son to be a Robert Lewandowski's successor.

Share this post
Close

Send Feedback