-
28 October 2021
- Airflow
As one of the most popular open-source data pipeline management tools, Apache Airflow has a lot of users all around the world. It is liked by many, but just like any other software, it also has some weak points. In order to keep this system up to date and solve some problems, the Airflow team conducts Community Surveys and works on new versions of the software. Recently, Airflow 2.0 was launched with many new features, such as a more efficient scheduler.
Those who’ve decided to move to the Airflow 2.0 will soon realize that the Airflow team’s main area of focus was on the scheduler. Although Airflow’s task execution was always relatively good, improving its efficiency was crucial for achieving higher overall performance of the entire system. Are you curious how the new Airflow’ scheduler differs from the old one?
Apache Airflow scheduler — what’s it for?
Apache Airflow has been transformed. Airflow’s team has added numerous new features and improved the performance of various components of the software. The one that’s undergone the most significant modification was the scheduler. Its purpose is to read the pipelines of tasks, which are presented as DAGs — Directed Acyclic Graphs. It is used for scheduling the tasks contained within DAGs and monitoring their execution. It also triggers tasks if their dependencies are met.
The thing is, that running hundreds of tasks is not easy, and the companies that run Airflow for production cannot be held back. When using the previous version of Airflow, you probably experienced multiple DAG execution problems, like delays in picking up tasks. Sometimes the scheduler crushed for an unknown reason, which sometimes led into tasks hanging in queued state. Crushing of the scheduler sometimes was also caused by very short (couple of seconds) network outage — it was happening, because there was no retry mechanism in Airflow core. These issues make horizontal scalability and availability of the scheduler very important for users. It also has to schedule and execute tasks as quickly and efficiently as possible, without missing tasks when scheduler replication is in progress. How have these challenges been addressed by Airflow’s team?
The Airflow scheduler in version 2.0
The Airflow team has certainly succeeded in optimizing resource usage. The scheduler works much faster now with no increase of CPU or memory. The scheduler operates in independent mode, so when the system uses more than one scheduler, there will be zero downtime. Performance when executing many DAGs in parallel mode has been significantly improved. Would you like to learn more? Let’s focus on the details.
Multiple schedulers in the Active Model
Airflow 2.0 allows users to use multiple schedulers. Each of them is “active” and can handle numerous tasks, including those not related to scheduling, such as task execution monitoring, task error handling, etc. All schedulers that are running have access to the same relational database. Once one of the schedulers picks the DAG to process from the database, it blocks it for the other schedulers. Then, after the processing is done, the DAG is released.
Horizontal Scalability
Having multiple schedulers in the active model allows Airflow 2.0 to achieve horizontal scalability. Simply, if the load on one scheduler is too big, you can scale Airflow horizontally by leveraging numerous schedulers across nodes. If you don’t need them any more, you can reduce the number of schedulers in order to minimize resource usage. The schedulers are identical, so there will be no negative effects of scaling down.
Zero recovery time and no crashed schedulers
The active model in Airflow 2.0’s scheduler results in zero downtime and no recovery time in case of scheduler failures. That is because of the other fully active schedulers that keep running and take over the operations when one of the schedulers crashes. Of course, throughput will probably be reduced in some way, as there will be more work left for the remaining schedulers, but still there is no need for recovery.
Each scheduler independent
Maintaining complex software like Apache Airflow is not easy, but having it work without delays is very important, as it is often a part of the much bigger system. Installing the necessary patches and updates is crucial, though. The entire system’s security depends on it. In the new Airflow 2.0 updating one of the schedulers do not affect the performance of the others.
The scheduler’s overall optimization
The new version of Airflow comes with an interesting “fast-follow” feature that makes the scheduler more efficient. What is this? It is described by some as a “mini-scheduler” in the workers. After one task is completed in an Airflow Worker, the system checks if there is any follow-on task in the same DAG that is currently being processed. If there is one available, it is taken by the current worker and executed right away. There is no need to schedule the follow-on task to a worker. The processing time is reduced and efficiency is increased.
Airflow 2.0 installation and customization
Apache Airflow has gone through a significant transformation and the types of changes carried out by Airflow’s team suggests that this system will, in time, be suitable for various projects. Think of your company’s needs and ask yourself if this flexible software would be great for your future business endeavours?
We offer you our assistance and advice on Airflow 2.0 installation, and we can help you to configure it according to your organization’s requirements. We can also prepare and conduct an Airflow 2.0 training for your in-house team so your employees would be able to make the most of the new features as soon as possible. Contact us for more information.
Check out our blog for more details on Airflow: