-
9 November 2021
- Airflow
Releasing the new version of some software can create significant challenges for data engineers. Those who use Apache Airflow and have already encountered Airflow 2.0 will surely agree that even minor modifications can totally modify how DAGs work or even block them. Has Airflow changed for the better? How can you simplify its setup with DAG Versioning and Serialization?
Although some functionality of earlier versions has been preserved, there are some important changes in the new Airflow; for example, it comes with complete REST API. This can pose some new challenges when upgrading. Fortunately, there are also modifications that can simplify the day-to-day work of your data engineers. Read below about DAG versioning and serialization.
Airflow 2.0 — what has changed?
Introducing a new version of some software is always preceded by a mix of excitement and concern on the part of the professionals that use it on a daily basis. Will changes affect a team’s efficiency in a positive or negative way? Will the new program suit your company needs? Will it be easy enough to get used to it? There are a lot of questions to be answered, but Airflow 2.0 is already here, so you can probably try to answer them yourself or join the discussion. You can always contact us for our support, but before that, here are some of the most noticeable changes you should know about:
- A redesigned user interface — the new, clear and easy to read dashboard is certainly a positive change.
- An efficient scheduler — the scheduler is one of the core functionalities of Airflow and now, due to the modifications, its performance is much better than before. It is also possible to run multiple scheduler instances in an active/active model, which increases the availability and failovers, which is crucial for stability of the particular Airflow solution.
- Complete REST API — the new fully supported API can create some issues when upgrading the software, but it certainly makes it easier to access third-parties platforms.
- Smart sensors — in new Airflow’s, you will observe improved efficiency of long-running tasks thanks to DAG centralization and batch processing.
- DAG serialization — in Airflow’s new version, the system server parses DAGs differently, as only the scheduler needs access to the DAG file.
- DAG versioning — users gain additional support for storing many versions of serialized DAGs.
- Task Groups — Instead of using SubDAGs, which caused performance issues, there is a possibility to use Task Groups to organize tasks within DAG’s graph view. It is performed in the Airflow UI, so then it does not affect performance. Less complexity using less code.
There have been many improvements, for sure, and it is normal that users need some time to get used to them. In this article, we’d like to focus on the last two changes and explain how they make Airflow 2.0 setup easier.
DAG serialization before and now
Serialization is quite an important functionality of Apache Airflow. The term refers to storing a serialized representation of the DAGs in the database. They are stored in lightweight JSON format. Simply, the scheduler can parse DAG files and keep a representation in the database, so it can later be fetched by the web server to fill the user interface. Processing DAGs on both the web server and the scheduler is rather inefficient due to unnecessary duplication, affecting the overall performance of Airflow negatively.
In the old version of Airflow, both the web server and the scheduler required access to the DAG files in order to read and parse them. In Airflow 2.0, parsing and serialization can be performed with the scheduler only accessing the metadata database, and by the web server only accessing metadata. This improves Airflow’s efficiency by reducing the load on the web server because there is no need to parse DAGs from the DAG files. Serialized DAGs are simply retrieved from the database. Significantly, thanks to changes in the new version of Airflow, as accessing DAG files by web server is no longer necessary, Airflow setup and DAG deployment is much easier than before.
DAG versioning — what has changed?
As you know, data pipelines are represented within the Airflow by DAGs. A company is like a living organism — it changes over time, and after some period of time it can have different business needs than it once did. DAG changes, as well as business requirements, evolve. It’s no secret to those who work with Airflow on a daily basis that adding tasks to an existing DAG had one, peculiar side effect — “no-status” tasks were presented in the history overview. This could cause problems with analysing logs and viewing code assigned to the current DagRun.
It is important for Airflow users to be able to check how a given DAG was run in the past. Thankfully, the new version of Apache Airflow offers solutions for many previous problems. After upgrading to version 2.0, you gain additional support for storing many versions of serialized DAGs. Relations between DagRuns and DAGs will be correctly presented.
Can you transition smoothly from the previous version to Airflow 2.0?
Apache Airflow 2.0 provides users with quite interesting modifications and new features. It would be a shame not to leverage it all in order to improve your company’s efficiency. Remember that Airflow 2.0 does not have a monolithic architecture. New functionalities are split into its core and 61 provider packages, and each of them is meant for a specific external service or database. Thanks to this, you can benefit from performing a custom Airflow 2.0 installation in order to configure this business tool to your specific needs.
Perhaps you’re not sure how to properly install the new version of Airflow or how to best set it up to take the most advantage of it. Fortunately, you are not on your own. Our experienced team can help you with the installation and configuration process and prepare a training for your in-house team. Let us know if you need our assistance.
Check out our blog for more details on Airflow: