Airflow Plugin Tutorial: Building the EDW Grants Plugin
Introduction
There are times when the built-in features of Apache Airflow just don’t quite cover everything you need. Whether it’s adding custom functionalities or extending the user interface to suit your specific workflow, Airflow plugins offer a powerful way to do this. Plugins allow you to enhance and tailor Airflow to your unique requirements without needing to modify its core.
In this tutorial, we’ll walk you through how we built the airflow-edw-grants plugin, which simplifies the management of Redshift roles and users directly within Airflow. This plugin empowers teams to create users, assign roles, and manage access permissions easily, all from within the familiar Airflow interface. Even if you're not a database administrator, this plugin streamlines the process of handling permissions, making it accessible to more people in your team.
By the end of this guide, you’ll have a clear understanding of how to build your own Airflow plugin, and how you can leverage this flexibility to extend Airflow for your own use cases.
You can find the source code on GitHub and install it from PyPI.
Understanding Airflow Plugins
Airflow is built on Flask—a lightweight web framework that allows developers to quickly build and deploy web applications. Flask’s modular structure is ideal for creating plugins, as it enables Airflow to integrate custom UI components and APIs via Flask’s blueprint and view systems, allowing developers to add new web pages and functionality within the Airflow UI.
How Airflow Imports Plugins
You can add plugins to Airflow in two different ways: through installation using pip or by placing custom-built plugins in the designated plugins folder.
For plugins that are installed using pip, Airflow dynamically scans the environment for classes that inherit from AirflowPlugin within the installed Python packages. During startup, Airflow’s plugin discovery mechanism registers these AirflowPlugin classes and their components, making them accessible across the application.
For local, custom-built plugins, developers should place their Python modules in the $AIRFLOW_HOME/plugins folder. Airflow scans this directory for plugin files on startup, integrating these custom plugins alongside those installed via pip.
Core Plugin Components
Flask View
In Flask, a View is a function or method that handles requests and returns a response. Views are central to the web application’s routing system, and they determine how to respond to different URL endpoints. Each view function is typically associated with a specific route, allowing it to process incoming requests, perform any necessary logic (such as database queries or data processing), and return the appropriate output, often as HTML, JSON, or other content types.
In the context of Airflow plugins, views are essential for defining the behavior of custom pages within the Airflow UI. By creating views, developers can extend the functionality of Airflow, enabling users to interact with the application in new ways. For instance, you can create views for data visualization, management interfaces, or user interactions.
Blueprint
A Blueprint is a Flask feature that allows Airflow plugins to register additional routes on the web server. By creating a blueprint, plugins can add custom routes and views that integrate with the Airflow UI, enabling them to create custom dashboards or pages.
Thinking of blueprints as reusable components for Flask applications can help clarify their purpose. This modular approach allows developers to organize related functionality, making it easier to manage and maintain complex applications. Each blueprint can encapsulate its routes, handlers, and templates, providing a clean separation of concerns.
For instance, if you have a plugin that manages user interactions, you could create a blueprint specifically for user-related routes and views. This keeps user functionality distinct from other areas of the application, enhancing readability and maintainability.
You can learn more about blueprints in Flask by referring to the official documentation.
AppBuilderBaseView
This is a Flask AppBuilder class used to create custom views in the Airflow UI. By inheriting from this base view, you can define new pages, tabs, and menu items within the UI. Additionally, any APIs and methods you define in your subclass will be associated with these views, allowing you to handle various HTTP requests and implement custom logic for managing data, user interactions, and more.
AirflowPlugin
This is the core registration class for plugins. It declares the plugin’s name and lists all components (e.g., operators, sensors, hooks, and views) that Airflow integrates at startup. This structure facilitates easy addition and modification of custom functionality across the Airflow instance.
Models
You can leverage Airflow’s built-in metadata without writing raw SQL queries by importing models such as DagRun and DagModel. This approach not only simplifies the interaction with Airflow's metadata but also helps prevent SQL injection vulnerabilities when dealing with non-Airflow metadata tables. Instead of writing SQL directly, it’s better to use SQLAlchemy queries and models. For instance:
This method promotes better security practices and maintains the integrity of your database operations.
Templates and Jinja in Airflow Plugins
Jinja is a powerful templating engine for Python that allows you to generate dynamic HTML pages by embedding Python-like expressions within your HTML templates. It is commonly used in Flask applications, including Airflow, to render views with dynamic content. When defining custom views in an Airflow plugin, you can use Jinja templates to create the HTML content displayed in the Airflow UI. Rendering templates involves using the render_template method provided by Flask, which allows you to pass context variables to the templates.
HTTP Methods in EDW Grants Plugin
Flask supports several HTTP methods that allow your views to handle different types of requests. In Airflow plugins, you can define routes that respond to these methods:
GET: Used to retrieve data from the server, often to display data on a web page. For example, this method is used to serve the main page with users and roles.
POST: Utilized to submit data to the server, such as form submissions. This method is used to create new users and roles.
PUT: Employed to update existing resources on the server. This idempotent method allows for changing existing users and roles.
DELETE: Used to remove roles and users from the system.
Note: The @admin_only and @failure_tolerant_front_end decorators were created and added to each HTTP method. The @admin_only decorator ensures that only users with the admin role can access this plugin, providing an added layer of security for sensitive operations. The @failure_tolerant_front_end decorator is designed to pop up error messages and redirect users to the main plugin page instead of showing a broken Airflow page.
Troubleshooting Plugins
If plugins don’t behave as expected, the airflow plugins command can help by outputting information about loaded plugins, assisting in identifying loading issues or misconfigurations.
Plugin Loading and Refreshing
Lazy Loading: Plugins in Airflow are lazily loaded, meaning they are loaded when first needed. Once a plugin is loaded, it is not reloaded unless specifically configured.
Refreshing Plugins: To force plugins to load at the start of each Airflow process, users can set lazy_load_plugins = False in airflow.cfg. When this option is enabled, any changes made to a plugin will require a restart of both the Airflow web server and the scheduler to reflect updates.
Task Execution and Plugin Updates: For tasks that utilize plugins, updates to the plugin code will not be reflected in running tasks until the worker or scheduler is restarted. By default, tasks execute in a new process (forked from the main process) to enhance speed. However, users can enforce a new Python interpreter for each task by setting execute_tasks_new_python_interpreter = True in airflow.cfg, allowing the latest plugin code to be used immediately, albeit with a slight performance overhead.
Automatic Plugin Reloading: Additionally, setting AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE=True enables the web server to automatically reload plugins whenever changes are detected. This feature provides a more seamless development experience, as you won’t need to restart the web server manually after making changes to your plugins. However, this setting may come with some trade-offs, such as potential performance implications if plugins are frequently modified during heavy usage, compared to the more controlled approach of explicitly restarting the server.
EDW Grants Plugin
To install our plugin, you can find it at GitHub or PyPI. Add airflow-edw-grants to your requirements, restart your web server, and add the redshift_connection_grants_name variable with your Redshift connection name or create a Redshift connection with the name edw_con. Navigate to the EDW -> Permissions section to access the Roles and Users tables, where you can create roles and users and attach roles to each one of them.
Note: Only users with the admin role can access this plugin, ensuring that sensitive permissions and role management features are securely managed.
Conclusion
Creating custom plugins for Apache Airflow can greatly enhance its functionality and allow you to tailor the platform to fit your specific needs. Whether you're managing database roles, building user-friendly interfaces, or integrating new features, understanding the core components and how to troubleshoot your plugins is crucial. By following the guidelines outlined in this tutorial, you'll be well on your way to developing effective and efficient plugins that elevate your data orchestration workflows. With the knowledge gained, you can explore even more advanced features and expand the capabilities of your Airflow instance to meet the evolving demands of your data projects.