code for article pfeilbr/amazon-managed-workflows-for-apache-airflow-playground
learn Amazon Managed Workflows for Apache Airflow
Install Steps
- create S3 bucket with versioning enabled
- create VPC - this is where RDS Postgres for airflow data lives, the airflow web UI runs in a container here
- create aiflow environment
Notes
- you create named MWAA environments and they each have own configuration
- airflow ui exposed publically or private within VPC
- login to airflow ui with IAM user
- can create web login token via AWS CLI
- upload DAG .py files to
dag
path in S3 - run DAGs via airflow cli, REST endpoint, boto3
- invoke DAG via lambda
- install additional deps by providing requirements.txt
- auto scales up and down via Apache Celery Executor by adding / removing worker containers as needed
- scale via number of workers, and environment class
Airflow
Plugins
- Hooks: Hooks are basically python modules that enables tasks to access external platform, like AWS, AZURE, GCP and many more.
- Sensors: Sensors are python modules which are used to create watcher tasks(in the most basic sense), for example s3Sensor is used to create s3 file watcher task. A sensor stays in running state till a specific state appears.
- Operators: Operators are typically execution engines. Operators are used to create task that execute some process, based on the type of Operator. For example: PythonOperator can be used create a task that will run a specific python method.
XCom
- share small bits of data between tasks
- stored in airflow postgres db
Example Environment Attributes
{
"Environment": {
"AirflowConfigurationOptions": {},
"AirflowVersion": "1.10.12",
"Arn": "arn:aws:airflow:eu-west-1:xxxxxxxxxxxx:environment/airflow-blogpost-dublin",
"CreatedAt": 1610632127.0,
"DagS3Path": "dags",
"EnvironmentClass": "mw1.medium",
"ExecutionRoleArn": "arn:aws:iam:: xxxxxxxxxxxx:role/airflow-demo-mwaa-eks-iamrole",
"LastUpdate": {
"CreatedAt": 1611137820.0,
"Status": "SUCCESS"
},
"LoggingConfiguration": {
"DagProcessingLogs": {
"CloudWatchLogGroupArn": "arn:aws:logs:: xxxxxxxxxxxx:log-group:airflow-ricsue-dublin-DAGProcessing",
"Enabled": true,
"LogLevel": "INFO"
},
"SchedulerLogs": {
"CloudWatchLogGroupArn": "arn:aws:logs:: xxxxxxxxxxxx:log-group:airflow-ricsue-dublin-Scheduler",
"Enabled": true,
"LogLevel": "INFO"
},
"TaskLogs": {
"CloudWatchLogGroupArn": "arn:aws:logs:: xxxxxxxxxxxx:log-group:airflow-ricsue-dublin-Task",
"Enabled": true,
"LogLevel": "INFO"
},
"WebserverLogs": {
"CloudWatchLogGroupArn": "arn:aws:logs:: xxxxxxxxxxxx:log-group:airflow-ricsue-dublin-WebServer",
"Enabled": true,
"LogLevel": "INFO"
},
"WorkerLogs": {
"CloudWatchLogGroupArn": "arn:aws:logs:: xxxxxxxxxxxx:log-group:airflow-ricsue-dublin-Worker",
"Enabled": true,
"LogLevel": "INFO"
}
},
"MaxWorkers": 5,
"Name": "ricsue-dublin",
"NetworkConfiguration": {
"SecurityGroupIds": [
"sg-0c88ef4755c295zzz"
],
"SubnetIds": [
"subnet-0493dffd0282f4xxx",
"subnet-08f416023356ffyyy"
]
},
"RequirementsS3Path": "requirements/requirements.txt",
"ServiceRoleArn": "arn:aws:iam:: xxxxxxxxxxxx:role/aws-service-role/airflow.amazonaws.com/AWSServiceRoleForAmazonMWAA",
"SourceBucketArn": "arn:aws:s3:::airflow-mybucket",
"Status": "AVAILABLE",
"Tags": {},
"WebserverAccessMode": "PUBLIC_ONLY",
"WebserverUrl": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee.c5.eu-west-1.airflow.amazonaws.com",
"WeeklyMaintenanceWindowStart": "SUN:14:00"
}
}
Resources
- Amazon Managed Workflows for Apache Airflow
- Interacting with Amazon Managed Workflows for Apache Airflow via the command line
- AWS CLI Command Reference | mwaa
- Airflow — Custom Plugins
- Airflow XCOM : The Ultimate Guide
Twitter • Reddit