Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

Airflow implementation of ingest pipeline for Green House Gas Center (GHGC)

License

Notifications You must be signed in to change notification settings

US-GHG-Center/ghgc-data-airflow

 
 

Repository files navigation

veda-data-pipelines

Important

The US GHG Center has started using veda-data-airflow repository directly for its data processing and STAC metadata creation. Hence, this forked version of the veda-data-airflow repository is no longer maintained and so the repository is now archived.

This repo houses function code and deployment code for producing cloud-optimized data products and STAC metadata for interfaces such as https://github.com/NASA-IMPACT/delta-ui.

Project layout

  • dags: Contains the Directed Acyclic Graphs which constitute Airflow state machines. This includes the python for running each task as well as the python definitions of the structure of these DAGs
  • pipeline_tasks: Contains util functions used in python DAGs
  • data: Contains JSON files which define ingests of collections and items
  • docker_tasks: Contains definitions tasks which we want to run in docker containers either because these tasks have special, unique dependencies or for the sake of performance (e.g. using multiprocessing)
  • infrastructure: Contains the terraform modules necessary to deploy all resources to AWS
  • custom policies: Contains custom policies for the mwaa environment execution role
  • scripts: Contains bash and python scripts useful for deploying and for running ingests

Fetching Submodules

First time setting up the repo: git submodule update --init --recursive

Afterwards: git submodule update --recursive --remote

Requirements

Docker

See get-docker

Terraform

See terraform-getting-started

AWS CLI

See getting-started-install

Poetry

See poetry-landing-page

pip install poetry

Deployment

This project uses Terraform modules to deploy Apache Airflow and related AWS resources using Amazon's managed Airflow provider.

Make sure that environment variables are set

[.env.example](./.env.example) contains the environment variables which are necessary to deploy. Copy this file and update its contents with actual values. The deploy script will source` and use this file during deployment when provided through the command line:

# Copy .env.example to a new file
$cp .env.example .env
# Fill values for the environments variables

# Init terraform modules
$bash ./scripts/deploy.sh .env <<< init

# Deploy
$bash ./scripts/deploy.sh .env <<< deploy

Note: Be careful not to check in .env (or whatever you called your env file) when committing work.

Gitflow Model

VEDA pipeline gitflow

License

This project is licensed under Apache 2, see the LICENSE file for more details.

About

Airflow implementation of ingest pipeline for Green House Gas Center (GHGC)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 83.6%
  • HCL 10.5%
  • Shell 3.8%
  • Dockerfile 1.5%
  • Smarty 0.6%