These ETL pipelines, depicted in Figure 2, below, ensure that data for analysis is delivered accurately and in a timely manner. Our data infrastructure powers thousands of dashboards that fetch data from this system every five to ten minutes during the peak hours of usage. These analytics cover everything from product development to sales, and all depend on getting data from Snowflake, our data warehouse vendor. The challenges of scaling ETL pipelinesĭoorDash has significant data infrastructure requirements to analyze its business functions. This diagram demonstrates how complex our DAG can be. Figure 1: Airflow is able to handle the kind of ordering complexities which would otherwise be impossible to schedule correctly. While this was a great place to start, it did not scale to meet our business’ continued growth. This configuration was easy to set up and provided us with the framework needed to run all our ETL jobs. When we began using Airflow for scheduling our ETL jobs, we set it up to run in a single node cluster on an AWS EC2 machine using Local Executor. Airflow also comes with built-in operators for frameworks like Apache Spark, Google Cloud's BigQuery, Apache Hive, Kubernetes, and AWS' EMR, which helps with various integrations. It provides an easy-to-read UI and simple ways to manage dependencies in these workflows. Apache Airflow’s platform helps programmatically create, schedule, and monitor complex workflows, as shown in Figure 1 below the DAG gets extremely complicated. To contextualize our legacy system we will dive into how Apache Airflow was set up to orchestrate all the ETL’s that power DoorDash’s data platform. How Airflow helped orchestrate our initial data delivery This solution was perfect for our needs, as we already use Kubernetes clusters and the combination scaled to handle our traffic. The open source community came to the rescue with a new Airflow version adding support for Kubernetes pod operators. When scalability became an issue, we looked for another orchestration solution. Initially, we used Airflow for orchestration to build data pipelines and set up a single node to get started quickly. Managing data in our infrastructure to make it usable for the DoorDash teams who need it requires various pipelines and ETL jobs. As we grew to cover more than 4,000 cities, all the data became more complex and voluminous, making orchestration hard to manage. Our solution came from a new Airflow version which let us pair it with Kubenetes, ensuring that our data infrastructure could keep up with our business growth.ĭata not only powers workflows in our infrastructure, such as sending an order from a customer to a restaurant, but also supplies models with fresh data and enables our Data Science and Business Intelligence teams to analyze and optimize our services. However, as our business grew to 2 billion orders delivered, scalability became an issue. As an orchestration engine, Apache Airflow let us quickly build pipelines in our data infrastructure.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |