It’s challenging to compete with an open source project that has a large community and is “good enough” for many use cases. It takes an ambitious team and a leap from the status-quo. We at Next47 are eager to back founders building the next step-change in technology for data practitioners.
We help connect our startups to enterprises around the world, and data pipelines are present in all of them. These pipelines can range from straightforward tasks like combining CRM and product usage data on a dashboard, to complex endeavors like using LiDAR point cloud data to drive an autonomous vehicle. In a FiveTran survey, the majority of respondents said it took over a day to fix a broken data pipeline. Reliability is important. The right data orchestrator will build workflows that don’t break easily and fail gracefully.
Over the last two years, tech macro headwinds have frozen budgets and data teams are managing compute and storage costs more carefully. Some data practitioners are reconsidering the practice of loading raw data into their data warehouse before transforming it. It can be more cost effective to prune data before ingest and a data orchestrator can assist with this pattern.
30% of respondents to FiveTran’s survey use an orchestration tool. Orchestrators schedule workflows, implement parallel processing, visualize dependencies, handle errors and retry logic, and provide observability into issues. Data engineers are some of the most common users, representing 54% of respondents in Airflow’s 2022 Survey. Over 73% of Airflow users work with the tool daily.
When we spoke to practitioners about the pros and cons of Airflow, consistent themes emerged.
These weaknesses provide room for disruption. The strengths—a bar to reach. Next, we’ll break down how founders are positioning their companies in the Data Orchestration Market.
Data Orchestration
Data orchestration stands out as the most mature and crowded category in this sector. These vendors primarily serve data engineers. Airflow is the giant in the category and its community continues to make incremental improvements. Astronomer is a vendor ideal for enterprises that want to continue using Airflow with more automated deployment, management, and added customer support. Google, Amazon, and Azure have managed Airflow offerings as well.
Alternatives to Airflow often have open source projects and paid cloud versions. One of the most unique examples is Dagster, which uses data assets rather than tasks as a base unit. This new approach enables easier policy creation (e.g. ensure this report never uses data more than a day old) and tracking of data lineage. Prefect is another Airflow alternative that uses Python decorators to make pipeline configuration simpler, and works well with streaming architectures. Over time we expect these vendors to add more features tailored to machine learning engineers and merge with the next category in our landscape.
Machine Learning Orchestration
Machine learning workflows are proliferating throughout the software ecosystem. Over 40,000 people have ML Engineer as their title on LinkedIn. They build pipelines that prepare data, engineer features, train and choose the best models, and run inference. ML orchestration tools are tailored to the highly iterative process of productionizing an ML system. ZenML, for instance, versions data, code, and models. Union has multiple open source projects across ML workflow orchestration and data quality. MLflow offers a format for packaging a model and its dependencies. Emerging workflows around foundational models will drive further demand for ML orchestration tools.
Microservice Orchestration
All of the aforementioned tools provide an abstraction that defines a workflow and typically leaves the responsibility of computation to other tools. Microservice orchestration tools provide this same “separation of concerns” for software development and can help manage long-running processes with ephemeral infrastructure. Teams working with Kubernetes can use a tool like Argo Workflows to decouple application code from the error handling, queues, and conditional logic between services. Restate is a new entrant for building resilient applications and RPC handlers without any failure handling code. Microservice orchestrators are generally flexible enough to perform data orchestration. However, they may not see as much adoption for data workflows compared to the Python-based tools data engineers are accustomed to.
Conclusion
A startup can succeed as the commercial successor to Airflow, but the bar is set high. The winner of this category will be used by engineers of all types to build data pipelines with the same best practices used in software development. Considering the limited budget data engineers have outside of warehouse and BI tools, data orchestration tools may need to incorporate features from adjacent categories like ETL, data quality monitoring, ML experiment tracking and data lineage to build a large business. By offering features that help manage costs, like intelligent materializations and partitioning, orchestration tools can improve customer ROI. They will need to continuously adapt to the latest workflow patterns, integrations, and languages. And they will need allies, one source of which is the open source community.
Lastly, if you want to join other members of the broader open source community—we’re hosting our first Open Source Day in London on Sep 21, 2023. Learn from leading open-source experts and connect with fellow devs, enthusiasts, and professionals. Space is limited, so reserve your spot here. See you there!