Marquez is an open-source metadata service for collecting, aggregating, and visualizing data lineage, serving as the reference implementation of the OpenLineage standard. In this Marquez review, we examine how the platform provides real-time lineage collection from running jobs and applications, and how it compares to alternatives like DataHub, OpenMetadata, and commercial lineage tools.
Overview
Marquez was originally developed at WeWork and is now a Linux Foundation AI & Data project. It provides a metadata server with an OpenLineage-compatible API endpoint that collects lineage information in real time from running jobs and applications. As the reference implementation of OpenLineage, Marquez works out of the box with any tool that emits OpenLineage events โ Apache Airflow, Apache Spark, dbt, Dagster, Flink, and Great Expectations.
The platform stores job and dataset metadata, tracks lineage relationships (which jobs read from and write to which datasets), and provides a web UI for visualizing lineage graphs. Marquez is designed to be simple to deploy and operate โ a single service with a PostgreSQL backend, compared to the multi-component architectures of DataHub or OpenMetadata.
Key Features and Architecture
OpenLineage-Compatible API
Marquez exposes an API endpoint that accepts OpenLineage events โ standardized JSON messages describing job runs, dataset reads/writes, and schema information. Any tool that emits OpenLineage events (Airflow, Spark, dbt, Dagster, Flink) can send lineage data to Marquez without custom integration.
Real-Time Lineage Collection
Lineage data is collected in real time as jobs run, not through periodic crawling or manual registration. When an Airflow DAG executes, each task emits OpenLineage events that Marquez captures immediately, building the lineage graph as pipelines execute.
Lineage Visualization
The web UI provides an interactive lineage graph showing datasets, jobs, and their relationships. Users can trace data flow upstream (where did this data come from?) and downstream (what depends on this dataset?), enabling impact analysis and root cause investigation.
Job and Dataset Metadata
Beyond lineage, Marquez stores metadata about jobs (run history, duration, status, facets) and datasets (schema, location, quality metrics). This provides operational context alongside lineage โ not just "what connects to what" but "when did it last run and did it succeed?"
Facets System
OpenLineage facets allow attaching arbitrary metadata to jobs and datasets โ data quality metrics, schema changes, SQL queries, Spark execution plans. Marquez stores and indexes these facets, enabling rich queries beyond basic lineage relationships.
Simple Architecture
Marquez runs as a single Java service backed by PostgreSQL. This is dramatically simpler than DataHub (Kafka + Elasticsearch + MySQL + GMS) or OpenMetadata (4 components). Deployment is a single Docker container or JAR file.
Ideal Use Cases
Organizations Adopting OpenLineage
Teams standardizing on OpenLineage for cross-tool lineage collection use Marquez as the lineage backend. Since Airflow, Spark, dbt, and Dagster all support OpenLineage natively, Marquez provides lineage visibility across the entire pipeline stack without custom integration.
Lightweight Lineage for Small-to-Medium Data Teams
Teams that need lineage visualization without the complexity of a full data catalog (DataHub, OpenMetadata) deploy Marquez for focused lineage tracking. The single-service architecture means one engineer can deploy and maintain it.
Pipeline Debugging and Impact Analysis
When a data quality issue is discovered, engineers use Marquez's lineage graph to trace upstream to the root cause. Before making schema changes, they check downstream dependencies to assess impact.
Compliance and Audit Trail
Organizations needing to demonstrate data provenance for regulatory compliance (GDPR, SOX) use Marquez's lineage records as an audit trail showing how data flows through the organization.
Pricing and Licensing
Marquez is completely free and open-source under the Apache 2.0 license:
| Option | Cost | Includes |
|---|---|---|
| Open Source (Self-Hosted) | $0 + PostgreSQL | Full lineage server, web UI, OpenLineage API, community support |
Infrastructure costs are minimal: Marquez runs as a single service backed by PostgreSQL. A typical deployment costs $50โ$150/month for a small PostgreSQL instance and the Marquez server. For comparison, commercial lineage tools like Atlan ($50,000+/year), Collibra ($150,000+/year), and Monte Carlo ($30,000+/year) include lineage as part of broader platforms. DataHub and OpenMetadata are also free but require more infrastructure.
Pros and Cons
Pros
- OpenLineage reference implementation โ guaranteed compatibility with the lineage standard; works with Airflow, Spark, dbt, Dagster, and Flink out of the box
- Simple architecture โ single service + PostgreSQL; dramatically easier to deploy and operate than DataHub or OpenMetadata
- Real-time lineage โ collects lineage as jobs run, not through periodic crawling; always up-to-date
- Free and open-source (Apache 2.0) โ no licensing costs, no open-core restrictions
- Focused scope โ does lineage well without the complexity of a full data catalog
- Facets system โ extensible metadata model for attaching quality metrics, schemas, and custom data to lineage events
Cons
- Lineage only โ no data discovery, governance, quality testing, or collaboration features; you'll need additional tools for a complete data catalog
- Smaller community โ less active than DataHub or OpenMetadata; fewer contributors and slower feature development
- No managed cloud offering โ self-hosted only; no SaaS option for teams that want zero infrastructure management
- Limited UI โ the web UI provides basic lineage visualization but lacks the polish and features of commercial tools or DataHub's UI
- Java-based โ the server is written in Java, which may not align with Python-centric data teams for customization and contribution
Getting Started
Getting started with Marquez is straightforward. Visit the official website to create a free account or download the application. The onboarding process typically takes under 5 minutes, and most users can be productive within their first session. For teams evaluating Marquez against alternatives, we recommend a 2-week trial period to assess whether the feature set and user experience align with your specific workflow requirements. Documentation and community resources are available to help with initial setup and configuration.
Alternatives and How It Compares
DataHub
DataHub provides lineage plus data discovery, governance, and observability in a comprehensive open-source platform. DataHub is more feature-rich but requires significantly more infrastructure (Kafka, Elasticsearch, MySQL). Choose Marquez for focused lineage with simple deployment; DataHub for a full data catalog.
OpenMetadata
OpenMetadata offers lineage alongside data discovery, quality testing, and governance. It has a simpler architecture than DataHub (4 components) but is still more complex than Marquez's single service. Choose OpenMetadata if you need lineage plus catalog; Marquez if you only need lineage.
Atlan
Atlan (~$50,000+/year) is a commercial data catalog with lineage, discovery, and collaboration. Atlan provides a polished experience with dedicated support but at significant cost. Marquez is free but lineage-only; Atlan is comprehensive but expensive.
OpenLineage (Standard Only)
OpenLineage is the standard, not a tool โ it defines the event format. You need a backend to collect and visualize OpenLineage events. Marquez is the reference backend, but DataHub and Atlan also accept OpenLineage events.
Frequently Asked Questions
What is Marquez?
Marquez is an open-source metadata service for data lineage, designed to help organizations understand and manage their data across various systems.
How much does Marquez cost?
Marquez operates on an enterprise pricing model, with custom pricing starting at a certain amount (TBD). Please contact us for more information on costs and packages.
Is Marquez better than Apache Atlas?
While both Marquez and Apache Atlas are data-lineage tools, Marquez is specifically designed to handle large-scale metadata management and provide real-time insights. However, the choice between the two ultimately depends on your organization's specific needs and infrastructure.
Is Marquez suitable for small businesses?
Marquez is primarily designed for enterprise-level organizations due to its complex metadata management capabilities and custom pricing model. It may not be the best fit for smaller businesses with simpler data requirements.
Can I use Marquez with cloud-based storage systems?
Yes, Marquez supports integration with various cloud-based storage systems, allowing you to manage metadata across multiple platforms and environments.