📚

Data Engineering Glossary

Your comprehensive guide to data engineering terminology. Learn about ETL, data warehouses, data quality, and the modern data stack.

47 Terms10 Categories (A-Z)
B

Batch Processing

A data processing approach that collects and processes data in large groups at scheduled intervals. More cost-efficient than stream processing for workloads that don't require real-time results.

Example: A nightly batch job processes all of yesterday's orders, computes daily metrics, and refreshes the executive dashboard by 6am.

Business Intelligence

Technologies, practices, and tools for collecting, integrating, analyzing, and presenting business data. BI helps organizations make data-driven decisions through dashboards, reports, and interactive visualizations.

Example: A sales VP opens a Tableau dashboard each morning showing pipeline velocity, win rates by region, and forecasted quarterly revenue.

C

CDC

(Change Data Capture)

A technique for tracking and capturing changes (inserts, updates, deletes) in a database, enabling real-time or near-real-time data replication without full table scans.

Example: Debezium reads the MySQL binlog and streams row-level changes to Kafka, so the data warehouse stays in sync within seconds.

Columnar Storage

A data storage format that stores data by columns rather than rows. Highly efficient for analytical queries that typically access a subset of columns across many rows. Used by most modern data warehouses.

Example: Querying 'SELECT AVG(price) FROM sales' only reads the price column, skipping all other columns -- 10x faster than row-based storage.

Tools & guides:SnowflakeBigQuery
D

DAG

(Directed Acyclic Graph)

A graph structure where nodes represent tasks and directed edges represent dependencies, with no circular references. DAGs are the backbone of modern data orchestration, defining the order in which pipeline steps execute.

Example: An Airflow DAG defines: extract_salesforce >> extract_stripe >> [transform_revenue, transform_churn] >> build_dashboard.

Tools & guides:Apache AirflowDagster

Data Catalog

A centralized inventory of data assets in an organization. Catalogs help users discover, understand, and trust data by providing metadata, documentation, usage statistics, and ownership information.

Example: A new analyst searches the data catalog for 'customer lifetime value' and finds the canonical table, its owner, freshness SLA, and column descriptions.

Data Contract

A formal agreement between data producers and consumers that defines the structure, semantics, and quality expectations of data. Helps prevent breaking changes and ensures data reliability across teams.

Example: The contract specifies: 'user_id is a non-null UUID, updated within 1 hour of the event, and never deleted' -- any violation triggers an alert.

Tools & guides:Data Quality Tools

Data Democratization

The practice of making data accessible to all employees in an organization, regardless of technical skill. Aims to remove bottlenecks where only data teams can answer business questions.

Example: Instead of filing a Jira ticket and waiting 2 weeks, a product manager uses a self-service BI tool to answer their own retention question in 10 minutes.

Data Fabric

An architecture that provides a unified, intelligent layer for managing data across diverse environments (on-premise, cloud, multi-cloud). Uses metadata, AI, and automation to simplify data access and governance at scale.

Example: A data fabric automatically catalogs data across 3 cloud providers and on-premise systems, recommending the best source for each analytics query.

Tools & guides:Data Quality Tools

Data Freshness

A measure of how recently data was updated relative to the source system. A key data quality dimension -- stale data leads to decisions based on outdated information.

Example: The SLA requires the orders table to be no more than 1 hour behind production. Data observability tools alert if freshness exceeds this threshold.

Data Governance

The overall management of data availability, usability, integrity, and security. Includes policies, processes, and standards for managing data as a strategic asset across the organization.

Example: Data governance rules require PII columns to be tagged and masked, so analysts see 'j***@email.com' instead of real addresses.

Data Integration

The process of combining data from different sources into a unified view. Encompasses ETL, ELT, data replication, and API-based data exchange to create a single source of truth.

Example: A data integration layer pulls customer data from Salesforce, Zendesk, and Intercom into one unified customer_360 table.

Data Lake

A storage repository that holds vast amounts of raw data in its native format until needed. Unlike data warehouses, data lakes can store unstructured and semi-structured data alongside structured data.

Example: An S3-based data lake stores raw JSON API responses, CSV exports, Parquet files, and PDF invoices, all queryable via Athena.

Data Lakehouse

A modern data architecture that combines the best features of data lakes and data warehouses. Provides the flexibility of a data lake with the performance and ACID transactions of a data warehouse.

Example: Databricks Delta Lake stores raw and curated data in the same platform, with ACID transactions, time travel, and fast SQL queries.

Tools & guides:DatabricksSnowflake

Data Lineage

The tracking of data's origins, movements, and transformations throughout its lifecycle. Lineage helps understand where data comes from, how it's transformed, and what downstream assets depend on it.

Example: When a revenue number looks wrong, lineage traces it back through 4 dbt models to reveal a join condition was changed last Tuesday.

Data Mart

A subject-specific subset of a data warehouse, designed for a particular business unit or use case. Data marts simplify access by presenting only the data relevant to a specific team.

Example: The marketing data mart contains only campaign, attribution, and conversion data -- simpler and faster for the marketing team than querying the full warehouse.

Tools & guides:Data Warehouse Tools

Data Mesh

A decentralized data architecture that treats data as a product, with domain teams owning and serving their data. Emphasizes four principles: domain ownership, data as a product, self-serve infrastructure, and federated governance.

Example: The payments team owns and publishes a 'payments' data product with SLAs, while the marketing team consumes it without depending on a central data team.

Tools & guides:Data Quality Tools

Data Observability

The ability to understand the health and state of data in your system. Includes monitoring for data freshness, volume, schema changes, and distribution anomalies to detect issues before they impact downstream users.

Example: Monte Carlo alerts the team at 6am that the orders table hasn't been updated since midnight -- before anyone opens a broken dashboard.

Data Orchestration

The automated coordination and management of data pipelines, ensuring tasks run in the correct order, handling dependencies, retries, and monitoring. Tools like Airflow and Dagster are popular orchestrators.

Example: Airflow orchestrates a nightly pipeline: first extract from Salesforce, then from Stripe, then run dbt transformations, then refresh Looker dashboards.

Data Pipeline

A series of data processing steps that move data from one or more sources to a destination. Pipelines automate the flow of data and can include extraction, transformation, validation, and loading steps.

Example: A pipeline runs every hour: it pulls new events from Kafka, validates schemas, deduplicates records, and writes them to a data warehouse.

Data Quality

The measure of data's fitness for its intended use. High-quality data is accurate, complete, consistent, timely, and valid. Data quality tools help monitor and improve these dimensions automatically.

Example: A data quality check flags that 15% of email addresses in yesterday's import are missing '@' -- the pipeline pauses until the source is fixed.

Data Replication

The process of copying data from one database or system to another, keeping them in sync. Can be done in real-time (via CDC) or in scheduled batches, depending on latency requirements.

Example: A read replica is maintained via continuous replication so analytics queries don't impact the production database's performance.

Tools & guides:FivetranAirbyte

Data Silo

An isolated repository of data controlled by one department that is not easily accessible to other parts of the organization. Silos lead to duplicated effort, inconsistent metrics, and incomplete analysis.

Example: Marketing tracks revenue in HubSpot, Finance in NetSuite, and Product in Amplitude -- three different 'revenue' numbers that never match.

Data Transformation

The process of converting data from one format, structure, or value to another. Includes cleaning, aggregating, joining, and enriching data to make it suitable for analysis.

Example: dbt models join raw Stripe payments with Salesforce accounts, calculate MRR, and create a clean 'monthly_revenue' table.

Tools & guides:dbtdbt vs Dataform

Data Vault

A data modeling methodology designed for enterprise data warehouses that need to handle changing business requirements. Uses hubs (business keys), links (relationships), and satellites (descriptive data) for maximum flexibility and auditability.

Example: The customer hub stores only customer_id. Satellites store customer attributes with load dates, so you can see any customer's state at any point in history.

Data Warehouse

A centralized repository optimized for analytical queries and reporting. Data warehouses store structured, historical data from multiple sources and are designed for fast query performance on large datasets.

Example: Snowflake stores 3 years of transaction data from 12 source systems, enabling analysts to run complex joins and aggregations in seconds.

dbt

(data build tool)

A popular open-source tool that enables data analysts and engineers to transform data in their warehouse using SQL. dbt handles dependency management, testing, and documentation, bringing software engineering practices to analytics.

Example: A data team writes SQL models in dbt that transform raw Shopify data into a clean 'orders' table, with automated tests ensuring no null order IDs.

Dimensional Modeling

A data modeling technique optimized for data warehousing and business intelligence. Organizes data into facts (measurements) and dimensions (context), making it intuitive for business users to query and understand.

Example: A fact_orders table stores order amounts, linked to dim_customer, dim_product, and dim_date dimensions for flexible slicing.

E

ELT

(Extract, Load, Transform)

A modern data integration approach where raw data is first extracted and loaded into a target system (like a data warehouse), then transformed using the processing power of the target system. Popular with cloud data warehouses that offer cheap, scalable compute.

Example: Airbyte extracts raw JSON from a REST API, loads it into BigQuery, then dbt transforms it into analytics-ready tables using SQL.

Embedded Analytics

The integration of analytical capabilities directly into business applications, portals, or products. Allows users to access insights without leaving their workflow.

Example: A SaaS product embeds Metabase dashboards directly in its admin panel so customers can see their own usage analytics.

Tools & guides:MetabaseSuperset

ETL

(Extract, Transform, Load)

A data integration process that extracts data from source systems, transforms it to fit operational needs, and loads it into a target database or data warehouse. Traditional ETL transforms data before loading.

Example: A retail company extracts sales data from its POS system, transforms currency values and date formats, then loads clean records into Snowflake for reporting.

I

Idempotency

A property where running an operation multiple times produces the same result as running it once. Critical in data pipelines to ensure retries and reruns don't create duplicate or corrupted data.

Example: An idempotent pipeline uses MERGE/UPSERT instead of INSERT, so rerunning yesterday's job doesn't double-count revenue.

Tools & guides:Data Pipeline Tools
M

Materialized View

A database object that stores the pre-computed results of a query. Unlike regular views, materialized views persist results to disk, trading storage for dramatically faster read performance on complex aggregations.

Example: A materialized view pre-computes daily revenue by region, so the dashboard loads in 200ms instead of running a 30-second aggregation.

Metrics Layer

A centralized system for defining and managing business metrics. Ensures consistent metric definitions across all tools and teams, preventing conflicting numbers in different reports.

Example: The metrics layer defines 'churn rate' once, and both the executive dashboard and the product team's Slack bot use the same formula.

Tools & guides:dbtLooker

Modern Data Stack

A collection of cloud-native tools that work together to collect, store, transform, and analyze data. Characterized by modularity, scalability, and ease of use. Typically includes a cloud warehouse, ELT tools, transformation layer, and BI platform.

Example: A typical stack: Fivetran (extract/load) + Snowflake (warehouse) + dbt (transform) + Looker (BI) + Monte Carlo (observability).

Tools & guides:SnowflakedbtFivetran
O

OLAP

(Online Analytical Processing)

A computing approach optimized for complex analytical queries on large datasets. OLAP systems are designed for read-heavy workloads like reporting, dashboards, and ad-hoc analysis, as opposed to OLTP systems optimized for transactional writes.

Example: A BI analyst runs a query that aggregates revenue by region, product, and quarter across 2 billion rows -- OLAP makes this fast.

R

Real-time Analytics

The ability to analyze and act on data as soon as it is generated, with latencies measured in seconds rather than hours. Requires stream processing infrastructure and specialized databases.

Example: An e-commerce site shows a live dashboard of orders per minute, trending products, and current inventory levels updating every 5 seconds.

Reverse ETL

The process of moving data from a data warehouse back to operational systems like CRMs, marketing platforms, or customer support tools. Enables teams to activate their warehouse data in business applications.

Example: A marketing team syncs customer segments computed in Snowflake back to HubSpot so sales reps see each lead's product usage score.

Tools & guides:CensusHightouch
S

Schema Drift

Unexpected changes to the structure of data, such as new columns, renamed fields, or changed data types. Schema drift can break downstream pipelines and reports if not detected and handled.

Example: A third-party API renames 'user_id' to 'userId' -- schema drift detection catches this before it breaks 12 downstream models.

Schema on Read

An approach where data is stored in its raw format and structure is applied only when the data is read or queried. Offers flexibility but shifts data quality responsibility to consumers. Contrasts with Schema on Write used by traditional databases.

Example: Raw JSON is dumped into a data lake. When analysts query it, they define the schema in their SQL -- flexible but risky if the JSON structure changes.

Tools & guides:Databricks

Self-Service Analytics

An approach that enables business users to access and analyze data without requiring technical expertise or IT assistance. Modern BI tools focus on making data accessible to non-technical users through intuitive interfaces.

Example: A product manager drags and drops fields in Metabase to build a funnel analysis, without writing any SQL or asking the data team.

Tools & guides:MetabaseLooker

Semantic Layer

A business abstraction layer that sits between raw data and end users. It provides consistent definitions for metrics, dimensions, and business logic, ensuring everyone uses the same calculations regardless of which BI tool they use.

Example: The semantic layer defines 'revenue' as 'SUM(amount) WHERE status != refunded', so Looker, Tableau, and the API all report the same number.

Slowly Changing Dimensions

(SCD)

A methodology for tracking historical changes in dimension tables. SCD Type 1 overwrites old values, Type 2 adds new rows with validity dates to preserve history, and Type 3 adds columns for previous values.

Example: When a customer moves from New York to London, SCD Type 2 keeps both rows with date ranges so historical reports still show the correct location at the time of each order.

Snowflake Schema

A dimensional modeling pattern similar to star schema, but with normalized dimension tables. Dimensions are split into sub-dimensions, reducing redundancy at the cost of more complex queries.

Example: dim_product links to dim_subcategory, which links to dim_category -- normalized but requires extra joins compared to a star schema.

Tools & guides:Data Warehouse Tools

Star Schema

A dimensional modeling pattern where a central fact table connects to multiple dimension tables. Called 'star' because the diagram resembles a star with the fact table at the center. Optimized for query performance.

Example: The fact_sales table joins to dim_product, dim_store, dim_date, and dim_customer -- simple, fast queries for BI dashboards.

Tools & guides:Data Warehouse Tools

Stream Processing

A data processing paradigm that handles data in real-time as it arrives, rather than in scheduled batches. Enables low-latency analytics, real-time dashboards, and event-driven architectures.

Example: Kafka Streams processes clickstream events in real-time to detect fraud within 500ms of a transaction occurring.

Tools & guides:Data Pipeline Tools
V

Vector Database

A database optimized for storing and querying high-dimensional vector embeddings. Essential for AI/ML applications including semantic search, recommendation systems, and retrieval-augmented generation (RAG).

Example: Product descriptions are converted to embeddings and stored in Pinecone. When a user searches 'cozy winter jacket', the vector DB finds semantically similar products.

Tools & guides:Browse All Tools

Ready to Build Your Data Stack?

Now that you understand the terminology, explore our tool reviews and comparisons to find the right solutions for your team.