Lineage

Understand how data flows through your pipelines. Track dependencies, see data transformations, and trace data back to its source.

What is Data Lineage?

Data lineage shows:

  • Where data comes from (source)
  • How it's transformed (steps)
  • Where it goes (destination)
  • All dependencies and connections

Example flow:

Customer Database
         ↓
  Extract (Query 1)
         ↓
  Transform (Clean nulls, format)
         ↓
  Aggregate (Group by region)
         ↓
  Load (Insert to Analytics DB)
         ↓
  Reports (Dashboard displays)

Viewing Lineage

Lineage Graph

  1. Click Lineage in sidebar
  2. See visual diagram of data flow
  3. Nodes show data sources, processes, destinations
  4. Arrows show data movement and transformations

Graph Features

  • Click nodes to see details
  • Hover over arrows to see data volume
  • Color coding:
    • Green: Successful
    • Yellow: In progress
    • Red: Failed
    • Blue: Not yet run

Exploring Lineage

  • Zoom in/out to see details
  • Filter by source, type, or status
  • Search for specific datasets or processes
  • Expand paths to see nested pipelines

Tracing Data

Forward Tracing

Follow data downstream:

  • Start at source
  • See all transformations
  • Find all destinations
  • Understand impact of changes

Use case: "If I change this customer field, what reports are affected?"

  1. Find customer source table
  2. Follow arrows downstream
  3. See all dependent queries and reports
  4. Know what could break

Reverse Tracing

Trace data upstream:

  • Start at any point
  • See all dependencies
  • Find original sources
  • Understand data quality issues

Use case: "Why does this report show wrong numbers?"

  1. Start at report/dashboard
  2. Trace back to source data
  3. Check each transformation
  4. Find where issue originated

Data Relationships

Source to Destination Mapping

See which sources feed which destinations:

Source: customers_db.users
  → Transform: Clean and validate
  → Destination: analytics_db.users_clean
  → Destination: reports_db.user_summary
  → Dashboard: Sales performance

Transformation Details

Click a transformation to see:

  • Input data volume
  • Output data volume
  • Records affected (added/removed)
  • Data quality checks applied

Dependencies

Understand what depends on what:

  • Query A depends on Table B
  • Report C depends on Query A
  • Alert D triggers if Report C changes

Monitoring Data Flow

Data Quality

Each step shows:

  • Records in
  • Records out
  • Error rate
  • Processing time
Extract: 10,000 records
Transform: 10,000 → 9,998 (2 invalid removed)
Load: 9,998 records inserted
Accuracy: 99.98%

Performance

Track how long each step takes:

  • Extract: 2 seconds
  • Transform: 8 seconds (slow!)
  • Load: 3 seconds
  • Total: 13 seconds

Handling Changes

Impact Analysis

Before making changes:

  1. Click dataset/process
  2. Select "Show Impact"
  3. See all downstream consumers
  4. Understand risk

Example: Before removing a column:

  1. Find column in source
  2. See "Show Impact"
  3. Discover 5 reports use it
  4. Decide if safe to remove

Change Tracking

Lineage shows changes:

  • Added new source
  • Modified transformation
  • Removed step
  • Updated schedule

Rollback Capability

If change breaks things:

  1. See previous version
  2. Understand what changed
  3. Rollback if needed
  4. Rerun affected pipelines

Common Tasks

Data Quality Investigation

Dashboard shows wrong numbers
↓
Click on dashboard in lineage
↓
Trace back to source
↓
See transformation removed nulls
↓
Check if nulls should be included
↓
Fix transformation if needed
↓
Rerun pipeline

Adding New Data Source

Want to add new source
↓
Check lineage to find similar pipelines
↓
Reuse transformations if possible
↓
See where to insert new source
↓
Update lineage with new path

Compliance and Audit

Need to show data flow for audit
↓
Export lineage diagram
↓
Document transformations
↓
Show data handling at each step
↓
Demonstrate compliance

Visualization Options

View Types

  • Graph - Interactive diagram
  • Table - List all connections
  • Tree - Hierarchical view
  • Timeline - When things ran

Customization

  • Show/hide specific types of nodes
  • Color by status, performance, or category
  • Layout options: horizontal, vertical, hierarchical
  • Export as image or PDF

Advanced Features

Data Contracts

Define expected data format:

Expected schema:
- user_id: integer (required)
- email: string (required)
- created_at: timestamp (required)
- status: enum (active, inactive)

If violated: Fail the pipeline and alert

SLA Monitoring

Track if pipelines meet SLAs:

Extract must complete in < 5 minutes
Transform must complete in < 10 minutes
Load must complete in < 2 minutes

Alert if SLA violated

Metadata Enrichment

Add business context to lineage:

  • Owner: "Data team"
  • Purpose: "Revenue reporting"
  • Classification: "Confidential"
  • Update frequency: "Daily"

Troubleshooting

Missing Data

Data appears in source but not in destination
↓
Check lineage for breaks
↓
Look for failed transformation
↓
Check data quality rules
↓
Verify load process completed

Unexpected Changes

Data changed without pipeline run
↓
Check lineage for manual edits
↓
Look for unscheduled runs
↓
Review access logs
↓
Investigate source

Performance Issues

Pipeline slower than usual
↓
Check lineage nodes for delays
↓
Find bottleneck
↓
Optimize that step
↓
Rerun to verify improvement

Best Practices

  • Document why each transformation exists
  • Monitor data volumes at each step
  • Alert on unexpected changes
  • Test lineage paths before production
  • Audit regularly for compliance