Lineage
Understand how data flows through your pipelines. Track dependencies, see data transformations, and trace data back to its source.
What is Data Lineage?
Data lineage shows:
- Where data comes from (source)
- How it's transformed (steps)
- Where it goes (destination)
- All dependencies and connections
Example flow:
Customer Database
↓
Extract (Query 1)
↓
Transform (Clean nulls, format)
↓
Aggregate (Group by region)
↓
Load (Insert to Analytics DB)
↓
Reports (Dashboard displays)
Viewing Lineage
Lineage Graph
- Click Lineage in sidebar
- See visual diagram of data flow
- Nodes show data sources, processes, destinations
- Arrows show data movement and transformations
Graph Features
- Click nodes to see details
- Hover over arrows to see data volume
- Color coding:
- Green: Successful
- Yellow: In progress
- Red: Failed
- Blue: Not yet run
Exploring Lineage
- Zoom in/out to see details
- Filter by source, type, or status
- Search for specific datasets or processes
- Expand paths to see nested pipelines
Tracing Data
Forward Tracing
Follow data downstream:
- Start at source
- See all transformations
- Find all destinations
- Understand impact of changes
Use case: "If I change this customer field, what reports are affected?"
- Find customer source table
- Follow arrows downstream
- See all dependent queries and reports
- Know what could break
Reverse Tracing
Trace data upstream:
- Start at any point
- See all dependencies
- Find original sources
- Understand data quality issues
Use case: "Why does this report show wrong numbers?"
- Start at report/dashboard
- Trace back to source data
- Check each transformation
- Find where issue originated
Data Relationships
Source to Destination Mapping
See which sources feed which destinations:
Source: customers_db.users
→ Transform: Clean and validate
→ Destination: analytics_db.users_clean
→ Destination: reports_db.user_summary
→ Dashboard: Sales performance
Transformation Details
Click a transformation to see:
- Input data volume
- Output data volume
- Records affected (added/removed)
- Data quality checks applied
Dependencies
Understand what depends on what:
- Query A depends on Table B
- Report C depends on Query A
- Alert D triggers if Report C changes
Monitoring Data Flow
Data Quality
Each step shows:
- Records in
- Records out
- Error rate
- Processing time
Extract: 10,000 records
Transform: 10,000 → 9,998 (2 invalid removed)
Load: 9,998 records inserted
Accuracy: 99.98%
Performance
Track how long each step takes:
- Extract: 2 seconds
- Transform: 8 seconds (slow!)
- Load: 3 seconds
- Total: 13 seconds
Handling Changes
Impact Analysis
Before making changes:
- Click dataset/process
- Select "Show Impact"
- See all downstream consumers
- Understand risk
Example: Before removing a column:
- Find column in source
- See "Show Impact"
- Discover 5 reports use it
- Decide if safe to remove
Change Tracking
Lineage shows changes:
- Added new source
- Modified transformation
- Removed step
- Updated schedule
Rollback Capability
If change breaks things:
- See previous version
- Understand what changed
- Rollback if needed
- Rerun affected pipelines
Common Tasks
Data Quality Investigation
Dashboard shows wrong numbers
↓
Click on dashboard in lineage
↓
Trace back to source
↓
See transformation removed nulls
↓
Check if nulls should be included
↓
Fix transformation if needed
↓
Rerun pipeline
Adding New Data Source
Want to add new source
↓
Check lineage to find similar pipelines
↓
Reuse transformations if possible
↓
See where to insert new source
↓
Update lineage with new path
Compliance and Audit
Need to show data flow for audit
↓
Export lineage diagram
↓
Document transformations
↓
Show data handling at each step
↓
Demonstrate compliance
Visualization Options
View Types
- Graph - Interactive diagram
- Table - List all connections
- Tree - Hierarchical view
- Timeline - When things ran
Customization
- Show/hide specific types of nodes
- Color by status, performance, or category
- Layout options: horizontal, vertical, hierarchical
- Export as image or PDF
Advanced Features
Data Contracts
Define expected data format:
Expected schema:
- user_id: integer (required)
- email: string (required)
- created_at: timestamp (required)
- status: enum (active, inactive)
If violated: Fail the pipeline and alert
SLA Monitoring
Track if pipelines meet SLAs:
Extract must complete in < 5 minutes
Transform must complete in < 10 minutes
Load must complete in < 2 minutes
Alert if SLA violated
Metadata Enrichment
Add business context to lineage:
- Owner: "Data team"
- Purpose: "Revenue reporting"
- Classification: "Confidential"
- Update frequency: "Daily"
Troubleshooting
Missing Data
Data appears in source but not in destination
↓
Check lineage for breaks
↓
Look for failed transformation
↓
Check data quality rules
↓
Verify load process completed
Unexpected Changes
Data changed without pipeline run
↓
Check lineage for manual edits
↓
Look for unscheduled runs
↓
Review access logs
↓
Investigate source
Performance Issues
Pipeline slower than usual
↓
Check lineage nodes for delays
↓
Find bottleneck
↓
Optimize that step
↓
Rerun to verify improvement
Best Practices
- Document why each transformation exists
- Monitor data volumes at each step
- Alert on unexpected changes
- Test lineage paths before production
- Audit regularly for compliance
Related Topics
- Orchestration - Run data pipelines
- Activity Logs - Track all data movements
- System Monitoring - Monitor pipeline health