Data Workflow Guide
Learn how Notebook, ML Experiments, Orchestration, Lineage, and Metadata work together as a complete data platform.
The Data Platform Workflow
graph TD
A["EXPLORE<br/>━━━━━<br/>Notebook<br/>Load & Analyze"] --> B["EXPERIMENT<br/>━━━━━━<br/>ML Experiments<br/>Train Models"]
B --> C["AUTOMATE<br/>━━━━━━<br/>Orchestration<br/>Run Pipelines"]
C --> D["TRACK<br/>━━━━<br/>Lineage + Metadata<br/>Monitor Flow"]
D --> E["SHARE<br/>━━━━<br/>Documentation<br/>Collaborate"]
E -.-> A
style A fill:#1f97d4,stroke:#0b5394,stroke-width:2px,color:#fff
style B fill:#ff9900,stroke:#ec7211,stroke-width:2px,color:#fff
style C fill:#37475a,stroke:#1f1f1f,stroke-width:2px,color:#fff
style D fill:#1f97d4,stroke:#0b5394,stroke-width:2px,color:#fff
style E fill:#ff9900,stroke:#ec7211,stroke-width:2px,color:#fff
Phase 1: Explore Data in Notebook
Start here when:
- You have a new dataset
- Want to understand patterns
- Testing hypotheses
- Prototyping analysis
Notebook Is Your Lab
What you do:
✓ Load data from databases
✓ Write analysis code
✓ Visualize findings
✓ Test approaches
✓ Iterate quickly
Tools:
- Python/SQL cells
- Instant feedback
- Live visualizations
- Easy data exploration
Example: Customer Analysis
# Cell 1: Load customer data
import pandas as pd
customers = pd.read_sql("""
SELECT * FROM customers
WHERE created_at > '2024-01-01'
""", connection)
# Cell 2: Explore
print(f"Total customers: {len(customers)}")
print(customers.head())
# Cell 3: Analyze
churn_rate = (customers[customers['status'] == 'inactive'].shape[0] / len(customers)) * 100
print(f"Churn rate: {churn_rate}%")
# Cell 4: Visualize
import matplotlib.pyplot as plt
customers.groupby('region')['revenue'].sum().plot(kind='bar')
plt.show()
Next step: If analysis works and is valuable → Move to ML Experiments
Phase 2: Track Experiments
Move here when:
- Analysis is solid
- Want to try variations
- Need to compare results
- Building a model
ML Experiments Track Your Progress
What you do:
✓ Try multiple approaches
✓ Log metrics from each run
✓ Compare performance
✓ Keep best version
✓ Document what worked
Tools:
- Parameter logging
- Metric tracking
- Automatic comparison
- Model versioning
Example: Building Customer Churn Model
from credvault import experiment
# Start experiment
exp = experiment.start('Churn Prediction')
# Try approach 1: Simple logistic regression
exp.log_params({'model': 'logistic_regression', 'features': 10})
model1 = train_logistic_regression(data)
accuracy1 = evaluate(model1)
exp.log_metric('accuracy', accuracy1)
# Try approach 2: Random forest with more features
exp.log_params({'model': 'random_forest', 'features': 25})
model2 = train_random_forest(data)
accuracy2 = evaluate(model2)
exp.log_metric('accuracy', accuracy2)
# Compare in ML Experiments dashboard
# Decide: Random forest (92% accuracy) > Logistic (87%)
exp.save_as_model('v1.0-production')
Results in ML Experiments:
Run 1: Accuracy 87% ← Logistic Regression
Run 2: Accuracy 92% ← Random Forest (BEST)
Next step: Model works well → Automate it with Orchestration
Phase 3: Automate with Orchestration
Move here when:
- Model/analysis is finalized
- Need to run regularly
- Process many datasets
- Multiple steps involved
Orchestration Runs on Schedule
What you do:
✓ Define workflow steps
✓ Schedule daily/hourly
✓ Handle dependencies
✓ Manage retries
✓ Monitor execution
Pipeline:
1. Extract data from databases
2. Transform and validate
3. Run trained model
4. Load results
5. Generate reports
Example: Daily Churn Prediction Pipeline
9:00 AM:
↓
Step 1: Extract fresh customer data (5 min)
↓
Step 2: Validate data quality (2 min)
↓
Step 3: Run churn model (3 min)
↓
Step 4: Load predictions to database (1 min)
↓
Step 5: Generate report & send to team (1 min)
↓
9:12 AM: Done! Results ready
In Orchestration:
name: Daily Churn Prediction
schedule: "0 9 * * *" # 9 AM every day
steps:
- name: extract
query: SELECT * FROM customers WHERE active = true
- name: validate
checks:
- no_nulls: email
- unique: customer_id
- name: predict
model: churn_v1.0
input: extract.output
- name: load
destination: predictions_daily
input: predict.output
- name: report
template: churn_daily_report
recipients: [team@company.com]
Next step: Pipeline running → Track data flow with Lineage
Phase 4: Track with Lineage
Use alongside Orchestration to:
- Understand data flow
- Debug data issues
- See impact of changes
- Ensure quality
Lineage Maps Your Data Journey
Raw Customer DB
↓ (Extract)
Cleaned Data
↓ (Transform)
Features Table
↓ (ML Model)
Predictions
↓ (Load)
Production Ready
↓ (Reports & Dashboards)
Business Insights
Example: Churn Prediction Lineage
Production Database
├─ customers table
│ └─ Extract step (daily)
│ ↓
├─ features_cleaned
│ └─ Transform step (validation)
│ ↓
├─ model_churn_v1.0
│ └─ Train (from ML Experiments)
│ ↓
├─ predictions
│ └─ Score step (daily 9 AM)
│ ↓
└─ churn_risk_report
└─ Report generation
└─ Sent to Sales team
Questions Lineage Answers:
"Why are predictions different this week?"
→ Check lineage: Data source changed? Transform updated?
"Which reports use customer data?"
→ Check lineage: Trace forward from customers table
"If I delete this column, what breaks?"
→ Check lineage: See all downstream dependencies
"Where did this number come from?"
→ Check lineage: Trace backward to source
Phase 5: Document in Metadata
Document everything for team reuse:
- What data exists
- How to use it
- Who owns it
- Data quality
- Business meaning
Metadata Catalog
Asset: churn_predictions
Owner: Data Science Team
Description: Daily churn risk scores for all customers
Update frequency: Daily at 9:15 AM
Schema:
- customer_id: Unique customer identifier
- churn_risk: Probability (0-100%)
- risk_level: Category (low/medium/high)
Quality SLA: 99% completeness
Last updated: 2024-06-12 09:15 UTC
Used by:
- Sales Dashboard
- Customer Retention Team
- Marketing Campaigns
Related:
- churn_model_v1.0 (ML model)
- customers (source table)
Complete Example: End-to-End
Week 1: Exploration Phase
Monday: Load customer data in Notebook
Tuesday: Analyze churn patterns
Wednesday: Test different models
Thursday: Simple visualization
Friday: Present findings
Week 2: Experimentation Phase
Monday: Set up ML Experiments
Tuesday-Thursday: Try 15 different model approaches
Friday: Compare all runs, select best
Week 3: Production Phase
Monday: Build Orchestration pipeline
Tuesday: Test pipeline end-to-end
Wednesday: Run in production
Thursday: Verify results in database
Friday: Set up dashboards from results
Week 4: Operations Phase
Every day: Pipeline runs at 9 AM
Monitor: Check Lineage for data quality
Update: Metadata with new insights
Share: Results with stakeholders
Common Workflows
Scenario 1: Improving an Existing Model
Current: ML model runs daily, accuracy 85%
Step 1 (Notebook):
- Load current predictions
- Analyze errors
- Identify patterns
- Test improvements
Step 2 (ML Experiments):
- Try new feature engineering
- Test new algorithm
- Compare accuracy
- New approach: 92%
Step 3 (Orchestration):
- Update pipeline with new model
- Test end-to-end
- Deploy when ready
- Keep old as fallback
Step 4 (Lineage):
- Monitor data flows
- Verify impact across dashboards
- Document changes
Step 5 (Metadata):
- Update model documentation
- Note improvement (85% → 92%)
- Record change date
Scenario 2: Adding New Data Source
Business Need: Include recent transactions in churn model
Step 1 (Notebook):
- Load transactions data
- Explore patterns
- Test impact on model
Step 2 (ML Experiments):
- Retrain with new feature
- Compare old vs new
- Validate improvement
Step 3 (Orchestration):
- Add new query to pipeline
- Extract transactions
- Join with existing features
Step 4 (Lineage):
- See new data source
- Verify it flows to predictions
- Check downstream impact
Step 5 (Metadata):
- Document transactions table
- Add to data catalog
- Explain usage
Scenario 3: Debugging Data Issues
Problem: Dashboard shows wrong numbers
Step 1 (Metadata):
- Find data source
- Check last update time
- See data quality metrics
Step 2 (Lineage):
- Trace backward from dashboard
- Find transformation that feeds it
- Check for recent changes
Step 3 (Orchestration):
- Check pipeline logs
- See if step failed
- Look at error messages
Step 4 (Notebook):
- Load raw data
- Verify manually
- Test transformations
- Find the issue
Step 5 (Fix & Rerun):
- Fix code
- Rerun pipeline
- Verify results
- Update documentation
Best Practices Across Phases
Notebook Phase
- ✓ Use descriptive cell names
- ✓ Document your thinking
- ✓ Keep experiments organized
- ✓ Save reusable code snippets
- ✗ Don't push to production directly
Experiments Phase
- ✓ Log all parameters
- ✓ Compare fairly (same data)
- ✓ Save model artifacts
- ✓ Document assumptions
- ✗ Don't use in production until tested
Orchestration Phase
- ✓ Test thoroughly first
- ✓ Set up monitoring
- ✓ Plan for failures
- ✓ Document workflow
- ✗ Don't change production pipelines without testing
Lineage & Metadata Phase
- ✓ Keep documentation current
- ✓ Document changes
- ✓ Tag data appropriately
- ✓ Set quality rules
- ✗ Don't let documentation lag behind code
Related Topics
- Notebook - Start exploring
- ML Experiments - Compare approaches
- Orchestration - Automate workflows
- Lineage - Track data flows
- Metadata - Document data