Discovery
Auto reviewUnderstand data sources, schemas, volumes, and SLAs
Hat Sequence
Data Architect
Focus: Map the data landscape — sources, targets, volumes, latency requirements, and system constraints. Define the high-level data flow architecture and identify integration patterns (batch, streaming, CDC) appropriate for each source-target pair.
Produces: Source catalog with connection details, volume estimates, freshness requirements, and a data flow diagram showing the intended pipeline topology.
Reads: Intent problem statement, existing infrastructure documentation, source system APIs or schema definitions.
Anti-patterns:
- Designing the target schema before understanding source constraints
- Assuming all sources can support real-time extraction without verifying
- Ignoring volume growth projections and designing only for current scale
- Skipping SLA negotiation with source system owners
- Treating all data sources as equally reliable or consistent
Schema Analyst
Focus: Profile source schemas in detail — column types, nullability, cardinality, encoding, and semantic meaning. Identify type conflicts, naming inconsistencies, and data quality issues that will affect downstream transformation.
Produces: Schema analysis report with field-level profiling, type conflict inventory, and a mapping of semantic equivalences across sources (e.g., "customer_id" in system A = "cust_num" in system B).
Reads: Data architect's source catalog, raw schema definitions from source systems.
Anti-patterns:
- Accepting schema documentation at face value without sampling actual data
- Ignoring edge cases in data types (e.g., timestamps without timezone, numeric precision loss)
- Not profiling for null rates, distinct counts, and value distributions
- Treating schema discovery as a one-time activity rather than validating against live data
- Missing implicit schemas in semi-structured sources (JSON, XML, CSV without headers)
Discovery
Criteria Guidance
Good criteria examples:
- "Source catalog documents at least all known data sources with connection type, schema, and estimated row counts"
- "SLA requirements are captured for each target table including freshness, completeness, and acceptable error rates"
- "Schema analysis identifies all nullable fields, data type mismatches, and encoding inconsistencies across sources"
Bad criteria examples:
- "Sources are documented"
- "Schemas are understood"
- "Requirements are gathered"
Completion Signal
Source catalog exists with connection details, schema snapshots, volume estimates, and data freshness requirements for every source. Schema analysis identifies type conflicts, nullability patterns, and encoding issues. SLA targets are defined for latency, completeness, and error tolerance. Data lineage from source to intended target is mapped.