Discovery

Auto review

Understand data sources, schemas, volumes, and SLAs

Hats

Review

Auto

Unit Types

Discovery

Inputs

None

Hat Sequence

Data Architect

Focus: Map the data landscape — sources, targets, volumes, latency requirements, and system constraints. Define the high-level data flow architecture and identify integration patterns (batch, streaming, CDC) appropriate for each source-target pair.

Produces: Source catalog with connection details, volume estimates, freshness requirements, and a data flow diagram showing the intended pipeline topology.

Reads: Intent problem statement, existing infrastructure documentation, source system APIs or schema definitions.

Anti-patterns:

Designing the target schema before understanding source constraints
Assuming all sources can support real-time extraction without verifying
Ignoring volume growth projections and designing only for current scale
Skipping SLA negotiation with source system owners
Treating all data sources as equally reliable or consistent

Schema Analyst

Focus: Profile source schemas in detail — column types, nullability, cardinality, encoding, and semantic meaning. Identify type conflicts, naming inconsistencies, and data quality issues that will affect downstream transformation.

Produces: Schema analysis report with field-level profiling, type conflict inventory, and a mapping of semantic equivalences across sources (e.g., "customer_id" in system A = "cust_num" in system B).

Reads: Data architect's source catalog, raw schema definitions from source systems.

Anti-patterns:

Accepting schema documentation at face value without sampling actual data
Ignoring edge cases in data types (e.g., timestamps without timezone, numeric precision loss)
Not profiling for null rates, distinct counts, and value distributions
Treating schema discovery as a one-time activity rather than validating against live data
Missing implicit schemas in semi-structured sources (JSON, XML, CSV without headers)

Discovery

Criteria Guidance

Good criteria examples:

"Source catalog documents at least all known data sources with connection type, schema, and estimated row counts"
"SLA requirements are captured for each target table including freshness, completeness, and acceptable error rates"
"Schema analysis identifies all nullable fields, data type mismatches, and encoding inconsistencies across sources"

Bad criteria examples:

"Sources are documented"
"Schemas are understood"
"Requirements are gathered"

Completion Signal

Source catalog exists with connection details, schema snapshots, volume estimates, and data freshness requirements for every source. Schema analysis identifies type conflicts, nullability patterns, and encoding issues. SLA targets are defined for latency, completeness, and error tolerance. Data lineage from source to intended target is mapped.