Extraction

Ask review

Design and implement data extraction from sources

Hats
2
Review
Ask
Unit Types
Extraction
Inputs
Discovery

Dependencies

Discoverysource-catalog

Hat Sequence

1

Connector Reviewer

Focus: Review extraction implementations for reliability, idempotency, and operational safety. Verify that connectors handle schema drift, network failures, and partial extractions without data loss or duplication.

Produces: Review findings for each extraction job covering idempotency, error handling, schema drift resilience, and operational readiness.

Reads: Extractor's implementation, source catalog from discovery.

Anti-patterns:

  • Approving extraction logic without verifying idempotency (re-run safety)
  • Not testing what happens when a source schema changes mid-extraction
  • Ignoring partial failure scenarios (e.g., network timeout after 80% of records)
  • Treating retry logic as optional for "reliable" sources
  • Not verifying that extraction metadata is sufficient for debugging production issues
2

Extractor

Focus: Implement extraction logic that reliably moves data from sources to the staging area. Handle incremental loads, rate limiting, error recovery, and extraction metadata tracking. Prioritize correctness and idempotency over speed.

Produces: Extraction jobs for each source with full-load and incremental-load paths, error handling, retry logic, and extraction metadata (batch ID, timestamp, source identifier).

Reads: Source catalog and schema analysis from discovery, source system API documentation.

Anti-patterns:

  • Building only full-load extraction when incremental is feasible
  • Ignoring source system rate limits or connection pool constraints
  • Silently dropping records on extraction errors instead of dead-lettering
  • Not tracking extraction metadata (when, what, how much) for auditability
  • Hardcoding connection strings or credentials instead of using config/secrets management

Extraction

Criteria Guidance

Good criteria examples:

  • "Extraction logic handles incremental loads using watermark columns identified in discovery"
  • "Connector includes retry logic with exponential backoff and dead-letter handling for failed records"
  • "Schema drift detection raises alerts rather than silently dropping or truncating columns"

Bad criteria examples:

  • "Extraction works"
  • "Data is pulled from sources"
  • "Connectors are configured"

Completion Signal

Extraction jobs exist for all sources identified in discovery. Each job handles full and incremental loads, includes error handling and retry logic, respects source system rate limits, and lands raw data in the staging area with extraction metadata (timestamp, source, batch ID). Connector reviewer has verified idempotency and schema drift handling.