We shift quality checks "left" (earlier in the pipeline).
We enforce constraints at the point of ingestion (Landing → Bronze).
- Schema Adherence: Does the JSON match the expected format?
- Non-Nulls: Are critical fields (like
student_id) present?
We use tooling to run automated checks:
- Great Expectations / Soda: (Future implementation) to validate data distributions.
- Simple Validation: Scripts that verify file integrity before promoting from
Landing to Bronze.
- Bad Data Handling: Data that fails validation is not deleted; it is moved to a
Quarantine folder within the Landing zone for manual inspection.
- Alerting: Notification to the Data Engineering team (via Slack/Email) when quality checks fail.