This is the "brain" of the data lake. Without this, the storage is just a file dump.
We automate the harvesting of metadata:
- Technical Metadata: File size, format, creation date, schema (if applicable).
- Business Metadata: Ownership, definitions, context (e.g., "Student ID", "Course Level").
We need a search engine for our data estate. In our scale, this is managed via:
- Structured Naming Conventions:
yyyy/mm/dd paths.
- Metadata Files: Companion
.json or .meta files alongside raw assets (sidecar pattern).
- Future State: dedicated catalog tool (like Amundsen or DataHub).
Role-Based Access Control is integrated with our Identity Management.
- Principle of Least Privilege: Users only see the zones they need.
- Public Web App: Read-only access to specific
Silver paths.
- Tutorbot: Read/Write to
Bronze and Silver.
- Analyst: Read-only to
Gold.
- PII Tagging: Personally Identifiable Information (names, phone numbers) must be flagged.
- Lineage: We track the flow:
Raw (Bronze) → Refined (Silver) → Aggregated (Gold) to facilitate root cause analysis.