© 2026 Stephen Adei. All rights reserved.
This work is protected by copyright. For permission requests or inquiries, please contact the author.
View full license at LICENSE.md.
This project contains a structured, interview-ready submission for the Ohpen case study:
ohpen-case-2026/
├── README.md # This file
├── .gitignore # Git ignore rules
├── Makefile # Test automation (make test, make test-task1, etc.)
├── docs/ # All documentation
│ ├── submission/ # Executive summary, design decisions, handout
│ ├── technical/ # Architecture, testing, runbooks
│ └── ... (other docs)
├── docker/ # Docker helpers (Gitea, etc.)
├── scripts/ # Repo scripts (test report aggregation, sync, etc.)
├── dist/ # Build artifacts & archives
└── tasks/ # Task implementations
├── 01_data_ingestion_transformation/ # Python ETL
├── 02_data_lake_architecture_design/ # Architecture
├── 03_sql/ # SQL solution
├── 04_devops_cicd/ # CI/CD workflow
└── 05_communication_documentation/ # Stakeholder comms
Full documentation (architecture, design decisions, runbooks): ohpen.stephenadei.nl. Repository & scripts map: Repository & scripts (Gitea).
Gitea (scripts repository): To run Gitea locally and point the docs at it, see docker/GITEA_SETUP.md. Build the docs with GITEA_REPO_URL=<your-repo-url> so the repository links page uses your Gitea URL.
tasks/01.../ — Ingestion: Python ETL (CSV in S3 → validate → Parquet partitions)tasks/02.../ — Architecture: Data lake design + schema evolutiontasks/03.../ — Analytics: SQL solution for month-end balance historytasks/04.../ — DevOps: CI/CD workflow + IaC artifacts list + Terraform stubstasks/05.../ — Communication: Stakeholder email + technical reference documentShape Convention: Process-First - No storage nodes, shapes group by component type (Repository structure, Processes).
This mindmap provides a comprehensive overview of the project structure and components. It is also published on the documentation site: Project Mind Map (same content, shared with Docusaurus).
All diagrams in this project follow a consistent Shape-First, Color-Second convention for accessibility and semantic clarity.
Design Philosophy:
Shapes encode primary function (mandatory), colors add semantic meaning (strategic). This dual encoding:
For complete diagram conventions, see Diagram Conventions
[(name)]: Bronze/Silver/Gold layers, S3 buckets, databases, catalogs{{name}}: ETL jobs, validation engines, Lambda functions, Glue jobs[name]: Infrastructure, orchestration, API calls, generic services{name}: Validation gates, approval checks, conditional routing([name]): Success, failure, condemned, archived states#FF9900: Bronze layer - Raw, unvalidated data#66BB6A: Silver layer - Validated, quality-assured data#4D9BF0: Gold layer - Authoritative, business-ready data#D13212: Quarantine/Error - Invalid data, pipeline stops#4FA83D: ETL processing - Active data transformation#7B2CBF: Infrastructure - Platform services (EventBridge, Step Functions)#FFC107: Decision points - Conditional logic, validation gates#4A90E2: CI/CD - Automation, deployment#5BC0DE: Observability - Monitoring, logs, metricsRationale: Shape-first design ensures diagrams remain accessible and readable in all contexts (colorblind, print, monochrome). Strategic color use (5-7 per diagram) adds semantic richness for data state (Bronze/Silver/Gold) and component type (ETL/Infrastructure) without overwhelming the reader.
Note: Where requirements were ambiguous, assumptions follow industry practice; see ASSUMPTIONS_AND_EDGE_CASES per task.
Example bucket names (replace with your deployment's buckets):
From the project root:
cd /path/to/ohpen-case-2026
# (Optional) create venv and install deps
python3 -m venv .venv
source .venv/bin/activate
pip install -r tasks/01_data_ingestion_transformation/requirements.txt
# Run ETL: S3 -> validated Parquet -> S3
# Configure AWS credentials (choose one method):
# Method 1: Use AWS CLI
aws configure
# Method 2: Set environment variables
export AWS_DEFAULT_REGION=us-east-1
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
# Run ETL pipeline (from task directory; see tasks/01_data_ingestion_transformation/src/etl/README.md)
cd tasks/01_data_ingestion_transformation
PYTHONPATH=src python3 -m etl.ingest_transactions \
--input-bucket ohpen-bronze \
--input-key transactions/transactions.csv \
--output-bucket ohpen-silver \
--output-prefix silver/transactions \
--quarantine-bucket ohpen-quarantine \
--quarantine-prefix quarantine/transactions
Workflows run from the repo root (.github/workflows/):
ci.yml): On push/PR to main or master — lint (Ruff, SQLFluff), unit tests (Pandas + PySpark ETL, SQL), integration tests (MinIO).cd.yml): On push to main or master — build ETL artifacts, upload to S3, Terraform plan then apply.CD setup (required for deploy): In the repo Settings → Secrets and variables → Actions, set:
AWS_ROLE_ARN_CD (variable): IAM role ARN for OIDC (GitHub Actions assumes this role to run Terraform and upload to S3).TF_STATE_BUCKET (variable, optional): S3 bucket for Terraform state; defaults to ohpen-terraform-state if unset.Optional manual approval: To gate Terraform apply behind approval, create an environment (e.g. production) under Settings → Environments with required reviewers, then in .github/workflows/cd.yml uncomment environment: production on the deploy job.
Pipeline schema (raw → balance report): Task 1 ETL outputs TransactionID, CustomerID, TransactionAmount, Currency, TransactionTimestamp. Task 3 SQL expects id, account_id, amount, new_balance, tx_date. Map columns (e.g. CustomerID → account_id) and derive new_balance (e.g. cumulative sum per account). See tasks/03_sql/ASSUMPTIONS_AND_EDGE_CASES.md.
| Component | Case Study Requirement | Location |
|---|---|---|
| Ingestion | Task 1: ETL Pipeline | tasks/01../ETL_FLOW.md (Flow) & SCRIPTS.md (Code) |
| Architecture | Task 2: Data Lake Design | tasks/02../ARCHITECTURE.md (Design) |
| Analytics | Task 3: SQL Aggregation | tasks/03../SQL_BREAKDOWN.md (Query Analysis) |
| DevOps | Task 4: CI/CD & Infra | tasks/04../CI_CD_WORKFLOW.md (Workflow) |
| Communication | Task 5: Stakeholder Comms | tasks/05../STAKEHOLDER_EMAIL.md |
For a detailed status report, see Implementation Status.