Shared datalake utilities for Medallion Architecture.
import { MedallionBuckets, MedallionPaths, extractSubjectFromDatalakePath } from '@stephen/datalake';
// Use bucket names
const bucket = MedallionBuckets.BRONZE_EDUCATION;
// Use path helpers
const path = MedallionPaths.notabilityPriveles('VO', 'Amirah');
// Extract subject from path
const subject = extractSubjectFromDatalakePath('notability/Priveles/VO/Amirah/');
The package provides a unified way to create MinIO clients across all projects:
import { createMinioClient, getMinioConfig } from '@stephen/datalake';
// Create a MinIO client with unified configuration
const client = createMinioClient();
// Use the client
const exists = await client.bucketExists('my-bucket');
The utility reads configuration from environment variables with sensible defaults:
MINIO_ENDPOINT: Endpoint hostname or URL (default: localhost)MINIO_PORT: Port number (default: 9000)MINIO_SECURE: Use SSL/TLS (default: false, set to 'true' to enable)MINIO_ACCESS_KEY: Access key (default: minioadmin)MINIO_SECRET_KEY: Secret key (default: minioadmin)import { getMinioConfig } from '@stephen/datalake';
import * as Minio from 'minio';
// Get configuration object for custom client setup
const config = getMinioConfig();
// Create custom client with modifications
const customClient = new Minio.Client({
...config,
// Override specific settings if needed
useSSL: true,
});
For services that need presigned URLs with public endpoints (e.g., datalake-simple.ts), the utility provides the base configuration which can be extended:
import { getMinioConfig } from '@stephen/datalake';
const baseConfig = getMinioConfig();
// Use baseConfig for internal client
// Create separate presigned client with public endpoint logic
Generate thumbnails for PDFs in Bronze layer and store them in Silver layer using Poppler (pdftoppm) for fast, stream-based processing.
Requirements:
sudo apt-get install poppler-utils)npm install)Usage:
# From datalake package directory
cd /home/stephen/packages/datalake
# Set environment variables for local MinIO connection
export MINIO_ENDPOINT_LOCAL=127.0.0.1
export MINIO_PORT=9005
export MINIO_SECURE=false
# Process all PDFs
npm run process-thumbnails
# Process specific folder
node scripts/process-thumbnails.mjs --folder="notability/Priveles/VO/StudentName"
# Process specific subject
node scripts/process-thumbnails.mjs --subject="VO"
# Force regenerate existing thumbnails
node scripts/process-thumbnails.mjs --force
Thumbnail Storage:
Thumbnails are stored in silver-education/thumbnails/{sanitized_file_path}/{size}.png
small (200x200), medium (400x400), large (800x800)Performance Benefits (Poppler):
poppler-utils system packageGenerate AI analysis metadata for PDFs in Bronze layer and store them in Silver layer using LangChain with structured output validation.
Requirements:
OPENAI_API_KEY environment variable)npm install)langchain, @langchain/openai, zod)Usage:
# From datalake package directory
cd /home/stephen/packages/datalake
# Set environment variables for local MinIO connection
export MINIO_ENDPOINT_LOCAL=127.0.0.1
export MINIO_PORT=9005
export MINIO_SECURE=false
export OPENAI_API_KEY=your-api-key-here
# Process all PDFs
npm run process-ai-analysis
# Process specific folder
node scripts/process-ai-analysis.mjs --folder="notability/Priveles/VO/StudentName"
# Process specific subject
node scripts/process-ai-analysis.mjs --subject="VO"
# Force regenerate existing metadata
node scripts/process-ai-analysis.mjs --force
Metadata Storage:
Metadata is stored in silver-education/{file_path}.metadata.json
LangChain Benefits: