Production Data Platform

Kafka · Spark · Delta Lake · Airflow · MinIO · Snowflake

Production Data Platform
End-to-end data engineering platform —
raw events to analytics-ready Snowflake tables
Raw data from four live sources (clickstream, orders CDC, restaurants, users) flows through Kafka into a three-layer medallion architecture (Bronze → Silver → Gold) and loads into Snowflake. Every stage is idempotent, supports late-arriving data correction, and full historical backfilling.
23
Docker services
4
data domains
10
Kafka topics
06:00
UTC daily run
3
medallion layers
Exactly-once semantics
Late-data auto-reprocess
Full backfill support
GCP Phase 3 ready
Plugin architecture — new domain = 1 YAML + 1 package
Ingestion Layer — 4 Sources · 4 Producers
🖱️
Clickstream Generator
Faker · synthetic food-delivery events · 11 event types
streaming~10 events/s50 virtual users
clickstream-producer
🛍️
Orders (Postgres CDC)
psycopg2 · poll every 30s · WHERE updated_at > watermark
micro-batchCDC watermarknested items[]
orders-cdc-producer
🍽️
Restaurants (Batch)
Full snapshot · hourly · restaurants + menu_items nested
batchhourlynested JSON
restaurants-batch-producer
👥
Users (Batch)
Incremental extract · daily · updated_at watermark · email masked
batchdailyemail masked
users-batch-producer
🐘
PostgreSQL 16
pipeline_db · shared source DB
ordersusersrestaurantsmenu_items
Messaging Layer — Apache Kafka 3.7.0 · KRaft · SASL_PLAINTEXT
kafka-1:9092 kafka-2:9092 KRaft quorum · controller:9093 · no ZooKeeper
6p · RF=2 · 7draw-events
6p · RF=2 · 7draw-orders
3p · RF=2 · 7draw-restaurants
3p · RF=2 · 7draw-users
6p · RF=2 · 30dprocessed-events
3p · compact · ∞schema-changes
3p · 1dpipeline-alerts
DLQ · 30ddlq-raw-events
DLQ · 30ddlq-raw-orders
📋
Schema Registry
Confluent 7.6.1 · :8081
JSON SchemaFULL_TRANSITIVErf=2
schema-registry
Sink Consumer — Kafka → Parquet → MinIO (at-least-once)
🔄
S3 Sink Consumer
kafka-python · multi-topic · manual commit after S3 PUT · SIGTERM graceful flush
buffer: 1000 msgs OR 60sgroup: s3-sink-consumer auto_offset_reset: earliestsnappy Parquet enable_auto_commit: false
s3-sink-consumer
Bronze Layer — Landing Zone · MinIO S3-Compatible · Hive Partitioned
MinIO RELEASE.2024-04-18 · data-lake bucket · 4-drive erasure coding · :9000 API · :9001 Console
bronze/events/event_date=YYYY-MM-DD/
part-{ts}-{uuid8}.parquet · snappy
bronze/orders/order_date=YYYY-MM-DD/
part-{ts}-{uuid8}.parquet · snappy
bronze/restaurants/snapshot_date=YYYY-MM-DD/
part-{ts}-{uuid8}.parquet · snappy
bronze/users/snapshot_date=YYYY-MM-DD/
part-{ts}-{uuid8}.parquet · snappy
Orchestration Layer — Apache Airflow 2.9.2 · CeleryExecutor
🌊
Apache Airflow 2.9.2 CeleryExecutor
DAG Factory @ 06:00 UTC daily · Partition State Machine · Late-data detection · Auto-retry failed partitions
data_pipeline_eventsdata_pipeline_orders Partition State MachineSLA monitoring Slack failure alertsSparkSubmitOperatorS3KeySensor
webserver  scheduler  worker  triggerer
🗃️
Metadata Store
PostgreSQL (Airflow DB) · Delta Lake (partition_meta)
partition_statelineagerun_log
Redis 7
Celery broker · 1 GB · allkeys-lru
task queueAOF
redis
Processing Layer — Apache Spark 3.5.1 · Delta Lake 3.2.0
spark-master · spark-worker-1 · spark-worker-2 · hadoop-aws · Delta Lake 3.2.0 · Snowflake connector
bronze_to_silver.py
Schema validation · type cast · from_json nested items · SHA-256 record_hash · CDC latest-record Window · to_date partition key
row_number() over CDC keydesc(updated_at)from_json items schema
silver_to_gold.py
Business aggregations · ACID Delta MERGE · idempotent · late-data reprocess · DeltaTable.forPath
MERGE INTOidempotentpartition prune
Medallion Architecture — Delta Lake · ACID Transactions
🥈
Silver Layer
Cleaned · Deduplicated · Typed · Hive-partitioned
Delta Lake ACIDevents_silverorders_silverrecord_hash
silver/events/event_date=...
silver/orders/order_date=...
🥇
Gold Layer
Aggregated · Business-ready · ACID MERGE
Delta Lake ACIDevents_goldorders_goldidempotent MERGE
gold/events/event_date=...
gold/orders/order_date=...
Data Warehouse — Snowflake
❄️
Snowflake Data Warehouse
Staging MERGE → Gold schema · fully idempotent loads · consumed by BI teams and analysts
MERGE INTO stagingevents schemaorders schema no duplicatesBI ready
🐳
Docker Compose v2
23 services · single host
🌐
4 Isolated Networks
backend · storage · messaging · processing
🔒
Security
SASL · no-new-privileges · .env secrets
♻️
Exactly-Once
Late-data · Backfill · State machine
📊
Observability
Airflow UI · Spark UI · MinIO · Flower
🔌
Plugin Architecture
New domain = 1 YAML + 1 plugin pkg
☁️
GCP Phase 3 — Scalable Production
GKE Autopilot · GCS · Cloud SQL · Secret Manager · Workload Identity · spark-on-k8s-operator · ArgoCD GitOps · Terraform IaC
Auto-scale pods GitOps (ArgoCD) Zero static workers Workload Identity — no JSON keys Zero pipeline code change
Ingestion Layer GKE Pods
🖱️
Clickstream Deployment
replicas:1 · HPA cpu>70% · same image
extraction:latest
🛍️
Orders CDC Deployment
PVC for watermark · Cloud SQL source
PersistentVolumeClaim
🍽️
Restaurants CronJob
schedule: "0 * * * *"
hourly
👥
Users CronJob
schedule: "0 6 * * *"
daily
🗄️
Cloud SQL
Postgres 16 · HA · private IP · auto-backup · replaces local PG
Messaging GKE StatefulSet — 3 replicas
Apache Kafka 3.7.0 · KRaft · StatefulSet 3 replicas · PVC per broker · or swap to Confluent Cloud (bootstrap URL change only)
6p · RF=3 · 7draw-events
6p · RF=3 · 7draw-orders
3p · RF=3raw-restaurants
3p · RF=3raw-users
DLQ · 30ddlq-raw-events
DLQ · 30ddlq-raw-orders
📋
Schema Registry Deployment
ClusterIP service · FULL_TRANSITIVE · rf=3
Sink → GCS Bronze GKE Deployment
🔄
S3 Sink Consumer → GCS zero code change
Same Python code · S3_ENDPOINT_URL removed · AWS creds = GCS HMAC key or Workload Identity
Deployment replicas:1Workload IdentityGCS bucket: data-lake-{project}
Bronze Layer Google Cloud Storage
gs://data-lake-{project}/ · Versioning · Lifecycle: Bronze >90d → Nearline · Silver >180d → Coldline · Gold >365d → Archive
bronze/events/event_date=YYYY-MM-DD/
bronze/orders/order_date=YYYY-MM-DD/
bronze/restaurants/snapshot_date=YYYY-MM-DD/
bronze/users/snapshot_date=YYYY-MM-DD/
Orchestration GKE · KubernetesExecutor
🌊
Airflow 2.9.2 KubernetesExecutor
Helm chart · webserver + scheduler · one GKE pod per task · no Celery workers · GCS remote logs
Helm chartpod-per-taskno Redis neededArgoCD sync
🔐
Secret Manager
Replaces .env · auto-rotation · external-secrets-operator
Workload Identityno JSON keys
🚀
ArgoCD
GitOps · git push → cluster sync · Helm values
Processing spark-on-k8s-operator · SparkApplication CRD
Driver: general GKE pool · Executors: Spark node pool (NO_SCHEDULE taint) · auto-scale 2→10 pods
bronze_to_silver.py SparkApplication
GCS connector · Workload Identity · same script logic
silver_to_gold.py SparkApplication
Delta Lake on GCS · same ACID MERGE logic
Medallion Architecture Delta Lake on GCS
🥈
Silver GCS
gs://data-lake-{project}/silver/ · same Delta format
GCS connectorWorkload Identity
🥇
Gold GCS
gs://data-lake-{project}/gold/ · Coldline archive >365d
lifecycle archivalACID MERGE
Warehouse (unchanged)
❄️
Snowflake GCS external stage replaces S3 stage
Same MERGE logic · GCS external stage · zero pipeline code change
MERGE idempotentstage URL change onlyBI / Analysts
ComponentCurrent (Docker Compose)GCP Phase 3Code change?
RuntimeDocker Compose · single hostGKE AutopilotNo
Object StorageMinIO (S3-compatible)Google Cloud StorageEnv var only
Secrets.env fileGCP Secret ManagerK8s secret sync
PostgreSQLpostgres:16-alpine containerCloud SQL Postgres 16 · HADSN only
Airflow ExecutorCeleryExecutor + RedisKubernetesExecutor · no RedisHelm values
SparkStandalone 3 containersspark-on-k8s-operator + CRDCRD YAML only
Kafka2 containers · KRaftStatefulSet 3 pods or Confluent CloudBootstrap URL
ScalingManual docker compose scaleHPA + node auto-provisioningNo
Pipeline Logic✓ Identical — bronze_to_silver.py, silver_to_gold.py, all DAGs, schemas unchangedNone
Follow a Single Event
From user tap on a food delivery app → queryable row in Snowflake
1
User opens the food delivery app
The app fires an app_open event. The clickstream producer generates a Python dataclass with event_id = uuid4(), event_timestamp (ISO-8601 UTC), device type, city, and session ID.
⏱ t=0ms   event generated in memory
2
Published to Kafka raw-events
The producer serialises the event to JSON and calls producer.send("raw-events", value=msg_bytes). The KafkaProducer batches and flushes to one of 6 partitions on kafka-1 or kafka-2 using the default murmur2 partitioner on user_id.
⏱ t < 100ms   in Kafka, replicated to RF=2
3
S3 Sink Consumer buffers the event
The sink consumer polls all subscribed topics. The event lands in an in-memory buffer keyed by topic → partition date. The buffer flushes when 1,000 messages accumulate (high-throughput) or 60 seconds elapse (low-throughput), whichever comes first. After a successful S3 PUT, the consumer manually commits the Kafka offset — guaranteeing at-least-once delivery even if the process crashes mid-flight.
⏱ t = 0–60s   buffer window
4
Parquet file written to MinIO Bronze
The consumer converts the buffer to a pandas.DataFrame, writes it as Parquet with Snappy compression via PyArrow, and PUTs the file to s3://data-lake/bronze/events/event_date=2024-01-15/part-{ts}-{uuid8}.parquet. The partition date is derived from the event_timestamp field in the message.
⏱ t ≈ 60s   file visible in MinIO Bronze
5
Airflow DAG fires at 06:00 UTC
The data_pipeline_events DAG starts. The Partition State Machine scans Bronze S3, discovers all partition dates with new files, computes a file-list hash for each, and compares against the partition_meta Delta table. New partitions enter state PENDING. Previously completed partitions whose file hash changed are re-marked LATE.
⏱ t ≈ 18–24h   next scheduled run (next day)
6
Spark: bronze_to_silver.py
SparkSubmitOperator submits the job. Spark reads all Parquet files for the partition date from Bronze, casts types, validates schema, parses nested fields with from_json(), computes a SHA-256 record_hash, deduplicates on event_id, and writes a Delta table partition to Silver. Partition state advances to IN_FLIGHTDONE.
⏱ +5–15 min   Silver partition written
7
Spark: silver_to_gold.py
Reads the Silver Delta partition, computes business aggregations (hourly event counts, funnel conversions, revenue metrics), and writes to Gold using DeltaTable.merge(). The MERGE is fully idempotent — re-running replaces the same rows rather than duplicating them. This is critical for late-data reprocessing.
⏱ +5–10 min   Gold partition merged
8
Loaded into Snowflake
The Snowflake task reads the Gold Delta partition, inserts into a staging table, then runs MERGE INTO gold.events USING staging …. Duplicates are impossible. The load is idempotent by design — running it twice produces the same result.
⏱ +2–5 min   rows visible to BI tools
Event is queryable by analysts in Snowflake
Total end-to-end latency: ~06:25–06:35 UTC for data produced the previous day. If the event was produced after the DAG cut-off and arrives into Bronze the same day, it appears in Bronze within 60 seconds but is processed in the next morning's run — or flagged as late data and reprocessed automatically.
⏱ End-to-End Latency Summary
StageLatency
Clickstream → Kafka< 100ms
Kafka → Bronze Parquet≤ 60s
Bronze → Airflow trigger~18–24h (next run)
Airflow → Silver (Spark)5–15 min
Silver → Gold (Spark)5–10 min
Gold → Snowflake (MERGE)2–5 min
Total (happy path)~06:25–06:35 UTC
📊 Approximate Data Volumes
DomainDaily volume
Events (clickstream)~860K events/day
Orders (CDC)~2,880 msgs/day
Restaurants (batch)24 snapshots/day
Users (batch)1 snapshot/day
Based on default config: 10 events/s, 30s CDC poll, 50 virtual users
🔌 Shared Framework — What Airflow Sees
Every DAG task group calls the same framework code. Domain-specific logic lives entirely in the plugin package. The orchestration layer never changes when a new domain is added.
DAG factory reads YAML SparkSubmitOperator is generic MetadataManager is shared S3KeySensor path from YAML
🔌 Domain Plugin Architecture
Adding a new data domain (e.g. drivers, payments, reviews) requires zero changes to the shared framework. You provide two things — a YAML config and a Python plugin package — and the platform handles the rest.
① YAML Config
One file per domain in airflow/config/pipeline.yaml
# pipeline.yaml (excerpt) domains: drivers: # new bronze: bucket: data-lake prefix: bronze/drivers partition_key: event_date timestamp_col: event_ts silver: prefix: silver/drivers primary_key: driver_id gold: prefix: gold/drivers
+
② Plugin Package
Two classes in spark/domains/drivers/
# transform.py class Transform(BaseDomainTransform): def transform(self, df, cfg): return (df .withColumn("event_date", to_date(col("event_ts"))) .dropDuplicates(["driver_id"]) .withColumn("ingested_at", current_timestamp())) # aggregations.py class Aggregation(BaseDomainAggregation): def aggregate(self, df, cfg, dt): return (df .groupBy("event_date","city") .agg(count("*").alias("trips")))
What the Platform auto-creates
No framework changes needed:
# Airflow DAG factory reads YAML data_pipeline_drivers # new DAG # Kafka topic (kafka-init) raw-drivers # add to init script # S3 paths (auto from YAML) bronze/drivers/ silver/drivers/ gold/drivers/ # Metadata tracking # auto by MetadataManager domain=drivers row in partition_meta Delta table # Snowflake schema # add MERGE task to DAG
Partition Lifecycle & Metadata Tracking
Every (domain, partition_date) pair is a tracked entity. The state machine ensures exactly-once processing, automatic late-data correction, and safe retries.
🔄 Partition State Machine
Happy path →
no files yetUNDISCOVERED
files found in BronzePENDING
Spark job runningIN_FLIGHT
Silver + Gold writtenDONE ✓
Failure →
Spark job runningIN_FLIGHT
Spark error / timeoutFAILED
next Airflow run picks it upRETRYING
re-submittedIN_FLIGHT
→ …
max_retries exceeded → DEAD (manual)
Late data →
previously completedDONE
new files arrived in BronzeLATE ⚠
re-queued for next runREPROCESSING
re-submittedIN_FLIGHT
updated aggregationsDONE ✓
UNDISCOVERED
PENDING
IN_FLIGHT
DONE
FAILED
RETRYING
LATE
REPROCESSING
DEAD (max retries hit)
🔍 Late Data Detection Algorithm
Runs at the start of every Airflow DAG run, before any Spark jobs are submitted. No manual intervention required.
1
Scan Bronze S3 — list all partition paths for the domain: bronze/events/event_date=*/
2
Compute file hash — for each partition date, sort the list of Parquet file keys and hash them: md5(sorted(file_keys))
3
Query partition_meta — fetch all rows where domain = "events" and status = "DONE"
4
Compare hashes — if computed_hash ≠ stored file_hash for a completed partition, new files have arrived since it was last processed
5
Mark LATE — update partition_meta.status = "LATE", store new file_hash, log detected delta
6
Build run queue — combine PENDING (new) + LATE + RETRYING partitions into a single ordered list
7
Process all — each partition in the queue gets its own Spark Bronze→Silver→Gold job in the same DAG run. No separate backfill DAG needed.
Why this works: Bronze is append-only and immutable. Files are never deleted or overwritten. Late-arriving Kafka messages that cross a date boundary are written to the correct Bronze partition date by the sink consumer. The next Airflow run detects the new files and reprocesses. No data is lost.
🗂 partition_meta Delta Table Schema
Stored as a Delta Lake table in MinIO/GCS. Single source of truth for partition lineage, retry logic, and idempotency checks. Path: s3://pipeline-metadata/partition_meta/
ColumnTypeDescription
domainstring"events", "orders", "restaurants", "users"
partition_datedateThe calendar date this partition covers
statusstringCurrent state machine state (see diagram above)
run_idstringAirflow DAG run ID that last touched this row
record_countbigintRecords written to Silver on last successful run
source_filesarray<string>Bronze Parquet file paths included in last run
file_hashstringmd5(sorted(source_files)) — late-data detection key
silver_recordsbigintDeduplicated rows in Silver output
gold_recordsbigintRows merged into Gold output
retry_countintNumber of FAILED → RETRYING cycles
error_messagestringLast Spark error stacktrace (null on success)
created_attimestampWhen partition was first discovered
updated_attimestampLast state transition timestamp
processed_attimestampWhen most recent DONE state was reached
silver_pathstringS3 path of Silver output (for lineage)
gold_pathstringS3 path of Gold output (for lineage)
🛡 Idempotency Guarantees Per Layer
Bronze Layer At-Least-Once
Bronze is append-only. Duplicate Parquet files can exist if the sink consumer crashes between a successful S3 PUT and a Kafka commit. This is by design — duplicates are resolved in Silver. Bronze stores raw truth; it is never cleaned or overwritten.
duplicates possibleimmutable filesappend-only
Silver Layer Exactly-Once
Event data: deduplication on event_id (UUID4). CDC data (orders): Window function row_number() OVER (PARTITION BY order_id ORDER BY updated_at DESC) keeps only the latest status per order. A record_hash (SHA-256) detects schema-level corruption. Re-running Silver for the same partition always produces the same output.
deduplicatedlatest CDC record keptsafe to rerun
Gold + Snowflake Idempotent MERGE
Both Gold (Delta Lake) and Snowflake use MERGE INTO … WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT. Running the same partition twice replaces rows rather than duplicating them. This is the key property that makes late-data reprocessing safe — previously computed aggregations are simply overwritten with corrected values.
MERGE idempotentsafe late-data overwriteno manual cleanup
🔁 How to Trigger a Backfill
To reprocess a historical date range (e.g. after a pipeline bug fix), delete or update the rows in partition_meta. The next Airflow run will re-discover the partitions as PENDING and reprocess them.
Option 1: SQL on partition_meta
-- Mark a range as PENDING for reprocessing UPDATE partition_meta SET status = 'PENDING', retry_count = 0 WHERE domain = 'events' AND partition_date BETWEEN '2024-01-10' AND '2024-01-15'; -- Next DAG run at 06:00 UTC picks them up automatically
Option 2: Airflow UI
# 1. Open Airflow UI → data_pipeline_events DAG # 2. Click the failed / target task group # 3. Select "Clear" → confirm # 4. Airflow re-queues the task # MetadataManager resets state to PENDING # Spark job re-runs with same Bronze data
OPERATIONS GUIDE

Run, monitor, and recover the platform

Service URLs, startup commands, daily cadence, common Kafka/Airflow ops, and failure runbooks — everything on-call needs in one place.

Service Endpoints
Airflow UI PUBLIC
http://localhost:8080
DAG management, task logs, trigger backfills, monitor runs.
Login: AIRFLOW_ADMIN_USER / AIRFLOW_ADMIN_PASSWORD
airflow-webserverCeleryExecutor
MinIO Console PUBLIC
http://localhost:9001
Browse Bronze / Silver / Gold buckets, inspect Parquet files, manage lifecycle policies.
S3 API: :9000
platform-minioerasure 4-drive
Spark Master UI LOOPBACK
http://localhost:18080
Active jobs, worker status, executor memory, completed application history.
Submit endpoint: spark://spark-master:7077
spark-master127.0.0.1 only
Spark Workers LOOPBACK
Worker-1: :8082  |  Worker-2: :8083
Per-worker executor logs, running tasks, memory / core allocation. Each worker: up to 6g RAM, 4 cores (configurable via .env).
spark-worker-1spark-worker-2
Celery Flower LOOPBACK
http://localhost:5555
Real-time Celery worker monitoring: task rates, queue depths, worker heartbeats, task history per worker.
airflow-flowerRedis broker
Schema Registry LOOPBACK
http://localhost:8081
REST API to list/inspect Avro, Protobuf, JSON schemas. Compatibility mode: FULL_TRANSITIVE.
schema-registryConfluent 7.6.1
PostgreSQL LOOPBACK
localhost:5432
Databases: airflow (Airflow meta), pipeline_db (domain tables — orders, users, restaurants).
platform-postgres16-alpine
Quick Start
First-time setup
# 1. Copy env template and fill in secrets cp .env.example .env # vim .env — set AIRFLOW_FERNET_KEY, passwords, etc. # 2. Build custom images (Airflow + Spark + Extraction) docker compose build # 3. Start everything (init containers run automatically) docker compose up -d # 4. Watch init containers complete docker compose logs -f airflow-init kafka-init minio-init # 5. Verify all services healthy docker compose ps
Day-to-day operations
# Restart a single service after a code change docker compose restart airflow-scheduler # Rebuild extraction layer after code change docker compose build clickstream-producer docker compose up -d --no-deps clickstream-producer # Tail logs from any service docker compose logs -f --tail=100 airflow-worker # Stop everything (preserves volumes / data) docker compose down # Full teardown including volumes (DESTRUCTIVE) docker compose down -v
Daily Pipeline Cadence (UTC)
Every day the pipeline runs a coordinated sequence. All times UTC. Extraction layer (Kafka producers) runs continuously — the Airflow DAG processes the accumulated Bronze data at 06:00.
Time (UTC) Event Service What happens
00:00 → Clickstream events flowing clickstream-producer ~10 events/sec continuously writing to raw-events topic
00:00 → Order CDC polling orders-cdc-producer Every 30s: reads orders WHERE updated_at > watermark, publishes to raw-orders
hourly Restaurant catalog snapshot restaurants-batch-producer Full join of restaurants + menu_items published to raw-restaurants
daily User profile extract users-batch-producer Incremental extract of changed users (watermark on updated_at) → raw-users
continuous S3 sink flush s3-sink-consumer Every 60s or 1,000 messages: flushes buffered Parquet files to s3://data-lake/bronze/
~05:55 Airflow scheduler fires airflow-scheduler Evaluates DAG schedule, queues data_pipeline_events task instances
06:00 Bronze → Silver (Spark) airflow-worker spark-worker-1/2 SparkSubmitOperator runs bronze_to_silver.py per domain; deduplication, typing, Delta merge
~06:10 Silver → Gold (Spark) spark-worker-1/2 Aggregations run after Silver succeeds; writes daily metrics to Gold Delta tables
~06:20 Gold → Snowflake (optional) airflow-worker Snowflake MERGE via silver_to_gold.py — idempotent, safe to re-run
~06:30 Late-data detection airflow-scheduler On next run: Bronze file hashes compared to partition_meta — DONE partitions with new files → LATE → reprocess
Common Kafka Operations
Shell into a broker first
# Exec into kafka-1 and write the client config docker exec -it platform-kafka-1 bash # Inside the container: env vars are available directly cat > /tmp/client.properties <<EOF security.protocol=SASL_PLAINTEXT sasl.mechanism=PLAIN sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \ username="${KAFKA_CLIENT_USER}" \ password="${KAFKA_CLIENT_PASSWORD}"; EOF # Verify credentials are set echo "User: $KAFKA_CLIENT_USER"
List topics & consumer lag
# List all topics /opt/kafka/bin/kafka-topics.sh \ --bootstrap-server kafka-1:9092 \ --command-config /tmp/client.properties \ --list # Describe topic (partition leaders, ISR) /opt/kafka/bin/kafka-topics.sh \ --bootstrap-server kafka-1:9092 \ --command-config /tmp/client.properties \ --describe --topic raw-events # Check consumer group lag /opt/kafka/bin/kafka-consumer-groups.sh \ --bootstrap-server kafka-1:9092 \ --command-config /tmp/client.properties \ --describe --group s3-sink-consumer
Peek at topic messages
# Consume latest 5 messages from raw-events /opt/kafka/bin/kafka-console-consumer.sh \ --bootstrap-server kafka-1:9092 \ --command-config /tmp/client.properties \ --topic raw-events \ --from-beginning \ --max-messages 5 # Peek at dead-letter queue /opt/kafka/bin/kafka-console-consumer.sh \ --bootstrap-server kafka-1:9092 \ --command-config /tmp/client.properties \ --topic dlq-raw-events \ --from-beginning \ --max-messages 10
Reset consumer offset (backfill)
# Reset s3-sink-consumer to beginning (re-sink all data) # Stop consumer first! docker compose stop s3-sink-consumer /opt/kafka/bin/kafka-consumer-groups.sh \ --bootstrap-server kafka-1:9092 \ --command-config /tmp/client.properties \ --group s3-sink-consumer \ --reset-offsets \ --to-earliest \ --topic raw-events \ --execute docker compose start s3-sink-consumer # Reset to specific datetime # --to-datetime 2024-01-15T06:00:00.000
Failure Runbooks
KAFKA BROKER UNHEALTHY exit 127 or healthcheck timeout
Symptoms
  • docker compose ps shows kafka-1/2 unhealthy
  • kafka-init stuck in "not ready" retry loop
  • airflow-worker can't produce to Kafka topics
Resolution
# Check healthcheck logs docker inspect --format \ "{{json .State.Health.Log}}" \ platform-kafka-1 | python3 -m json.tool # Recreate to pick up latest image/config docker compose up -d --force-recreate \ kafka-1 kafka-2
AIRFLOW TASK FAILED Spark job error or Bronze data missing
Diagnosis steps
  • Airflow UI → failed task → View Log
  • Spark UI (localhost:18080) → failed application → Stderr
  • Check MinIO for Bronze Parquet files (did s3-sink-consumer flush?)
  • Check partition_meta table for FAILED state
Re-run a specific DAG run
# Via Airflow UI # Grid view → failed run → Clear # Via CLI (inside airflow-webserver) docker exec -it platform-airflow-webserver \ airflow dags trigger \ data_pipeline_events \ --exec-date 2024-01-15T06:00:00+00:00 \ --conf '{"domains":["events"]}'
S3 SINK LAG / STALE BRONZE Consumer group falling behind or no Parquet files appearing
Check lag
# Check consumer group lag docker exec -it platform-kafka-1 \ /opt/kafka/bin/kafka-consumer-groups.sh \ --bootstrap-server kafka-1:9092 \ --command-config /tmp/client.properties \ --describe --group s3-sink-consumer # Check consumer logs docker compose logs --tail=50 s3-sink-consumer
Resolution
# Restart the consumer docker compose restart s3-sink-consumer # Scale up if lag is high # (production: split into per-topic consumers) # see s3_sink_consumer.py header for pattern # Force flush: restart with buffer=1 SINK_FLUSH_BUFFER_SIZE=1 \ docker compose up -d s3-sink-consumer
POSTGRES / CDC PRODUCER ERROR psycopg2 connection refused or watermark drift
Check DB health
# Check Postgres is accepting connections docker exec -it platform-postgres \ pg_isready -U $POSTGRES_USER -d pipeline_db # List domain tables docker exec -it platform-postgres \ psql -U $POSTGRES_USER -d pipeline_db \ -c "\dt"
Reset CDC watermark
# Reset orders watermark to replay all orders docker exec platform-orders-cdc-producer \ sh -c "rm -f /var/lib/extraction/watermarks/orders_cdc.txt" docker compose restart orders-cdc-producer # Reset users watermark docker exec platform-users-batch-producer \ sh -c "rm -f /var/lib/extraction/watermarks/users_batch.txt" docker compose restart users-batch-producer
REFERENCE

Technology stack, networks, security & glossary

Versions, network topology, security model, glossary of every term used in this platform, and the GCP migration roadmap.

Technology Stack
Component Image / Version Role Notable config
PostgreSQL 16-alpine Airflow metadata DB + pipeline domain DB (orders, users, restaurants) PGDATA=/var/lib/postgresql/data/pgdata
Redis 7-alpine Celery broker & result backend; Airflow task queue maxmemory 1gb, LRU eviction, AOF persistence
MinIO 2024-04-18T19-09-19Z S3-compatible object store; Bronze / Silver / Gold / metadata buckets Erasure coding: /data{1...4} — survives 2 drive read failures
Apache Kafka apache/kafka:3.7.0 Event streaming bus; 10 topics, KRaft mode (no ZooKeeper) SASL_PLAINTEXT, 2-broker dev / 3-broker prod, 2GB JVM heap
Confluent Schema Registry cp-schema-registry:7.6.1 Avro / Protobuf / JSON schema enforcement; FULL_TRANSITIVE compatibility SASL auth to Kafka, RF=2 internal topic
Apache Spark apache/spark:3.5.1 Bronze→Silver dedup/typing, Silver→Gold aggregations, Snowflake MERGE + Delta Lake 3.2.0, hadoop-aws, Kafka connector, Snowflake connector
Apache Airflow 2.9.2 DAG orchestration; CeleryExecutor; S3 remote logs; SparkSubmitOperator 4 workers (AIRFLOW__CELERY__WORKER_CONCURRENCY=8), 4 webserver workers
Extraction Layer data-platform/extraction:latest Python producers (clickstream, CDC, batch) + S3 sink consumer kafka-python, boto3, psycopg2-binary, faker, pyarrow, pydantic
Delta Lake 3.2.0 ACID table format for Silver and Gold; MERGE for idempotent upserts Snappy compression, Z-order on partition date cols
Network Topology
BACKEND-NET
platform-backend  |  172.20.0.0/24
Internal services: auth, queuing, task state.
postgresredis airflow-webserverairflow-scheduler airflow-workerairflow-triggerer airflow-flower
STORAGE-NET
platform-storage  |  172.20.1.0/24
Object store access: read/write Bronze, Silver, Gold, metadata, Spark event logs.
miniominio-init spark-masterspark-worker-1/2 airflow-workers3-sink-consumer
MESSAGING-NET
platform-messaging  |  172.20.2.0/24
Kafka cluster communication; producers publish, consumer sinks read.
kafka-1kafka-2 schema-registrykafka-init clickstream-producerorders-cdc-producer restaurants-batch-producerusers-batch-producer s3-sink-consumerairflow-worker spark-worker-1/2
PROCESSING-NET
platform-processing  |  172.20.3.0/24
Spark cluster internal communication and job submission from Airflow worker.
spark-masterspark-worker-1 spark-worker-2airflow-worker
NETWORK ISOLATION NOTE
airflow-worker is the only service bridging all four networks — it's the integration hub. Spark workers are on processing-net + storage-net + messaging-net (need Kafka for streaming jobs). Producers are messaging-net only. The Postgres and Redis instances are reachable only from backend-net, keeping them isolated from untrusted networks.
Security Model
Kafka Authentication
  • Protocol: SASL_PLAINTEXT
  • Mechanism: PLAIN
  • Inter-broker: separate credentials from client
  • JAAS file generated at container start from env vars — never hardcoded in image
  • Every CLI command requires --command-config
Container Hardening
  • All containers: no-new-privileges:true
  • Airflow: runs as UID 50000 (non-root)
  • Spark: runs as user spark (non-root)
  • Extraction: non-root Python service user
  • Postgres / Redis: official images, non-root by default
  • No secrets in image layers — all via .env
Network Exposure
  • Airflow UI (:8080) and MinIO Console (:9001) — bound to 0.0.0.0
  • All other ports: 127.0.0.1 loopback only
  • MinIO service account (MINIO_PIPELINE_USER) has least-privilege bucket policy
  • Airflow connections store credentials in encrypted Postgres column (Fernet key)
  • Docker bridge networks: services can only reach peers on shared network
Glossary
Bronze Layer
Raw, unmodified data as produced by the S3 sink consumer. Parquet files partitioned by date in s3://data-lake/bronze/{domain}/. At-least-once: may contain duplicates. Never modified after write.
Silver Layer
Deduplicated, typed, validated records in Delta Lake format. Bronze duplicates removed using event_id / CDC watermark window. Each domain has a Silver table managed by a domain plugin.
Gold Layer
Business-level aggregations: daily revenue, order counts, funnel metrics. Written as Delta Lake tables. MERGE-based upsert makes re-runs idempotent. Final destination before Snowflake.
CDC (Change Data Capture)
Technique for capturing row-level changes in a source database. Here implemented as micro-batch polling: SELECT … WHERE updated_at > watermark every 30s, not log-based CDC.
Watermark
A persisted checkpoint timestamp (stored in a Docker volume). Producers read the watermark on startup to resume from where they left off, avoiding re-processing all historical records.
KRaft
Kafka Raft — the built-in consensus protocol replacing ZooKeeper (available since Kafka 3.3). Brokers manage their own metadata via a Raft quorum; eliminates the ZooKeeper dependency entirely.
DLQ (Dead-Letter Queue)
A Kafka topic where unprocessable messages are routed instead of being discarded. Examples: dlq-raw-events, dlq-raw-orders. 30-day retention for manual triage.
Medallion Architecture
A data lakehouse design pattern with three layers: Bronze (raw), Silver (clean), Gold (aggregated). Each layer adds quality and semantic meaning. Coined by Databricks, implemented here with Delta Lake.
Partition State Machine
A per-(domain, date) state tracker in the partition_meta Delta table. States: UNDISCOVERED → PENDING → IN_FLIGHT → DONE, with FAILED and LATE paths. Enables idempotent re-runs and late-data detection.
CeleryExecutor
Airflow executor that distributes tasks to Celery workers via Redis. Enables horizontal scaling: add more airflow-worker replicas to increase concurrency without changing the scheduler.
Delta Lake MERGE
MERGE INTO target USING source ON key WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT — a single atomic SQL statement that upserts rows. Makes re-processing late data safe with no duplicates.
At-least-once / Exactly-once
Delivery guarantees. Bronze is at-least-once (Kafka commit after S3 PUT — may duplicate on crash). Silver achieves exactly-once via deduplication on event_id. Gold/Snowflake via idempotent MERGE.
GCP Migration Phases
Phase 1
CURRENT
Local Docker Stack
All services on a single machine (or small VM). KRaft Kafka 2-broker dev cluster, MinIO 4-drive erasure coding, Spark standalone, Airflow CeleryExecutor. Full pipeline parity — no cloud required. Cost: ~$0/month local, ~$100-300/month on a GCP e2-standard-8.
running now Docker ComposeMinIOKafka KRaftSpark standalone
Phase 2
HYBRID
Managed Kafka + GCS Storage
Replace self-managed Kafka with Confluent Cloud or Google Pub/Sub. Replace MinIO with Google Cloud Storage. Airflow + Spark remain on GCE VMs (Compute Engine) or GKE. Producers/consumers require minimal code changes — swap bootstrap server and S3 endpoint env vars. Airflow connections updated to GCS.
Confluent Cloud / Pub/Sub GCSGCE / GKE Spark standalone → Dataproc
Phase 3
FULL GCP
Fully Managed Cloud-Native
All components replaced with GCP-managed equivalents. Cloud Composer 2 (managed Airflow on GKE Autopilot). Dataproc Serverless for Spark jobs (no cluster management, pay-per-job). BigQuery replaces Snowflake (or alongside it). Pub/Sub + Dataflow for streaming. GCS as the unified data lake. IAM service accounts replace username/password secrets.
Cloud Composer 2 Dataproc Serverless BigQuery Pub/Sub Dataflow GCS IAM Workload Identity
Migration Comparison
ComponentPhase 1 (Current)Phase 2Phase 3
Event busKafka KRaft (Docker)Confluent Cloud / Pub/SubPub/Sub + Dataflow
Object storeMinIO (Docker)Google Cloud StorageGCS
ProcessingSpark standalone (Docker)Dataproc on GCEDataproc Serverless
OrchestrationAirflow CeleryExecutorAirflow on GCE/GKECloud Composer 2
Analytics DBSnowflake (external)Snowflake on GCPBigQuery
Secrets.env + FernetSecret ManagerIAM Workload Identity
Approx. cost$0 local / ~$200/mo VM~$500–1,200/mo~$800–2,000/mo