Production Data Platform

End-to-end data engineering platform —
raw events to analytics-ready Snowflake tables

Raw data from four live sources (clickstream, orders CDC, restaurants, users) flows through Kafka into a three-layer medallion architecture (Bronze → Silver → Gold) and loads into Snowflake. Every stage is idempotent, supports late-arriving data correction, and full historical backfilling.

Docker services

data domains

Kafka topics

06:00

UTC daily run

medallion layers

Exactly-once semantics

Late-data auto-reprocess

Full backfill support

GCP Phase 3 ready

Plugin architecture — new domain = 1 YAML + 1 package

Ingestion Layer — 4 Sources · 4 Producers

🖱️

Clickstream Generator

Faker · synthetic food-delivery events · 11 event types

streaming~10 events/s50 virtual users

clickstream-producer

🛍️

Orders (Postgres CDC)

psycopg2 · poll every 30s · WHERE updated_at > watermark

micro-batchCDC watermarknested items[]

orders-cdc-producer

🍽️

Restaurants (Batch)

Full snapshot · hourly · restaurants + menu_items nested

batchhourlynested JSON

restaurants-batch-producer

👥

Users (Batch)

Incremental extract · daily · updated_at watermark · email masked

batchdailyemail masked

users-batch-producer

🐘

PostgreSQL 16

pipeline_db · shared source DB

ordersusersrestaurantsmenu_items

Messaging Layer — Apache Kafka 3.7.0 · KRaft · SASL_PLAINTEXT

kafka-1:9092 kafka-2:9092 KRaft quorum · controller:9093 · no ZooKeeper

6p · RF=2 · 7draw-events

6p · RF=2 · 7draw-orders

3p · RF=2 · 7draw-restaurants

3p · RF=2 · 7draw-users

6p · RF=2 · 30dprocessed-events

3p · compact · ∞schema-changes

3p · 1dpipeline-alerts

DLQ · 30ddlq-raw-events

DLQ · 30ddlq-raw-orders

📋

Schema Registry

Confluent 7.6.1 · :8081

JSON SchemaFULL_TRANSITIVErf=2

schema-registry

Sink Consumer — Kafka → Parquet → MinIO (at-least-once)

🔄

S3 Sink Consumer

kafka-python · multi-topic · manual commit after S3 PUT · SIGTERM graceful flush

buffer: 1000 msgs OR 60sgroup: s3-sink-consumer auto_offset_reset: earliestsnappy Parquet enable_auto_commit: false

s3-sink-consumer

Bronze Layer — Landing Zone · MinIO S3-Compatible · Hive Partitioned

MinIO RELEASE.2024-04-18 · data-lake bucket · 4-drive erasure coding · :9000 API · :9001 Console

bronze/events/event_date=YYYY-MM-DD/
part-{ts}-{uuid8}.parquet · snappy

bronze/orders/order_date=YYYY-MM-DD/
part-{ts}-{uuid8}.parquet · snappy

bronze/restaurants/snapshot_date=YYYY-MM-DD/
part-{ts}-{uuid8}.parquet · snappy

bronze/users/snapshot_date=YYYY-MM-DD/
part-{ts}-{uuid8}.parquet · snappy

Orchestration Layer — Apache Airflow 2.9.2 · CeleryExecutor

🌊

Apache Airflow 2.9.2 CeleryExecutor

DAG Factory @ 06:00 UTC daily · Partition State Machine · Late-data detection · Auto-retry failed partitions

data_pipeline_eventsdata_pipeline_orders Partition State MachineSLA monitoring Slack failure alertsSparkSubmitOperatorS3KeySensor

webserver scheduler worker triggerer

🗃️

Metadata Store

PostgreSQL (Airflow DB) · Delta Lake (partition_meta)

partition_statelineagerun_log

⚡

Redis 7

Celery broker · 1 GB · allkeys-lru

task queueAOF

redis

Processing Layer — Apache Spark 3.5.1 · Delta Lake 3.2.0

spark-master · spark-worker-1 · spark-worker-2 · hadoop-aws · Delta Lake 3.2.0 · Snowflake connector

bronze_to_silver.py

Schema validation · type cast · from_json nested items · SHA-256 record_hash · CDC latest-record Window · to_date partition key

row_number() over CDC keydesc(updated_at)from_json items schema

→

silver_to_gold.py

Business aggregations · ACID Delta MERGE · idempotent · late-data reprocess · DeltaTable.forPath

MERGE INTOidempotentpartition prune

Medallion Architecture — Delta Lake · ACID Transactions

🥈

Silver Layer

Cleaned · Deduplicated · Typed · Hive-partitioned

Delta Lake ACIDevents_silverorders_silverrecord_hash

silver/events/event_date=...
silver/orders/order_date=...

→

🥇

Gold Layer

Aggregated · Business-ready · ACID MERGE

Delta Lake ACIDevents_goldorders_goldidempotent MERGE

gold/events/event_date=...
gold/orders/order_date=...

Data Warehouse — Snowflake

❄️

Snowflake Data Warehouse

Staging MERGE → Gold schema · fully idempotent loads · consumed by BI teams and analysts

MERGE INTO stagingevents schemaorders schema no duplicatesBI ready

🐳

Docker Compose v2

23 services · single host

🌐

4 Isolated Networks

backend · storage · messaging · processing

🔒

Security

SASL · no-new-privileges · .env secrets

♻️

Exactly-Once

Late-data · Backfill · State machine

📊

Observability

Airflow UI · Spark UI · MinIO · Flower

🔌

Plugin Architecture

New domain = 1 YAML + 1 plugin pkg

Ingestion Layer GKE Pods

🖱️

Clickstream Deployment

replicas:1 · HPA cpu>70% · same image

extraction:latest

🛍️

Orders CDC Deployment

PVC for watermark · Cloud SQL source

PersistentVolumeClaim

🍽️

Restaurants CronJob

schedule: "0 * * * *"

hourly

👥

Users CronJob

schedule: "0 6 * * *"

daily

🗄️

Cloud SQL

Postgres 16 · HA · private IP · auto-backup · replaces local PG

Messaging GKE StatefulSet — 3 replicas

Apache Kafka 3.7.0 · KRaft · StatefulSet 3 replicas · PVC per broker · or swap to Confluent Cloud (bootstrap URL change only)

6p · RF=3 · 7draw-events

6p · RF=3 · 7draw-orders

3p · RF=3raw-restaurants

3p · RF=3raw-users

DLQ · 30ddlq-raw-events

DLQ · 30ddlq-raw-orders

📋

Schema Registry Deployment

ClusterIP service · FULL_TRANSITIVE · rf=3

Sink → GCS Bronze GKE Deployment

🔄

S3 Sink Consumer → GCS zero code change

Same Python code · S3_ENDPOINT_URL removed · AWS creds = GCS HMAC key or Workload Identity

Deployment replicas:1Workload IdentityGCS bucket: data-lake-{project}

Bronze Layer Google Cloud Storage

gs://data-lake-{project}/ · Versioning · Lifecycle: Bronze >90d → Nearline · Silver >180d → Coldline · Gold >365d → Archive

bronze/events/event_date=YYYY-MM-DD/

bronze/orders/order_date=YYYY-MM-DD/

bronze/restaurants/snapshot_date=YYYY-MM-DD/

bronze/users/snapshot_date=YYYY-MM-DD/

Orchestration GKE · KubernetesExecutor

🌊

Airflow 2.9.2 KubernetesExecutor

Helm chart · webserver + scheduler · one GKE pod per task · no Celery workers · GCS remote logs

Helm chartpod-per-taskno Redis neededArgoCD sync

🔐

Secret Manager

Replaces .env · auto-rotation · external-secrets-operator

Workload Identityno JSON keys

🚀

ArgoCD

GitOps · git push → cluster sync · Helm values

Processing spark-on-k8s-operator · SparkApplication CRD

Driver: general GKE pool · Executors: Spark node pool (NO_SCHEDULE taint) · auto-scale 2→10 pods

bronze_to_silver.py SparkApplication

GCS connector · Workload Identity · same script logic

→

silver_to_gold.py SparkApplication

Delta Lake on GCS · same ACID MERGE logic

Medallion Architecture Delta Lake on GCS

🥈

Silver GCS

gs://data-lake-{project}/silver/ · same Delta format

GCS connectorWorkload Identity

→

🥇

Gold GCS

gs://data-lake-{project}/gold/ · Coldline archive >365d

lifecycle archivalACID MERGE

Warehouse (unchanged)

❄️

Snowflake GCS external stage replaces S3 stage

Same MERGE logic · GCS external stage · zero pipeline code change

MERGE idempotentstage URL change onlyBI / Analysts

Component	Current (Docker Compose)	GCP Phase 3	Code change?
Runtime	Docker Compose · single host	GKE Autopilot	No
Object Storage	MinIO (S3-compatible)	Google Cloud Storage	Env var only
Secrets	.env file	GCP Secret Manager	K8s secret sync
PostgreSQL	postgres:16-alpine container	Cloud SQL Postgres 16 · HA	DSN only
Airflow Executor	CeleryExecutor + Redis	KubernetesExecutor · no Redis	Helm values
Spark	Standalone 3 containers	spark-on-k8s-operator + CRD	CRD YAML only
Kafka	2 containers · KRaft	StatefulSet 3 pods or Confluent Cloud	Bootstrap URL
Scaling	Manual docker compose scale	HPA + node auto-provisioning	No
Pipeline Logic	✓ Identical — bronze_to_silver.py, silver_to_gold.py, all DAGs, schemas unchanged		None

Follow a Single Event

From user tap on a food delivery app → queryable row in Snowflake

User opens the food delivery app

The app fires an app_open event. The clickstream producer generates a Python dataclass with event_id = uuid4(), event_timestamp (ISO-8601 UTC), device type, city, and session ID.

⏱ t=0ms event generated in memory

Published to Kafka raw-events

The producer serialises the event to JSON and calls producer.send("raw-events", value=msg_bytes). The KafkaProducer batches and flushes to one of 6 partitions on kafka-1 or kafka-2 using the default murmur2 partitioner on user_id.

⏱ t < 100ms in Kafka, replicated to RF=2

S3 Sink Consumer buffers the event

The sink consumer polls all subscribed topics. The event lands in an in-memory buffer keyed by topic → partition date. The buffer flushes when 1,000 messages accumulate (high-throughput) or 60 seconds elapse (low-throughput), whichever comes first. After a successful S3 PUT, the consumer manually commits the Kafka offset — guaranteeing at-least-once delivery even if the process crashes mid-flight.

⏱ t = 0–60s buffer window

Parquet file written to MinIO Bronze

The consumer converts the buffer to a pandas.DataFrame, writes it as Parquet with Snappy compression via PyArrow, and PUTs the file to s3://data-lake/bronze/events/event_date=2024-01-15/part-{ts}-{uuid8}.parquet. The partition date is derived from the event_timestamp field in the message.

⏱ t ≈ 60s file visible in MinIO Bronze

Airflow DAG fires at 06:00 UTC

The data_pipeline_events DAG starts. The Partition State Machine scans Bronze S3, discovers all partition dates with new files, computes a file-list hash for each, and compares against the partition_meta Delta table. New partitions enter state PENDING. Previously completed partitions whose file hash changed are re-marked LATE.

⏱ t ≈ 18–24h next scheduled run (next day)

Spark: bronze_to_silver.py

SparkSubmitOperator submits the job. Spark reads all Parquet files for the partition date from Bronze, casts types, validates schema, parses nested fields with from_json(), computes a SHA-256 record_hash, deduplicates on event_id, and writes a Delta table partition to Silver. Partition state advances to IN_FLIGHT → DONE.

⏱ +5–15 min Silver partition written

Spark: silver_to_gold.py

Reads the Silver Delta partition, computes business aggregations (hourly event counts, funnel conversions, revenue metrics), and writes to Gold using DeltaTable.merge(). The MERGE is fully idempotent — re-running replaces the same rows rather than duplicating them. This is critical for late-data reprocessing.

⏱ +5–10 min Gold partition merged

Loaded into Snowflake

The Snowflake task reads the Gold Delta partition, inserts into a staging table, then runs MERGE INTO gold.events USING staging …. Duplicates are impossible. The load is idempotent by design — running it twice produces the same result.

⏱ +2–5 min rows visible to BI tools

✓

Event is queryable by analysts in Snowflake

Total end-to-end latency: ~06:25–06:35 UTC for data produced the previous day. If the event was produced after the DAG cut-off and arrives into Bronze the same day, it appears in Bronze within 60 seconds but is processed in the next morning's run — or flagged as late data and reprocessed automatically.

⏱ End-to-End Latency Summary

Stage	Latency
Clickstream → Kafka	< 100ms
Kafka → Bronze Parquet	≤ 60s
Bronze → Airflow trigger	~18–24h (next run)
Airflow → Silver (Spark)	5–15 min
Silver → Gold (Spark)	5–10 min
Gold → Snowflake (MERGE)	2–5 min
Total (happy path)	~06:25–06:35 UTC

📊 Approximate Data Volumes

Domain	Daily volume
Events (clickstream)	~860K events/day
Orders (CDC)	~2,880 msgs/day
Restaurants (batch)	24 snapshots/day
Users (batch)	1 snapshot/day

Based on default config: 10 events/s, 30s CDC poll, 50 virtual users

🔌 Shared Framework — What Airflow Sees

Every DAG task group calls the same framework code. Domain-specific logic lives entirely in the plugin package. The orchestration layer never changes when a new domain is added.

DAG factory reads YAML SparkSubmitOperator is generic MetadataManager is shared S3KeySensor path from YAML

🔌 Domain Plugin Architecture

Adding a new data domain (e.g. drivers, payments, reviews) requires zero changes to the shared framework. You provide two things — a YAML config and a Python plugin package — and the platform handles the rest.

① YAML Config

One file per domain in airflow/config/pipeline.yaml

# pipeline.yaml (excerpt)
domains:
  drivers:              # new
    bronze:
      bucket: data-lake
      prefix: bronze/drivers
      partition_key: event_date
      timestamp_col: event_ts
    silver:
      prefix: silver/drivers
      primary_key: driver_id
    gold:
      prefix: gold/drivers

② Plugin Package

Two classes in spark/domains/drivers/

# transform.py
class Transform(BaseDomainTransform):
  def transform(self, df, cfg):
    return (df
      .withColumn("event_date",
        to_date(col("event_ts")))
      .dropDuplicates(["driver_id"])
      .withColumn("ingested_at",
        current_timestamp()))

# aggregations.py
class Aggregation(BaseDomainAggregation):
  def aggregate(self, df, cfg, dt):
    return (df
      .groupBy("event_date","city")
      .agg(count("*").alias("trips")))

→

What the Platform auto-creates

No framework changes needed:

# Airflow DAG factory reads YAML
data_pipeline_drivers  # new DAG

# Kafka topic (kafka-init)
raw-drivers            # add to init script

# S3 paths (auto from YAML)
bronze/drivers/
silver/drivers/
gold/drivers/

# Metadata tracking
# auto by MetadataManager
domain=drivers row in
partition_meta Delta table

# Snowflake schema
# add MERGE task to DAG

Partition Lifecycle & Metadata Tracking

Every (domain, partition_date) pair is a tracked entity. The state machine ensures exactly-once processing, automatic late-data correction, and safe retries.

🔄 Partition State Machine

Happy path →

no files yetUNDISCOVERED

→

files found in BronzePENDING

→

Spark job runningIN_FLIGHT

→

Silver + Gold writtenDONE ✓

Failure →

Spark job runningIN_FLIGHT

→

Spark error / timeoutFAILED

→

next Airflow run picks it upRETRYING

→

re-submittedIN_FLIGHT

→ …

max_retries exceeded → DEAD (manual)

Late data →

previously completedDONE

→

new files arrived in BronzeLATE ⚠

→

re-queued for next runREPROCESSING

→

re-submittedIN_FLIGHT

→

updated aggregationsDONE ✓

UNDISCOVERED

PENDING

IN_FLIGHT

DONE

FAILED

RETRYING

LATE

REPROCESSING

DEAD (max retries hit)

🔍 Late Data Detection Algorithm

Runs at the start of every Airflow DAG run, before any Spark jobs are submitted. No manual intervention required.

Scan Bronze S3 — list all partition paths for the domain: bronze/events/event_date=*/

Compute file hash — for each partition date, sort the list of Parquet file keys and hash them: md5(sorted(file_keys))

Query partition_meta — fetch all rows where domain = "events" and status = "DONE"

Compare hashes — if computed_hash ≠ stored file_hash for a completed partition, new files have arrived since it was last processed

Mark LATE — update partition_meta.status = "LATE", store new file_hash, log detected delta

Build run queue — combine PENDING (new) + LATE + RETRYING partitions into a single ordered list

Process all — each partition in the queue gets its own Spark Bronze→Silver→Gold job in the same DAG run. No separate backfill DAG needed.

Why this works: Bronze is append-only and immutable. Files are never deleted or overwritten. Late-arriving Kafka messages that cross a date boundary are written to the correct Bronze partition date by the sink consumer. The next Airflow run detects the new files and reprocesses. No data is lost.

🗂 partition_meta Delta Table Schema

Stored as a Delta Lake table in MinIO/GCS. Single source of truth for partition lineage, retry logic, and idempotency checks. Path: s3://pipeline-metadata/partition_meta/

Column	Type	Description
domain	string	"events", "orders", "restaurants", "users"
partition_date	date	The calendar date this partition covers
status	string	Current state machine state (see diagram above)
run_id	string	Airflow DAG run ID that last touched this row
record_count	bigint	Records written to Silver on last successful run
source_files	array<string>	Bronze Parquet file paths included in last run
file_hash	string	md5(sorted(source_files)) — late-data detection key
silver_records	bigint	Deduplicated rows in Silver output
gold_records	bigint	Rows merged into Gold output
retry_count	int	Number of FAILED → RETRYING cycles
error_message	string	Last Spark error stacktrace (null on success)
created_at	timestamp	When partition was first discovered
updated_at	timestamp	Last state transition timestamp
processed_at	timestamp	When most recent DONE state was reached
silver_path	string	S3 path of Silver output (for lineage)
gold_path	string	S3 path of Gold output (for lineage)

🛡 Idempotency Guarantees Per Layer

Bronze Layer At-Least-Once

Bronze is append-only. Duplicate Parquet files can exist if the sink consumer crashes between a successful S3 PUT and a Kafka commit. This is by design — duplicates are resolved in Silver. Bronze stores raw truth; it is never cleaned or overwritten.

duplicates possibleimmutable filesappend-only

Silver Layer Exactly-Once

Event data: deduplication on event_id (UUID4). CDC data (orders): Window function row_number() OVER (PARTITION BY order_id ORDER BY updated_at DESC) keeps only the latest status per order. A record_hash (SHA-256) detects schema-level corruption. Re-running Silver for the same partition always produces the same output.

deduplicatedlatest CDC record keptsafe to rerun

Gold + Snowflake Idempotent MERGE

Both Gold (Delta Lake) and Snowflake use MERGE INTO … WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT. Running the same partition twice replaces rows rather than duplicating them. This is the key property that makes late-data reprocessing safe — previously computed aggregations are simply overwritten with corrected values.

MERGE idempotentsafe late-data overwriteno manual cleanup

🔁 How to Trigger a Backfill

To reprocess a historical date range (e.g. after a pipeline bug fix), delete or update the rows in partition_meta. The next Airflow run will re-discover the partitions as PENDING and reprocess them.

Option 1: SQL on partition_meta

-- Mark a range as PENDING for reprocessing
UPDATE partition_meta
SET    status = 'PENDING', retry_count = 0
WHERE  domain = 'events'
AND    partition_date BETWEEN '2024-01-10' AND '2024-01-15';
-- Next DAG run at 06:00 UTC picks them up automatically

Option 2: Airflow UI

# 1. Open Airflow UI → data_pipeline_events DAG
# 2. Click the failed / target task group
# 3. Select "Clear" → confirm
# 4. Airflow re-queues the task
#    MetadataManager resets state to PENDING
#    Spark job re-runs with same Bronze data

OPERATIONS GUIDE

Run, monitor, and recover the platform

Service URLs, startup commands, daily cadence, common Kafka/Airflow ops, and failure runbooks — everything on-call needs in one place.

Service Endpoints

Airflow UI PUBLIC

http://localhost:8080

DAG management, task logs, trigger backfills, monitor runs.
Login: AIRFLOW_ADMIN_USER / AIRFLOW_ADMIN_PASSWORD

airflow-webserverCeleryExecutor

MinIO Console PUBLIC

http://localhost:9001

Browse Bronze / Silver / Gold buckets, inspect Parquet files, manage lifecycle policies.
S3 API: :9000

platform-minioerasure 4-drive

Spark Master UI LOOPBACK

http://localhost:18080

Active jobs, worker status, executor memory, completed application history.
Submit endpoint: spark://spark-master:7077

spark-master127.0.0.1 only

Spark Workers LOOPBACK

Worker-1: :8082 | Worker-2: :8083

Per-worker executor logs, running tasks, memory / core allocation. Each worker: up to 6g RAM, 4 cores (configurable via .env).

spark-worker-1spark-worker-2

Celery Flower LOOPBACK

http://localhost:5555

Real-time Celery worker monitoring: task rates, queue depths, worker heartbeats, task history per worker.

airflow-flowerRedis broker

Schema Registry LOOPBACK

http://localhost:8081

REST API to list/inspect Avro, Protobuf, JSON schemas. Compatibility mode: FULL_TRANSITIVE.

schema-registryConfluent 7.6.1

PostgreSQL LOOPBACK

localhost:5432

Databases: airflow (Airflow meta), pipeline_db (domain tables — orders, users, restaurants).

platform-postgres16-alpine

Quick Start

First-time setup

# 1. Copy env template and fill in secrets
cp .env.example .env
# vim .env  — set AIRFLOW_FERNET_KEY, passwords, etc.

# 2. Build custom images (Airflow + Spark + Extraction)
docker compose build

# 3. Start everything (init containers run automatically)
docker compose up -d

# 4. Watch init containers complete
docker compose logs -f airflow-init kafka-init minio-init

# 5. Verify all services healthy
docker compose ps

Day-to-day operations

# Restart a single service after a code change
docker compose restart airflow-scheduler

# Rebuild extraction layer after code change
docker compose build clickstream-producer
docker compose up -d --no-deps clickstream-producer

# Tail logs from any service
docker compose logs -f --tail=100 airflow-worker

# Stop everything (preserves volumes / data)
docker compose down

# Full teardown including volumes (DESTRUCTIVE)
docker compose down -v

Daily Pipeline Cadence (UTC)

Every day the pipeline runs a coordinated sequence. All times UTC. Extraction layer (Kafka producers) runs continuously — the Airflow DAG processes the accumulated Bronze data at 06:00.

Time (UTC)	Event	Service	What happens
00:00 →	Clickstream events flowing	clickstream-producer	~10 events/sec continuously writing to `raw-events` topic
00:00 →	Order CDC polling	orders-cdc-producer	Every 30s: reads `orders WHERE updated_at > watermark`, publishes to `raw-orders`
hourly	Restaurant catalog snapshot	restaurants-batch-producer	Full join of restaurants + menu_items published to `raw-restaurants`
daily	User profile extract	users-batch-producer	Incremental extract of changed users (watermark on `updated_at`) → `raw-users`
continuous	S3 sink flush	s3-sink-consumer	Every 60s or 1,000 messages: flushes buffered Parquet files to `s3://data-lake/bronze/`
~05:55	Airflow scheduler fires	airflow-scheduler	Evaluates DAG schedule, queues `data_pipeline_events` task instances
06:00	Bronze → Silver (Spark)	airflow-worker spark-worker-1/2	SparkSubmitOperator runs `bronze_to_silver.py` per domain; deduplication, typing, Delta merge
~06:10	Silver → Gold (Spark)	spark-worker-1/2	Aggregations run after Silver succeeds; writes daily metrics to Gold Delta tables
~06:20	Gold → Snowflake (optional)	airflow-worker	Snowflake MERGE via `silver_to_gold.py` — idempotent, safe to re-run
~06:30	Late-data detection	airflow-scheduler	On next run: Bronze file hashes compared to `partition_meta` — DONE partitions with new files → LATE → reprocess

Common Kafka Operations

Shell into a broker first

# Exec into kafka-1 and write the client config
docker exec -it platform-kafka-1 bash

# Inside the container: env vars are available directly
cat > /tmp/client.properties <<EOF
security.protocol=SASL_PLAINTEXT
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \
  username="${KAFKA_CLIENT_USER}" \
  password="${KAFKA_CLIENT_PASSWORD}";
EOF

# Verify credentials are set
echo "User: $KAFKA_CLIENT_USER"

List topics & consumer lag

# List all topics
/opt/kafka/bin/kafka-topics.sh \
  --bootstrap-server kafka-1:9092 \
  --command-config /tmp/client.properties \
  --list

# Describe topic (partition leaders, ISR)
/opt/kafka/bin/kafka-topics.sh \
  --bootstrap-server kafka-1:9092 \
  --command-config /tmp/client.properties \
  --describe --topic raw-events

# Check consumer group lag
/opt/kafka/bin/kafka-consumer-groups.sh \
  --bootstrap-server kafka-1:9092 \
  --command-config /tmp/client.properties \
  --describe --group s3-sink-consumer

Peek at topic messages

# Consume latest 5 messages from raw-events
/opt/kafka/bin/kafka-console-consumer.sh \
  --bootstrap-server kafka-1:9092 \
  --command-config /tmp/client.properties \
  --topic raw-events \
  --from-beginning \
  --max-messages 5

# Peek at dead-letter queue
/opt/kafka/bin/kafka-console-consumer.sh \
  --bootstrap-server kafka-1:9092 \
  --command-config /tmp/client.properties \
  --topic dlq-raw-events \
  --from-beginning \
  --max-messages 10

Reset consumer offset (backfill)

# Reset s3-sink-consumer to beginning (re-sink all data)
# Stop consumer first!
docker compose stop s3-sink-consumer

/opt/kafka/bin/kafka-consumer-groups.sh \
  --bootstrap-server kafka-1:9092 \
  --command-config /tmp/client.properties \
  --group s3-sink-consumer \
  --reset-offsets \
  --to-earliest \
  --topic raw-events \
  --execute

docker compose start s3-sink-consumer

# Reset to specific datetime
# --to-datetime 2024-01-15T06:00:00.000

Failure Runbooks

KAFKA BROKER UNHEALTHY exit 127 or healthcheck timeout

Symptoms

docker compose ps shows kafka-1/2 unhealthy
kafka-init stuck in "not ready" retry loop
airflow-worker can't produce to Kafka topics

Resolution

# Check healthcheck logs
docker inspect --format \
  "{{json .State.Health.Log}}" \
  platform-kafka-1 | python3 -m json.tool

# Recreate to pick up latest image/config
docker compose up -d --force-recreate \
  kafka-1 kafka-2

AIRFLOW TASK FAILED Spark job error or Bronze data missing

Diagnosis steps

Airflow UI → failed task → View Log
Spark UI (localhost:18080) → failed application → Stderr
Check MinIO for Bronze Parquet files (did s3-sink-consumer flush?)
Check partition_meta table for FAILED state

Re-run a specific DAG run

# Via Airflow UI
# Grid view → failed run → Clear

# Via CLI (inside airflow-webserver)
docker exec -it platform-airflow-webserver \
  airflow dags trigger \
  data_pipeline_events \
  --exec-date 2024-01-15T06:00:00+00:00 \
  --conf '{"domains":["events"]}'

S3 SINK LAG / STALE BRONZE Consumer group falling behind or no Parquet files appearing

Check lag

# Check consumer group lag
docker exec -it platform-kafka-1 \
  /opt/kafka/bin/kafka-consumer-groups.sh \
  --bootstrap-server kafka-1:9092 \
  --command-config /tmp/client.properties \
  --describe --group s3-sink-consumer

# Check consumer logs
docker compose logs --tail=50 s3-sink-consumer

Resolution

# Restart the consumer
docker compose restart s3-sink-consumer

# Scale up if lag is high
# (production: split into per-topic consumers)
# see s3_sink_consumer.py header for pattern

# Force flush: restart with buffer=1
SINK_FLUSH_BUFFER_SIZE=1 \
docker compose up -d s3-sink-consumer

POSTGRES / CDC PRODUCER ERROR psycopg2 connection refused or watermark drift

Check DB health

# Check Postgres is accepting connections
docker exec -it platform-postgres \
  pg_isready -U $POSTGRES_USER -d pipeline_db

# List domain tables
docker exec -it platform-postgres \
  psql -U $POSTGRES_USER -d pipeline_db \
  -c "\dt"

Reset CDC watermark

# Reset orders watermark to replay all orders
docker exec platform-orders-cdc-producer \
  sh -c "rm -f /var/lib/extraction/watermarks/orders_cdc.txt"
docker compose restart orders-cdc-producer

# Reset users watermark
docker exec platform-users-batch-producer \
  sh -c "rm -f /var/lib/extraction/watermarks/users_batch.txt"
docker compose restart users-batch-producer

REFERENCE

Technology stack, networks, security & glossary

Versions, network topology, security model, glossary of every term used in this platform, and the GCP migration roadmap.

Technology Stack

Component	Image / Version	Role	Notable config
PostgreSQL	16-alpine	Airflow metadata DB + pipeline domain DB (orders, users, restaurants)	PGDATA=/var/lib/postgresql/data/pgdata
Redis	7-alpine	Celery broker & result backend; Airflow task queue	maxmemory 1gb, LRU eviction, AOF persistence
MinIO	2024-04-18T19-09-19Z	S3-compatible object store; Bronze / Silver / Gold / metadata buckets	Erasure coding: /data{1...4} — survives 2 drive read failures
Apache Kafka	apache/kafka:3.7.0	Event streaming bus; 10 topics, KRaft mode (no ZooKeeper)	SASL_PLAINTEXT, 2-broker dev / 3-broker prod, 2GB JVM heap
Confluent Schema Registry	cp-schema-registry:7.6.1	Avro / Protobuf / JSON schema enforcement; FULL_TRANSITIVE compatibility	SASL auth to Kafka, RF=2 internal topic
Apache Spark	apache/spark:3.5.1	Bronze→Silver dedup/typing, Silver→Gold aggregations, Snowflake MERGE	+ Delta Lake 3.2.0, hadoop-aws, Kafka connector, Snowflake connector
Apache Airflow	2.9.2	DAG orchestration; CeleryExecutor; S3 remote logs; SparkSubmitOperator	4 workers (AIRFLOW__CELERY__WORKER_CONCURRENCY=8), 4 webserver workers
Extraction Layer	data-platform/extraction:latest	Python producers (clickstream, CDC, batch) + S3 sink consumer	kafka-python, boto3, psycopg2-binary, faker, pyarrow, pydantic
Delta Lake	3.2.0	ACID table format for Silver and Gold; MERGE for idempotent upserts	Snappy compression, Z-order on partition date cols

Network Topology

BACKEND-NET

platform-backend | 172.20.0.0/24

Internal services: auth, queuing, task state.

postgresredis airflow-webserverairflow-scheduler airflow-workerairflow-triggerer airflow-flower

STORAGE-NET

platform-storage | 172.20.1.0/24

Object store access: read/write Bronze, Silver, Gold, metadata, Spark event logs.

miniominio-init spark-masterspark-worker-1/2 airflow-workers3-sink-consumer

MESSAGING-NET

platform-messaging | 172.20.2.0/24

Kafka cluster communication; producers publish, consumer sinks read.

kafka-1kafka-2 schema-registrykafka-init clickstream-producerorders-cdc-producer restaurants-batch-producerusers-batch-producer s3-sink-consumerairflow-worker spark-worker-1/2

PROCESSING-NET

platform-processing | 172.20.3.0/24

Spark cluster internal communication and job submission from Airflow worker.

spark-masterspark-worker-1 spark-worker-2airflow-worker

NETWORK ISOLATION NOTE

airflow-worker is the only service bridging all four networks — it's the integration hub. Spark workers are on processing-net + storage-net + messaging-net (need Kafka for streaming jobs). Producers are messaging-net only. The Postgres and Redis instances are reachable only from backend-net, keeping them isolated from untrusted networks.

Security Model

Kafka Authentication

Protocol: SASL_PLAINTEXT
Mechanism: PLAIN
Inter-broker: separate credentials from client
JAAS file generated at container start from env vars — never hardcoded in image
Every CLI command requires --command-config

Container Hardening

All containers: no-new-privileges:true
Airflow: runs as UID 50000 (non-root)
Spark: runs as user spark (non-root)
Extraction: non-root Python service user
Postgres / Redis: official images, non-root by default
No secrets in image layers — all via .env

Network Exposure

Airflow UI (:8080) and MinIO Console (:9001) — bound to 0.0.0.0
All other ports: 127.0.0.1 loopback only
MinIO service account (MINIO_PIPELINE_USER) has least-privilege bucket policy
Airflow connections store credentials in encrypted Postgres column (Fernet key)
Docker bridge networks: services can only reach peers on shared network

Glossary

Bronze Layer

Raw, unmodified data as produced by the S3 sink consumer. Parquet files partitioned by date in s3://data-lake/bronze/{domain}/. At-least-once: may contain duplicates. Never modified after write.

Silver Layer

Deduplicated, typed, validated records in Delta Lake format. Bronze duplicates removed using event_id / CDC watermark window. Each domain has a Silver table managed by a domain plugin.

Gold Layer

Business-level aggregations: daily revenue, order counts, funnel metrics. Written as Delta Lake tables. MERGE-based upsert makes re-runs idempotent. Final destination before Snowflake.

CDC (Change Data Capture)

Technique for capturing row-level changes in a source database. Here implemented as micro-batch polling: SELECT … WHERE updated_at > watermark every 30s, not log-based CDC.

Watermark

A persisted checkpoint timestamp (stored in a Docker volume). Producers read the watermark on startup to resume from where they left off, avoiding re-processing all historical records.

KRaft

Kafka Raft — the built-in consensus protocol replacing ZooKeeper (available since Kafka 3.3). Brokers manage their own metadata via a Raft quorum; eliminates the ZooKeeper dependency entirely.

DLQ (Dead-Letter Queue)

A Kafka topic where unprocessable messages are routed instead of being discarded. Examples: dlq-raw-events, dlq-raw-orders. 30-day retention for manual triage.

Medallion Architecture

A data lakehouse design pattern with three layers: Bronze (raw), Silver (clean), Gold (aggregated). Each layer adds quality and semantic meaning. Coined by Databricks, implemented here with Delta Lake.

Partition State Machine

A per-(domain, date) state tracker in the partition_meta Delta table. States: UNDISCOVERED → PENDING → IN_FLIGHT → DONE, with FAILED and LATE paths. Enables idempotent re-runs and late-data detection.

CeleryExecutor

Airflow executor that distributes tasks to Celery workers via Redis. Enables horizontal scaling: add more airflow-worker replicas to increase concurrency without changing the scheduler.

Delta Lake MERGE

MERGE INTO target USING source ON key WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT — a single atomic SQL statement that upserts rows. Makes re-processing late data safe with no duplicates.

At-least-once / Exactly-once

Delivery guarantees. Bronze is at-least-once (Kafka commit after S3 PUT — may duplicate on crash). Silver achieves exactly-once via deduplication on event_id. Gold/Snowflake via idempotent MERGE.

GCP Migration Phases

Phase 1

CURRENT

Local Docker Stack

All services on a single machine (or small VM). KRaft Kafka 2-broker dev cluster, MinIO 4-drive erasure coding, Spark standalone, Airflow CeleryExecutor. Full pipeline parity — no cloud required. Cost: ~$0/month local, ~$100-300/month on a GCP e2-standard-8.

running now Docker ComposeMinIOKafka KRaftSpark standalone

Phase 2

HYBRID

Managed Kafka + GCS Storage

Replace self-managed Kafka with Confluent Cloud or Google Pub/Sub. Replace MinIO with Google Cloud Storage. Airflow + Spark remain on GCE VMs (Compute Engine) or GKE. Producers/consumers require minimal code changes — swap bootstrap server and S3 endpoint env vars. Airflow connections updated to GCS.

Confluent Cloud / Pub/Sub GCSGCE / GKE Spark standalone → Dataproc

Phase 3

FULL GCP

Fully Managed Cloud-Native

All components replaced with GCP-managed equivalents. Cloud Composer 2 (managed Airflow on GKE Autopilot). Dataproc Serverless for Spark jobs (no cluster management, pay-per-job). BigQuery replaces Snowflake (or alongside it). Pub/Sub + Dataflow for streaming. GCS as the unified data lake. IAM service accounts replace username/password secrets.

Cloud Composer 2 Dataproc Serverless BigQuery Pub/Sub Dataflow GCS IAM Workload Identity

Migration Comparison

Component	Phase 1 (Current)	Phase 2	Phase 3
Event bus	Kafka KRaft (Docker)	Confluent Cloud / Pub/Sub	Pub/Sub + Dataflow
Object store	MinIO (Docker)	Google Cloud Storage	GCS
Processing	Spark standalone (Docker)	Dataproc on GCE	Dataproc Serverless
Orchestration	Airflow CeleryExecutor	Airflow on GCE/GKE	Cloud Composer 2
Analytics DB	Snowflake (external)	Snowflake on GCP	BigQuery
Secrets	.env + Fernet	Secret Manager	IAM Workload Identity
Approx. cost	$0 local / ~$200/mo VM	~$500–1,200/mo	~$800–2,000/mo