| Component | Current (Docker Compose) | GCP Phase 3 | Code change? |
|---|---|---|---|
| Runtime | Docker Compose · single host | GKE Autopilot | No |
| Object Storage | MinIO (S3-compatible) | Google Cloud Storage | Env var only |
| Secrets | .env file | GCP Secret Manager | K8s secret sync |
| PostgreSQL | postgres:16-alpine container | Cloud SQL Postgres 16 · HA | DSN only |
| Airflow Executor | CeleryExecutor + Redis | KubernetesExecutor · no Redis | Helm values |
| Spark | Standalone 3 containers | spark-on-k8s-operator + CRD | CRD YAML only |
| Kafka | 2 containers · KRaft | StatefulSet 3 pods or Confluent Cloud | Bootstrap URL |
| Scaling | Manual docker compose scale | HPA + node auto-provisioning | No |
| Pipeline Logic | ✓ Identical — bronze_to_silver.py, silver_to_gold.py, all DAGs, schemas unchanged | None | |
app_open event. The clickstream producer generates a Python dataclass with event_id = uuid4(), event_timestamp (ISO-8601 UTC), device type, city, and session ID.raw-eventsproducer.send("raw-events", value=msg_bytes). The KafkaProducer batches and flushes to one of 6 partitions on kafka-1 or kafka-2 using the default murmur2 partitioner on user_id.pandas.DataFrame, writes it as Parquet with Snappy compression via PyArrow, and PUTs the file to s3://data-lake/bronze/events/event_date=2024-01-15/part-{ts}-{uuid8}.parquet. The partition date is derived from the event_timestamp field in the message.data_pipeline_events DAG starts. The Partition State Machine scans Bronze S3, discovers all partition dates with new files, computes a file-list hash for each, and compares against the partition_meta Delta table. New partitions enter state PENDING. Previously completed partitions whose file hash changed are re-marked LATE.from_json(), computes a SHA-256 record_hash, deduplicates on event_id, and writes a Delta table partition to Silver. Partition state advances to IN_FLIGHT → DONE.DeltaTable.merge(). The MERGE is fully idempotent — re-running replaces the same rows rather than duplicating them. This is critical for late-data reprocessing.MERGE INTO gold.events USING staging …. Duplicates are impossible. The load is idempotent by design — running it twice produces the same result.| Stage | Latency |
|---|---|
| Clickstream → Kafka | < 100ms |
| Kafka → Bronze Parquet | ≤ 60s |
| Bronze → Airflow trigger | ~18–24h (next run) |
| Airflow → Silver (Spark) | 5–15 min |
| Silver → Gold (Spark) | 5–10 min |
| Gold → Snowflake (MERGE) | 2–5 min |
| Total (happy path) | ~06:25–06:35 UTC |
| Domain | Daily volume |
|---|---|
| Events (clickstream) | ~860K events/day |
| Orders (CDC) | ~2,880 msgs/day |
| Restaurants (batch) | 24 snapshots/day |
| Users (batch) | 1 snapshot/day |
airflow/config/pipeline.yamlspark/domains/drivers/bronze/events/event_date=*/md5(sorted(file_keys))domain = "events" and status = "DONE"computed_hash ≠ stored file_hash for a completed partition, new files have arrived since it was last processedpartition_meta.status = "LATE", store new file_hash, log detected deltaPENDING (new) + LATE + RETRYING partitions into a single ordered lists3://pipeline-metadata/partition_meta/| Column | Type | Description |
|---|---|---|
| domain | string | "events", "orders", "restaurants", "users" |
| partition_date | date | The calendar date this partition covers |
| status | string | Current state machine state (see diagram above) |
| run_id | string | Airflow DAG run ID that last touched this row |
| record_count | bigint | Records written to Silver on last successful run |
| source_files | array<string> | Bronze Parquet file paths included in last run |
| file_hash | string | md5(sorted(source_files)) — late-data detection key |
| silver_records | bigint | Deduplicated rows in Silver output |
| gold_records | bigint | Rows merged into Gold output |
| retry_count | int | Number of FAILED → RETRYING cycles |
| error_message | string | Last Spark error stacktrace (null on success) |
| created_at | timestamp | When partition was first discovered |
| updated_at | timestamp | Last state transition timestamp |
| processed_at | timestamp | When most recent DONE state was reached |
| silver_path | string | S3 path of Silver output (for lineage) |
| gold_path | string | S3 path of Gold output (for lineage) |
event_id (UUID4). CDC data (orders): Window function row_number() OVER (PARTITION BY order_id ORDER BY updated_at DESC) keeps only the latest status per order. A record_hash (SHA-256) detects schema-level corruption. Re-running Silver for the same partition always produces the same output.
MERGE INTO … WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT. Running the same partition twice replaces rows rather than duplicating them. This is the key property that makes late-data reprocessing safe — previously computed aggregations are simply overwritten with corrected values.
partition_meta. The next Airflow run will re-discover the partitions as PENDING and reprocess them.Service URLs, startup commands, daily cadence, common Kafka/Airflow ops, and failure runbooks — everything on-call needs in one place.
AIRFLOW_ADMIN_USER / AIRFLOW_ADMIN_PASSWORD:9000spark://spark-master:7077FULL_TRANSITIVE.airflow (Airflow meta), pipeline_db (domain tables — orders, users, restaurants).Versions, network topology, security model, glossary of every term used in this platform, and the GCP migration roadmap.
SASL_PLAINTEXTPLAIN--command-configno-new-privileges:truespark (non-root).env0.0.0.0127.0.0.1 loopback onlyMINIO_PIPELINE_USER) has least-privilege bucket policys3://data-lake/bronze/{domain}/. At-least-once: may contain duplicates. Never modified after write.event_id / CDC watermark window. Each domain has a Silver table managed by a domain plugin.SELECT … WHERE updated_at > watermark every 30s, not log-based CDC.dlq-raw-events, dlq-raw-orders. 30-day retention for manual triage.partition_meta Delta table. States: UNDISCOVERED → PENDING → IN_FLIGHT → DONE, with FAILED and LATE paths. Enables idempotent re-runs and late-data detection.airflow-worker replicas to increase concurrency without changing the scheduler.MERGE INTO target USING source ON key WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT — a single atomic SQL statement that upserts rows. Makes re-processing late data safe with no duplicates.event_id. Gold/Snowflake via idempotent MERGE.| Component | Phase 1 (Current) | Phase 2 | Phase 3 |
|---|---|---|---|
| Event bus | Kafka KRaft (Docker) | Confluent Cloud / Pub/Sub | Pub/Sub + Dataflow |
| Object store | MinIO (Docker) | Google Cloud Storage | GCS |
| Processing | Spark standalone (Docker) | Dataproc on GCE | Dataproc Serverless |
| Orchestration | Airflow CeleryExecutor | Airflow on GCE/GKE | Cloud Composer 2 |
| Analytics DB | Snowflake (external) | Snowflake on GCP | BigQuery |
| Secrets | .env + Fernet | Secret Manager | IAM Workload Identity |
| Approx. cost | $0 local / ~$200/mo VM | ~$500–1,200/mo | ~$800–2,000/mo |