We’re streaming haul truck telemetry (payload, fuel, GPS) from MineStar via MQTT into Kafka, then Spark into a Delta Lake on S3, and I’m torn between per-asset narrow tables vs a wide event schema and how to dedupe late packets with 2–5 minute drift. Has anyone in production mines settled on a reliable pattern for watermarks/late data and soft deletes, and did you land raw in Iceberg and feed Snowflake every 15 minutes or stick with the lakehouse and serve via Databricks SQL?
We stuck with Delta on S3; what made late/dupe handling reliable was keying Kafka by asset_id+seq_no and running a 12‑min withWatermark in Structured Streaming, then MERGE into silver on (asset_id, seq_no) so newer packets upsert and old ones flip is_deleted=true (docs: Table deletes, updates, and merges | Delta Lake).
What worked for us: wide event table in Delta with bronze append-only, a silver stream deduping on truck+packet_uid and an event-time watermark about 3× observed drift; soft deletes stayed as a flag in silver, then we emitted real deletes via Delta CDF (Change data feed | Delta Lake) and did weekly VACUUM — also ZORDER on truck_id,event_ts or MERGEs crawl. @timothy2584’s note resonates; add an ingest_ts tie-breaker for those “radio hiccup” moments — does MineStar give you a stable packet UID?
We ended up using the device timestamp as event time with a 15‑min watermark (your “2–5 minute drift” blows out at shift change), deduping on unit + msg_id from headers and upserting into a single wide‑ish log. For soft deletes, we emit a tombstone to a compacted side topic (Apache Kafka) and mirror it as deleted=true in the log; if you really need per-asset narrow tables, @timothy2584’s 3× rule still held up but the schema churn was a pain.
Partition Kafka by truck with idempotent producers; fallback watermark on LogAppendTime +2m for S3 lag when MQTT misorders, @sadams85.