Contacts
Get in touch
Close

Contacts

USA, Washington D.C

+ (1) 240-380-7545

info@zorost.com

Pull-quote: “Streaming pipelines that wake people at 3 AM are not real-time. They’re real-painful.”

Why this matters

Real-time pipelines are easy to demo and hard to operate. The pattern that fails: a clever Spark Structured Streaming job that works in dev, struggles in prod under skew, and breaks at the first schema evolution. The pattern that survives: Auto Loader for ingestion, DLT for transformations, expectations for quality, and SLOs that the team monitors like uptime.

The reference architecture

   Sources                Ingestion              Transformation          Consumption
   ───────                ─────────              ──────────────          ───────────
   Cloud storage  ──►  Auto Loader (cloudFiles) ──►  Bronze
   Kafka / EH     ──►  Structured Streaming    ──►  Bronze
   CDC (Debezium) ──►  Auto Loader / SS        ──►  Bronze
                                                        │
                                              DLT expectations
                                              (drop / quarantine)
                                                        ▼
                                                     Silver
                                                        │
                                              joins / aggregations
                                                        ▼
                                                      Gold ──►  BI · ML · Apps

Auto Loader: incremental, schema-evolving, exactly-once

Auto Loader is the foundation. For file-based ingestion at scale, it handles:

  • Incremental discovery of new files
  • Schema inference with versioned schema files
  • Schema evolution with rescued data column for unexpected fields
  • Exactly-once semantics via durable file tracking

For event streams, Structured Streaming directly from Kafka, Event Hubs, or Kinesis covers the same role.

DLT: declarative streaming with managed dependencies

DLT lets you describe what the pipeline computes, not how. The runtime handles dependency ordering, retry semantics, schema validation, and metric capture. Expectations express data-quality contracts:

-- Pseudocode
CREATE STREAMING LIVE TABLE silver_orders
  CONSTRAINT valid_id  EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW
  CONSTRAINT valid_amt EXPECT (amount > 0)           ON VIOLATION DROP ROW
  CONSTRAINT plausible EXPECT (amount < 1e7)         ON VIOLATION QUARANTINE
  AS SELECT ... FROM STREAM(LIVE.bronze_orders);

The metrics on those expectations become part of the pipeline’s observability surface.

SLOs that survive production

SLO Target
End-to-end latency P95 < 60 s for “near-real-time” use cases
Drop rate < 0.5% of input records
Quarantine rate < 2% of input records
Pipeline uptime 99.9% monthly
Backfill capability < 24 h for last-7-day reprocessing

These are the right targets to commit to, not the latency benchmarks vendors quote in marketing.

Closing

Streaming on the Lakehouse is operationally feasible when you adopt Auto Loader, DLT, and expectations as the standard pattern. The team’s job becomes monitoring SLOs and reviewing quarantine, not babysitting jobs.