<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Databricks Modernization Archives - Zorost Intelligence | AI, Cloud &amp; Data Experts</title>
	<atom:link href="https://zorost.com/category/databricks-modernization/feed/" rel="self" type="application/rss+xml" />
	<link>https://zorost.com/category/databricks-modernization/</link>
	<description>Production AI systems for aviation, manufacturing, pharma, government, finance, freight, and geopolitical intelligence.</description>
	<lastBuildDate>Wed, 20 May 2026 18:52:40 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://zorost.com/wp-content/uploads/2025/08/ZOROST-Intel-Logo3_512-150x150.png</url>
	<title>Databricks Modernization Archives - Zorost Intelligence | AI, Cloud &amp; Data Experts</title>
	<link>https://zorost.com/category/databricks-modernization/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">81719879</site>	<item>
		<title>Databricks Cost Optimization &#038; Finops: Where the Real Savings Are</title>
		<link>https://zorost.com/databricks-cost-optimization-finops/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 21 Apr 2026 09:00:00 +0000</pubDate>
				<category><![CDATA[Databricks Modernization]]></category>
		<category><![CDATA[Cost Optimization]]></category>
		<category><![CDATA[Databricks]]></category>
		<category><![CDATA[FinOps]]></category>
		<category><![CDATA[Performance Tuning]]></category>
		<guid isPermaLink="false">https://zorost.com/databricks-cost-optimization-finops/</guid>

					<description><![CDATA[<p>A practical FinOps playbook for Databricks. Cluster types, file compaction, caching, serverless, and BI rationalization — with realistic savings ranges.</p>
<p>The post <a href="https://zorost.com/databricks-cost-optimization-finops/">Databricks Cost Optimization &#038; Finops: Where the Real Savings Are</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;Cost optimization is not a one-time project. It&#8217;s a recurring discipline. The tooling is there. The discipline is the ask.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Most Databricks deployments have 30–60% slack in their spend within twelve months of go-live. Some of it is unavoidable (early-stage discovery). Some of it is technical (file layout, cluster sizing). Most of it is organizational (no cost ownership, no tagging, no review cadence).</p>
<h4>Where the real savings are</h4>
<table>
<thead>
<tr>
<th>Lever</th>
<th>Typical impact</th>
</tr>
</thead>
<tbody>
<tr>
<td>Right-sized cluster types (Photon, autoscaling, spot)</td>
<td>15–30%</td>
</tr>
<tr>
<td>Job orchestration (concurrent runs, dependencies, retries)</td>
<td>5–15%</td>
</tr>
<tr>
<td>File compaction (<code>OPTIMIZE</code>, <code>Z-ORDER</code>, <code>liquid clustering</code>)</td>
<td>10–25% on read-heavy workloads</td>
</tr>
<tr>
<td>Caching strategies (Delta cache, query cache)</td>
<td>5–15%</td>
</tr>
<tr>
<td>Workload migration to Serverless SQL where appropriate</td>
<td>10–25%</td>
</tr>
<tr>
<td>BI semantic-model rationalization</td>
<td>10–20% on Power BI / Tableau queries</td>
</tr>
<tr>
<td>Autoscaling thresholds</td>
<td>5–10%</td>
</tr>
<tr>
<td>Tombstone management (<code>VACUUM</code>)</td>
<td>Cleanup, not a direct saving, but sustainable</td>
</tr>
</tbody>
</table>
<blockquote>
<p>Ranges are typical for engagements where the team has not previously focused on cost. Mature deployments have less to find.</p>
</blockquote>
<h4>Tagging and ownership — the prerequisite</h4>
<p>Without tagging, you can&#8217;t optimize. Required tags:</p>
<ul>
<li><code>cost_center</code></li>
<li><code>environment</code> (dev / stage / prod)</li>
<li><code>owner</code> (team or person)</li>
<li><code>workload</code> (training / serving / ETL / BI / ad-hoc)</li>
</ul>
<p>These flow into the <strong>system tables</strong> for cost reporting (<code>system.billing.usage</code>).</p>
<h4>The audit, in twelve hours</h4>
<p>A typical audit takes about twelve hours of senior engineering time:</p>
<ol>
<li>Pull <code>system.billing.usage</code> for the last 90 days, joined with cluster metadata</li>
<li>Identify the top 10 jobs by cost</li>
<li>For each, evaluate: is the cluster the right type? Is autoscaling tuned? Are files compacted? Is the workload running at the right cadence?</li>
<li>Identify candidates for serverless migration</li>
<li>Identify candidates for materialized view replacement</li>
<li>Produce a prioritized list with estimated savings</li>
</ol>
<p>Most teams find five to ten actions that together deliver 20–40% savings.</p>
<h4>Common findings</h4>
<ul>
<li>A nightly batch job using a high-end cluster size when a Photon-enabled smaller cluster would do</li>
<li>A streaming pipeline running with a cluster sized for peak when traffic is bimodal</li>
<li>A Power BI model importing 80% of data that nobody queries</li>
<li>A <code>SELECT *</code> materialized in a downstream view, doubling storage cost on a hot dataset</li>
<li>An ad-hoc cluster left running over a weekend</li>
</ul>
<h4>Cost ownership cadence</h4>
<p>The discipline that holds savings: monthly cost review with the data leadership and the FinOps lead. Each owner explains anomalies. Tags get fixed. Wasteful patterns get retired.</p>
<h4>Closing</h4>
<p>Cost optimization on Databricks is not a one-time project. It is a recurring discipline backed by tagging, system tables, and a monthly review. The platform tooling is there. The discipline is the ask.</p>
<hr>
<p>The post <a href="https://zorost.com/databricks-cost-optimization-finops/">Databricks Cost Optimization &#038; Finops: Where the Real Savings Are</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24309</post-id>	</item>
		<item>
		<title>Power Bi Direct Lake on Databricks SQL: a Modernization Playbook</title>
		<link>https://zorost.com/power-bi-direct-lake-databricks-sql/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 31 Mar 2026 09:00:00 +0000</pubDate>
				<category><![CDATA[Databricks Modernization]]></category>
		<category><![CDATA[BI]]></category>
		<category><![CDATA[Databricks SQL]]></category>
		<category><![CDATA[Direct Lake]]></category>
		<category><![CDATA[Power BI]]></category>
		<category><![CDATA[Semantic Model]]></category>
		<guid isPermaLink="false">https://zorost.com/power-bi-direct-lake-databricks-sql/</guid>

					<description><![CDATA[<p>Migrate Power BI semantic models from import / DirectQuery to Direct Lake on Databricks SQL. Performance, governance, and migration patterns.</p>
<p>The post <a href="https://zorost.com/power-bi-direct-lake-databricks-sql/">Power Bi Direct Lake on Databricks SQL: a Modernization Playbook</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;Direct Lake is not faster DirectQuery. It is a different mode that eliminates a class of refreshes that should never have existed.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Power BI has been deployed in three modes for a decade: <strong>Import</strong>, <strong>DirectQuery</strong>, and <strong>Composite</strong>. Each has trade-offs. Import is fast but stale; DirectQuery is fresh but slow; Composite is a compromise. Direct Lake — Power BI talking directly to Delta tables in Databricks SQL — is a fourth mode that eliminates a class of refresh problems that should never have existed.</p>
<h4>The four modes</h4>
<table>
<thead>
<tr>
<th>Mode</th>
<th>Freshness</th>
<th>Performance</th>
<th>When to use</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Import</strong></td>
<td>Stale until next refresh</td>
<td>Fast</td>
<td>Small models, infrequent updates</td>
</tr>
<tr>
<td><strong>DirectQuery</strong></td>
<td>Live</td>
<td>Slow on large fact tables</td>
<td>Real-time-ish dashboards over modest volume</td>
</tr>
<tr>
<td><strong>Composite</strong></td>
<td>Mixed</td>
<td>Mixed</td>
<td>Hybrid scenarios</td>
</tr>
<tr>
<td><strong>Direct Lake</strong></td>
<td>Live (on Delta)</td>
<td>Fast</td>
<td>Lakehouse-native consumption</td>
</tr>
</tbody>
</table>
<h4>Why Direct Lake works</h4>
<p>Direct Lake reads Delta files directly into Power BI&#8217;s analytics engine without import. There is no refresh schedule. There is no DirectQuery overhead. The semantic model points at Unity Catalog tables and the engine handles the rest.</p>
<p>The conditions for it to work:</p>
<ul>
<li>Source data must be in Delta format</li>
<li>Tables must be in Unity Catalog</li>
<li>Model size must fit in the engine&#8217;s memory budget for the SKU</li>
<li>DAX must be Direct Lake-compatible (most is; some isn&#8217;t)</li>
</ul>
<h4>Migration playbook</h4>
<table>
<thead>
<tr>
<th>Phase</th>
<th>Output</th>
</tr>
</thead>
<tbody>
<tr>
<td>Discovery</td>
<td>Catalog of existing Power BI models · usage telemetry</td>
</tr>
<tr>
<td>Source landing in Delta</td>
<td>Sources moved to Delta tables in Unity Catalog</td>
</tr>
<tr>
<td>Semantic model rebuild</td>
<td>New model on Direct Lake</td>
</tr>
<tr>
<td>Visual rebuild</td>
<td>Reports and dashboards rebuilt against the new model</td>
</tr>
<tr>
<td>Parallel run</td>
<td>Old and new models in production simultaneously</td>
</tr>
<tr>
<td>Cutover</td>
<td>Old retired</td>
</tr>
</tbody>
</table>
<h4>Governance benefits</h4>
<ul>
<li>Row and column security live in the <strong>dynamic views</strong> in Unity Catalog, not in the semantic model. One source of truth for security.</li>
<li>Lineage covers the entire path from source through Delta to Power BI.</li>
<li>Performance tuning happens at the Delta layer (liquid clustering, OPTIMIZE, Z-order) and benefits every consumer, not just Power BI.</li>
</ul>
<h4>Closing</h4>
<p>Direct Lake is the modern Power BI mode for Lakehouse-native consumption. The migration is methodical, the trade-offs are clear, and the result is faster, fresher dashboards with simpler operations.</p>
<hr>
<p>The post <a href="https://zorost.com/power-bi-direct-lake-databricks-sql/">Power Bi Direct Lake on Databricks SQL: a Modernization Playbook</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24308</post-id>	</item>
		<item>
		<title>Production ML on Databricks: Mlflow, Feature Store, Calibration</title>
		<link>https://zorost.com/production-ml-databricks-mlflow-feature-store-calibration/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 03 Mar 2026 09:00:00 +0000</pubDate>
				<category><![CDATA[Databricks Modernization]]></category>
		<category><![CDATA[Calibration]]></category>
		<category><![CDATA[Feature Store]]></category>
		<category><![CDATA[MLflow]]></category>
		<category><![CDATA[MLOps]]></category>
		<category><![CDATA[Mosaic AI]]></category>
		<guid isPermaLink="false">https://zorost.com/production-ml-databricks-mlflow-feature-store-calibration/</guid>

					<description><![CDATA[<p>A reference MLOps stack on Databricks — MLflow Model Registry, Feature Store with online serving, calibration-first model evaluation, and Mosaic AI Model Serving.</p>
<p>The post <a href="https://zorost.com/production-ml-databricks-mlflow-feature-store-calibration/">Production ML on Databricks: Mlflow, Feature Store, Calibration</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;Production ML is not training a model. It&#8217;s the disciplines around training, registering, serving, monitoring, retraining, and retiring.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Most teams shipping their first ML model on Databricks underestimate the discipline required. Training is the small part. The system around training is the large part.</p>
<h4>The reference stack</h4>
<pre><code>   Data ──►  Feature Store  ◄────  online + offline serving
                  │
                  ▼
   Training pipeline (Databricks Job)
                  │
                  ▼
   MLflow Model Registry  ◄────  versions, stages, approvals
                  │
                  ▼
   Mosaic AI Model Serving  ◄────  A/B + canary
                  │
                  ▼
   Monitoring (drift, calibration, performance)
                  │
                  ▼
   Retraining trigger (event, schedule, drift threshold)</code></pre>
<h4>Feature Store — point-in-time correctness</h4>
<p>The Feature Store enforces <strong>point-in-time correctness</strong>: training features are joined as they were at the historical point in time the label was generated. This eliminates leakage that destroys offline evaluation reliability. Online serving uses the same feature definitions to keep training and serving consistent.</p>
<h4>MLflow Model Registry — lifecycle stages</h4>
<p>Models progress through stages with explicit gates:</p>
<table>
<thead>
<tr>
<th>Stage</th>
<th>Gate</th>
</tr>
</thead>
<tbody>
<tr>
<td>Staging</td>
<td>Passes regression suite + calibration checks</td>
</tr>
<tr>
<td>Production</td>
<td>Passes A/B + canary criteria</td>
</tr>
<tr>
<td>Archived</td>
<td>Replaced by a newer Production model</td>
</tr>
</tbody>
</table>
<p>Every stage transition is logged with the user, the reason, and the metrics that justified it.</p>
<h4>Calibration-first evaluation</h4>
<p>We require every model to ship with <strong>Expected Calibration Error (ECE)</strong> and <strong>conformal prediction</strong> intervals (LACP). Headline accuracy is reported but is not the gate.</p>
<table>
<thead>
<tr>
<th>Gate</th>
<th>Default threshold</th>
</tr>
</thead>
<tbody>
<tr>
<td>ECE</td>
<td>&lt; 0.02 on holdout</td>
</tr>
<tr>
<td>Reliability diagram</td>
<td>No bin &gt; 0.05 deviation</td>
</tr>
<tr>
<td>Conformal coverage</td>
<td>Within 2pp of stated coverage</td>
</tr>
<tr>
<td>Performance regression</td>
<td>No metric below the prior production model</td>
</tr>
</tbody>
</table>
<h4>Mosaic AI Model Serving — A/B and canary</h4>
<p>Traffic splits and canary rollouts are first-class. New versions get 5% of traffic, observed for SLAs and metrics, then ramp. Rollback is one click.</p>
<h4>Monitoring — drift, calibration, performance</h4>
<p>Three things to monitor:</p>
<ul>
<li><strong>Feature drift</strong> — input distribution shift</li>
<li><strong>Calibration drift</strong> — ECE moving</li>
<li><strong>Performance drift</strong> — labeled outcomes degrading</li>
</ul>
<p>Monitoring runs as a Databricks Job. Alerts go to Slack / Teams / PagerDuty.</p>
<h4>Closing</h4>
<p>Production ML on Databricks is straightforward when the stack is right: Feature Store for consistency, MLflow Registry for lifecycle, Mosaic AI Model Serving for delivery, calibration-first evaluation, and disciplined monitoring. The training is the easy part.</p>
<hr>
<p>The post <a href="https://zorost.com/production-ml-databricks-mlflow-feature-store-calibration/">Production ML on Databricks: Mlflow, Feature Store, Calibration</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24307</post-id>	</item>
		<item>
		<title>Building Multi-Agent Workflows on Databricks (mosaic AI Agent Framework)</title>
		<link>https://zorost.com/multi-agent-databricks-mosaic-ai-agent-framework/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 24 Feb 2026 09:00:00 +0000</pubDate>
				<category><![CDATA[Databricks Modernization]]></category>
		<category><![CDATA[Agent Framework]]></category>
		<category><![CDATA[Agentic AI]]></category>
		<category><![CDATA[MLflow]]></category>
		<category><![CDATA[Mosaic AI]]></category>
		<category><![CDATA[Multi-Agent]]></category>
		<guid isPermaLink="false">https://zorost.com/multi-agent-databricks-mosaic-ai-agent-framework/</guid>

					<description><![CDATA[<p>Multi-agent workflows native to the Lakehouse — designed, built, evaluated, and deployed on the Mosaic AI Agent Framework with typed tools and an evaluation harness.</p>
<p>The post <a href="https://zorost.com/multi-agent-databricks-mosaic-ai-agent-framework/">Building Multi-Agent Workflows on Databricks (mosaic AI Agent Framework)</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;Agents on the Lakehouse mean tools that read and write Delta tables, models that serve under MLflow, and evaluations that ship as Delta tables themselves.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Agentic workflows are the next layer on the Lakehouse — agents that reason, plan, call tools, and produce verifiable artifacts. The Mosaic AI Agent Framework provides the runtime. The architectural decisions still belong to you.</p>
<h4>Reference architecture</h4>
<pre><code>┌──────────────────────────────────────────────────────────────────┐
│                    AGENT (LangGraph / LlamaIndex / Custom)        │
│                                                                    │
│   Planner ──► Executor ──► Critic ──► Referee                    │
└─────────────────────┬────────────────────────────────────────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │   Typed Tools                 │ ◄── Tool catalog
       │   - read Delta tables         │     (Unity Catalog)
       │   - write Delta tables        │
       │   - call MLflow models        │
       │   - call REST APIs            │
       └──────────────┬───────────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │   Mosaic AI Model Serving     │
       │   - foundation models         │
       │   - fine-tuned models         │
       │   - per-agent traffic split   │
       └──────────────┬───────────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │   Evaluations as Delta tables │ ◄── Versioned
       │   - golden datasets           │
       │   - regression suite          │
       │   - hallucination detection   │
       └──────────────────────────────┘</code></pre>
<h4>What &#8220;typed tools&#8221; means</h4>
<p>Every tool has a JSON schema for inputs and outputs. The agent cannot call a tool with invalid inputs — the schema rejects the call. This eliminates an entire class of failure that plagues unconstrained agents.</p>
<h4>What &#8220;evaluations as Delta tables&#8221; means</h4>
<p>Evaluation results are stored as rows in versioned Delta tables. Each row is <code>(agent_version, input, expected_output, actual_output, score, metadata)</code>. Regression analysis is a <code>JOIN</code> between two <code>agent_version</code> slices. New versions don&#8217;t promote unless they pass.</p>
<h4>The agent / human contract</h4>
<p>Where humans fit:</p>
<ul>
<li><strong>High-risk operations</strong> require human-in-the-loop checkpoints. Agents can propose; humans approve.</li>
<li><strong>Critic disagreements with the executor</strong> route to humans when the referee cannot adjudicate.</li>
<li><strong>Periodic spot-checks</strong> on agent decisions are scheduled into the evaluation harness.</li>
</ul>
<p>This is not &#8220;manual override.&#8221; This is a designed-in contract about which decisions are agent-final and which are human-final.</p>
<h4>Common architectural decisions</h4>
<table>
<thead>
<tr>
<th>Decision</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td>Number of executors</td>
<td>One unless sub-goals are independent</td>
</tr>
<tr>
<td>Critic per executor or shared</td>
<td>Shared unless executors are heterogeneous</td>
</tr>
<tr>
<td>Memory model</td>
<td>Working memory in agent state; long-term memory in Delta table</td>
</tr>
<tr>
<td>Tool call timeout</td>
<td>30 s default, with retries on idempotent tools</td>
</tr>
<tr>
<td>Cost ceiling per session</td>
<td>Configurable; defaults to a hard cap</td>
</tr>
</tbody>
</table>
<h4>Closing</h4>
<p>Multi-agent workflows on Databricks are productive when the framework is paired with discipline: typed tools, deterministic logging, evaluations as Delta tables, and a designed-in agent / human contract. The Mosaic AI Agent Framework is the runtime; the architecture is yours.</p>
<hr>
<p>The post <a href="https://zorost.com/multi-agent-databricks-mosaic-ai-agent-framework/">Building Multi-Agent Workflows on Databricks (mosaic AI Agent Framework)</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24306</post-id>	</item>
		<item>
		<title>Production-Grade RAG on the Lakehouse with Mosaic AI Vector Search</title>
		<link>https://zorost.com/production-rag-mosaic-ai-vector-search/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 03 Feb 2026 09:00:00 +0000</pubDate>
				<category><![CDATA[Databricks Modernization]]></category>
		<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[Hybrid Retrieval]]></category>
		<category><![CDATA[Mosaic AI]]></category>
		<category><![CDATA[RAG]]></category>
		<category><![CDATA[Vector Search]]></category>
		<guid isPermaLink="false">https://zorost.com/production-rag-mosaic-ai-vector-search/</guid>

					<description><![CDATA[<p>How to design, build, and evaluate a production RAG system on Databricks using Mosaic AI Vector Search, hybrid retrieval, and a real evaluation harness.</p>
<p>The post <a href="https://zorost.com/production-rag-mosaic-ai-vector-search/">Production-Grade RAG on the Lakehouse with Mosaic AI Vector Search</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;RAG works in demos. RAG that works in production requires hybrid retrieval, a re-ranker, citation grounding, and an evaluation harness.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Most RAG projects pilot well and disappoint in production. The pattern is the same: embed the corpus, run vector search, ship. Production-grade RAG requires more.</p>
<h4>The production RAG architecture</h4>
<pre><code>                     ┌────────────────────┐
        Question ───►│  AI Gateway        │  ← key mgmt, routing, observability
                     └──────────┬─────────┘
                                ▼
        ┌────────────────────────────────────────────┐
        │                Retrieval                    │
        │  ┌────────────────┐  ┌────────────────┐   │
        │  │ Mosaic AI      │  │ BM25 (lexical) │   │
        │  │ Vector Search  │  │ on Delta SQL   │   │
        │  │ (Delta-synced) │  │                │   │
        │  └───────┬────────┘  └────────┬───────┘   │
        │          └──── merge (RRF) ───┘           │
        │                  │                          │
        │              cross-encoder                  │
        │              re-rank                        │
        └────────────────┬─────────────────────────────┘
                         ▼
              top-K (typically 6–10)
                         │
                         ▼
              Citation-grounded generation
              (Mosaic AI Model Serving)
                         │
                         ▼
              Validated answer with source links</code></pre>
<h4>Why Mosaic AI Vector Search specifically</h4>
<p>Mosaic AI Vector Search <strong>synchronizes with Delta tables</strong>. Update the source table, the index updates. No orchestration glue. Tagging, ACLs, and lineage flow through Unity Catalog. For RAG over enterprise data that changes, this matters more than people initially appreciate.</p>
<h4>Hybrid retrieval is the pattern</h4>
<p>Pure vector search is the most common production RAG mistake. Pure BM25 is the second most common. Hybrid — vector + BM25 + filters + re-rank — is the answer that actually works.</p>
<h4>Citation grounding as a structural fix</h4>
<p>Constrain the model to write with bracketed citation tokens. Validate every citation against the retrieval set. Reject answers that fail validation. This is a small structural change with a large operational impact.</p>
<h4>Evaluation harness — non-negotiable</h4>
<p>A production RAG system without an evaluation harness is a guess. The harness has three components:</p>
<ol>
<li><strong>Golden Q&amp;A dataset</strong> — questions paired with the documents that should ground the answers</li>
<li><strong>Grounding rate</strong> — what fraction of generated claims are supported by retrieved documents</li>
<li><strong>Hallucination detection</strong> — flagging unsupported claims</li>
</ol>
<p>The harness runs as a Databricks Job on every model or retrieval change. Regressions are caught before deployment.</p>
<h4>Closing</h4>
<p>Production RAG on the Lakehouse with Mosaic AI is straightforward when you adopt the architecture: hybrid retrieval, re-ranker, citation grounding, evaluation harness. The result is a RAG system analysts trust enough to use.</p>
<hr>
<p>The post <a href="https://zorost.com/production-rag-mosaic-ai-vector-search/">Production-Grade RAG on the Lakehouse with Mosaic AI Vector Search</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24305</post-id>	</item>
		<item>
		<title>Unity Catalog: Governance Done Right</title>
		<link>https://zorost.com/unity-catalog-governance-done-right/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 13 Jan 2026 09:00:00 +0000</pubDate>
				<category><![CDATA[Databricks Modernization]]></category>
		<category><![CDATA[Data Mesh]]></category>
		<category><![CDATA[Governance]]></category>
		<category><![CDATA[Lineage]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Unity Catalog]]></category>
		<guid isPermaLink="false">https://zorost.com/unity-catalog-governance-done-right/</guid>

					<description><![CDATA[<p>Most governance projects fail because they start with policy. The good ones start with structure. Here is a reference Unity Catalog deployment that supports both governance and data-mesh patterns.</p>
<p>The post <a href="https://zorost.com/unity-catalog-governance-done-right/">Unity Catalog: Governance Done Right</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;Governance that the team can&#8217;t navigate is governance that the team will route around.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Most data-governance projects fail because they start with policy. The good ones start with <strong>structure</strong>. Unity Catalog&#8217;s hierarchy (Catalog → Schema → Table) is the structural foundation that makes policy enforceable.</p>
<h4>Reference layout (data-mesh)</h4>
<pre><code>catalog: zorost
├── domain_aviation
│   ├── flights_silver
│   ├── delays_gold
│   └── safety_rag
├── domain_manufacturing
│   ├── spc_silver
│   └── capability_gold
├── domain_freight
│   ├── corridors_silver
│   └── emissions_gold
├── domain_finance
│   └── ...
└── domain_governance       ← cross-cutting
    ├── audit_logs
    ├── pii_register
    └── data_quality_metrics</code></pre>
<h4>Permission model</h4>
<table>
<thead>
<tr>
<th>Principal</th>
<th>What they get</th>
</tr>
</thead>
<tbody>
<tr>
<td>Domain Steward</td>
<td>OWNER on <code>domain_X.*</code></td>
</tr>
<tr>
<td>Domain Engineer</td>
<td>USAGE on parent catalog + USE_SCHEMA on <code>domain_X.<em></code> + CREATE on <code>domain_X.</em></code></td>
</tr>
<tr>
<td>Cross-domain Analyst</td>
<td>SELECT on Gold tables only</td>
</tr>
<tr>
<td>Auditor</td>
<td>SELECT on <code>domain_governance.*</code></td>
</tr>
<tr>
<td>Service Principal (apps)</td>
<td>SELECT on specific Gold tables · scoped by token</td>
</tr>
</tbody>
</table>
<h4>Row and column security with dynamic views</h4>
<p>Unity Catalog supports <strong>dynamic views</strong> — views whose behavior depends on the current user. A typical pattern:</p>
<pre><code>CREATE VIEW domain_aviation.flights_secure AS
SELECT
  flight_id,
  origin_airport,
  destination_airport,
  CASE WHEN is_member('phi_authorized') THEN passenger_count ELSE NULL END
    AS passenger_count,
  ...
FROM domain_aviation.flights_silver
WHERE
  CASE
    WHEN is_member('all_regions') THEN TRUE
    ELSE region IN (SELECT region FROM domain_governance.user_region_grants
                     WHERE user = current_user())
  END;</code></pre>
<p><code>is_member()</code>, <code>current_user()</code>, <code>mask()</code>, and <code>filter()</code> together cover row-level, column-level, and full-fledged ABAC patterns.</p>
<h4>Tags and classification</h4>
<p>Every column and table can carry tags. We standardize a tag taxonomy:</p>
<table>
<thead>
<tr>
<th>Tag</th>
<th>Values</th>
<th>Use</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>pii_class</code></td>
<td><code>pii</code>, <code>pii_sensitive</code>, <code>phi</code>, <code>pci</code>, <code>none</code></td>
<td>Drives masking and access policy</td>
</tr>
<tr>
<td><code>data_owner</code></td>
<td>domain steward email</td>
<td>Clear accountability</td>
</tr>
<tr>
<td><code>freshness_sla</code></td>
<td><code>realtime</code>, <code>1h</code>, <code>1d</code>, <code>1w</code></td>
<td>Drives monitoring</td>
</tr>
<tr>
<td><code>retention</code></td>
<td><code>30d</code>, <code>1y</code>, <code>7y</code>, <code>permanent</code></td>
<td>Drives lifecycle</td>
</tr>
</tbody>
</table>
<p>Tags make policy queryable: &#8220;show me all PII-tagged columns in domain_finance&#8221; returns a row, not an email thread.</p>
<h4>Lineage and audit</h4>
<p>Unity Catalog captures column-level lineage across SQL, Python, ML, and BI consumption. Audit logs go to a sink the security team owns. Both are queryable via <code>system.access.audit</code> and <code>system.lineage.column_lineage</code>.</p>
<h4>Closing</h4>
<p>Governance done right starts with structure. Unity Catalog&#8217;s hierarchy + permission model + tagging + dynamic views + lineage + audit are the primitives. The implementation is workshop-driven, but the building blocks are stable and the patterns are reproducible.</p>
<hr>
<p>The post <a href="https://zorost.com/unity-catalog-governance-done-right/">Unity Catalog: Governance Done Right</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24304</post-id>	</item>
		<item>
		<title>Streaming on the Lakehouse: Auto Loader + DLT in Practice</title>
		<link>https://zorost.com/streaming-lakehouse-auto-loader-dlt/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 30 Dec 2025 09:00:00 +0000</pubDate>
				<category><![CDATA[Databricks Modernization]]></category>
		<category><![CDATA[Auto Loader]]></category>
		<category><![CDATA[DLT]]></category>
		<category><![CDATA[Real-Time]]></category>
		<category><![CDATA[Streaming]]></category>
		<category><![CDATA[Structured Streaming]]></category>
		<guid isPermaLink="false">https://zorost.com/streaming-lakehouse-auto-loader-dlt/</guid>

					<description><![CDATA[<p>A reference architecture for real-time pipelines on Databricks. Auto Loader, DLT, expectations, and SLOs that survive production.</p>
<p>The post <a href="https://zorost.com/streaming-lakehouse-auto-loader-dlt/">Streaming on the Lakehouse: Auto Loader + DLT in Practice</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;Streaming pipelines that wake people at 3 AM are not real-time. They&#8217;re real-painful.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Real-time pipelines are easy to demo and hard to operate. The pattern that fails: a clever Spark Structured Streaming job that works in dev, struggles in prod under skew, and breaks at the first schema evolution. The pattern that survives: Auto Loader for ingestion, DLT for transformations, expectations for quality, and SLOs that the team monitors like uptime.</p>
<h4>The reference architecture</h4>
<pre><code>   Sources                Ingestion              Transformation          Consumption
   ───────                ─────────              ──────────────          ───────────
   Cloud storage  ──►  Auto Loader (cloudFiles) ──►  Bronze
   Kafka / EH     ──►  Structured Streaming    ──►  Bronze
   CDC (Debezium) ──►  Auto Loader / SS        ──►  Bronze
                                                        │
                                              DLT expectations
                                              (drop / quarantine)
                                                        ▼
                                                     Silver
                                                        │
                                              joins / aggregations
                                                        ▼
                                                      Gold ──►  BI · ML · Apps</code></pre>
<h4>Auto Loader: incremental, schema-evolving, exactly-once</h4>
<p>Auto Loader is the foundation. For file-based ingestion at scale, it handles:</p>
<ul>
<li><strong>Incremental discovery</strong> of new files</li>
<li><strong>Schema inference</strong> with versioned schema files</li>
<li><strong>Schema evolution</strong> with rescued data column for unexpected fields</li>
<li><strong>Exactly-once semantics</strong> via durable file tracking</li>
</ul>
<p>For event streams, Structured Streaming directly from Kafka, Event Hubs, or Kinesis covers the same role.</p>
<h4>DLT: declarative streaming with managed dependencies</h4>
<p>DLT lets you describe <strong>what</strong> the pipeline computes, not how. The runtime handles dependency ordering, retry semantics, schema validation, and metric capture. Expectations express data-quality contracts:</p>
<pre><code>-- Pseudocode
CREATE STREAMING LIVE TABLE silver_orders
  CONSTRAINT valid_id  EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW
  CONSTRAINT valid_amt EXPECT (amount &gt; 0)           ON VIOLATION DROP ROW
  CONSTRAINT plausible EXPECT (amount &lt; 1e7)         ON VIOLATION QUARANTINE
  AS SELECT ... FROM STREAM(LIVE.bronze_orders);</code></pre>
<p>The metrics on those expectations become part of the pipeline&#8217;s observability surface.</p>
<h4>SLOs that survive production</h4>
<table>
<thead>
<tr>
<th>SLO</th>
<th>Target</th>
</tr>
</thead>
<tbody>
<tr>
<td>End-to-end latency P95</td>
<td>&lt; 60 s for &#8220;near-real-time&#8221; use cases</td>
</tr>
<tr>
<td>Drop rate</td>
<td>&lt; 0.5% of input records</td>
</tr>
<tr>
<td>Quarantine rate</td>
<td>&lt; 2% of input records</td>
</tr>
<tr>
<td>Pipeline uptime</td>
<td>99.9% monthly</td>
</tr>
<tr>
<td>Backfill capability</td>
<td>&lt; 24 h for last-7-day reprocessing</td>
</tr>
</tbody>
</table>
<p>These are the right targets to commit to, not the latency benchmarks vendors quote in marketing.</p>
<h4>Closing</h4>
<p>Streaming on the Lakehouse is operationally feasible when you adopt Auto Loader, DLT, and expectations as the standard pattern. The team&#8217;s job becomes monitoring SLOs and reviewing quarantine, not babysitting jobs.</p>
<hr>
<p>The post <a href="https://zorost.com/streaming-lakehouse-auto-loader-dlt/">Streaming on the Lakehouse: Auto Loader + DLT in Practice</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24303</post-id>	</item>
		<item>
		<title>Dimensional Modeling on Delta Lake — and When to Choose Data Vault Instead</title>
		<link>https://zorost.com/dimensional-modeling-delta-data-vault/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 09 Dec 2025 09:00:00 +0000</pubDate>
				<category><![CDATA[Databricks Modernization]]></category>
		<category><![CDATA[Data Vault]]></category>
		<category><![CDATA[Delta Lake]]></category>
		<category><![CDATA[Dimensional Modeling]]></category>
		<category><![CDATA[Lakehouse Federation]]></category>
		<guid isPermaLink="false">https://zorost.com/dimensional-modeling-delta-data-vault/</guid>

					<description><![CDATA[<p>Star schema, data vault, one-big-table, federation. Each has a different shape and different trade-offs on Delta Lake. A decision framework.</p>
<p>The post <a href="https://zorost.com/dimensional-modeling-delta-data-vault/">Dimensional Modeling on Delta Lake — and When to Choose Data Vault Instead</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;There is no single right model. There is the right model for the workload.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Dimensional modeling is a forty-year-old discipline. Lakehouse architecture is a five-year-old discipline. Most teams import their old habits into the new platform and produce models that work but underperform — or models that look modern but break under load.</p>
<p>The right approach is workload-driven.</p>
<h4>Four patterns to choose from</h4>
<table>
<thead>
<tr>
<th>Pattern</th>
<th>When to use</th>
<th>Strengths</th>
<th>Weaknesses</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Star schema</strong></td>
<td>Reporting and dashboards dominate; Power BI / Tableau is the primary consumer</td>
<td>Familiar; BI-tool friendly; fast slicing on Photon-enabled Delta</td>
<td>Less agile to change; many-to-many requires bridge tables</td>
</tr>
<tr>
<td><strong>Data Vault 2.0</strong></td>
<td>Many sources; auditability is required; the model needs to evolve continuously</td>
<td>Auditable; agile; handles many sources; clear separation of business keys, satellites, and links</td>
<td>More tables; queries usually need a presentation layer</td>
</tr>
<tr>
<td><strong>One Big Table</strong></td>
<td>API-driven sub-second queries dominate; consumers are applications, not analysts</td>
<td>Sub-second queries; simple semantics for app developers</td>
<td>Joins move into ETL; updates can be expensive</td>
</tr>
<tr>
<td><strong>Lakehouse Federation</strong></td>
<td>Cross-system reporting without governance ownership</td>
<td>No data movement; fast to deliver</td>
<td>Performance depends on source; governance has to be explicit</td>
</tr>
</tbody>
</table>
<h4>Decision tree</h4>
<pre><code>Primary consumer of the model?
   ├── Analysts / BI tools  ──► Star schema (consider Direct Lake)
   ├── Apps / APIs          ──► One Big Table or Star with caching
   ├── Many sources, audit  ──► Data Vault 2.0
   └── Cross-system reporting, no copy possible ──► Lakehouse Federation</code></pre>
<h4>How we structure the medallion architecture</h4>
<p>Regardless of model pattern, we maintain a Bronze/Silver/Gold separation:</p>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Purpose</th>
<th>Typical retention</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bronze</td>
<td>Raw + arrival timestamp + source ID; immutable</td>
<td>Long (years)</td>
</tr>
<tr>
<td>Silver</td>
<td>Parsed, conformed, deduplicated; data quality enforced</td>
<td>Medium (months to years)</td>
</tr>
<tr>
<td>Gold</td>
<td>Business-ready aggregates / dimensions / facts</td>
<td>Short to medium</td>
</tr>
</tbody>
</table>
<p>The model pattern (star, vault, OBT) lives in <strong>Gold</strong>.</p>
<h4>When to mix</h4>
<p>Mixing is normal. A typical enterprise customer ends up with:</p>
<ul>
<li><strong>Data Vault 2.0</strong> for the foundational integration of multiple sources</li>
<li><strong>Star schema</strong> in Gold for analytical consumers</li>
<li><strong>One Big Table</strong> in Gold for app consumers</li>
<li><strong>Lakehouse Federation</strong> for occasional cross-system reporting</li>
</ul>
<h4>Closing</h4>
<p>Dimensional modeling on Delta Lake is dimensional modeling, with new physics. Photon, liquid clustering, and Z-order are the storage primitives that change query performance economics. The choice of model still depends on the workload — but the trade-offs are different now than they were a decade ago.</p>
<hr>
<p>The post <a href="https://zorost.com/dimensional-modeling-delta-data-vault/">Dimensional Modeling on Delta Lake — and When to Choose Data Vault Instead</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24302</post-id>	</item>
		<item>
		<title>Modernizing ETL: Informatica/ssis/datastage to Lakeflow + DLT</title>
		<link>https://zorost.com/modernizing-etl-lakeflow-dlt/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 25 Nov 2025 09:00:00 +0000</pubDate>
				<category><![CDATA[Databricks Modernization]]></category>
		<category><![CDATA[Auto Loader]]></category>
		<category><![CDATA[DataStage]]></category>
		<category><![CDATA[DLT]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Informatica]]></category>
		<category><![CDATA[Lakeflow]]></category>
		<category><![CDATA[SSIS]]></category>
		<guid isPermaLink="false">https://zorost.com/modernizing-etl-lakeflow-dlt/</guid>

					<description><![CDATA[<p>A practical conversion playbook for legacy ETL — Informatica, SSIS, DataStage — to Databricks Lakeflow + DLT with Auto Loader. Patterns, expectations, and how to handle SCD types.</p>
<p>The post <a href="https://zorost.com/modernizing-etl-lakeflow-dlt/">Modernizing ETL: Informatica/ssis/datastage to Lakeflow + DLT</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;If your modernization plan doesn&#8217;t replace the ETL tool, you didn&#8217;t modernize. You just changed where the data lands.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>A migration that moves data into Databricks but leaves Informatica running is an incomplete migration. Half the cost and operational pain of legacy stacks lives in the ETL tool — license fees, scheduling brittleness, lineage gaps, and brittle dependencies on legacy connectors.</p>
<p>The right modernization replaces the ETL tool. <strong>Lakeflow Declarative Pipelines (DLT)</strong>, <strong>Auto Loader</strong>, and <strong>Databricks Jobs</strong> together cover the full surface area.</p>
<h4>The conversion table</h4>
<table>
<thead>
<tr>
<th>Legacy pattern</th>
<th>Lakehouse equivalent</th>
</tr>
</thead>
<tbody>
<tr>
<td>Source-to-stage mapping</td>
<td><strong>Auto Loader</strong> (<code>cloudFiles</code>) — incremental, schema-evolving, exactly-once</td>
</tr>
<tr>
<td>Slowly Changing Dimension Type 1</td>
<td>DLT <code>apply_changes</code> with <code>STORED AS SCD TYPE 1</code></td>
</tr>
<tr>
<td>SCD Type 2 with effective dating</td>
<td>DLT <code>apply_changes</code> with <code>STORED AS SCD TYPE 2</code></td>
</tr>
<tr>
<td>Aggregations &amp; roll-ups</td>
<td><strong>Materialized views</strong> in Databricks SQL</td>
</tr>
<tr>
<td>Workflow scheduling</td>
<td><strong>Databricks Jobs</strong> with retries, alerts, lineage</td>
</tr>
<tr>
<td>Data quality rules</td>
<td><strong>DLT expectations</strong> with quarantine and metric capture</td>
</tr>
<tr>
<td>Custom logging &amp; audit</td>
<td><strong>Unity Catalog lineage</strong> + <code>audit_logs</code></td>
</tr>
<tr>
<td>Reusable transformations</td>
<td>DLT pipelines with shared notebooks/libraries</td>
</tr>
</tbody>
</table>
<h4>A reference DLT pipeline</h4>
<pre><code>                  ┌────────────────────────────┐
   Cloud Storage ──►│ Auto Loader (schema evol.) │──► Bronze
                  └────────────────────────────┘
                                                      │
                          DLT expectations           ▼
                          (drop / quarantine)    Silver
                                                      │
                          Aggregations / joins        ▼
                                                  Gold</code></pre>
<h4>How we treat data quality</h4>
<p>Data quality is part of the pipeline, not bolted on after. Every Silver table has DLT expectations that:</p>
<ul>
<li><strong>Drop</strong> obviously bad rows (null business keys, malformed dates)</li>
<li><strong>Quarantine</strong> suspicious rows (range violations, referential gaps) for review</li>
<li><strong>Capture metrics</strong> so dashboards show data-quality trends, not just data volume</li>
</ul>
<p>Quality is a first-class output of the pipeline. The data team monitors it like they monitor latency.</p>
<h4>Migration sequence</h4>
<table>
<thead>
<tr>
<th>Phase</th>
<th>Output</th>
</tr>
</thead>
<tbody>
<tr>
<td>Inventory</td>
<td>Mappings, jobs, sessions, schedules, lineage gaps</td>
</tr>
<tr>
<td>Pattern library</td>
<td>Templates for the top 8–12 conversion patterns in your stack</td>
</tr>
<tr>
<td>Iteration 1 (highest-volume sources)</td>
<td>First migrated DLT pipelines · parallel run</td>
</tr>
<tr>
<td>Iterations 2–N</td>
<td>Wave-by-wave conversion with parallel run, cutover, decommission</td>
</tr>
<tr>
<td>Hyper-care</td>
<td>30/60/90 day stabilization</td>
</tr>
</tbody>
</table>
<h4>Closing</h4>
<p>ETL modernization done right replaces the legacy tool, not just the destination. Lakeflow + DLT + Auto Loader covers the full surface. The savings are measurable in license fees, operational toil, and time-to-insight.</p>
<hr>
<p>The post <a href="https://zorost.com/modernizing-etl-lakeflow-dlt/">Modernizing ETL: Informatica/ssis/datastage to Lakeflow + DLT</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24301</post-id>	</item>
		<item>
		<title>OBIEE to Databricks: a Practical Migration Pattern</title>
		<link>https://zorost.com/obiee-to-databricks-migration-pattern/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 04 Nov 2025 09:00:00 +0000</pubDate>
				<category><![CDATA[Databricks Modernization]]></category>
		<category><![CDATA[Databricks]]></category>
		<category><![CDATA[Migration]]></category>
		<category><![CDATA[OBIEE]]></category>
		<category><![CDATA[Semantic Layer]]></category>
		<category><![CDATA[Unity Catalog]]></category>
		<guid isPermaLink="false">https://zorost.com/obiee-to-databricks-migration-pattern/</guid>

					<description><![CDATA[<p>Move Oracle OBIEE / OAS to Databricks SQL with a clear semantic-layer methodology. RPD reconstruction, security translation, ETL conversion, and report rebuild — without losing business logic.</p>
<p>The post <a href="https://zorost.com/obiee-to-databricks-migration-pattern/">OBIEE to Databricks: a Practical Migration Pattern</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;The RPD is not a black box. It is a graph of joins, hierarchies, and security predicates. Treat it that way and migration becomes tractable.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Oracle BI EE is one of the most widely deployed enterprise BI platforms. It also has accumulated technical debt — schema drift, layered RPDs, undocumented session variables, and report logic split between the BMM and the report itself. Most &#8220;migration&#8221; projects start by trying to lift-and-shift everything, get blocked, and stall.</p>
<p>The right approach is methodological. The RPD is treatable as three layers, each of which has a clean Databricks SQL equivalent.</p>
<h4>The three-layer translation</h4>
<pre><code>   OBIEE                                Databricks
   ─────                                ──────────
   Physical layer    ─────►  Delta Lake tables in Unity Catalog
                              + Lakehouse Federation for live sources

   BMM (logical)     ─────►  Databricks SQL semantic model
                              (Lakehouse views with row/column security)

   Presentation      ─────►  Power BI / Tableau on Databricks SQL
                              (dimensions, measures, time intelligence)</code></pre>
<h4>Migration sequence</h4>
<table>
<thead>
<tr>
<th>Phase</th>
<th>Length (typical)</th>
<th>Output</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>1. Discovery</strong></td>
<td>2–4 wks</td>
<td>Catalog of subject areas, RPDs, repositories, presentation catalogs · usage telemetry · report criticality matrix</td>
</tr>
<tr>
<td><strong>2. Source mapping</strong></td>
<td>2–4 wks</td>
<td>Mapping of physical layer to landing tables in Bronze/Silver Delta · federated sources documented</td>
</tr>
<tr>
<td><strong>3. Semantic model design</strong></td>
<td>4–8 wks</td>
<td>Logical-to-Databricks-SQL semantic model with row/column security</td>
</tr>
<tr>
<td><strong>4. ETL conversion</strong></td>
<td>parallel with 3</td>
<td>Native ETL → Lakeflow / DLT / Spark with DLT expectations</td>
</tr>
<tr>
<td><strong>5. Report rebuild</strong></td>
<td>4–10 wks</td>
<td>Top reports rebuilt in Power BI Direct Lake or Tableau</td>
</tr>
<tr>
<td><strong>6. Cutover &amp; decom.</strong></td>
<td>2–6 wks</td>
<td>Parallel run · UAT · sign-off · legacy decommissioning</td>
</tr>
<tr>
<td><strong>7. Hyper-care</strong></td>
<td>30/60/90 days</td>
<td>Stabilization with SLA-backed support</td>
</tr>
</tbody>
</table>
<h4>Security translation</h4>
<table>
<thead>
<tr>
<th>OBIEE security primitive</th>
<th>Databricks equivalent</th>
</tr>
</thead>
<tbody>
<tr>
<td>Application Roles</td>
<td>Unity Catalog groups (Entra/IDP-mapped)</td>
</tr>
<tr>
<td>Data filters on logical tables</td>
<td>Dynamic views with <code>current_user()</code> and <code>is_member()</code></td>
</tr>
<tr>
<td>Column-level filters</td>
<td><code>mask()</code> functions in dynamic views</td>
</tr>
<tr>
<td>Session variables</td>
<td>Catalog-scoped configuration tables</td>
</tr>
<tr>
<td>Init blocks</td>
<td>Replaced by IDP/Entra group claims</td>
</tr>
</tbody>
</table>
<h4>Common pitfalls</h4>
<ul>
<li><strong>Trying to lift-and-shift the BMM.</strong> Some logic in the BMM is workaround for OBIEE limitations. Rebuild as Lakehouse views; don&#8217;t translate one-for-one.</li>
<li><strong>Skipping usage telemetry.</strong> Half the reports in a typical OBIEE deployment are unused. Don&#8217;t migrate them.</li>
<li><strong>Translating session variables literally.</strong> Most session variables become dynamic-view predicates or IDP claims.</li>
<li><strong>Building the semantic model in Power BI instead of Databricks SQL.</strong> Power BI imports work in the short term and create future modernization debt. Direct Lake is the target.</li>
</ul>
<h4>Closing</h4>
<p>The OBIEE → Databricks migration pattern is reproducible when you treat the RPD as a graph of joins, hierarchies, and security predicates rather than as a black box. The result is a cleaner semantic model on a platform that supports SQL, ML, streaming, and agentic AI — instead of a single-purpose BI server.</p>
<hr>
<p>The post <a href="https://zorost.com/obiee-to-databricks-migration-pattern/">OBIEE to Databricks: a Practical Migration Pattern</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24300</post-id>	</item>
	</channel>
</rss>
