The Evolution of Data Modeling

An interactive guide to the Kimball methodology and its application with Apache Iceberg and Delta Lake.

Kimball Fundamentals

This section introduces the foundational concepts of the Kimball methodology. It's a business-focused, bottom-up approach to data warehousing that prioritizes user understandability, query performance, and adaptability, making it an enduring framework for business intelligence.

Core Principles and Philosophy

At its heart, the Kimball methodology is a business process-oriented approach. Development commences by focusing on critical business processes offering the highest value. The primary objective is to structure data to directly facilitate analysis by business users, optimizing for retrieval speed even with substantial data volumes.

This business-first orientation means data models are built around articulated business needs, not just existing data structures. An incremental build-out of the data warehouse, starting with individual data marts, allows for rapid delivery of analytical capabilities. These marts integrate into a broader enterprise data warehouse via conformed dimensions and facts, fostering a "bottom-up" style. Such early value delivery secures stakeholder buy-in and aligns with agile principles.

A foundational principle is "ease of understanding." The clear separation between quantitative measures (facts) and descriptive context (dimensions) makes the data warehouse accessible to a broad audience, empowering data democratization and reducing reliance on specialized IT/BI teams. This fosters a culture of self-service analytics and enhances data literacy.

Key Advantages of the Kimball Approach

Optimal Query Performance: Denormalized dimensional models (star schemas) minimize joins, facilitating rapid query execution even against very large datasets.
Flexibility and Extensibility: Modular design allows adding new dimensions or facts without disruptive changes, accommodating evolving business needs and ensuring long-term viability.
Ease of Understanding: Intuitive demarcation between facts and dimensions makes the structure comprehensible for non-technical users, a consistently cited major benefit.
Scalability: Dimensional structures inherently handle increasing data volumes and query complexity, provided the underlying infrastructure is also scalable.
Data Consistency and Integrity: Structured models with conformed dimensions promote better data quality and consistency across subject areas.
Cost Reduction: Long-term benefits from ease of maintenance, scalability, flexibility, and improved user productivity lead to significant cost savings (Total Cost of Ownership). This includes reduced training overhead, faster development cycles, and potentially lower computational resource consumption.
Incremental Development: The iterative approach often leads to lower upfront costs compared to methodologies requiring a comprehensive enterprise-wide model first.

Kimball vs. Inmon: Two Philosophies

Kimball (Bottom-Up)

Philosophy: Focuses on individual business processes, developing data marts with dimensional models (star schemas). Enterprise DW emerges from integrated marts via conformed dimensions. Emphasizes rapid delivery of value.

Modeling: Employs dimensional modeling (facts/dimensions), leading to denormalized structures optimized for querying.

Implementation: Generally quicker cycles, greater flexibility to adapt to changing business requirements due to incremental deployment.

🧩 → 🏢

Inmon (Top-Down)

Philosophy: Proposes a centralized, normalized Enterprise Data Warehouse (EDW) as the single source of truth (often 3NF). Departmental data marts are derived from this EDW.

Modeling: Utilizes normalized data models (e.g., 3NF) for the central EDW to reduce redundancy and improve integrity, potentially leading to complex queries.

Implementation: Prioritizes enterprise-wide consistency from the outset, typically a longer and more costly initial undertaking.

🏢 → 🧩

Considerations & Hybrids:

The choice often reflects strategic priorities: Kimball for speed-to-value and departmental needs, Inmon for comprehensive central governance. Hybrid architectures (e.g., Data Vault feeding Kimball marts) are increasingly common, aiming to combine strengths.

A critique of Kimball regarding a "loss of a single source of truth" often stems from improper implementation. Well-executed Kimball architectures with rigorously governed conformed dimensions create an integrated, consistent view across data marts.

Anatomy of the Kimball Model

Explore the core components: fact tables (measurements) and dimension tables (context). Understand their characteristics, common types, and how they form various schema patterns.

Fact Tables: The Heart of the Model

Fact tables store the performance measurements from business processes. They contain quantitative data for analysis.

Numerical Measures: Hold quantitative data (e.g., sales revenue, units sold).
Foreign Keys to Dimensions: Link to dimension tables, providing context for facts.
Composite Primary Key: Usually formed by a subset (often all) of foreign keys from linked dimensions.
Grain Definition: Crucially defines what a single row represents (e.g., "an individual line item on a sales transaction"). This is the most critical design decision, dictating detail level and analytical capabilities.

Additive: Can be summed across all associated dimensions (e.g., `sales_quantity`). Most flexible.
Semi-Additive: Can be summed across some dimensions but not others (e.g., account balances, inventory levels - not additive across time).
Non-Additive: Cannot be meaningfully summed (e.g., ratios, percentages, temperatures). Best practice is to store numerators/denominators as additive facts and calculate ratios in BI tools/queries.

Transaction Fact Tables: Most common; each row is a single transaction/event (e.g., sales line item). Most granular.
Periodic Snapshot Fact Tables: Capture status at predefined intervals (e.g., monthly account balances).
Accumulating Snapshot Fact Tables: Track process lifecycle with multiple milestones (e.g., order fulfillment). A single row represents the entire lifecycle, updated as milestones are met.
Factless Fact Tables: No numeric measures; record event occurrence or coverage (e.g., student attendance, promotion eligibility). The "fact" is the row's existence.
Aggregated Fact Tables (Summary Tables): Contain pre-summarized facts from lower-grain tables to improve query performance.
Consolidated Fact Tables (Supermarket Tables): Combine facts from different processes sharing the same grain and dimensionality.

Dimension Tables: Providing Context

Dimension tables provide descriptive context (who, what, where, when, why, how) for the facts.

Descriptive Attributes: Rich in textual attributes (e.g., `product_name`, `customer_address`).
Denormalization: Often deliberately denormalized (combining attributes from related entities) to optimize query performance by reducing joins and simplifying the model.
Hierarchies: Contain natural drill-down paths (e.g., Category > Subcategory > Product).
Surrogate Keys: System-generated unique integer primary keys. Insulate from operational key changes, integrate data from multiple sources, and are essential for SCD Type 2.

Degenerate Dimensions: Dimension keys (e.g., `order_number`) in the fact table without a corresponding dimension table, as attributes are elsewhere or the key itself is the info.
Junk Dimensions: Group miscellaneous, low-cardinality flags/indicators into a single dimension to avoid many small foreign keys in the fact table.
Role-Playing Dimensions: A single physical dimension (e.g., `DimDate`) referenced multiple times in a fact table, each with a different logical role (e.g., `OrderDateKey`, `ShipDateKey`).
Conformed Dimensions: Defined once and shared consistently across multiple fact tables/data marts (same structure, attributes, definitions, content). Cornerstone of integrated enterprise DW, enabling "drill-across" reporting. Requires strong governance.

Schema Types in Dimensional Modeling

Star Schema: The Foundation

The most common pattern. A central fact table connects directly to denormalized dimension tables. This structure is simple, intuitive for users, offers high query performance by minimizing joins, and is well-supported by BI tools.

Dim
Product

Dim
Date

Dim
Customer

Dim
Store

Fact
Sales

Schema Comparison Summary

Feature	Star Schema	Snowflake Schema	Fact Constellation (Galaxy)
Key Characteristics	Central fact table, directly connected dimensions	Fact table, normalized (hierarchical) dimension tables	Multiple fact tables sharing common dimension tables
Dimension Structure	Denormalized, single table per dimension	Normalized, multiple related tables per logical dimension	Can be a mix; shared dimensions are key
Number of Tables	Fewer	More	Many (multiple stars interconnected)
Join Complexity	Low	High	Variable, can be high for cross-fact analysis
Query Performance	Generally faster	Generally slower due to more joins	Variable, depends on query complexity
Data Redundancy	Higher in dimensions	Lower in dimensions	Variable, depends on individual star components
Ease of Understanding	High	Moderate	Moderate to High
Typical Use Cases	General BI, data marts, departmental analytics	Complex hierarchies, very large dimensions, storage limits (less common now)	Integrated enterprise analytics, cross-process analysis

Handling Changes: Slowly Changing Dimensions (SCD)

Dimension attributes can change over time (e.g., customer address). SCDs are techniques to manage these historical changes. The choice of SCD type is a critical business decision based on the analytical need to track history.

The attribute value is never changed. Changes in the source are ignored. Used for attributes that are genuinely static or where history is irrelevant (e.g., original credit score).

The old attribute value is overwritten with the new value. No history is kept. Simplest approach, suitable for correcting errors or when historical accuracy for that attribute isn't required.

Most common and powerful for complete history tracking. When a tracked attribute changes, a new row is added for the new version (new surrogate key), preserving the old row. Effective date columns (start_date, end_date) and/or a current indicator flag (is_current) distinguish versions. Allows facts to join to the historically accurate dimension version.

Limited history (usually one level) by adding a new column for the "previous" value (e.g., `current_sales_rep`, `previous_sales_rep`). Rarely used for more than one or two attributes due to inflexibility.

For a group of attributes in a large dimension that change rapidly ("rapidly changing monster dimensions"), these attributes are split into a separate "mini-dimension" table. The main dimension holds stable attributes, linked to the mini-dimension. Isolates volatility.

Builds on Type 4. A foreign key in the main dimension always points to the *current* row in the mini-dimension (Type 1 outrigger). Provides both full history in the mini-dimension and easy reporting on current values.

Combines Type 1 (overwrite) for some attributes and Type 2 (add new row) for others within the same dimension. When a Type 1 attribute changes, it's overwritten on the current Type 2 row. When a Type 2 attribute changes, a new row is created, and Type 1 attributes are copied. Practical and efficient for selective history.

The physical table is a complete Type 2 dimension. Two views are created on top: one presenting a simple Type 1 view (current values only) for "as-is" analysis, and another exposing full Type 2 historical data for "as-was" analysis.

The Kimball Design Process

Kimball's methodology relies on a structured lifecycle and a pragmatic four-step process for designing dimensional models, ensuring relevance and effectiveness.

The Kimball Lifecycle Overview

A comprehensive "bottom-up" methodology for DW/BI systems, outlining tasks from planning to growth. Key phases include:

Program/Project Planning: Define scope, plan project, establish management.
Business Requirements Definition: Critical phase engaging users to understand analytical needs, KPIs, often resulting in an enterprise bus matrix.
Technology Track (Parallel): Technical architecture design, product selection, and installation.
Data Track (Parallel): Core modeling and ETL:
- Dimensional Modeling: Translating requirements into logical star schemas (the 4-step process).
- Physical Design: Translating logical models to physical database design.
- ETL Design and Development: Extracting, transforming, and loading data (Kimball et al. suggest 34 ETL subsystems).
Business Intelligence Application Track (Parallel): Developing end-user reports, dashboards, OLAP cubes.
Deployment: Testing, documentation, user training, establishing support.
Maintenance: Ongoing data reconciliation, performance tuning, security.
Growth: Anticipating future projects (new data marts, new sources), aligning with the iterative nature.

A notable aspect is parallelism: after requirements, Technology, Data, and BI Application tracks can proceed concurrently, shortening timelines and delivering value faster.

The 4-Step Dimensional Design Process

This pragmatic, business-driven sequence ensures models are relevant. Click each step:

Choose a single operational activity performed by the organization (e.g., "processing orders," "managing inventory"). This selection is driven by business priorities and stakeholder analytical needs, anchoring the model to tangible operations.

Define exactly what a single row in the fact table will represent – the most atomic level of detail (e.g., "an individual product line item on a customer's sales receipt"). This crucial decision prevents ambiguity, ensures consistency, and dictates the level of detail for facts and dimensions.

With the grain defined, identify the dimensions providing descriptive context – answering "who, what, where, when, why, and how" related to the facts (e.g., `Date`, `Product`, `Customer`, `Store`, `Promotion`). These are derived by how business people describe the data from the chosen process.

Identify the numeric facts (performance measures) that populate each fact table row. These are determined by business questions and must be true to the declared grain (e.g., `quantity_sold`, `unit_price`, `extended_sales_amount`).

This four-step process is typically iterative, with refinements made as understanding of business requirements and data realities deepens.

Applicability and Limitations

While highly effective, it's important to understand Kimball's strengths, ideal use cases, and potential limitations.

Strengths & Ideal Use Cases

User-Centricity & Ease of Understanding: Intuitive star schemas align with business thinking, lowering adoption barriers and empowering self-service.
Query Performance: Denormalized structures optimize for fast queries crucial for interactive analysis.
Agile & Incremental Development: Bottom-up approach delivers value quickly through individual data marts.
Flexibility & Extensibility: Easily extend models without disruption.
Broad BI Tool Support: Most tools are optimized for star schemas.
Ideal For: Departmental data marts, BI for specific subject areas (sales, marketing), scenarios where speed-to-market is key.

Potential Limitations & Considerations

Enterprise Integration Challenges: Risk of data silos if conformed dimensions and enterprise bus architecture are not diligently managed (often a governance issue, not a methodology flaw).
Data Redundancy: Denormalization leads to repeated attributes in dimensions. Impact diminished by cheap storage and columnar database compression.
Handling Highly Complex Relationships: Extremely complex many-to-many relationships might require intricate ETL or oversimplify in a pure dimensional model.
Update Anomalies/Latency (Historically): Updates to dimensions (SCDs) require careful ETL. Older architectures could face issues if not handled meticulously.
Initial Enterprise-Wide View: May not provide a complete enterprise view from day one unless guided by an overall information strategy (e.g., bus matrix).
Legacy Data Integration: Can present significant ETL challenges for complex legacy data.
Ambiguity in Process Mapping: Challenging if business processes are poorly defined or data ownership is unclear.

Many limitations can be mitigated with careful planning, strong data governance, and disciplined application of Kimball's full recommended practices.

The Modern Data Lakehouse & Key Technologies

The Data Lakehouse combines data lake flexibility with data warehouse management features. Open table formats like Apache Iceberg and Delta Lake are pivotal, enabling reliable Kimball models directly on the lake.

Apache Iceberg 🧊

An open table format for petabyte-scale analytical datasets in data lakes, bringing database-like reliability via robust metadata.

Core Concepts:

Table Format, Not Storage Engine: Defines file organization (Parquet, ORC, Avro) for engines like Spark, Trino, Flink.
Snapshots: Tracks table state; each modification creates a new snapshot representing the table at a point in time (enables time travel, atomicity).
Hierarchical Metadata: Metadata files (schema, partition spec) -> Manifest lists (pointers to manifests, partition ranges) -> Manifest files (track data files, paths, stats).
Iceberg Catalog: Tracks current metadata pointer (e.g., Hive Metastore, Glue, Nessie); atomic updates enable ACID.

Key Benefits:

ACID Transactions: Reliable concurrent reads/writes, consistent table states.
Schema Evolution: Robust changes (add, drop, rename, reorder, type change) often without rewriting data.
Time Travel & Rollback: Query past versions or revert errors.
Efficient Partitioning: Hidden partitioning (query by value, auto-pruning), partition evolution without rewrite.
Performance Optimization: Aggressive predicate pushdown and data skipping.
Decoupled Storage/Compute: Use various engines on same data in open storage.
Open Standard & Ecosystem: Apache project, broad support, avoids vendor lock-in.

Delta Lake 💧

An open-source storage layer adding ACID, scalable metadata, and reliability to data lakes via a transaction log for Parquet files.

Key Features:

ACID Transactions: Ensures data reliability via a file-based transaction log.
Scalable Metadata Handling: Efficiently manages metadata for very large tables.
Time Travel (Data Versioning): Query previous versions for audits, rollbacks.
Schema Enforcement & Evolution: Enforces schema on write, supports adding columns and other compatible changes.
Full DML Support: `UPDATE`, `DELETE`, `MERGE INTO` crucial for SCDs.
Unified Batch & Streaming: Serves as sink for streaming and source for batch.
Optimizations: Data skipping, Z-Ordering, Liquid Clustering (Databricks) enhance query performance.
Strong Spark Integration: Natural choice for Spark-centric ETL workloads.

Implementing Kimball Models with Apache Iceberg

Fact and dimension tables are directly implemented as Iceberg tables using SQL DDL (e.g., `CREATE TABLE ... USING iceberg`). This combines Kimball's intuitive structures with Iceberg's ACID, schema management, and time travel on data lakes.

Iceberg gracefully handles schema changes (add, drop, rename, reorder columns, change types) via metadata updates, often without rewriting existing data. New columns can be added, and older data will have nulls or can be backfilled. This is crucial for long-term maintenance and agility.

Iceberg's ACID transactions and DML support (e.g., `MERGE INTO` in Spark) enable atomic SCD operations. Snapshots inherently version data, useful for auditing. Change Data Capture (CDC) or "change log views" are emerging concepts to dynamically generate SCD Type 2 representations, potentially simplifying ETL.

Iceberg offers hidden partitioning (query by data values, Iceberg handles pruning transparently) and partition evolution (change partition scheme for new data without rewriting old data). Fact tables are typically partitioned by a date dimension key. Column statistics in manifests further enhance pruning.

Implementing Kimball Models on Delta Lake

Delta Lake is also highly suitable for Kimball models:

Fact/Dimension Tables: Created as Delta tables (default in Databricks).
DML for SCDs: Robust `UPDATE`, `DELETE`, and especially `MERGE INTO` simplify SCD Type 2 logic declaratively in SQL, reducing ETL complexity.
Schema Evolution/Enforcement: Tables adapt to changes (e.g., new facts/attributes) without full rewrites; enforcement ensures data quality.
Time Travel: Beneficial for auditing dimension changes and historical analysis.
Performance Optimizations: Features like Liquid Clustering (Databricks Delta) automate data layout for fact/dimension joins. `ANALYZE TABLE` helps optimizer.

Delta Lake UniForm: Bridging Formats

UniForm allows Delta tables to be read by Iceberg (and Hudi) clients, promoting interoperability.

Mechanism: Leverages common Parquet data files. UniForm asynchronously generates Iceberg metadata (JSONs, manifests) after Delta writes. Both Delta and Iceberg metadata point to the same data files (no data duplication).
Benefit: Organizations using Delta can allow Iceberg ecosystem tools to access data without migration or duplicate pipelines. E.g., Delta tables on AWS queried by Redshift (via Iceberg support).
Consideration: Iceberg access is read-only (writes via Delta). Asynchronous metadata generation means a brief delay (seconds to minutes) before changes are visible to Iceberg clients.

Technology Comparison: Iceberg vs. Delta Lake

Both Iceberg and Delta Lake enable Kimball models on the data lake but have different strengths. The chart offers a visual summary, followed by a detailed feature table for deeper insights into their capabilities for dimensional modeling.

Feature Breakdown for Kimball Modeling

Feature	Apache Iceberg	Delta Lake	Implications for Kimball Modeling
Core Architecture	Tracks table state via snapshots, hierarchical metadata (metadata files, manifest lists, manifest files).	Extends Parquet with a file-based transaction log (Delta Log).	Both provide reliable structures for facts/dimensions. Iceberg's metadata offers fine-grained file tracking. Delta Log is central.
ACID Transactions	Yes, via atomic metadata swaps in catalog.	Yes, via transaction log.	Essential for integrity in fact/dimension tables, especially during ETL and SCD processing.
Schema Evolution	Highly flexible: add, drop, rename, reorder columns, change types, update partition spec, often without data rewrite.	Strong: add columns, change some types (limitations for complex changes), schema enforcement on write.	Iceberg's flexibility is advantageous for evolving dimension attributes/fact measures. Delta's enforcement is key for data quality.
Partitioning	Hidden partitioning, partition evolution (change spec without rewrite).	Standard Hive-style partitioning; Liquid Clustering (Databricks) for adaptive, automatic data layout.	Iceberg's partition evolution is powerful for long-term fact table management. Liquid Clustering offers advanced physical optimization.
SCD Handling Support	DML via engines (MERGE, UPDATE, DELETE); change log view concept for SCD Type 2.	Mature DML support (MERGE, UPDATE, DELETE) deeply integrated with Spark.	Both enable robust SCDs. Delta's MERGE is well-established. Iceberg's engine-based DML and potential CDC offer flexibility.
Time Travel	Yes, query by snapshot ID or timestamp.	Yes, query by version or timestamp.	Crucial for auditing, debugging ETL, historical analysis of dimension changes, complementing explicit SCD versioning.
Concurrency Control	Optimistic Concurrency Control (OCC).	Optimistic Concurrency Control (OCC), with conflict resolution.	Both support concurrent reads/writes, vital for busy DW environments.
Ecosystem/Query Engine Support	Broad native support: Spark, Trino, Flink, Presto, Hive, Impala, Dremio, Snowflake, BigQuery, Redshift.	Strongest with Apache Spark and Databricks; growing support in other engines (Trino, Flink, Presto).	Iceberg's wider native compatibility offers more flexibility. Delta excels in Spark-centric environments.
Openness/Governance	Apache Software Foundation project, open standard, community-driven.	Linux Foundation project, open-source, initially developed by Databricks.	Both open-source. Iceberg often seen as more vendor-neutral. Delta has strong Databricks backing.
UniForm-like Interoperability	Native format, read by various engines.	Delta Lake UniForm allows Delta tables to be read as Iceberg/Hudi tables (read-only).	UniForm enables Delta-based Kimball models to be accessed by Iceberg-native tools without data duplication.

Governance & Conformed Dimensions in the Lakehouse

Conformed dimensions are the cornerstone of integrated analytics in Kimball's methodology. Their effective management is even more critical in the diverse data lakehouse landscape, requiring strong governance and strategic implementation.

The Critical Role of Conformed Dimensions

Conformed dimensions (e.g., `DimDate`, `DimProduct`, `DimCustomer`) have consistent interpretations, attributes, and meanings across multiple fact tables or data marts. They are the semantic glue enabling integrated analysis ("drill-across") and a holistic business view.

Key Benefits:

Analytic Consistency: Ensures organization-wide use of same definitions for key entities.
Reduced Development Costs: Reusing established dimensions speeds up new data mart development.
Simplified Integration: Provides a framework for integrating new data sources.

Without disciplined conformity, a lakehouse risks becoming isolated data silos, undermining integrated analytics.

Strategies for Lakehouse Implementation (Iceberg/Delta Lake)

Centralized, Shared Tables: Implement conformed dimensions as dedicated, centrally managed Iceberg or Delta Lake tables, serving as the single source of truth.
Schema Enforcement & Evolution: Leverage table format features to maintain structural integrity and adapt dimensions (e.g., add attributes) controllably.
ACID Transactions for Updates: Ensure reliable and consistent updates (new members, SCD changes) to conformed dimensions.
Centralized Catalog & Metadata Management: Use a robust catalog (Hive Metastore, Glue, Nessie, Unity Catalog) for discoverability, versioning, and managing current state pointers.
Data Governance & Stewardship: Crucial for success. Involves:
- Cross-functional agreement on definitions, attributes, hierarchies.
- Clear ownership and responsibility for data quality.
- Processes for managing and communicating changes. (Tools like Unity Catalog can aid here).

Technology provides means, but organizational consensus and governance are paramount.

Medallion Architecture: A Layered Best Practice

The Medallion architecture (Bronze, Silver, Gold) provides a natural framework for managing conformed dimensions. They are typically mastered in the Silver layer and consumed by Gold layer data marts.

🥉 Bronze

Raw, unchanged source data.

→

🥈 Silver

Cleansed, integrated, conformed data. Authoritative conformed dimensions (e.g., `dim_customer_silver`) are mastered here from various sources.

→

🥇 Gold

Business-centric, aggregated, application-specific data marts (Kimball star schemas) built using Silver layer conformed dimensions for consistent analytics.

Challenges & Further Best Practices

Challenges:

Maintaining Consistency: Difficult in large, distributed environments; risk of "semantic drift."
Organizational Buy-in: Gaining enterprise-wide agreement on definitions requires collaboration and sponsorship.
Complexity/Maintenance: ETL for conformed dimensions integrating many sources can be complex.
Performance at Scale: Very large conformed dimensions require query optimization.

Best Practices (Continued):

Centralized Data Governance Body: Essential for defining, approving, and overseeing conformed dimensions.
Utilize a Robust Data Catalog: Critical for discovery, lineage, versioning, access control (e.g., Nessie, Unity Catalog).
Automate ETL & Data Quality: Implement robust, automated processes with data quality checks.
Iterative Development: Start with critical dimensions, expand over time.
Clear Documentation & Communication: Document definitions, hierarchies, rules; communicate changes effectively.

Practical Implementation with Iceberg & Spark SQL

Spark SQL examples for a Kimball model using Apache Iceberg: table creation, SCD Type 2, schema evolution, and historical querying.

Define `DimCustomer` (SCD Type 2) and `FactSales` (partitioned by date).

-- DimCustomer: Tracks customer history (SCD Type 2)
CREATE TABLE dw.dim_customer (
    customer_key   BIGINT,
    customer_id    INT,
    customer_name  STRING,
    address        STRING,
    city           STRING,
    state          STRING,
    is_current     BOOLEAN,
    start_date     DATE,
    end_date       DATE
) USING iceberg;

-- FactSales: Partitioned by transaction_date for performance
CREATE TABLE dw.fact_sales (
    transaction_id   BIGINT,
    transaction_date DATE,
    customer_key     BIGINT, -- Surrogate key from DimCustomer
    customer_id      INT,    -- Natural key, useful for some joins
    product_key      INT,
    sales_amount     DECIMAL(10, 2)
) USING iceberg
PARTITIONED BY (days(transaction_date));

Prepare for SCD Type 2: Initial load and stage incoming changes.

-- Step 1: Initial Load of a customer
INSERT INTO dw.dim_customer VALUES 
(101, 1, 'John Doe', '123 Main St', 'Anytown', 'CA', true, '2023-01-15', NULL);

-- Step 2: A change arrives (John Doe moved). Stage in a temporary view.
CREATE OR REPLACE TEMPORARY VIEW customer_updates AS
SELECT 1 AS customer_id, 'John Doe' AS customer_name, '456 Oak Ave' AS address, 
       'Someville' AS city, 'CA' AS state, '2024-05-20' AS effective_date;

Apply SCD Type 2 changes using a two-step approach for clarity and robustness in Spark: first expire old records, then insert new ones.

-- Step 3a: Expire old records for customers with changed addresses
UPDATE dw.dim_customer d
SET d.is_current = false, d.end_date = date_sub(u.effective_date, 1)
FROM customer_updates u
WHERE d.customer_id = u.customer_id AND d.is_current = true AND d.address <> u.address;

-- Step 3b: Insert new records for the customers who had changes
-- (Assumes a function/method to generate new surrogate keys, e.g., max_existing_key + 1)
INSERT INTO dw.dim_customer
SELECT
  (SELECT coalesce(max(customer_key), 0) FROM dw.dim_customer) + 1, -- Generate new surrogate key
  u.customer_id,
  u.customer_name,
  u.address,
  u.city,
  u.state,
  true AS is_current,
  u.effective_date AS start_date,
  NULL AS end_date
FROM customer_updates u
-- Ensure we only insert if a change was processed (i.e., an old record was just expired)
WHERE EXISTS (
    SELECT 1 FROM dw.dim_customer d_old
    WHERE d_old.customer_id = u.customer_id 
      AND d_old.is_current = false -- Check the record we just updated
      AND d_old.end_date = date_sub(u.effective_date, 1)
);

Resulting DimCustomer (conceptual): John Doe's old record (key 101) has `is_current=false`, `end_date='2024-05-19'`. A new record (key 102) has `is_current=true`, `start_date='2024-05-20'`, new address.

Iceberg allows easy schema changes. Add a `loyalty_tier` column.

-- Add a new column to track customer loyalty status
ALTER TABLE dw.dim_customer ADD COLUMN loyalty_tier STRING;

-- Existing records will have NULL for loyalty_tier. Table remains queryable.

Query for historically accurate analysis. The ETL loading `FactSales` should look up the correct `customer_key` based on `transaction_date`. For ad-hoc historical queries directly against `DimCustomer`, use the date range join on the natural key (`customer_id`).

-- Find the customer's city AT THE TIME OF EACH SALE
SELECT
  fs.transaction_date,
  fs.sales_amount,
  dc.customer_name,
  dc.city AS city_at_time_of_sale,
  dc.customer_key AS dim_customer_version_key -- The specific version of customer record
FROM dw.fact_sales fs
JOIN dw.dim_customer dc
  -- This join assumes FactSales.customer_key was correctly populated by ETL
  -- to point to the historically accurate DimCustomer record.
  -- ON fs.customer_key = dc.customer_key

  -- For direct historical join if FactSales only has natural key (customer_id):
  ON fs.customer_id = dc.customer_id 
  AND fs.transaction_date >= dc.start_date 
  AND (fs.transaction_date <= dc.end_date OR dc.end_date IS NULL);

Conclusion & Future Outlook

Kimball's dimensional modeling, with its focus on business processes, user understandability, and query performance, remains remarkably relevant. Its principles provide essential semantic clarity for BI, even in modern data lakehouses. While some advocate for flatter "One Big Table" (OBT) structures for specific workloads, OBTs are often best derived from an underlying, well-structured dimensional model which serves as the foundational source of truth and business logic.

Open table formats like Apache Iceberg and Delta Lake are transformative, bringing ACID transactions, schema evolution, and time travel to data lakes. This allows robust Kimball models to be implemented directly on cost-effective, scalable lake infrastructure, realizing the data lakehouse vision. They are foundational for a new, flexible paradigm for analytical platforms.

Recommendations for Adoption

1.
Embrace Layered Architectures: Adopt structured approaches like the Medallion architecture (Bronze, Silver, Gold). Implement Kimball models in the Gold layer, sourcing cleansed, integrated, and conformed data (including conformed dimensions) from the Silver layer. This promotes organization, data quality, and reusability.
2.
Prioritize Conformed Dimension Governance: Integrated enterprise analytics hinges on well-governed conformed dimensions. Establish strong data governance practices, cross-functional agreement, and leverage data catalogs (e.g., Unity Catalog, Glue, Nessie) to manage definitions, schema, quality, and accessibility.
3.
Leverage Specific Strengths of Table Formats:
- Apache Iceberg: Excellent for open stacks with diverse query engine needs, requiring flexible schema/partition evolution. Offers broad compatibility and a community-driven standard.
- Delta Lake: Ideal for Spark/Databricks-centric ecosystems, offering deep integration, mature DMLs (especially `MERGE` for SCDs), and platform-specific optimizations (e.g., Liquid Clustering).
- Consider Delta Lake UniForm for interoperability if primarily using Delta but needing read access for Iceberg-native tools.
4.
Design for Evolvability and Performance: Utilize schema evolution capabilities. Continuously monitor query patterns and data volumes to optimize physical design (partitioning, file sizes via compaction, indexing techniques like Z-Ordering or Bloom filters).
5.
Focus on Business Value: Technology is an enabler. The core Kimball principles—starting with business processes, defining clear grains, building intuitive models—should remain paramount to ensure the data warehouse delivers tangible organizational value.

The choice between Iceberg and Delta Lake (or using them together) should be guided by existing ecosystem, compute engines, technical needs, team skills, and long-term interoperability goals. Kimball's logical design principles can be effectively implemented on either format.

Kimball Fundamentals

Core Principles and Philosophy

Key Advantages of the Kimball Approach

Kimball vs. Inmon: Two Philosophies

Kimball (Bottom-Up)

Inmon (Top-Down)

Considerations & Hybrids:

Anatomy of the Kimball Model

Fact Tables: The Heart of the Model

Dimension Tables: Providing Context

Schema Types in Dimensional Modeling

Star Schema: The Foundation

Snowflake Schema: Normalized Dimensions

Fact Constellation (Galaxy): Multiple Facts

Schema Comparison Summary

Handling Changes: Slowly Changing Dimensions (SCD)

The Kimball Design Process

The Kimball Lifecycle Overview

The 4-Step Dimensional Design Process

Applicability and Limitations

Strengths & Ideal Use Cases

Potential Limitations & Considerations

The Modern Data Lakehouse & Key Technologies

Apache Iceberg 🧊

Core Concepts:

Key Benefits:

Delta Lake 💧

Key Features:

Implementing Kimball Models with Apache Iceberg

Implementing Kimball Models on Delta Lake

Delta Lake UniForm: Bridging Formats

Technology Comparison: Iceberg vs. Delta Lake

Feature Breakdown for Kimball Modeling

Governance & Conformed Dimensions in the Lakehouse

The Critical Role of Conformed Dimensions

Key Benefits:

Strategies for Lakehouse Implementation (Iceberg/Delta Lake)

Medallion Architecture: A Layered Best Practice

🥉 Bronze

🥈 Silver

🥇 Gold

Challenges & Further Best Practices

Challenges:

Best Practices (Continued):

Practical Implementation with Iceberg & Spark SQL

Conclusion & Future Outlook

Recommendations for Adoption