Performance

Database Denormalization

Strategically duplicating data to accelerate read queries at the cost of write complexity.

Overview

Denormalisation intentionally introduces redundancy into a relational schema by storing derived or duplicated data to improve read performance. Normalised schemas (3NF) minimise redundancy but require JOIN operations; denormalised schemas duplicate data to serve common queries from a single table. The trade-off is faster reads at the cost of more complex writes, storage, and risk of inconsistency.

Origin

Relational normalisation theory was developed by Edgar Codd at IBM (1970-1974). Third Normal Form (3NF) became the standard for OLTP databases. Denormalisation as a deliberate performance technique emerged in the 1980s as relational databases scaled. Data warehousing (star/snowflake schemas) in the 1990s institutionalised denormalisation for analytics. Cassandra and DynamoDB enforced denormalisation by design in the 2010s.

Examples

Counter cache and pre-computed totals in Rails migration

class AddDenormalisedFields < ActiveRecord::Migration[7.1]
  def change
    # Counter cache: avoid COUNT query on every post display
    # Without: Post.all.map { |p| [p.id, p.comments.count] } = N+1
    # With: Post.select(:id, :comments_count) = single query, no JOIN
    add_column :posts, :comments_count, :integer, default: 0, null: false
    add_column :posts, :likes_count, :integer, default: 0, null: false

    # Backfill existing data
    Post.find_each do |post|
      Post.where(id: post.id).update_all(
        comments_count: post.comments.count,
        likes_count: post.likes.count
      )
    end

    # Pre-computed order total: avoid SUM(line_items.price * qty) on every display
    add_column :orders, :subtotal_cents, :integer, null: false, default: 0
    add_column :orders, :total_cents, :integer, null: false, default: 0

    # Customer order summary: avoid expensive aggregation per customer page load
    add_column :customers, :total_orders_count, :integer, default: 0, null: false
    add_column :customers, :lifetime_value_cents, :integer, default: 0, null: false
  end
end

ActiveRecord counter_cache: true in the belongs_to declaration automatically updates the parent's counter column on association create/destroy. For custom counters, use after_create/after_destroy callbacks or DB triggers for consistency.

Materialised view for complex reporting query in PostgreSQL

// Materialised views: a pre-computed result set stored as a table
// Refreshing is an explicit operation (REFRESH MATERIALIZED VIEW)

const createMaterialisedView = async (pool: import('pg').Pool) => {
  await pool.query(`
    CREATE MATERIALIZED VIEW daily_revenue_summary AS
    SELECT
      DATE_TRUNC('day', o.placed_at) AS day,
      p.category,
      COUNT(o.id) AS order_count,
      SUM(o.total_cents) AS revenue_cents,
      AVG(o.total_cents)::int AS avg_order_cents,
      COUNT(DISTINCT o.customer_id) AS unique_customers
    FROM orders o
    JOIN order_line_items li ON li.order_id = o.id
    JOIN products p ON p.id = li.product_id
    WHERE o.status = 'fulfilled'
    GROUP BY 1, 2
    WITH DATA
  `);

  // Index on the materialised view for fast date-range queries
  await pool.query(
    'CREATE INDEX idx_revenue_summary_day ON daily_revenue_summary (day DESC)'
  );
};

// Refresh daily (in a cron job or pg_cron)
const refreshView = async (pool: import('pg').Pool) => {
  // CONCURRENTLY does not block reads but requires a unique index
  await pool.query('REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue_summary');
};

REFRESH MATERIALIZED VIEW CONCURRENTLY (PostgreSQL 9.4+) creates a new version of the view in the background and swaps atomically, preventing read locks during refresh. Without CONCURRENTLY, reads are blocked for the duration of the refresh.

Use Cases

  • 01Dashboard and reporting queries that aggregate millions of rows; materialised views pre-compute the result
  • 02Counter caches on association counts displayed in list views (post.comments_count, product.reviews_count)
  • 03Analytics databases (warehouse/OLAP) where the read:write ratio is very high and normalisation overhead is unjustified
  • 04Search indexes (Elasticsearch, Algolia) that are denormalised copies of database records optimised for full-text and faceted search

When Not to Use

  • //Do not denormalise OLTP tables with high write volumes where maintaining consistency across duplicated columns adds significant overhead and risk
  • //Do not denormalise prematurely; normalise first, identify slow queries via profiling, then denormalise only the specific access patterns that are bottlenecks
  • //Do not denormalise without implementing mechanisms to keep the redundant data consistent; inconsistent denormalised data is worse than a slow normalised query

Technical Notes

  • PostgreSQL REFRESH MATERIALIZED VIEW WITH NO DATA creates the view structure without populating it; the first REFRESH populates it. CONCURRENTLY requires a unique index on the view
  • Apache Parquet and Apache ORC are columnar storage formats used in data lakes and warehouses; they are inherently denormalised and compressed, enabling very fast analytical queries via column pruning and predicate pushdown
  • Event sourcing enables denormalisation without consistency risk: the event log is the source of truth; materialised views (read models) are derived from events and can be rebuilt deterministically. This is the CQRS + event sourcing pattern
  • Write-time fan-out (used by Twitter's timeline service) is an extreme form of denormalisation: when a tweet is posted, it is written to the follower timelines of all N followers immediately, enabling O(1) timeline reads at the cost of O(N) writes