Database Denormalization
Strategically duplicating data to accelerate read queries at the cost of write complexity.
Overview
Denormalisation intentionally introduces redundancy into a relational schema by storing derived or duplicated data to improve read performance. Normalised schemas (3NF) minimise redundancy but require JOIN operations; denormalised schemas duplicate data to serve common queries from a single table. The trade-off is faster reads at the cost of more complex writes, storage, and risk of inconsistency.
Origin
Relational normalisation theory was developed by Edgar Codd at IBM (1970-1974). Third Normal Form (3NF) became the standard for OLTP databases. Denormalisation as a deliberate performance technique emerged in the 1980s as relational databases scaled. Data warehousing (star/snowflake schemas) in the 1990s institutionalised denormalisation for analytics. Cassandra and DynamoDB enforced denormalisation by design in the 2010s.
Examples
Counter cache and pre-computed totals in Rails migration
class AddDenormalisedFields < ActiveRecord::Migration[7.1]
def change
# Counter cache: avoid COUNT query on every post display
# Without: Post.all.map { |p| [p.id, p.comments.count] } = N+1
# With: Post.select(:id, :comments_count) = single query, no JOIN
add_column :posts, :comments_count, :integer, default: 0, null: false
add_column :posts, :likes_count, :integer, default: 0, null: false
# Backfill existing data
Post.find_each do |post|
Post.where(id: post.id).update_all(
comments_count: post.comments.count,
likes_count: post.likes.count
)
end
# Pre-computed order total: avoid SUM(line_items.price * qty) on every display
add_column :orders, :subtotal_cents, :integer, null: false, default: 0
add_column :orders, :total_cents, :integer, null: false, default: 0
# Customer order summary: avoid expensive aggregation per customer page load
add_column :customers, :total_orders_count, :integer, default: 0, null: false
add_column :customers, :lifetime_value_cents, :integer, default: 0, null: false
end
endActiveRecord counter_cache: true in the belongs_to declaration automatically updates the parent's counter column on association create/destroy. For custom counters, use after_create/after_destroy callbacks or DB triggers for consistency.
Materialised view for complex reporting query in PostgreSQL
// Materialised views: a pre-computed result set stored as a table
// Refreshing is an explicit operation (REFRESH MATERIALIZED VIEW)
const createMaterialisedView = async (pool: import('pg').Pool) => {
await pool.query(`
CREATE MATERIALIZED VIEW daily_revenue_summary AS
SELECT
DATE_TRUNC('day', o.placed_at) AS day,
p.category,
COUNT(o.id) AS order_count,
SUM(o.total_cents) AS revenue_cents,
AVG(o.total_cents)::int AS avg_order_cents,
COUNT(DISTINCT o.customer_id) AS unique_customers
FROM orders o
JOIN order_line_items li ON li.order_id = o.id
JOIN products p ON p.id = li.product_id
WHERE o.status = 'fulfilled'
GROUP BY 1, 2
WITH DATA
`);
// Index on the materialised view for fast date-range queries
await pool.query(
'CREATE INDEX idx_revenue_summary_day ON daily_revenue_summary (day DESC)'
);
};
// Refresh daily (in a cron job or pg_cron)
const refreshView = async (pool: import('pg').Pool) => {
// CONCURRENTLY does not block reads but requires a unique index
await pool.query('REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue_summary');
};REFRESH MATERIALIZED VIEW CONCURRENTLY (PostgreSQL 9.4+) creates a new version of the view in the background and swaps atomically, preventing read locks during refresh. Without CONCURRENTLY, reads are blocked for the duration of the refresh.
Use Cases
- 01Dashboard and reporting queries that aggregate millions of rows; materialised views pre-compute the result
- 02Counter caches on association counts displayed in list views (post.comments_count, product.reviews_count)
- 03Analytics databases (warehouse/OLAP) where the read:write ratio is very high and normalisation overhead is unjustified
- 04Search indexes (Elasticsearch, Algolia) that are denormalised copies of database records optimised for full-text and faceted search
When Not to Use
- //Do not denormalise OLTP tables with high write volumes where maintaining consistency across duplicated columns adds significant overhead and risk
- //Do not denormalise prematurely; normalise first, identify slow queries via profiling, then denormalise only the specific access patterns that are bottlenecks
- //Do not denormalise without implementing mechanisms to keep the redundant data consistent; inconsistent denormalised data is worse than a slow normalised query
Technical Notes
- PostgreSQL REFRESH MATERIALIZED VIEW WITH NO DATA creates the view structure without populating it; the first REFRESH populates it. CONCURRENTLY requires a unique index on the view
- Apache Parquet and Apache ORC are columnar storage formats used in data lakes and warehouses; they are inherently denormalised and compressed, enabling very fast analytical queries via column pruning and predicate pushdown
- Event sourcing enables denormalisation without consistency risk: the event log is the source of truth; materialised views (read models) are derived from events and can be rebuilt deterministically. This is the CQRS + event sourcing pattern
- Write-time fan-out (used by Twitter's timeline service) is an extreme form of denormalisation: when a tweet is posted, it is written to the follower timelines of all N followers immediately, enabling O(1) timeline reads at the cost of O(N) writes
More in Performance