Performance

Database Indexing & Query Optimization

How indexes work, when to add them, and how to read query execution plans.

Overview

Database indexes are data structures (typically B-trees or hash tables) that allow the query planner to locate rows without scanning the entire table. Without an index, a query performs a sequential scan: O(n). With a B-tree index, lookups are O(log n). Indexes come at a cost: write overhead (every INSERT, UPDATE, DELETE must update the index), storage, and stale statistics that mislead the planner.

Origin

B-tree indexes were introduced to database systems in the early 1970s, formalised by Bayer and McCreight (1972). IBM DB2 (1983) and Oracle (1979) made them standard. PostgreSQL's extended index types (GiST, GIN, BRIN, partial, expression indexes) appeared through the 1990s-2000s. Covering indexes (INCLUDE clause in PostgreSQL 11, 2018) enable index-only scans for frequent query patterns.

Examples

Composite and partial index strategies in PostgreSQL

# db/migrate/20250601_add_performance_indexes.rb
class AddPerformanceIndexes < ActiveRecord::Migration[7.1]
  def change
    # Composite index: column order matters
    # Query: WHERE customer_id = ? AND status = ? ORDER BY created_at DESC
    # Most selective column first (customer_id), then filter (status), then sort (created_at)
    add_index :orders, [:customer_id, :status, :created_at],
      name: 'idx_orders_customer_status_created'

    # Partial index: indexes only active subscriptions, not the full table
    # Dramatically smaller index for queries that always filter on active = true
    add_index :subscriptions, :plan_id,
      where: "status = 'active'",
      name: 'idx_subscriptions_plan_active'

    # Covering index (PostgreSQL 11+): includes non-key columns to avoid table fetch
    # Query: SELECT id, email FROM users WHERE company_id = ? - no heap fetch needed
    add_index :users, :company_id,
      include: [:id, :email],
      name: 'idx_users_company_covering'

    # GIN index for full-text search and JSONB containment queries
    add_index :products, :metadata, using: :gin,
      name: 'idx_products_metadata_gin'

    # Expression index: index on lower(email) for case-insensitive lookups
    add_index :users, 'lower(email)',
      unique: true, name: 'idx_users_email_lower_unique'
  end
end

Composite index column order follows the rule: equality conditions first (customer_id = ?), then range/filter conditions (status IN ?), then sort columns (created_at DESC). The index only accelerates queries that use a prefix of its columns.

Using EXPLAIN ANALYZE to validate index use

# In Rails console or a migration debug session
# Run EXPLAIN ANALYZE to see actual execution plan

result = ActiveRecord::Base.connection.execute(<<~SQL)
  EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
  SELECT id, total_cents, status
  FROM orders
  WHERE customer_id = 42
    AND status = 'confirmed'
  ORDER BY created_at DESC
  LIMIT 20
SQL

plan = JSON.parse(result.first['QUERY PLAN']).first
# Look for:
# "Node Type": "Index Scan" (good) vs "Seq Scan" (potentially bad)
# "Actual Rows": vs "Plan Rows" (large discrepancy = stale statistics)
# "Shared Hit Blocks": data served from buffer cache (fast)
# "Shared Read Blocks": data read from disk (slow)
# Run: ANALYZE orders; to update statistics after large data changes

puts "Plan type: #{plan.dig('Plan', 'Node Type')}"
puts "Actual rows: #{plan.dig('Plan', 'Actual Rows')}"
puts "Total cost: #{plan.dig('Plan', 'Total Cost')}"

BUFFERS shows whether data was served from PostgreSQL's shared buffer cache (Shared Hit Blocks) or disk (Shared Read Blocks). A large ratio of reads to hits indicates the working set does not fit in shared_buffers and the system is I/O-bound.

Use Cases

  • 01Foreign key columns used in JOIN conditions (customer_id, order_id) must be indexed to avoid sequential scans of the related table
  • 02Columns used in WHERE clauses in high-frequency queries (status, created_at range, email lookups)
  • 03Unique constraints implemented via unique indexes enforce data integrity and provide O(log n) uniqueness checks
  • 04Full-text search on text columns using GIN indexes with to_tsvector() eliminates the need for LIKE %query% patterns

When Not to Use

  • //Do not index columns with very low cardinality (boolean flags with 50/50 split); the query planner will choose a sequential scan anyway because reading the entire index plus heap is slower than a heap-only scan
  • //Do not add indexes speculatively on tables that are write-heavy; each extra index adds overhead to INSERT/UPDATE/DELETE operations
  • //Do not create indexes on small tables (under 1,000 rows); the planner will correctly choose sequential scans and the index adds overhead without benefit

Technical Notes

  • PostgreSQL's VACUUM ANALYZE updates table statistics used by the planner; autovacuum runs automatically but may lag on high-write tables. Manual ANALYZE after large bulk imports is often necessary
  • Index bloat accumulates over time from UPDATE and DELETE operations; dead tuples remain in the index until VACUUM runs. REINDEX CONCURRENTLY rebuilds the index without locking the table (PostgreSQL 12+)
  • Covering indexes (INCLUDE columns) are only useful for index-only scans; PostgreSQL must still check the visibility map to verify the row's transaction visibility. The visibility map is updated by VACUUM
  • BRIN (Block Range Index) indexes store min/max values per block range; they are tiny and effective for naturally-ordered data (time-series tables, monotonically increasing IDs) but ineffective for randomly distributed values