Batch Processing
Grouping operations together to reduce per-unit overhead and network round-trips.
Overview
Batch processing executes operations in groups rather than one-by-one. Database batch inserts/updates reduce query round-trips from N to O(N/batch_size). Application-level batching groups units of work (emails, webhooks, reports) for efficient processing with bounded parallelism. The Sidekiq batch (pro feature), ActiveJob, and BullMQ are common Ruby/Node.js implementations.
Origin
Batch processing predates interactive computing; IBM mainframes in the 1950s-1960s ran jobs from punched cards overnight. The concept carried into database operations with BULK INSERT (SQL Server), COPY (PostgreSQL), and INSERT...SELECT. Apache Spark (2014) and Hadoop (2006) brought distributed batch processing to big data. ActiveRecord's find_each (2010) introduced memory-safe batch iteration for ORMs.
Examples
Bulk insert and upsert in TypeScript with Prisma
interface ProductImport {
sku: string;
name: string;
priceCents: number;
stock: number;
}
// Single inserts: N round-trips to the database
async function importProductsSlow(products: ProductImport[]): Promise<void> {
for (const product of products) {
await prisma.product.create({ data: product }); // N queries
}
}
// Bulk insert: 1 round-trip regardless of N
async function importProductsFast(products: ProductImport[]): Promise<void> {
const BATCH_SIZE = 1000;
for (let i = 0; i < products.length; i += BATCH_SIZE) {
const batch = products.slice(i, i + BATCH_SIZE);
await prisma.product.createMany({
data: batch,
skipDuplicates: true, // Ignore rows with conflicting unique constraints
});
}
}
// Upsert: insert or update on conflict
async function syncProducts(products: ProductImport[]): Promise<void> {
const BATCH_SIZE = 500;
for (let i = 0; i < products.length; i += BATCH_SIZE) {
await prisma.$executeRaw`
INSERT INTO products (sku, name, price_cents, stock)
VALUES ${Prisma.join(
products.slice(i, i + BATCH_SIZE).map(
p => Prisma.sql``
)
)}
ON CONFLICT (sku) DO UPDATE SET
name = EXCLUDED.name,
price_cents = EXCLUDED.price_cents,
stock = EXCLUDED.stock,
updated_at = NOW()
`;
}
}BATCH_SIZE of 500-1000 is a typical sweet spot; too small wastes round-trips, too large risks exceeding PostgreSQL's 65535 parameter limit or causing lock contention. Measure with EXPLAIN ANALYZE at your target batch size.
Memory-safe batch processing with find_each in Rails
# find_each: loads records in batches of 1000 (default)
# Memory usage: O(batch_size), not O(total_records)
class OrderExportJob < ApplicationJob
BATCH_SIZE = 500
def perform(export_id)
export = Export.find(export_id)
csv_rows = []
Order
.includes(:customer, line_items: :product)
.where(status: 'fulfilled', fulfilled_at: export.date_range)
.find_each(batch_size: BATCH_SIZE) do |order|
csv_rows << [
order.id,
order.customer.email,
order.total_cents / 100.0,
order.fulfilled_at.iso8601
]
if csv_rows.size >= BATCH_SIZE
export.append_csv_rows(csv_rows)
csv_rows = []
end
end
export.append_csv_rows(csv_rows) unless csv_rows.empty?
export.mark_complete!
end
endfind_each uses LIMIT/OFFSET under the hood (ActiveRecord 6+ uses primary key keyset for performance). includes inside find_each is applied per-batch, not globally, keeping memory bounded. Accumulating rows and flushing per-batch prevents building a huge in-memory array.
Use Cases
- 01Bulk data imports (CSV uploads, API sync jobs) where processing one record at a time is prohibitively slow
- 02Email and notification sending where queuing individual emails per record creates unnecessary Sidekiq jobs; batch the sends into groups
- 03Nightly aggregate jobs (daily revenue summaries, user activity digests) that process large datasets
- 04Database migrations that update values for millions of rows; batching in transactions prevents long-running transactions and allows progress tracking
When Not to Use
- //Do not batch operations that have per-item side effects (sending an email per record) without idempotency; a failed batch may partially execute with no way to know which items succeeded
- //Do not batch database updates inside a single transaction when the batch is very large; a transaction that updates 100,000 rows holds locks for the full duration
- //Do not use batch processing for real-time operations where individual record latency matters; batch introduces inherent delay equal to the batch collection window
Technical Notes
- PostgreSQL COPY command is the fastest bulk load mechanism: 10-100x faster than batched INSERT statements for large datasets. ActiveRecord does not have native COPY support; the activerecord-import gem provides it via INSERT multi-row syntax
- INSERT ... ON CONFLICT DO UPDATE (PostgreSQL 9.5+, "UPSERT") is atomic and avoids the SELECT-then-INSERT-or-UPDATE race condition. MySQL uses INSERT ... ON DUPLICATE KEY UPDATE; SQLite uses INSERT OR REPLACE
- Batch size optimisation: PostgreSQL's parameter limit is 65535 per statement. For multi-row INSERT, each row with N columns uses N parameters; max batch size = 65535 / N. At 5 columns, max is 13,107 rows per statement
- Sidekiq Pro's Batch feature groups jobs and fires a callback when all complete; it is useful for parallel batch processing with a completion notification. Open-source alternative: maintain a counter in Redis
More in Performance