A PostgreSQL 17 Migration Postmortem: WAL Recycling, Silent Timeouts, and the False Economy of Conservative Configs

TL;DR: I migrated tens of GB of data into a PostgreSQL 17 cluster with async replication and discovered the hard way that pg_dump → psql under load is a recipe for disaster if you haven't tuned WAL retention, timeouts, and resource allocation. The replica fell behind, WAL got recycled, replication broke permanently. Index builds got killed by timeouts. Conservative container limits meant the migration ran so slowly it caused more problems than it prevented. This post is everything I wish I'd known before starting.

Let me tell you about a migration that should have been straightforward and was anything but.

The plan was simple: pg_dump from the source, pipe it through psql into a target PostgreSQL cluster, let replication handle the replica. I'd done this before at smaller scales. How hard could it be?

Two broken replicas, three midnight rebuilds, and one very long week later, I had my answer.

This isn't a post about a perfect migration. It's about a messy, humbling one that exposed exactly how fragile async replication is when you're writing data faster than your standby can consume it. It's about timeouts silently murdering index builds mid-restore and leaving you with a partially-loaded database. And it's about how my well-intentioned efforts to protect production resources with conservative configs actually made everything worse.

If you're planning a large PostgreSQL migration with async replication, read this first. It'll save you the night I lost.

The Setup

Data volume: Tens of GB across millions of records
Source: Production PostgreSQL database
Target: PostgreSQL 17 in Docker with two-node async streaming replication (primary + standby for backups)
Pattern: pg_dump → pipe → psql restore
Constraints: Production stays online, data integrity, replication must survive

The async standby's job was straightforward: receive WAL from the primary, replay it, stay ready for backups. It was a plain primary_conninfo and standby_mode setup feeding a backup script. Nothing fancy — just a standard two-node async replication topology.

The migration script ran inside Docker containers with deliberately conservative limits:

DOCKER_RES=(--cpus=1.0 --cpu-shares=256 --memory=512m --blkio-weight=100)
PG_RESTORE_OPTS="-c maintenance_work_mem=256MB -c statement_timeout=30s -c lock_timeout=30s"

Those limits were meant to protect the production workloads running alongside the migration container. In theory, that's the right instinct. In practice, they created a cascade of failures I didn't see coming.

How Replication Broke (and Why I Didn't Notice Until It Was Too Late)

The first sign of trouble was subtle. About thirty minutes into the restore, I checked pg_stat_replication on the primary:

SELECT application_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS lag_mb,
       state,
       sync_state
FROM pg_stat_replication;

 application_name | lag_mb  |  state  | sync_state
------------------+---------+---------+------------
 standby          |    1247 | catchup | async

Over a gigabyte of lag. And climbing.

The restore was writing data at a furious pace — every COPY statement generating WAL records, every index rebuild churning through write-ahead log segments. The primary was producing WAL far faster than the async standby could download and replay it.

I kept watching, hoping it would stabilize. It didn't.

-- Checked again 15 minutes later
SELECT application_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS lag_mb,
       state
FROM pg_stat_replication;

 application_name | lag_mb  |  state
------------------+---------+---------
 standby          |    4102 | catchup

4 GB of lag. Then, at some point between checks, the standby just... disappeared.

SELECT count(*) FROM pg_stat_replication;

 count
-------
     0

No rows. The replica was gone from the replication view.

The Mechanics of WAL Recycling

Here's what happened under the hood.

PostgreSQL on the primary has a WAL budget. Controlled by max_wal_size (default 1 GB), the server allows WAL segments to accumulate up to that limit before forcing a checkpoint and recycling the oldest segments. During a bulk restore, the primary was generating WAL at many megabytes per second. It hit max_wal_size quickly, triggered a checkpoint, and started recycling WAL segments that had already been checkpointed.

The problem: the async standby hadn't replayed those segments yet.

PostgreSQL does not keep WAL segments around forever just because a replica is behind. Once a segment is before the oldest checkpoint's redo point, it's eligible for recycling. The standby, struggling to keep up, was requesting segments that the primary had already deleted.

The replica log told the story:

2026-05-22 14:32:18 UTC LOG:  started streaming WAL from primary at 3C/AB000000
2026-05-22 14:32:18 UTC FATAL:  requested WAL segment 000000010000003C000000AB has already been removed
2026-05-22 14:32:23 UTC LOG:  started streaming WAL from primary at 3C/AB000000
2026-05-22 14:32:23 UTC FATAL:  requested WAL segment 000000010000003C000000AB has already been removed

Once that error appears, the replica is dead. It cannot catch up. The only fix is to rebuild the standby from scratch — pg_basebackup the entire primary, set up replication again, wait hours for the base backup to transfer and apply. During a migration window. At 2 AM.

I rebuilt that standby twice.

Why Async Replication Is Fragile During Bulk Writes

The root cause is a mismatch in timing. The restore generates WAL at a rate the replica cannot match. The primary, following its checkpoint schedule, recycles WAL segments the replica still needs. Once those segments are gone, the replica has no way to get them back.

Three PostgreSQL settings control this behavior:

-- How much WAL to accumulate before forcing a checkpoint
-- Default: 1GB. During restore: set much higher.
max_wal_size = '1GB'
 
-- Minimum WAL to retain (helps avoid premature recycling)
min_wal_size = '80MB'
 
-- How many megabytes of WAL to retain for replica consumption
-- THIS IS THE CRITICAL ONE
wal_keep_size = '0'    -- Default: 0 = no extra retention

wal_keep_size is the safety net. It tells PostgreSQL to retain that many megabytes of WAL segments beyond what checkpoints would normally recycle. With wal_keep_size = 0 (the default), the primary will recycle any WAL segment as soon as the checkpoint says it's safe — regardless of whether a replica has consumed it.

Set it to a non-zero value, and you give your replica a buffer:

wal_keep_size = '2048MB'   -- Keep 2 GB of extra WAL for lagging replicas

But wal_keep_size has a limit. If the replica falls behind by more than this, it still breaks. The real solution is replication slots:

-- Create a physical replication slot on the primary
SELECT pg_create_physical_replication_slot('standby_slot');
 
-- Configure primary to retain WAL for unused slots
max_slot_wal_keep_size = '8GB'   -- How much WAL to keep for slots

A replication slot tells PostgreSQL: "Keep all WAL segments this slot hasn't consumed, even if it means exceeding max_wal_size." The primary will grow its WAL directory as needed to prevent the replica from falling off.

-- Check slot status
SELECT slot_name, slot_type,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS lag_mb,
       active
FROM pg_replication_slots;

  slot_name   | slot_type | lag_mb | active
--------------+-----------+--------+--------
 standby_slot | physical  |   4102 | t

The risk: If the replica goes down permanently and you have a replication slot, the primary will never recycle WAL for that slot. The WAL directory grows until disk fills. Always monitor slot lag and have a procedure to drop stale slots.

-- Drop a slot if the standby is permanently lost (do this before disk fills!)
SELECT pg_drop_replication_slot('standby_slot');

For my migration, the fix was a combination:

# On the primary, before starting the restore
psql -c "ALTER SYSTEM SET wal_keep_size = '4096MB';"
psql -c "ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';"
psql -c "ALTER SYSTEM SET max_wal_size = '16GB';"
psql -c "SELECT pg_reload_conf();"
 
# Create a replication slot for the standby
psql -c "SELECT pg_create_physical_replication_slot('standby_slot');"
 
# On the standby, configure to use the slot
# In postgresql.conf or recovery.conf:
primary_slot_name = 'standby_slot'

This gave the replica enough breathing room. WAL would be retained even if the primary hit max_wal_size. The replica could lag without breaking irrecoverably.

But WAL retention alone wasn't enough. I also needed to slow down the WAL generation rate so the replica had a chance to keep up. Which leads to the next problem.

When Timeouts Murder Index Builds

After getting replication stable, the next disaster was quieter but equally destructive.

During the psql restore phase, PostgreSQL follows a specific sequence:

Create schema (tables, sequences, constraints without indexes)
Load data via COPY statements
Build indexes (all CREATE INDEX statements at the end)
Add constraints (foreign keys, check constraints)
Run post-load SQL functions

Step 3 is where I got burned.

The pg_dump output places index definitions after all the data. Once the COPY statements finish loading tens of GB of data, PostgreSQL starts building indexes — one at a time, for each index defined in the dump. For a table with 5-6 indexes on tens of GB of data, each index build takes minutes to hours.

And I had set statement_timeout = 5min.

Here's how that plays out:

Data finishes loading. PostgreSQL starts CREATE INDEX idx_users_email ON users(email);
The table has tens of millions of rows. Indexing takes 45 minutes.
At the 5-minute mark, PostgreSQL kills the statement.
The restore session is not dead — just the statement. But psql was running with -v ON_ERROR_STOP=1, so it exits on the first error.
The restore is incomplete. Indexes are missing. The database is in a half-baked state.

The error in the logs:

ERROR:  canceling statement due to statement timeout
CONTEXT:  CREATE INDEX idx_users_email

I didn't catch it immediately because the restore script was running in a screen session and the exit just looked like a generic failure. I restarted the restore from scratch, assuming a transient error. It failed again. Same time. Same index.

Why Index Builds Take So Long

PostgreSQL creates indexes by scanning the entire table, sorting the values, and writing the index structure. For a B-tree index on a column with tens of millions of rows:

Sequential scan of the heap: reads the entire table
External sort of the indexed column values: writes temp files to pgsql_tmp
Index page writes: sequential I/O building the leaf pages

The total time is roughly:

index_time = (table_scan_time + sort_time + index_write_time) / maintenance_work_mem_efficiency

With maintenance_work_mem = 256MB, the sort phase has limited memory and spills to disk aggressively. With maintenance_work_mem = 4GB, the entire sort fits in memory and the index build is significantly faster.

-- During migration: no timeouts, plenty of memory
SET statement_timeout = '0';
SET lock_timeout = '0';
SET idle_in_transaction_session_timeout = '0';
SET maintenance_work_mem = '4GB';

The index that took 45 minutes with maintenance_work_mem = 256MB completed in under 8 minutes with 4GB. The timeout that was killing it would never have fired if I'd tuned memory first.

The fix is brutally simple but easy to forget:

SET statement_timeout = '0';
SET lock_timeout = '0';
SET idle_in_transaction_session_timeout = '0';

Every single one of these needs to be zero during a migration. Not "high". Zero. Any value, even 30min, will eventually be hit by a large index build.

Important caveat: maintenance_work_mem is per-session. If you have multiple sessions creating indexes (e.g., from parallel pg_restore --jobs), multiply the budget. 4GB * 4 sessions = 16GB potential allocation. Make sure your system has the RAM.

The Conservative Config Trap

Here's the irony: I set conservative Docker limits and conservative PostgreSQL configs to protect the production workloads sharing the host. My reasoning was sound — don't let a migration starve production of CPU, memory, or I/O.

But I set them too conservative, and the migration ran so slowly that it caused all the problems I was trying to prevent:

Slower restore → more time with high WAL generation → more replication lag → WAL recycling → broken replica
Slower index builds → statements exceed timeouts → failed restore → restart from scratch → more total time under load
More total migration time → more opportunity for something to break

The Docker container was constrained to --memory=512m with maintenance_work_mem = 256MB. Let's do the math:

256MB  maintenance_work_mem
+ 128MB shared_buffers (default)
+   4MB wal_buffers
+  32MB work_mem (default)
+  64MB OS + filesystem cache overhead
= 484MB → fits in 512MB with 28MB headroom

Twenty-eight megabytes of headroom. Any spike — a sort that exceeds work_mem spilling to temp, a momentary memory allocation for a hash join, even a background process — and the container hits its cgroup limit.

The kernel's OOM killer does not care about PostgreSQL's careful memory management. It terminates the largest process when the cgroup limit is breached. In Docker with PID 1 being PostgreSQL, that means the entire container dies.

# Visible in dmesg on the host:
[12345.678901] oom-kill: constraint=CONSTRAINT_MEMCG
[12345.678902] oom-kill: process=postgres (pid=1234) victim=1234
[12345.678903] oom-kill: memory cgroup=/docker/<container_id>

And the restore logs show nothing useful:

psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: Connection terminated unexpectedly

No "OOM killed," no "out of memory," no hint at all. Just a dropped connection that looks like a network issue.

The Balance I Found

# Migration-specific Docker limits (temporary, not permanent)
docker run \
  --cpus=2.0 \
  --memory=4g \
  --memory-swap=4g \
  --cpuset-cpus=0,1 \
  -e POSTGRES_PASSWORD=... \
  postgres:17

-- Migration-specific PostgreSQL settings
SET maintenance_work_mem = '2GB';
SET shared_buffers = '1GB';
SET work_mem = '128MB';
SET effective_cache_size = '3GB';
SET max_wal_size = '16GB';
SET min_wal_size = '4GB';
SET checkpoint_completion_target = 0.95;
SET wal_buffers = '128MB';
SET statement_timeout = '0';
SET lock_timeout = '0';
SET idle_in_transaction_session_timeout = '0';
SET autovacuum = 'off';

The key insight: temporary resource allocation during a migration is not the same as permanent over-provisioning. You are allocating 4 GB of memory and 2 CPU cores for the duration of the migration, not forever. The migration finishes faster, which means less total resource consumption, which means less impact on production.

The conservative configs were a false economy. They made the migration slow, which made every problem worse, which made the migration take even longer. Breaking that cycle meant giving the migration enough resources to finish quickly.

How All Three Problems Compound

What made this migration genuinely difficult was not any single bottleneck — it was how they amplified each other.

Here's the compound failure cascade I experienced:

1. Conservative Docker limits → slow restore
2. Slow restore → sustained high WAL generation rate
3. High WAL generation → primary hits max_wal_size → recycles WAL
4. Replica behind → can't fetch recycled WAL → replication breaks
5. Fix: rebuild replica from scratch (hours)
6. Restore still slow → index builds take 45+ minutes each
7. statement_timeout = 30s → kills long index builds
8. Failed restore → restart from beginning
9. Repeat steps 1-8 until you fix the root causes

Each restart added hours. Each restart generated more WAL. Each restart put more strain on the replica. I was fighting the symptoms (broken replication) without addressing the causes (too-conservative configs, no WAL retention, timeouts set too low).

Breaking the cascade required fixing all three simultaneously:

WAL retention: Replication slots and wal_keep_size so the replica could lag safely
Timeouts: All timeouts to zero so long-running operations would not be killed
Resource allocation: Higher container limits and tuned PostgreSQL settings so the migration ran fast enough to minimize WAL generation duration

Config Cheat Sheet: The Settings That Actually Matter

Here are the exact settings I now use for PostgreSQL migrations of tens of GB.

WAL Retention Settings

These prevent replication from breaking during bulk writes:

-- WAL sizing: accommodate burst writes without forced checkpoints
max_wal_size = '16GB';                    -- Default: 1GB
min_wal_size = '4GB';                     -- Default: 80MB
checkpoint_completion_target = 0.95;      -- Default: 0.5, spread I/O
wal_buffers = '128MB';                    -- Default: 4MB or -1 (auto)
 
-- WAL retention: don't recycle WAL the replica needs
wal_keep_size = '4096MB';                 -- Default: 0, keep 4GB extra
 
-- Replication slots: explicit WAL retention per standby
-- Create on primary:
-- SELECT pg_create_physical_replication_slot('standby_slot');
max_slot_wal_keep_size = '10GB';          -- Default: -1 (unlimited)

What each does:

max_wal_size: Higher means checkpoints are less frequent but each one moves more data. For bulk loads, higher is better — fewer I/O spikes.
checkpoint_completion_target: Spreads checkpoint writes across 95% of the interval instead of blasting the disk at 50%.
wal_keep_size: Extra WAL retention for lagging replicas. The safety net.
max_slot_wal_keep_size: Limits how much WAL a replication slot can retain before the primary starts recycling anyway. Prevents disk-full scenarios if a standby is down permanently.

Timeout Settings

All to zero during migration:

SET statement_timeout = '0';
SET lock_timeout = '0';
SET idle_in_transaction_session_timeout = '0';

These accept the risk of a hung session in exchange for not killing legitimate long-running operations. The trade-off is acceptable during a controlled maintenance window.

Memory Settings

Match these to your available memory:

-- Per-operation memory for index builds and sorts
maintenance_work_mem = '2GB';     -- Per session. Count parallel workers.
 
-- Shared buffer cache
shared_buffers = '1GB';           -- 25% of available RAM is a starting point
 
-- Per-operation sort memory
work_mem = '128MB';               -- Per sort operation. Lower if many concurrent ops.

Budget calculation:

shared_buffers + maintenance_work_mem + work_mem * concurrent_ops + OS overhead
= 1GB + 2GB + 128MB * 2 + 512MB
= ~3.75 GB

Your Docker container memory must exceed this. I use 4 GB minimum.

Other Critical Settings

-- Disable temporarily during migration (re-enable after)
autovacuum = 'off';                     -- Prevent VACUUM from competing for I/O
 
-- Optional: risk/reward trade-offs for maximum speed
synchronous_commit = 'off';             -- Risk: data loss on crash after commit but before WAL flush
full_page_writes = 'off';               -- Risk: unrecoverable torn page on crash

About full_page_writes: After a checkpoint, PostgreSQL writes a full 8 KB page image to WAL for the first modification to each page — 8 KB instead of ~50 bytes for the diff. For a bulk load touching hundreds of thousands of pages, that translates to gigabytes of extra WAL. Disabling it eliminates this overhead entirely.

The risk: If the server crashes mid-write, a torn page (partially written) is unrecoverable without full_page_writes. Only disable this if:

You have a full base backup taken before the migration
You can re-run the entire migration if the server crashes
No concurrent writes from other applications

I leave full_page_writes = on in production and only disable it for controlled bulk loads.

Pre-Migration Checklist

Every item here cost me real time to learn. Run this checklist before any migration of significant size.

Replication Protection

Create a physical replication slot for each standby:

SELECT pg_create_physical_replication_slot('standby_slot');

Configure standby to use the slot:
```
primary_slot_name = 'standby_slot'
```
Set wal_keep_size high enough for expected max lag:
```
ALTER SYSTEM SET wal_keep_size = '4096MB';
```
Set max_slot_wal_keep_size to prevent unbounded WAL growth:
```
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';
```

Verify replication before starting:

SELECT application_name, state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS lag_mb,
       slot_name, slot_type
FROM pg_stat_replication
FULL JOIN pg_replication_slots USING (slot_name);

Document the drop-slot procedure in case of emergency:

-- If standby fails permanently, release WAL before disk fills:
SELECT pg_drop_replication_slot('standby_slot');

Config Verification

SELECT name, setting, unit
FROM pg_settings
WHERE name IN ('max_wal_size', 'min_wal_size',
  'checkpoint_completion_target', 'wal_buffers',
  'wal_keep_size', 'max_slot_wal_keep_size',
  'maintenance_work_mem', 'work_mem', 'shared_buffers',
  'statement_timeout', 'lock_timeout',
  'idle_in_transaction_session_timeout', 'autovacuum')
ORDER BY name;

Capacity Checks

Disk: Available space > 3x expected database size (data + WAL + temp files + dump file)
```
df -h /var/lib/postgresql/data
```

Memory: Container limits > calculated budget with 25% headroom

docker inspect pg_container --format '{{.HostConfig.Memory}}'
# Convert to human-readable: echo "<bytes>" | numfmt --to=iec

Docker resource limits: Temporary, appropriate for migration window, documented for rollback

Timeout Hardening

# Run via PGOPTIONS before connecting psql
export PGOPTIONS="-c statement_timeout=0 -c lock_timeout=0 -c idle_in_transaction_session_timeout=0 -c maintenance_work_mem=2GB"

Rollback Plan

Full base backup taken before migration
Procedure documented for restoring from backup if migration fails
Standby rebuild procedure documented (you will likely need it)
Migration script idempotent (can be re-run safely)

What I'd Do Differently

If I could go back and do this migration again, knowing what I know now:

Configure WAL retention before touching the database. Replication slots and wal_keep_size are the first thing I set, not something I add after the replica breaks.
Set all timeouts to zero. Not "high." Zero. The only way to guarantee a multi-hour index build does not get silently killed.
Give the migration enough resources to finish fast. Temporary over-provisioning is not the same as permanent waste. The faster the migration runs, the less total resource consumption, the less disruption to production.

Monitor replication lag in real-time during the entire restore. A simple watch command in another terminal window:

watch -n 30 'psql -c "SELECT application_name, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS lag_mb FROM pg_stat_replication;"'

Test the restore with production-sized data on dev first. This is the cardinal sin. I tested with a small subset and assumed it would scale linearly. It does not. The bottlenecks only appear at scale.

The Lessons, Condensed

Async replication without WAL retention configuration is fragile during bulk writes. Default max_wal_size and wal_keep_size = 0 guarantee that a fast-writing primary will recycle WAL before a lagging replica can consume it. Replication slots are your safety net.
Timeouts are the enemy of large index builds. A CREATE INDEX on a table with tens of millions of rows and multiple indexes takes minutes to hours. Set statement_timeout, lock_timeout, and idle_in_transaction_session_timeout to zero during migration.
Conservative configs create a false economy. Undersizing resources makes the migration slow, which makes every problem worse, which makes the migration take even longer. Temporary resource allocation for a migration is not permanent over-provisioning.
WAL retention requires explicit configuration. The default settings assume the primary can recycle WAL freely. If you have async replicas, you must configure wal_keep_size and replication slots to prevent them from falling off the WAL timeline.
Monitor replication lag continuously. A broken replica does not always error loudly. Sometimes it just disappears from pg_stat_replication. Watch it in real-time.
Account for index rebuild time in your migration window. The data load is only half the work. Index creation after the load can take as long or longer.
Run the checklist before every migration. Boring and repeatable beats exciting and broken every time.