Building Replication-Safe LSM Trees in Postgres
By Stu Hood, Ming Ying, Mathew Pregasen, and Olive Ratliff on June 30, 2025
In a vanilla Postgres implementation, full-text search is backed by a B-tree or GIN (Generalized Inverted Index) structure. These indexes are optimized for relatively fast lookups with minimal churn.
When we built pg_search
, the Postgres extension for search and analytics, our priorities were different. To be an effective alternative to Elasticsearch we needed to support very efficient scans in real time. We opted for a data structure better suited to dense posting list bitmaps and high-ingest workloads: a Log-Structured Merge (LSM) tree.
However, when we tested our LSM tree under physical replication — the mechanism that allows Postgres to replicate data from a primary node across one or more read replicas — we encountered several twists and turns. Most surprisingly, we learned that Postgres' out-of-the-box support for physical replication, built on Write-Ahead Log (WAL) shipping, isn't quite enough for an advanced data structure like an LSM tree to be replication-safe. In this post, we’ll do a deep dive into:
- What it means for an LSM tree to be replication-safe
- How Postgres' WAL shipping guarantees physical consistency
- Why atomic logging was necessary for logical consistency
- How we leveraged a little-known but powerful Postgres setting called
hot_standby_feedback
What is an LSM Tree?
A Log-Structured Merge Tree (LSM tree) is a write-optimized data structure commonly used in systems like RocksDB and Cassandra.
The core idea behind an LSM tree is to turn random writes into sequential ones. Incoming writes are first stored in an in-memory buffer called a memtable, which is fast to update. Once the memtable fills up, it is flushed to disk as a sorted, immutable segment file (often called an SSTable).
These segment files are organized by size into layers or levels. Newer data is written to the topmost layer. Over time, data is gradually pushed down into lower levels through a process called compaction, where data from smaller segments is merged, deduplicated, and rewritten into larger segments.
What Do We Mean by Replication Safety?
A reliable distributed datastore (one which guarantees “replication safety”) must demonstrate both physical and logical consistency1 across database replicas.
- Physical consistency means the replica contains structurally valid data — each page or block on disk is well-formed and corresponds to a state that did exist on the primary at some point.
- Logical consistency ensures that the data on the replica reflects a coherent and stable view of the database, something that could have been seen by a transaction on the primary.
A physically consistent state is not always a logically consistent state. Specifically, if you take a snapshot of a physically consistent replica while replicating an in-flight transaction, it may not be logically consistent. A good analogy is to imagine replicating a book. Physical consistency is like copying every page exactly, even if you're in the middle of a chapter — you're guaranteed to have real pages, but you might end up with half a sentence or a missing footnote. Logical consistency is like waiting until the chapter is finished before copying it, ensuring the result makes sense to a reader.
WAL Shipping: How Postgres Guarantees Physical Consistency
In a primary-standby replication setup, a primary server is paired with a standby server that acts as a read-only copy of its leader. The servers remain in sync by using a Write-Ahead Log (WAL) to record all binary-level changes to storage blocks on the primary server before they occur. Changes to this append-only WAL file are then streamed to the standby (a process called “log shipping”) and applied in the order received. This process enables near-realtime data synchronization between the two servers, hence the phrase “hot standby”.
Why Atomicity is a Requirement for Physical Consistency
Atomicity is a requirement for physical consistency because Postgres locks are not replayed on replica servers. This is because replaying every lock taken on the primary would require strict timing synchronization, significantly impacting performance and hindering the ability of the standby to serve reads. Instead, the WAL uses per-buffer locks to incrementally replay edits in some particular order: it acquires an exclusive lock on the buffer, makes its change, and then releases it.
The problem arises when modifying data structures that span many Postgres buffers. Without the guarantee that operations are atomic over the entire structure, these modifications can lead to structural corruption.
For example: pg_search
uses an unrolled linked list of Postgres buffers where each node holds the read validity of a batch of segments in our LSM tree. To ensure that the primary can never observe a broken linked list, we use hand-over-hand locking (also known as lock coupling) to guarantee that the list remains physically consistent on the primary. After each buffer in the list is modified, its WAL entry becomes visible atomically on the replica.
But what happens when we want to edit multiple entries in the list “at once” (atomically), such as when a newly compacted segment replaces multiple old segments? If only the primary server mattered, then we could protect the logical integrity of multiple list nodes by applying a global lock on the list itself, ensuring that the contents of the list were only visible in valid state. But replicas don’t have access to global locks, so there’s no way to coordinate edits across multiple nodes (and multiple buffers) at once.
Instead, for multi-node operations pg_search
uses a Copy-on-Write (CoW) clone of the list, and atomically swaps in the head. More generally, atomic operations insulate you from danger by eliminating reliance on coarse-grained locks.
A Problem: Vacuums Undermine Logical Consistency
Adapting algorithms to work atomically at the block level is table stakes for physical replication: if you don’t do it, your data structures are broken, and you won’t be able to use them consistently.
But even when individual WAL operations and data structures are atomically compatible, VACUUM can interfere with the execution of concurrent transactions spread across multiple WAL entries and compromise logical consistency.
To illustrate this problem, let's imagine the primary has a table with some number of rows. To ensure that concurrent writes can safely write without blocking each other, Postgres uses a mechanism called Multi-Version Concurrency Control (MVCC), which creates multiple versions of a modified row (or tuple) instead of updating the tuple in place. When a row is updated or deleted, the previous tuple is not immediately removed but instead flagged as “dead”.
Those “dead” tuples remain on disk until a periodic maintenance operation called VACUUM is run. Like any other operation, VACUUMs are recorded in the WAL, shipped to the standby, and replayed there.
Since the “deadness” of a tuple is local to the server but the VACUUM operation is replayed globally, faults can occur when a tuple is VACUUMed from the standby prematurely. The standby might be mid-read, iterating (across multiple WAL entries) over a tuple that it considers valid, while the primary might concurrently decide to VACUUM that tuple out of existence. The standby, lacking any locking coordination or awareness of concurrent operations, replays the VACUUM while the prior transaction is still in progress. This can lead to query failure if a long-running query attempts to access a VACUUMed tuple.
Why LSM Trees are Particularly Vulnerable to this Problem
If your Postgres is configured with read replicas and experiences a high volume of writes, you may have already seen this problem, even when using B-tree indexes. If a VACUUM is running on the primary at the same time that a query hits a read replica, it's possible for Postgres to abort the read. However, these errors are likely infrequent and tolerable in a typical Postgres setup where VACUUMs run once every few hours.
The same cannot be said for an LSM tree, where compaction is a core, ongoing part of the system. In a high-write throughput system, compaction can happen many times per minute, even per second. This increases the number of opportunities for conflicts to occur.
The Logical Solution: Hot Standby Feedback
This is where an optional Postgres setting called hot_standby_feedback
comes in. When enabled, hot_standby_feedback
allows the standby to tell the primary what data is safe to clean up from the replica’s point of view. This information significantly reduces the likelihood of a tuple being prematurely VACUUMed.
To understand what information hot_standby_feedback
actually relays, we must first understand how tuple versioning works in Postgres. Every tuple in Postgres has two key metadata attributes: xmin
and xmax
. xmin
stores the Transaction ID (XID) of the transaction that created or inserted that specific tuple version, while xmax
stores the XID of the transaction that either updated or deleted the tuple version, effectively marking it as obsolete. When a tuple is deleted, the xmax
value is updated with the XID of the deleting transaction. Since XID are assigned sequentially, later transactions are assigned a larger number for their XID, so another way to think about xmin
is as a proxy for the tuple’s “creation time” and xmax
for its “last updated or deleted time”.
When hot_standby_feedback
is enabled, the replica will periodically communicate the smallest xmin
(oldest “creation time”) that any of its active queries is currently pinned to. This xmin
identifies the oldest tuple still in use on the standby.
Armed with this information, the primary can make smarter decisions about when to permit cleanup operations (i.e. VACUUMs). If it sees that a standby query is still operating on a tuple that would otherwise be considered “dead,” it can defer cleanup until that query has finished.
Final Thoughts
Even with the help of hot_standby_feedback
, standby servers are fundamentally at the mercy of the WAL to provide instructions that are safe to execute in the order and moment they are received. The tension between the local incentives and global requirements is just one challenging dimension of the difficulty in achieving full replication safety in a distributed Postgres system.
To achieve both physical and logical consistency, pg_search
implements an atomically logged LSM tree, and to achieve logical consistency, we rely on hot_standby_feedback
.
This challenge was worth tackling because it enables the fastest possible search performance, without sacrificing consistency. To see it in action, check out our documentation or our open source project!
Physical and logical consistency are also referred to as structural integrity and transactional consistency.