# ParadeDB

> ParadeDB is a transactional alternative to Elasticsearch built on Postgres. It is a PostgreSQL-native search and analytics engine - a columnstore index for OLAP and a BM25 inverted index for full-text search, all inside Postgres as an extension.

Full text of ParadeDB's blog, customer stories, and learn articles for LLM consumption. For a linked index, see https://www.paradedb.com/llms.txt.

# Blog

# Why We Picked AGPL

> ParadeDB has been licensed under AGPL from day one. Here's our thought process and case study on why we picked AGPL.

Published: 2024-08-03 · Author: Philippe Noël

Source: https://www.paradedb.com/blog/agpl

---

[ParadeDB](https://github.com/paradedb/paradedb) has been licensed under the [GNU Affero General Public License 3.0](https://www.gnu.org/licenses/agpl-3.0.html) — also known as AGPL — from day one. We reached this decision after evaluating four licenses and interviewing dozens of other open source companies. Since then, we've received many questions about our license. We've decided to detail our thought process and offer up a case study on why we picked AGPL.

## What is AGPL?

The AGPL license permits free use, modification, and distribution of software, provided that distributed, derivative works of the software are released under the same license. This provision, called the copyleft provision, ensures that modifications remain open and discourages proprietary forks.

## What is ParadeDB?

ParadeDB is an open-source Elasticsearch alternative built on Postgres. The advice we’re about to share is most applicable to companies like us: startups aiming to build a sustainable business on top of open source.

## Why the Right License Matters

We see open-source licenses as a contract among authors, community contributors, and users. By defining what each party can expect when they use, modify, or distribute a project, the license facilitates successful cooperation among strangers.

The license is also one of the first things that companies read when deciding whether to adopt a project. Companies are much more likely to use projects with licenses that are already approved by their legal team.

Over the years, we’ve seen authors of large open source projects leave the choice of license as an afterthought, only to realize too late that they are unhappy with their license. If the authors change the project’s license, they risk alienating their users, who feel that the contract they implicitly signed has been broken. We wanted to avoid this by being thoughtful about our choice of license early on.

## What We Optimized For

There were three goals we wanted to accomplish with our license:

1. **Be familiar**. We’re developers, not lawyers. Neither are our users and contributors. Our license should be easy to understand and common enough that most developers are already familiar with it. Users should not need to read a custom license to decide whether they can adopt a project.
2. **Be future-proof**. We wanted to protect our IP from cloud vendors early on so that we would never need to relicense. Our users should trust that our project will stand the test of time, so they can feel comfortable adopting it for the long term.
3. **Be community-driven**. We wanted our project to be true open-source with a license approved by the Open Source Initiative (OSI). We believe that approved licenses incentivize developers to contribute to a project. On the flip side, we've seen many projects that start with an unrestrictive license but switch to a more restrictive license get forked, which fragments the project’s community.

## Licenses We Evaluated

We evaluated licenses from four categories:

1. **Fully permissive**: Apache 2.0
2. **Eventually permissive**: Business Source License (BSL)
3. **Copyleft permissive**: AGPL
4. **Non-permissive**: Elastic License 2.0 (ELv2)

Here is how they stacked up:

<table
  id="table">
  <tbody>
    <tr>
      <th align="center"></th>
      <th
        align="center">
        Apache 2.0
      </th>
      <th
        align="center">
        BSL
      </th>
      <th
        align="center">
        AGPLv3
      </th>
      <th
        align="center">
        ELv2
      </th>
    </tr>
    <tr>
      <td
        align="center">
        Familiar
      </td>
      <td
        align="center">
        ✅
      </td>
      <td
        align="center">
        ❌
      </td>
      <td
        align="center">
        ✅
      </td>
      <td
        align="center">
        ❌
      </td>
    </tr>
    <tr>
      <td
        align="center">
        Future-Proof
      </td>
      <td
        align="center">
        ❌
      </td>
      <td
        align="center">
        ✅
      </td>
      <td
        align="center">
        ✅
      </td>
      <td
        align="center">
        ✅
      </td>
    </tr>
    <tr>
      <td
        align="center">
        Community-Driven
      </td>
      <td
        align="center">
        ✅
      </td>
      <td
        align="center">
        ❌
      </td>
      <td
        align="center">
        ✅
      </td>
      <td
        align="center">
        ❌
      </td>
    </tr>
  </tbody>
</table>

Whether each license meets these criteria is subjective and open to debate, so we’ll explain how we evaluated each.

### AGPL

AGPL was the only license that met all our criteria.

1. **Familiar.** When it comes to businesses the AGPL license has been hugely helpful. It's been tested and used for years by companies like [MinIO](https://github.com/minio/minio), [Grafana](https://github.com/grafana/grafana), [Citus](https://github.com/citusdata/citus), and [Quickwit](https://github.com/quickwit-oss/quickwit). As a result, most companies with legal teams will already have a policy in place for whether or not they can use AGPL licensed code.
2. **Future-Proof**: Thanks to the copyleft provision, cloud vendors cannot easily resell our project without our consent. This gives us confidence in our ability to monetize without fear of predatory competition. ParadeDB just turned one year old, and we’ve already been contacted by four cloud providers who, had we not chosen the AGPL license, may have privately forked and distributed ParadeDB.
3. **Community-Driven**: AGPL is officially approved by the Open Source Initiative. When we started ParadeDB, we wanted to be honest with our community. We did not feel comfortable calling ourselves open source under an unapproved license but wanted to acknowledge our ambition to build a sustainable open-source business. AGPL struck the perfect balance.

### BSL

BSL restricts the use of software in production for a specified period, usually up to four years. After this period, the software is released under an open-source license, allowing for unrestricted use and distribution.

While BSL can successfully deter managed providers from reselling the software, we believe that BSL’s temporary restrictions on production use contradict the principles of open source. Developers who feel similarly may choose to not contribute to the project. Enterprises concerned about BSL’s licensing costs and compliance issues may choose not to use the software.

### ELv2

ELv2 prohibits use of the software as part of a managed or hosted service. Its primary purpose is to prevent cloud vendors from reselling a project and undercutting the original authors.

We liked the defensibility that ELv2 provides and its rising popularity. Since its inception in 2021, ELv2 has been adopted by projects like [Airbyte](https://github.com/airbytehq/airbyte) and [MindsDB](https://github.com/mindsdb/mindsdb).

Ultimately, however, ELv2 is not approved by the OSI and we could not convince ourselves that it was truly open source. Prior to building ParadeDB, we worked on a project licensed under ELv2 that has since been deprecated. From this experience, we learned that many organizations have policies which disallow usage of projects with non-approved licenses like ELv2.

### Apache 2.0

Apache 2.0 is a fully permissive license that allows for free use, modification, and distribution for any purpose. While this license fosters goodwill among developers and helps grow the community, we were worried about its long-term defensibility. We observed how Elasticsearch began with an Apache 2.0 license and relicensed years later, which led AWS to create the OpenSearch fork. Today, OpenSearch has siphoned significant mindshare and business away from Elasticsearch. While this concern may not be applicable to all open source projects, we felt that it was particularly relevant to us since we are building an Elasticsearch alternative.

Over the past few months, we have been contacted by several Apache-licensed projects who want to integrate with ParadeDB but are worried about licensing conflicts with AGPL. To address these concerns, we issued commercial licenses to these companies (please [contact us](mailto:support@paradedb.com) if you’re interested in doing the same).

## Results: One Year Later

One year after picking AGPL, we are incredibly happy with the results. While many other factors are at play, we believe that AGPL has helped in several areas:

1. **Open source traction**. ParadeDB has gained 5K stars in the past year, making it one of the fastest-growing open source database projects. We have an active community of members and contributors.
2. **Usage**. ParadeDB has been deployed 40K times and is used by several Fortune 1000 companies.
3. **Defensibility**. We have been contacted by and are working with several large cloud providers to integrate ParadeDB with their managed Postgres service.
4. **Monetization**. We have monetized through support contracts and issuing commercial, non-AGPL licenses, which also contain a few closed-source enterprise features.

We invite you to check out our [GitHub repo](https://github.com/paradedb/paradedb) to learn more or to join our [**Slack community**](/slack) if you have any questions. And please don’t hesitate to show your support by [**giving us a star**](https://github.com/paradedb/paradedb)!

---

# A New Postgres Block Storage Layout for Full Text Search

> How we implemented a new block storage layout in Postgres for full text search performance.

Published: 2025-01-16 · Author: Ming Ying

Source: https://www.paradedb.com/blog/block-storage-part-one

---

We recently completed one of our biggest engineering bets to date: migrating [`pg_search`](https://github.com/paradedb/paradedb/tree/dev/pg_search), a
Postgres extension for full text search and analytics, to Postgres' block storage system. In doing so, `pg_search` is the first-ever extension<sup>1</sup> to port an
external file format to Postgres block storage.

For context - block storage is Postgres’ storage API that backs all of Postgres’ tables and built-in index types. Prior to this migration, `pg_search` operated outside of block storage. This means that the extension created files which were not managed by Postgres and could read the contents of those files directly from disk.
While it’s not uncommon<sup>2</sup> for Postgres extensions to do this, block storage has enabled `pg_search` to simultaneously achieve:

1. Postgres write-ahead log (WAL) integration, which is necessary for physical replication of the index
2. Crash and point-in-time recovery
3. Full support for Postgres MVCC (multi-version concurrency control)
4. Integration with Postgres’ buffer cache, which has led to massive improvements in index creation times and write throughput

_Image: Comparison of index creation times before and after migrating pg_search to block storage_
_Image: Comparison of insert throughput in transactions per second before and after migrating pg_search to block storage_

At first, we weren’t sure if reconciling the data access patterns and concurrency model of Postgres and [Tantivy](https://github.com/quickwit-oss/tantivy) — `pg_search`'s underlying search library — was possible without drastic changes to Tantivy<sup>3</sup>.
In this blog post, we’ll dive into how we architected `pg_search`'s new block storage layout and data access patterns.

In the near future, we'll release two more posts: one to discuss how we designed and tested `pg_search` to be MVCC-safe in update-heavy scenarios, and another to dive into how we customized the block storage layout for analytical workloads (e.g. faceted search, aggregates).

## What Is Block Storage?

The fundamental unit of block storage is a block: a chunk of 8192 bytes. When executing a query, Postgres reads blocks into buffers, which are stored in Postgres’ buffer cache.

DML (`INSERT`, `UPDATE`, `DELETE`, `COPY`) statements do not modify the physical block. Instead, their changes are written to the underlying buffers, which are later flushed to disk when evicted from the buffer cache or during a checkpoint.

If Postgres crashes, modifications to buffers that have not been flushed can become lost. To guard against this, any changes to the index must be written to the write-ahead log (WAL). During crash recovery, Postgres replays the WAL to restore the database to its most recent state.

## What Is `pg_search`?

pg_search is a Postgres extension that implements a custom index for full text search and analytics. The extension is powered by Tantivy, a search library written in Rust and inspired by Lucene.

## Why Migrate To Block Storage?

A custom Postgres index has two choices for persistence: use Postgres block storage or the filesystem. At first, using the filesystem may seem like the easier option. Integrating with block storage requires solving a series of problems:

1. Some data structures may not fit within a single 8KB block. Splitting data across multiple blocks can create lock contention, garbage collection, and concurrency challenges.
2. Once a block has been allocated to a Postgres index, it cannot be physically deleted — only recycled. This means that the size of an index strictly increases until a `VACUUM FULL` or `REINDEX` is run. The index must be careful to return blocks that have been tombstoned by deletes or vacuums to Postgres’ free space map for reuse.
3. In update-heavy scenarios, the index can become dominated by space that once belonged to dead (i.e. deleted) rows. This may increase the number of I/O operations required for searches and updates, which degrades performance. The index must find ways to reorganize and compact the index during vacuums.
4. Because Postgres is single-threaded, multiple threads cannot concurrently read from block storage<sup>4</sup>. The index may need to leverage Postgres’ parallel workers.

Once the index overcomes these hurdles, however, Postgres block storage does an incredible amount of heavy lifting. After a year of working with the filesystem, it became clear that block storage was the way forward.

1. Being able to use the buffer cache means a huge reduction in disk I/O and massive improvements to read and write throughput.
2. Postgres provides a simple, battle-tested API to write buffers to the WAL. Without block storage, extensions must define custom WAL record types and implement their own WAL replay logic, which drastically increases complexity and surface area for bugs.
3. Postgres handles the physical creation and cleanup of files for us. The index doesn’t need to clean up after aborted transactions or `DROP INDEX` statements.

## Tantivy’s File-Based Index Layout

_Image: Diagram showing Tantivy’s file-based index layout_

The first challenge was migrating [Tantivy’s](https://github.com/quickwit-oss/tantivy) file-based index layout to block storage. Let’s quickly examine how Tantivy’s index is structured.

### Segments

A Tantivy index is comprised of multiple segments. A segment is like a database shard — it contains a subset of the documents in the index. Each segment, in turn, is made up of multiple files:

1. **Postings**: Stores a mapping of terms to document IDs and term frequencies, allowing Tantivy to efficiently retrieve documents that contain a specific term. This is the backbone of the inverted index.
2. **Positions**: Tracks the positions of terms within documents, enabling phrase queries by identifying where terms appear in relation to each other.
3. **Terms**: Contains the list of unique terms in the index and metadata for each term, such as document frequency and offsets into the postings file.
4. **Fieldnorms**: Stores normalization factors for each field in a document, which are used to adjust term scores during ranking.
5. **Fast fields**: Columnar storage for numeric and categorical fields, enabling fast filtering and sorting.
6. **Deletes**: A bitset that tracks which documents in the segment have been deleted.
7. **Store**: Stores the original document. This file is not used by `pg_search` since the heap table already contains the original value.

New segments are created whenever new documents are committed to the index. To maintain a target segment count, Tantivy’s merge process combines smaller segments into a single, larger segment.

### **Metadata**

When a segment is created, Tantivy assigns it a unique UUID. Segments are tracked across two files. The first file contains a `Vec<PathBuf>` of all files in the index. The second file contains a list of segment UUIDs that are currently visible. If a segment is present in first file but not the second, that means that the segment has been tombstoned by the merge process and is subject to being removed by garbage collection.

In addition, the second file also stores index’s schema and settings.

### **Locks**

Tantivy uses a file-based locking approach — if a lock file exists, then the lock is being held by another process. Locks are important for Tantivy because Tantivy is not a database capable of handling concurrent readers and writers. They ensure that only one writer exists per index, and that reads and writes to the metadata files are atomic. In Part 2, we’ll discuss how we used Postgres MVCC controls to lift Tantivy’s “one writer per index” limitation.

## Migrating to a Block Storage Layout

_Image: Diagram of pg_search’s new block storage layout_

### **Segments**

Rather than being written to a file, segments are serialized and written to blocks. Large segments that spill past a single block are stored in a linked list of blocks.

### **Metadata**

Separate blocks are used to store the index schema, settings, and list of segment UUIDs.

Postgres MVCC visibility information is stored alongside each segment UUID. At query time, the extension uses MVCC visibility rules to construct a snapshot of the list of all visible segments, which eliminates the need for a second list of visible segments<sup>4</sup>.

### **Locks**

Tantivy’s lock files are no longer needed since Postgres provides buffer-level, interprocess locking mechanisms.

## Challenge 1: Large Files Can Spill Over a Single Block

_Image: Diagram explaining how a large file is stored as a linked list of Postgres blocks_

A segment file can exceed an 8KB block. To accommodate these files, we implement a linked list over block storage, where each block is a node.

The linked list starts with a header block that contains a bitpacked representation of all subsequent block numbers. This structure enables O(1) lookups by directly mapping the starting offset of any byte range to its position in the list.

After the header block, the next block stores the file’s serialized data. Once the block becomes full, a new block is allocated. In Postgres, every block has an area reserved for metadata called special data. The current block's special data section is updated to store the block number of the newly allocated block, forming the linked list.

## Challenge 2: Blocks Cannot Be Memory Mapped

Tantivy’s data access patterns for fast fields assume that the underlying file can be memory mapped, which means that Tantivy can leverage zero-copy access for the entire fast field. This is not the case for block storage — the buffer cache can only provide a pointer to the contents of a single block. If a fast field spans multiple blocks, each block must be copied into memory, introducing significant overhead.

To address this problem, we [modified Tantivy](https://github.com/paradedb/tantivy/commit/b42a45dd4aa29ce880864c95f2b6e69ad26cdc06) to defer dereferencing large slices of bytes up front. Instead, bytes are dereferenced lazily and
cached in memory to avoid re-reading blocks that have been previously accessed.

## Challenge 3: The Segment Count Explodes in Update-Heavy Scenarios

Because segments are immutable, every DML statement in Tantivy creates at least one new segment. Having too many segments can degrade performance because there is a cost to opening a segment reader, searching over the segment, and merging the results with other segments. While the ideal number of segments depends on the dataset and the underlying hardware, having more than a hundred segments is generally not optimal.

If a table experiences a high volume of updates, the number of segments quickly explodes. To address this, we introduce a step called `merge_on_insert` , which looks for merge opportunities after an `INSERT` completes.

It is critical that only one merge process runs concurrently. If two merge processes run at the same time, they could both see the same segments, merge them together, and create duplicate segments. To guard against this, every merge process atomically writes its transaction ID to a metadata block. Subsequent merge attempts first read this transaction ID, and are only allowed to proceed if the effects of that transaction ID are MVCC-visible.

## Wrapping Up

The following parts of this blog post series will dive into some more exciting challenges we faced with block storage, with a focus on concurrency and analytical performance.

In the meantime, you can check out the [GitHub repo](https://github.com/paradedb/paradedb) for more details or [join our community](/slack) with questions.
And please don’t hesitate to [give us a star](https://github.com/paradedb/paradedb).

<hr />

<ol>
  <li>That we know of.</li>
  <li>
    [ZomboDB](https://github.com/zombodb/zombodb) and
    [pg_parquet](https://github.com/CrunchyData/pg_parquet) are examples of
    Postgres extensions that directly interact with the filesystem.
  </li>
  <li>Fortunately this ended up not being the case.</li>
  <li>This will be covered in detail in Parts 2 and 3.</li>
</ol>

---

# The ACID Test: Why We Think Search Needs Transactions

> A developer's look at how Elasticsearch and Postgres stack up against the ACID test

Published: 2025-09-29 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/blog/elasticsearch-acid-test

---

When developers talk about databases, one acronym comes up again and again: **ACID**. Atomicity, consistency, isolation, and durability are the safety guarantees that define what it means to be a transactional system. If your database is the source of truth, these guarantees keep your data safe.

Search engines were built with a different goal. Systems like [Elasticsearch](https://www.elastic.co/elasticsearch) prioritize distributed indexing and scale. Transactional guarantees were never part of the design. By contrast, relational databases like [PostgreSQL](https://www.postgresql.org/) were born in the transactional era, with ACID at their core from the beginning.

Many Postgres users already combine these two worlds by treating their favorite database as a search engine, leaning on built-in full-text search or extensions like [`pg_search`](https://github.com/paradedb/paradedb) (providing bm25 relevance scoring). By doing this, they get something Elasticsearch cannot offer: an ACID-compliant search engine. Or, as we like to call it, a **search database**.

But to understand why ACID matters in this context, and why it draws such a sharp line between databases and search engines, we need to revisit what the acronym actually means.

## What is ACID?

The idea of transactions dates back to IBM’s [System R](https://dl.acm.org/doi/10.1145/358769.358784) in the 1970s, the project that also gave us SQL. The key insight was that developers shouldn’t have to reason about half-completed operations: a bundle of reads and writes should either all succeed, or none do. This model made it possible to build reliable applications on top of unreliable hardware.

A decade later, [Theo Härder and Andreas Reuter coined the acronym **ACID**](https://cs-people.bu.edu/mathan/reading-groups/papers-classics/recovery.pdf) to describe the four essential properties of these transactions:

- **Atomicity** means all-or-nothing. If one part of an action fails, the whole transaction is rolled back.
- **Consistency** means that transactions preserve invariants. If the database starts valid, it ends valid. It’s worth two things. Firstly, consistency is the most overloaded term: this isn’t the same as distributed / eventual consistency or CAP theorem consistency. Secondly, it's up for debate is this is something applications or databases should be responsible for (although it's nice when databases can help).
- **Isolation** means concurrent transactions don’t interfere with each other; the outcome is as if they ran one after another.
- **Durability** means that once a transaction or action commits, its changes survive crashes or power loss.

For many developers today, the clearest introduction to these ideas comes from Martin Kleppmann’s _Designing Data-Intensive Applications._

<figure>
  _Image: Designing Data-Intensive Applications book cover_

  <figcaption>
    ❤️ Kleppmann, M. (2017). Designing data-intensive applications: The big
    ideas behind reliable, scalable, and maintainable systems. O’Reilly Media
  </figcaption>
</figure>

Kleppmann notes that ACID has become slippery over time. Different systems interpret isolation differently, vendors sometimes use “ACID compliant” as a pure marketing slogan, and often consistency feels like an application goal instead of a database one. Still, the four letters remain a useful guide when comparing systems with different design goals.

## Why We Think Search Needs Transactions

At first glance, it might seem odd to even suggest a relationship between ACID and search systems. Isn’t search about relevance and ranking? In practice, search is almost always tied to the mutable, relational data it’s sourced from. The difference between a search database and a search engine is the amount of trust you can put in these two worlds being in sync.

This matters where the searcher is motivated by precision. Financial records and legal documents are constantly changing; and when a search is executed the result should reflect now, not some stage in the past. Imagine searching for a record that was just deleted. In Elasticsearch it may still show up, even though the source of truth has moved on. In Postgres, that would be impossible. These results can be driving important business decisions.

[Modern Treasury](https://www.moderntreasury.com/), one of ParadeDB's early design partners, sums it up well:

> Read-after-write consistency is fundamental to payments. It ensures transactions are visible the instant they’re persisted, which is critical for reconciliation, ledger accuracy and trust in our services.

This is what ACID brings to search: trust. You can rely on your search results to reflect the system of record in real time, without dangling references or stale indexes. As a side-effect it also reduces technology sprawl, removes sync jobs, and eliminates reconciliation logic. A search database is a simpler and safer foundation for user-facing search workloads which prioritize correctness.

## ACID in Elasticsearch

[Elasticsearch was never a database](https://www.paradedb.com/blog/elasticsearch-was-never-a-database); it was built as a distributed search engine first, not as a transactional system of record.

**Atomicity** in Elasticsearch stops at the document boundary. Indexing or updating a single document is atomic (with no option to group writes), but multi-document operations like the bulk API aren’t transactional. If operation five fails, operations one through four are already committed and cannot be rolled back.

**Consistency** is enforced only at the field level. A numeric field won’t accept text, but Elasticsearch won’t prevent you from inserting an order pointing to a non-existent customer. Invariants like foreign keys or uniqueness constraints must be enforced in application code, with all the complexity that entails.

**Isolation** is absent. Elasticsearch has no concept of multi-statement transactions, so traditional isolation levels don’t exist. Each indexing or update request is applied independently, and concurrent operations can interleave without guarantees about ordering or serializability.

**Durability** is the strongest property for Elasticsearch. Every write is appended to the [`translog`](https://www.elastic.co/docs/reference/elasticsearch/index-settings/translog) on the primary shard and fsynced before acknowledgment. In default mode, the write isn’t confirmed until all in-sync replicas have also persisted it. That ensures crashes don’t lose any acknowledged writes.

It’s also worth noting that Elasticsearch doesn’t provide full read-your-writes for search. A newly indexed document is immediately retrievable by ID, but it won’t appear in search results until a background refresh makes it part of a Lucene segment. Because of the distributed nature of Elastic sharding dirty-reads are also possible under failure. These don’t fall under any of the letters of ACID (although it may be tempting to slip them into C), but are still relevant when we are talking about correctness.

None of these trade-offs are flaws. They’re exactly what make Elasticsearch fast, distributed, and scalable. The pain comes when developers try to treat it as something it isn’t: a system of record.

## ACID in PostgreSQL

PostgreSQL, by contrast, was built with transactions at its core. Its entire model assumes ACID from the ground up.

**Atomicity** spans multiple statements, tables, and rows. If you update ten tables in a transaction and the last one fails, Postgres rolls back the entire unit of work. Developers never have to wonder which changes applied and which didn’t.

**Consistency** is actively guarded. Postgres enforces foreign keys, uniqueness, and check constraints. It will block an order without a valid customer or a duplicate email address. Instead of pushing invariants into application logic, Postgres makes them first-class.

**Isolation** is explicit. Postgres exposes three levels (`read committed`, `repeatable read`, and `serializable)` each with well-defined semantics. Behind the scenes, multi-version concurrency control (MVCC) ensures concurrent transactions don’t step on each other’s toes.

**Durability** is guaranteed by the write-ahead log (WAL). A transaction isn’t acknowledged until its WAL record is safely flushed to disk. If Postgres says the transaction committed, that fact will survive crashes and power failures.

Outside of ACID Postgres also guarantees strict _read-your-writes_ semantics: once you commit a change, your subsequent reads will always reflect it. Because Postgres isn’t a distributed system, there’s no ambiguity: visibility is immediate and consistent.

Postgres’s extensibility means new workloads can live directly alongside its transactional core: from time-series, to geospatial, to vector search. ParadeDB follows in that tradition, bringing full-text search and analytics into the database while keeping ACID guarantees intact.

## Summary

Elasticsearch optimizes for scalability. Postgres optimizes for trust. For years, developers stitched them together: Postgres for correctness, Elastic for search, with pipelines in between. That works, but keeping the two systems aligned creates data lag (or even differences), stale indexes, consistency gaps, and operational overhead.

ParadeDB removes that complexity. By extending Postgres with BM25-powered search and a dedicated SQL search API, it becomes something new: a **search database**. You get the safety of ACID transactions and the power of a search engine (think bm25, tokenization, fuzzy matching, boosting, proximity queries, and faceting) in one system.

_And you don’t have to replace what you already run. ParadeDB can serve as your primary database, or as a logical replica that delivers real-time search next to your existing Postgres server._

Ready to try ACID-compliant search? [Get started with ParadeDB](https://docs.paradedb.com/documentation/getting-started/install), and please don’t hesitate to [give us a star](https://github.com/paradedb/paradedb).

---

# Full Text Search over Postgres: Elasticsearch vs. Alternatives

> A comprehensive comparison between Elasticsearch and PostgreSQL for search workloads.

Published: 2024-07-31 · Author: Ming Ying

Source: https://www.paradedb.com/blog/elasticsearch-vs-postgres

---

Over the past six months, we’ve spoken to 50+ companies who are looking for a full text search (FTS) solution over data stored
in Postgres. For many of these companies, the decision boils down to Elasticsearch vs. native Postgres full text search.

By working with these companies, we've identified the areas where each solution shines. Here’s what we’ve learned.

## What is Full Text Search (FTS)?

Full text search is a technique that finds entries in a collection of text based on the presence of specific keywords and phrases.
Most search engines like Elasticsearch use the BM25 algorithm to rank search results. BM25 considers how often a term appears and how unique that term is across all documents.

Full text search is different from similarity search, also known as vector search, which searches for and ranks results by semantic meaning.
Many modern applications use a combination of full text and similarity search. This practice is called hybrid search and can yield more accurate results.

## Postgres Full Text Search

[Postgres FTS](https://www.postgresql.org/docs/current/textsearch-intro.html) is a native functionality available to all Postgres
databases. It leverages the `tsvector` data type, which stores text as searchable tokens, and the GIN index, which improves search
speeds.

### The Good

1. **Simplicity**. Postgres FTS requires no additional infrastructure and is available on all managed Postgres services like AWS RDS. In the long run, not having to orchestrate and manage an external search engine saves significant time and headache.
2. **Real-Time Search**. With Postgres FTS, data is searchable immediately upon commit. This can be incredibly useful to companies building user-facing or latency-sensitive search experiences — for instance, e-commerce sites or fintechs.
3. **Postgres Transactions and MVCC**. Postgres’ ACID transactions and multi-version concurrency control (MVCC) ensure that FTS results are reliable under concurrent access and frequent updates.

### The Bad

1. **Feature Incomplete**. The limited feature set of Postgres FTS may be a deal breaker for some companies. Missing features include BM25 scoring, relevance tuning, custom tokenizers, and faceting.
2. **Poor Performance Over Large Datasets**. While Postgres FTS performs well over tables with a few million rows, performance degrades considerably over tables with tens of millions of rows.
3. **Transactional Overhead**. Creating a GIN index over a column adds a small amount (usually milliseconds) of latency to transactions that affect that column.

### Key Takeaway

Postgres FTS is ideal for search over small to medium-sized tables that do not require sophisticated FTS queries. We’re being intentionally vague about what “medium-sized” and “sophisticated” entail — the specifics vary depending on your performance requirements. Luckily, testing and migrating to/from Postgres FTS is fairly straightforward.

## Elasticsearch

While Elastic today offers a wide variety of products, its core product, Elasticsearch, is a data store and full text search engine.

### The Good

1. **Comprehensive Feature Set**. Elasticsearch is capable of handling almost any FTS query. The Elastic Query DSL (domain-specific language) is the gold standard for full text search capabilities.
2. **Performant**. Our [benchmarks](https://github.com/paradedb/paradedb/tree/dev/benchmarks#results) show that Elasticsearch can query billions of rows in milliseconds thanks to its underlying battle-tested Lucene search engine and distributed architecture.
3. **More than Search**. Beyond FTS, Elasticsearch is also an analytical query engine, vector database, security and observability platform. Many organizations enjoy the simplicity of consolidating several services within Elasticsearch.

### The Bad

1. **Not a Reliable Data Store**. We’ve talked to many companies who have tried and regretted their decision to use Elasticsearch as their primary data store. This is not a practice we recommend — Elasticsearch’s lack of ACID transactions and MVCC can lead to data inconsistencies and loss, while its lack of relational properties and real-time consistency makes many database queries challenging.
2. **Requires ETL Pipelines**. Because Elasticsearch is not a reliable data store, organizations that use Postgres typically extract, transform, and load (ETL) data from Postgres to Elasticsearch. Because failures in ETL pipelines can lead to all sorts of production outages, these pipelines need to be carefully maintained to avoid breaking changes in the underlying Postgres schema.
3. **Loss of Data Freshness**. ETL jobs are time-consuming and run at periodic intervals. Data that reaches Elasticsearch often lags several hours behind Postgres, which may be prohibitive for applications that perform real-time search over Postgres tables.
4. **Expensive**. We were surprised to hear from several enterprises that Elasticsearch had grown to become their largest software expense. As the cost of their Elasticsearch clusters ballooned, many of these enterprises switched from Elasticsearch Cloud to self-managed. While this reduced their cloud spend, it introduced a new problem: Elasticsearch is notoriously difficult to run, tune, and manage. These organizations then hired (expensive) engineers to manage their Elasticsearch clusters.

### Key Takeaway

Elasticsearch delivers excellent search performance at the price of operational overhead and data freshness. We recommend Elasticsearch if more lightweight alternatives cannot get the job done or if you intend on using other Elasticsearch services.

## Alternative Search Engines

Over the years, a modern breed of search engines like Algolia, Meilisearch, and Typesense has emerged. These engines are commonly used to build user-facing search experiences. [Hacker News search](https://hn.algolia.com/), for instance, is built on Algolia.

While each service differentiates around the edges, there’s an important caveat for developers looking for search over Postgres: none of these solutions are built specifically for Postgres. Users of Postgres will likely experience many of the same problems with these services as they would with Elasticsearch.

## Is the Best of Both Worlds Possible?

ParadeDB is a full text search engine built for Postgres. Powered by an extension called `pg_search`, ParadeDB embeds Tantivy, a Rust-based Lucene alternative, inside Postgres. Like native Postgres FTS, ParadeDB plugs into any existing, self-managed Postgres database with no additional infrastructure. Like Elasticsearch, ParadeDB provides the capabilities of an advanced full text search engine.

Teams making this tradeoff have already made the switch: [INSA Strasbourg replaced Elasticsearch with ParadeDB](/customers/case-study-insa) to simplify their research database stack, and [Bilt cut Postgres query timeouts by 95%](/customers/case-study-bilt) without standing up a separate search cluster.

Compatibility with managed Postgres services like Amazon RDS is coming soon. To get notified when it’s ready, we invite you to fill out our [interest form](https://form.typeform.com/to/jHkLmIzx?typeform-source=paradedb.typeform.com).
In the meantime, you can follow our [GitHub repository](https://github.com/paradedb/paradedb) and give us a star!

---

# Elasticsearch Was Never a Database

> Elasticsearch is a search engine, not a database. Here's why it falls short as a system of record.

Published: 2025-09-18 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/blog/elasticsearch-was-never-a-database

---

[Elasticsearch](https://www.elastic.co/elasticsearch) was never a database. It was built as a search engine API over [Apache Lucene](https://lucene.apache.org/) (an incredibly powerful full-text search library), but not as a system of record. Even Elastic’s own guidance has long suggested that your source of truth should live somewhere else, with Elasticsearch serving as a secondary index. Yet, over the last decade, many teams have tried to stretch the search engine into being their primary database, usually with unexpected results.

## What Do We Mean by “Database”?

Just to be clear up front, when we say _database_ in this context we mean a system you can use as your primary datastore for OLTP transactional workloads: the place where your application’s truth lives. Think Postgres (voted [most loved database](https://survey.stackoverflow.co/2025/technology#most-popular-technologies-database-prof) three years running), MySQL, or even Oracle.

## How Did We Get Here?

The story often begins with a simple need: search. A team is already using Postgres or MySQL to store their application data, but the built-in text search features don’t scale. Elasticsearch looks like the perfect solution; it’s fast, flexible, and easy to spin up.

At first, it’s just an index. Documents live in the database, and a copy lives in Elastic for search. But over time the line starts to blur. If documents are already in Elasticsearch, why bother writing them to the database at all? The process to keep the two stores in sync is the most brittle part of the stack, so why not get rid of it? Now the search index is also the database. The system of record has quietly shifted.

That’s where the trouble begins. A database isn’t just a place to keep JSON, text documents, and some metadata. It’s the authoritative source of truth, the arbiter that keeps your application data safe. This role carries expectations: atomic transactions, predictable updates, the ability to evolve schema safely, rich queries that let you ask questions beyond retrieval, and reliability under failure. Elasticsearch wasn’t built to solve this set of problems. It’s brilliant as an index, but brittle as a database.

## Transactions That Never Were

The first cracks appear around consistency. In a relational database, transactions guarantee that related writes succeed or fail together. If you insert an order and decrement inventory, those two operations are atomic. Either both happen, or neither does.

Elasticsearch can’t make that guarantee beyond a single document. Writes succeed independently, and potentially out of order. If one fails from a logical group, you’re left with half an operation applied. At first, teams add retries or reconciliation jobs, trying to patch over the gaps. But this is the moment Elasticsearch stops behaving like a database. A system of record shouldn’t ever let inconsistencies creep in over time.

You can see the same problem on the read side. Elasticsearch actually has two kinds of reads: `GET` by ID and `SEARCH`. A `GET` always returns the latest acknowledged version of a document, mirroring how databases work (although under failure-cases [dirty reads are possible](https://www.elastic.co/docs/deploy-manage/distributed-architecture/reading-and-writing-documents)). A `SEARCH`, however, only looks at Lucene segments, which are refreshed asynchronously. That means a recently acknowledged write may not show up until the next refresh.

Databases solve these issues with transaction boundaries and isolation levels. Elasticsearch has neither, because it doesn’t need them to be an effective search engine.

## Schema Migrations That Need Reindexes

Then the application changes. A field that was once an integer now needs decimals. A text field is renamed. In Postgres or MySQL, this would be a straightforward `ALTER TABLE`. In Elasticsearch, [index mappings](https://www.elastic.co/docs/manage-data/data-store/mapping#mapping-manage-update) are immutable once set, so sometimes the [only option](https://www.elastic.co/docs/manage-data/migrate) is to create a new index with the updated mapping and transfer every document into it.

When Elasticsearch is downstream of another database this is painful (a full network transfer) but safe, you can replay from the real source of truth. But when Elasticsearch is the only store, schema migrations require moving the entire system of record into a new structure, under load, with no safety net (other than a restore). What should be a routine schema change can become a high-risk operation.

## **Queries Without Joins**

Once Elasticsearch is the primary store, developers naturally want more than just search. They want to ask questions of the data. This is where you start to hit another wall.

Elasticsearch’s JSON-based Query DSL is powerful for full-text queries and aggregations, but limited for relational workloads. In Elastic’s [own words](http://elastic.co/docs/explore-analyze/query-filter/languages/querydsl), it “enables complex searching, filtering, and aggregations,” but if you want to move beyond that, the cracks show. Features you’d expect from a system of record (like basic joins) are either missing or only partially supported.

Consider the following SQL query:

```sql
-- What are the top ten products by average review rating,
-- only considering products with at least 50 reviews

SELECT p.id, p.name, AVG(r.rating) AS avg_rating
FROM products p
JOIN reviews r ON r.product_id = p.id
GROUP BY p.id, p.name
HAVING COUNT(r.id) >= 50
ORDER BY avg_rating DESC
LIMIT 10;
```

In Postgres, this is routine. In Elasticsearch, your options are clumsy: denormalize reviews into each product document (rewriting the product on every new review), embed reviews in products as children, or query both indexes separately and stitch the results back together in application code.

Elastic has been working on this gap. The more recent ES|QL introduces a [similar feature called lookup joins](https://www.elastic.co/blog/esql-lookup-join-elasticsearch), and Elastic SQL provides a more familiar syntax (with [no joins](https://www.elastic.co/docs/reference/query-languages/sql/sql-syntax-select)). But these are still bound by Lucene’s underlying index model. On top of that, developers now face a confusing sprawl of overlapping query syntaxes (currently: [Query DS](https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl)L, [ES|QL](https://www.elastic.co/docs/explore-analyze/query-filter/languages/esql-kibana), [SQL](https://www.elastic.co/docs/explore-analyze/query-filter/languages/sql), [EQL](https://www.elastic.co/docs/explore-analyze/query-filter/languages/eql), [KQL](https://www.elastic.co/docs/explore-analyze/query-filter/languages/kql)), each suited to different use cases, and with different strengths and weaknesses.

It is progress, but not parity with a relational database.

## Reliability That Can Fall Short

Eventually every system fails. The difference between an index and a database is how they recover. Databases use write-ahead or redo logs to guarantee that once a transaction is committed, _all_ of its changes are durable and will replay cleanly after a crash.

Under normal operation Elasticsearch is also durable at the level it was designed for: individual document writes. The [translog](https://www.elastic.co/docs/reference/elasticsearch/index-settings/translog) ensures acknowledged docs are fsynced on the primary shard, can survive crashes, and can be replayed on recovery. But, as we saw with transactions, that durability doesn’t extend beyond a single document. There are no transaction boundaries to guarantee that related writes survive or fail together (because that concept simply doesn’t exist). A failure can leave half-applied operations, and recovery won’t roll them back the way a database would.

That assumption is fine when Elasticsearch is an index layered on top of a database. If it’s your only store, though, the gap in transactional durability becomes a gap in correctness. Outages don’t just slow down search, they put your system of record at risk.

## Operations That Strain Stability

Operating Elasticsearch at scale introduces another reality check. Databases are supposed to be steady foundations: you run them, monitor them, and trust they’ll keep your data safe. Elasticsearch was designed for a different priority: elasticity. Shards can move, clusters can grow and shrink, and data can be reindexed or rebalanced. That flexibility is powerful, but distributed systems come with operational tradeoffs. Shards drift out of balance, JVM heaps demand careful tuning, reindexing consumes cluster capacity, and rolling upgrades can stall traffic.

Elastic has added tools to ease these challenges, and many teams do run large clusters successfully. But the baseline expectation is different. A relational database is engineered for stability and correctness because it assumes it will be your source of truth. Elasticsearch is [“optimized for speed and relevance”](https://www.elastic.co/docs/get-started/), and running it also as a system of record means accepting more operational risk than a database would impose.

## The Cost of Misuse

Elasticsearch is already complex to operate and heavy on resources. When you try to make it your primary database as well, both of those costs are magnified. Running on a single system feels like a simplification, but it often makes everything harder because you have two different optimization goals.

Transaction gaps, brittle migrations, limited queries, complex operations, and workarounds all pile up. Instead of reducing complexity, you’ve concentrated it in the most fragile place possible. The result is worse than your original solution: increased engineering effort, higher operational cost, and still none of the guarantees you would expect from a source of truth.

## So Where Does That Leave Elasticsearch?

Honestly, that leaves it right where it should be, and where it started: a search engine. Elasticsearch (and Apache Lucene under it) is an incredible achievement, bringing world-class search to developers everywhere. As long as you’re not trying to use it as a system of record, it does exactly what it was built for.

Even when used “correctly”, though, the hardest part often isn’t search itself, it’s everything around it. ETL pipelines, sync jobs, and ingest layers quickly become the most fragile parts of the stack.

That’s where ParadeDB comes in. Run it as your primary database, combining OLTP and full-text search in one system, or keep your existing Postgres database and eliminate ETL by deploying it as a logical follower.

This is the path teams like [INSA Strasbourg](/customers/case-study-insa) and [Modern Treasury](/customers/case-study-modern-treasury) have taken — replacing brittle search pipelines with ParadeDB running alongside Postgres.

If you want open-source search with correctness, simplicity, and world-class performance, [get started with ParadeDB](https://www.paradedb.com), and please don’t hesitate to [give us a star](https://github.com/paradedb/paradedb).

---

---

# Syncing with Postgres: Logical Replication vs. ETL

> Comparing ETL pipelines with PostgreSQL logical replication for data synchronization.

Published: 2025-08-08 · Author: Philippe Noël

Source: https://www.paradedb.com/blog/etl-vs-logical-replication

---

If your organization uses [Postgres](https://www.postgresql.org/) as its primary data store, you’re in good company. Many teams gravitate towards Postgres for its performance, stability, and familiarity. But even the strongest Postgres champions recognize its limits when it comes to powering specialized workloads like search, analytics, or warehousing.

To support these kinds of workloads, most organizations introduce a dedicated system isolated from the primary. Typical examples include Elasticsearch for search; ClickHouse for analytics; Snowflake for data warehousing. These systems rely on ETL (Extract, Transform, Load) pipelines to receive data from Postgres.

Interestingly, there are also use cases where these systems can be Postgres themselves. A common example is for reporting, where a separate Postgres instance tuned for OLAP (online analytical processing) can be used. In such a world, you aren’t confined to using ETL. An alternative opportunity is to use logical replication — a feature native to Postgres — to directly mirror changes from the primary.

When choosing a specialized system, the method of data ingestion is an often-overlooked factor that has long-term implications for maintainability, reliability, and performance. Over the past two years, we’ve interviewed and worked with dozens of companies who have dealt with ETL and logical replication. In this piece, we’ll explore the advantages and some non-obvious but common pitfalls of each approach.

## What is ETL?

ETL stands for Extract, Transform, and Load. When used as the interface between Postgres and non-Postgres systems:

1. **Extract** means pulling data out of the source system (Postgres).
2. **Transform** refers to reshaping that data to fit the schema or model of the destination (like Elasticsearch documents).
3. **Load** is the final step: writing the transformed data into the sink system.

_Image: Diagram of an ETL pipeline moving data from a source database into a target database through extract, transform, and load stages_

## Where Does ETL Shine?

  **TLDR**: Use when the sink isn't Postgres or you need heavy
  reshaping/multi-destination delivery.

ETL is the right tool when flexibility and interoperability between Postgres and non-Postgres systems are priorities. It offers several compelling advantages:

1. **Opportunity to enrich source data:** ETL presents a natural entry point for you to tailor incoming data exactly as needed downstream by applying custom transformations. This is valuable in use cases that involve remodeling or augmenting data. Tools like [dbt](https://www.getdbt.com/) are often used to manage these transformations.
2. **Flexibility to optimize data model**: ETL lets you reshape relational data into a structure optimized for the target system. Consider the advantages of denormalizing for document search or pre-aggregating for analytics: in these cases, ETL may be the most practical way to hit performance targets without contorting the source schema.
3. **Batch and stream support**: ETL pipelines are most commonly run in batch mode, which can be the simplest and most cost-effective approach when real-time data isn’t required. That said, many tools also support real-time streaming when freshness is a priority.
4. **Ability to manage schema drift explicitly**: With a dedicated transform layer, teams can design logic to handle schema changes (e.g., ignore, rename, or reshape fields) before they hit the sink.

## Where Does ETL Falter?

1. **Maintenance can be daunting:** There’s a common misconception that ETL is “set it and forget it.” In reality, ETL pipelines are fragile. Small schema changes or version mismatches between Postgres and downstream systems like Elasticsearch can silently break data flows. As a result, ETL pipelines require ongoing vigilance, so much so that the prospect of expanding the ETL surface area can have a [chilling effect](https://news.ycombinator.com/item?id=44633933#:~:text=Since%20we%20both,plane%20database%20stuff%3F).
2. **Lack of built-in transaction handling:** Most event-based ETL systems do not have an out-of-the-box method for handling deduplication, resolving update races, or replaying failed batches. With ETL, you’re handed a firehose of changes — and it's on you to ensure the downstream system doesn’t flood. Some teams embrace this control, building custom idempotency keys, versioning schemes, or last-write-wins logic. But these solutions are easy to get subtly wrong, especially in real-time systems where race conditions and partial failures are common.
3. **Backfill complexity is high**: With logical replication, everything is inherently replayable. With ETL, this is not the case. Replaying often means re-extracting from the source and manually reapplying transformations. These backfills are time-consuming and error-prone — avoiding duplication or stale data requires careful coordination across every stage of the pipeline.

## What is Logical Replication?

Logical replication is a built-in Postgres feature that streams row-level transactional changes from one Postgres instance to another. It works by decoding changes recorded in the Postgres write-ahead log (WAL) and sending them over a replication slot to replicas, where the changes are replayed in the order received.

_Image: Diagram of Postgres logical replication streaming row-level changes from a primary database to a replica over a replication slot_

## Where Does Logical Replication Shine?

  **TLDR**: Use when you want a faithful replica and minimal pipeline
  complexity.

If your sink is Postgres-based and you can leave the source data model untouched, logical replication promises a far simpler stack. Under logical replication, the sink is a literal replica of the source, which means:

1. **No third-party service required:** Logical replication is built into Postgres, so you don’t need to manage external connectors like Kafka, Debezium, or an ETL tool. This reduces your dependency footprint and simplifies your system architecture.
2. **Lower operational costs:** With fewer moving parts to deploy and monitor, teams spend less time firefighting and more time building. There's also no need to provision separate infrastructure for ETL pipelines or intermediate messaging layers.
3. **Near real-time data sync by default:** Changes are streamed from the source Postgres instance to the replica as they happen. This allows downstream systems to access fresh data almost immediately, without waiting for a batch job or a transformation window. If real-time is an absolute requirement, logical replication can also be configured to be fully synchronous.
4. **Transactional and ordered**: Logical replication will automatically pause if it encounters an unexpected change — like an insert that would create a primary key conflict — making it easier to detect and debug divergence. Additionally, changes are applied in commit order and within the boundaries of each transaction, which eliminates many classes of bugs that arise in event-based pipelines.
5. **Minimal backfill complexity**: Backfilling tables in a Postgres replica can be configured to happen automatically when a subscription is created. Postgres bulk-copies each table from a consistent snapshot, then streams any changes that occurred during the copy until the replica is fully caught up.

## Where Does Logical Replication Falter?

Logical replication isn’t for every use case, and has some limitations of its own.

1. **Schema coupling limits flexibility**: Aside from the ability to specify [row filters](https://www.postgresql.org/docs/current/logical-replication-row-filter.html) and [column lists](https://www.postgresql.org/docs/current/logical-replication-col-lists.html), the replica must mirror the source schema exactly. There’s no sophisticated transformation layer to reshape data for any downstream need.
2. **Single-thread apply process limits throughput**<sup>1</sup>: Each subscription applies changes in a single thread, regardless of how many tables are replicated inside a subscription. A common pitfall is placing too many tables into one subscription instead of splitting across multiple subscriptions.
3. **It doesn’t replicate DDL**<sup>2</sup>: Only row-level changes (INSERT, UPDATE, DELETE) are replicated—schema changes like `CREATE INDEX` or `ALTER TABLE` are not. The workaround is to run the same migration scripts on both your primary Postgres and the ParadeDB replica.

## What Databases Support Postgres Logical Replication?

Some databases advertise “Postgres wire compatibility,” meaning they can accept SQL over the Postgres frontend/backend protocol.
However, these systems typically still require ETL pipelines for ingestion. Logical replication relies on the replication sub-protocol and Postgres’s write-ahead log (WAL) format, which most wire-compatible systems do not implement.

A much smaller set of systems actually speak the replication protocol and can consume Postgres logical decoding output. In one camp are ETL/CDC tools themselves, whose job is to tap the logical replication stream and
hand the changes off to another system. In another camp are databases built on actual Postgres internals that inherit full logical replication support:

<div>
  <table>
    <thead>
      <tr>
        <th>Use Case</th>
        <th>Postgres Variant</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Data Warehouse</td>
        <td>Redshift, Greenplum</td>
      </tr>
      <tr>
        <td>Search and Analytics</td>
        <td>ParadeDB</td>
      </tr>
      <tr>
        <td>Vector Search</td>
        <td>pgvector</td>
      </tr>
      <tr>
        <td>Time Series</td>
        <td>Timescale</td>
      </tr>
    </tbody>
  </table>
</div>

For instance, [ParadeDB](https://www.paradedb.com) is an Elasticsearch alternative built on Postgres. It’s designed for Elastic-style search and analytical queries and is fully compatible with Postgres logical replication.

If you’re curious how it works, check out our [architecture](https://docs.paradedb.com/welcome/architecture) — and don’t hesitate to give us a [star](https://github.com/paradedb/paradedb).

<hr />

<ol>
  <li>
    Postgres mailing lists hint at improvements to parallel logical apply
    workers coming in Postgres 19.
  </li>
  <li>
    DDL replication is one of Postgres’ most highly requested features, so it
    may land in future releases.
  </li>
</ol>

---

# Teaching Postgres to Facet Like Elasticsearch

> Introducing faceted search in ParadeDB - bringing the power of search engine faceting to PostgreSQL with single-query aggregations.

Published: 2025-12-10 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/blog/faceting

---

Picture this: you've built a perfectly reasonable search app. Users type "dinosaur," get results, everybody's happy. Then someone asks, "but how many of these are carnivores?" So you add filters. Then they want to know how many results each filter would give them *before* they click it. Welcome to ~Jurassic park~ faceted search.

In this post, we'll break down how we brought **Elasticsearch-style faceting directly into PostgreSQL;** the syntax, the planner integration, and the execution strategy that makes it possible. We'll show you how we structured the SQL API using window functions, and the results we can get from pushing as much work as possible down into our underlying search library (spoiler alert: [over an order of magnitude faster](#performance-why-this-matters) for large result sets).

  If you aren't familiar with [ParadeDB](https://www.paradedb.com), we're a
  Postgres extension that brings modern search capabilities directly into your
  database. Instead of managing separate search infrastructure, you get BM25
  full-text search, vector search, and real-time analytics while keeping all
  your data in Postgres with full ACID guarantees.

## How Search Engines Count

Faceting is a way of summarizing the contents of a search result. When you type a query like "dinosaur", you’re not just asking for a list of matching documents, you’re also thinking, _what kinds of dinosaurs exist in this result set?_ Maybe there are 87 carnivores, 112 herbivores, and 41 omnivores. Each item in that breakdown, the counts of results grouped by attribute, is a **facet**.

Facets turn search from a one-dimensional list into something that can help visualize navigation. They set the stage to let you explore results by filtering on structured fields (like _era_, _habitat_, or _height range_) while still searching free text (like _large dinosaur_). In other words, faceting is where structured data meets unstructured search.

<figure>
  _Image: Faceted search powering real-world insights: an example from our customer, DemandScience._

  <figcaption>
    Faceted search powering real-world insights: an example from our customer,
    [DemandScience](https://demandscience.com/).
  </figcaption>
</figure>

Behind the scenes, each facet corresponds to a column or field in your dataset. Modern search engines use columnar stores to computes how many matching documents belong to each unique value or bucket of that field, then returns both the top results and these aggregate counts in a single response. Because a search result usually returns results that are spread across our dataset the database needs a quick way to look up values for each document.

That duality, returning ranked search hits _and_ category counts, is what makes faceting so powerful. It's also what makes it tricky to implement efficiently in a database built for rows, not documents.

## The Problem with Traditional Approaches

If you're thinking, "I'll just run a few extra COUNT queries, how hard can it be?", you're about to discover why search engines exist in the first place.

The thing is, faceting _looks_ simple: it's just grouping and counting. But try to make it fast in a traditional row-based database, and you'll run into serious performance challenges. Want to show search results _and_ category counts from a single query? That's either two index scans or a full index scan and a lot of data transferred. Want it to be responsive? That becomes increasingly difficult as the number of results increases.

Let's take a look at how much faster our approach is, and also how we make this possible (by pipelining the search query and faceting into a single index scan, and making use of a columnar format for value lookups).

## Performance: Our Benchmarking Results

  As of v0.22.0, the performance of MVCC-accurate faceting is roughly 40%
  faster.

To demonstrate the performance benefit of our new faceting, we benchmarked traditional PostgreSQL faceting against ParadeDB's integrated approach using a dataset of 46 million Hacker News posts and comments (details on the queries and schema can be found [later in the post](#comparing-the-approaches)).

Both approaches use ParadeDB's BM25 search (PostgreSQL's built-in tsvector simply doesn't cut it, already being 27x slower at 200,000 results), but they differ in how faceting is implemented. The results show just how significant the performance difference can be:

As the chart shows, manual faceting performance degrades dramatically with larger result sets, while ParadeDB's Top K faceting maintains consistent performance by executing both search ranking and aggregation in a single pass through the index. This efficiency comes from making use of ParadeDB's columnar index, which allows fast per-document value lookups during aggregation. At scale, this represents well over an order of magnitude improvement. The orange bar shows ParadeDB with MVCC[^2] disabled, demonstrating that an additional 3× performance gain is possible when strict transactional consistency isn't required for the facets.

So how does ParadeDB achieve these results? Before we jump into planning and execution let's take a look at our syntax.

## Faceting in a Row-Based World

When we set out to design the syntax for faceting, we had three goals:

1. Make it intuitive for both SQL users and Elasticsearch users.
2. Return both the search results and the facet counts in the same payload as fast as possible
3. Integrate cleanly into PostgreSQL's planner and executor.

We experimented with a few approaches: full JSONB result sets, LATERAL subqueries, and a behind-the-scenes UNION. The one we kept coming back to was using PostgreSQL's window functions to express faceting instructions, passing a JSON DSL that mirrors the Elasticsearch aggregation API.

That led us to a [new window function, `pdb.agg()`](https://docs.paradedb.com/documentation/aggregates/overview) (also usable in aggregate position, blog coming soon), which accepts JSONB that enables a range of aggregations. Most notably for faceting it supports [terms](https://docs.paradedb.com/documentation/aggregates/bucket/terms) (counting unique values), [histograms](https://docs.paradedb.com/documentation/aggregates/bucket/histogram) (bucketing unique values), and [date_histogram](https://docs.paradedb.com/documentation/aggregates/bucket/datehistogram) (bucketing timestamps).

Instead of complex CTEs and multiple queries, you get both search results and facet counts in a single, clean query:

```sql
SELECT
  id, title, score, pdb.score(id) AS rank,
  pdb.agg('{"histogram": {"field": "score", "interval": 50}}')
    OVER () as facets
FROM hn_items
WHERE (text ||| 'postgresql')
ORDER BY rank DESC
LIMIT 5;
```

The magic is the `OVER ()` syntax: this means "over the entire result set, ignoring the LIMIT". This window syntax computes both the top K ranked results and the facet counts in a single pass through the data—exactly the pattern most search applications need.

The result includes both your search hits alongside the faceting data in a JSONB column (effectively overloading this column to contain all the facets). This approach gives you structured, self-contained aggregation results that are easy to parse on the client side.

### MVCC and Performance Tradeoffs

By default, ParadeDB maintains full ACID guarantees by checking transaction visibility for every aggregated document. However, for large result sets where approximate counts are acceptable, you can disable MVCC checking for significant performance gains:

```sql
-- Default: MVCC enabled (ACID compliant, slower)
pdb.agg('{"terms": {"field": "category"}}') OVER ()

-- MVCC disabled: ~3x faster for large result sets
pdb.agg('{"terms": {"field": "category"}}', false) OVER ()
```

When MVCC is disabled, the aggregation operates directly on the search index without visibility checks. This is useful for analytics workloads on large datasets where some lag in reflecting the latest updates (bounded by the number of dead tuples since the last VACUUM) is acceptable. But this optional optimization can be great when you need maximum performance or have immutable INSERT only data.

### A Complete Example

Let's walk through how this works with a concrete example using our HackerNews dataset:

```sql
-- HackerNews table structure
CREATE TABLE hn_items (
    id INTEGER PRIMARY KEY,
    parent_id INTEGER,
    by TEXT,
    text TEXT,
    title TEXT,
    url TEXT,
    type TEXT,
    time TIMESTAMPTZ,
    score INTEGER
    -- ... other columns
);
```

After creating a ParadeDB search index[^1], you can run faceted search queries that return both ranked results and category counts:

```sql
SELECT
  id, title, score,
  pdb.agg('{"terms": {"field": "type"}}') OVER () as facets
FROM hn_items
WHERE (text @@@ 'paradedb')
ORDER BY score DESC
LIMIT 10;
```

When you see `OVER ()` alongside an `ORDER` and `LIMIT`, it signals: _"compute this aggregation over the entire search result set, not just the top K rows returned."_ This Top K pattern—showing ranked results with facet counts—is exactly what most search interfaces need.

This design intentionally bridges two paradigms. It is designed to be familiar to both Elasticsearch (making use of the Elasticsearch DSL) and SQL (with the WINDOW syntax) users. PostgreSQL's parser treats it as a standard window function call, allowing us to implement custom aggregation logic within the familiar SQL framework.

This design intentionally bridges two paradigms, with the structure of SQL and the flexibility of search engines. It combines the expressiveness of Elasticsearch’s JSON DSL, which allows rich and composable aggregations, with the familiar simplicity of SQL’s WINDOW syntax, which makes it easy to use and reason about. PostgreSQL’s parser treats `pdb.agg()` as a standard window function call, letting ParadeDB implement custom aggregation logic directly within PostgreSQL’s query execution framework. The result is an interface that feels natural to both SQL and Elasticsearch users.

The first row of the output looks like this:

```sql
id     | 41178808
title  | The future of F/OSS continues to be AGPL
score  | 34
facets | {
       |     "buckets": [
       |         { "key": "comment", "doc_count": 185 },
       |         { "key": "story",   "doc_count": 2   }
       |     ],
       | }
```

The faceting information is added alongside the search query using a JSONB column. JSON is the perfect wrapper for the faceting information: it's structured, self-contained, and easy to parse on the client. Today we attach this to all rows, but in the future, we may optimize this further by returning the facet JSON only on the first row instead of duplicating it.

## Comparing the Approaches

Let's see how this compares to a more traditional PostgreSQL approach with the same output, but using complex queries with CTEs and manual aggregations. Both approaches below use ParadeDB for BM25 search, differing in how the faceting is handled. Here's what a very optimized version of "manual faceting" for a histogram of post ratings looks like:

```sql
WITH
-- get the id, score and bm25 rank for the search query
hits AS  (
  SELECT
    id,
    pdb.score(id) AS rank,
    score
  FROM hn_items
  WHERE (text ||| 'postgresql')
),
-- do our faceting on the full result-set and build up a JSON object
facets AS (
  SELECT jsonb_build_object(
           'buckets',
           jsonb_agg(
             jsonb_build_object('key', score_bucket, 'doc_count', cnt)
             ORDER BY score_bucket
           )
         ) AS facets
  FROM (
    SELECT
      (score / 50) * 50 AS score_bucket,
      COUNT(*) AS cnt
    FROM hits
    GROUP BY score_bucket
  ) t
),
-- order the original hits by rank and return the top results
limited AS (
  SELECT id, rank FROM hits ORDER BY rank DESC LIMIT 10
)
-- join everything together
SELECT h.id, title, score ,facets
FROM limited h
JOIN hn_items i on i.id = h.id
CROSS JOIN facets f;
```

Here's what **ParadeDB faceting** looks like for the same result:

```sql
SELECT
  id, title, score, pdb.score(id) AS rank,
  pdb.agg('{"histogram": {"field": "score", "interval": 50}}')
    OVER () as facets
FROM hn_items
WHERE (text ||| 'postgresql')
ORDER BY rank DESC
LIMIT 5;
```

Same result, simpler syntax, and as we've already seen [dramatically better performance](#performance-our-benchmarking-results).

## **How Faceting Hooks Into the PostgreSQL Planner**

Now we have our syntax, but it isn’t as simple as just writing the `pdb.agg()` function and calling it a day. Because PostgreSQL aggregates (including those in window functions) usually run **after** the data is fetched from indexes and loaded to shared buffers we have a problem. We want to do our search and our faceting in a single pass, but they are executed in fundamentally different places during query execution.

PostgreSQL provides a custom scan API that allows extensions to implement their own execution nodes. However, to fully integrate faceting with window functions, we also need to use planner hooks to intercept the planning process before PostgreSQL creates standard window aggregation nodes.

We inject a custom execution node at plan time, intercepting the planning process when we detect a `pdb.agg()` function used as a window. We do this using a combination of planner hooks and the custom scan API in three stages:

1. **Query Interception.** The planner hook scans the query tree for `WindowFunc` nodes using `pdb.agg()`. It replaces them with ParadeDB placeholders _before_ PostgreSQL’s `grouping_planner()` runs. This prevents Postgres from creating real `WindowAgg` nodes we’d later need to undo.
2. **Custom Scan Injection.** When queries combine full-text search, `ORDER BY`, `LIMIT`, and our placeholder `WindowAgg` node we take over planning. This is the entry point for Top K faceting queries. The planner swaps in a `PdbScan` node capable of executing both the search and the aggregation in one pass, making sure that the parts of the query that aren’t handled by ParadeDB (non-search WHERE clauses, outer clauses / CTEs, FILTERS) are delegated back to PostgreSQL as normal.
3. **WindowAgg Extraction**. Extracts and converts the `pdb.agg()` placeholder into a form Tantivy (our underlying search library) understands, adding the information to the custom `PdbScan`. This is where we parse the JSONB definition and build the aggregation tree (terms, range, etc.) to run inside the search engine.

The key insight is timing, this integration happens *before* PostgreSQL attempts standard window function execution. By execution time the window function doesn’t exist and PostgreSQL is running what it sees as a normal custom scan node.

## **How Faceting Executes Inside ParadeDB**

Once PostgreSQL hands off the plan, the real work begins inside ParadeDB's search layer. At this point, we're no longer working with rows, we're working with **search indexes** and **column data** inside Tantivy. The key idea is that ParadeDB executes both **ranking** and **aggregation** in the _same pass_ through the index.

### **1. Searching and Collecting**

Every faceting query starts by traversing the search index using the search criteria from the query. We walk through the index entries for matching terms, answering the search part of the query and producing a stream of matching document IDs. This stream forms an ordered iterator representing the candidate documents.

This candidate document stream is then passed into a **compound collector**, which coordinates all work done over the same stream of documents. This collector runs multiple sub-collectors in parallel, so we can compute different kinds of results (like ranked hits and aggregations) without ever rereading the index.

### **2. TopDocs Collector (ORDER and LIMIT)**

The **TopDocs collector** is responsible for ranking and limiting the result set. The ranking could be by BM25 score (`pdb.score(id`), or it could be by a column value (which gets looked up from the columnar storage).

Tantivy uses a quickselect buffer to maintain only the top K highest ordered documents, traversing the document stream and only keeping K records at any one time using a very small amount of memory.

### **3. Aggregation Collector (Building Facets)**

The Aggregation collector is where the actual faceting magic happens. While TopDocs is busy ordering documents, this collector is busy counting things.

For every matching document, the facet collector fetches and aggregates only the fields you asked for (category, price, rating etc.) directly from Tantivy's columnar storage. This matters because we never need to reconstruct full rows like PostgreSQL would, we can pull just the specific column values needed for aggregation, reducing the amount of data we need to read into memory.

Tantivy's columnar storage is designed for exactly this pattern: fast per-document, single-column lookups. For string fields, it uses dictionary encoding, so the entire aggregation happens on integers rather than strings, which are only converted back to strings at the very end. This makes the aggregation process extremely efficient, we're counting integer IDs rather than comparing string values millions of times per second.

Here's where things get interesting from a correctness perspective: this same collector also checks and applies **MVCC filtering** to maintain full ACID guarantees, only aggregating values that are visible to your current transaction. When the MVCC flag is set to `false`, we skip these visibility checks entirely for maximum performance, which is useful for analytics workloads where approximate counts are acceptable.

### **5. Materializing the Final Result**

Once collection finishes, ParadeDB has two outputs:

- The **ranked hits** from the TopDocs collector.
- The **facet aggregation result** from the Aggregation collector.

The Aggregation collector serializes its in-memory facet map into JSON and attaches it as a single facets column next to the search the query output. This will be mapped back to the value for the `pdb.agg() OVER ()` window.

## Conclusion

Faceting is one of those features that looks simple on the surface, but is surprisingly hard to make both easy for developers to use and efficient under the hood. ParadeDB's approach solves this, bringing Elasticsearch-style faceting directly into PostgreSQL. By integrating faceting at the search engine level, we can execute both ranking and aggregation in a single pass through the index, delivering order-of-magnitude performance improvements without forcing you to manage separate search infrastructure.

The result is a simple SQL interface that gives you the best of both worlds: PostgreSQL's ACID guarantees and transactional semantics, combined with the performance and developer experience of modern search engines.

Ready to try faceted search in your PostgreSQL database? [Get started with ParadeDB](https://docs.paradedb.com/documentation/getting-started/install) to see how faceting can transform your search experience, and please don’t hesitate to [give us a star](https://github.com/paradedb/paradedb).

---

[^1]: To create a ParadeDB search index, you would run: `CREATE INDEX ON hn_items USING BM25 (id, by, text, title, (url::pdb.literal), (type::pdb.literal), time, score) WITH (key_field=id);`. All non-text and `literal` tokenized text fields are available to use in faceting. Full documentation on index creation can be found under our [CREATE INDEX documentation](https://docs.paradedb.com/documentation/indexing/create-index).

[^2]: [MVCC (Multi-Version Concurrency Control)](https://www.postgresql.org/docs/current/mvcc.html) is PostgreSQL's mechanism for handling concurrent transactions by maintaining multiple versions of data when rows are modified. The built in [VACUUM](https://www.postgresql.org/docs/current/routine-vacuuming.html) asynchronously cleans up these old versions. Accessing rows includes visibility checks in both tables and indexes to ensure each transaction sees the correct version of data.

---

# Hybrid Search in PostgreSQL: The Missing Manual

> Build production-ready hybrid search that combines BM25 lexical matching with vector similarity search, all inside PostgreSQL

Published: 2025-10-22 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/blog/hybrid-search-in-postgresql-the-missing-manual

---

PostgreSQL's search capabilities go much further than most developers realize. While many reach for external search engines like Elasticsearch or specialized vector databases like Pinecone, PostgreSQL can deliver the same functionality, you just need to know which extensions unlock its potential.

[ParadeDB](https://github.com/paradedb/paradedb) brings production-ready full text search with BM25 scoring for lexical relevance, while [pgvector](https://github.com/pgvector/pgvector) delivers vector similarity for semantic understanding. But how do you combine these into a hybrid system?

**This is the missing manual for hybrid search in PostgreSQL**.

We'll walk through building a complete solution that combines lexical precision with semantic understanding, using Reciprocal Rank Fusion (RRF) to create rankings that are both accurate and meaningful. By the end, you'll have a production-ready search strategy that lives entirely within your database, with no external dependencies or synchronization headaches.

## Why Not PostgreSQL's Native Full-Text Search?

Before diving into hybrid search, let's address the elephant in the room: PostgreSQL already has [full-text search](/learn/search-concepts/full-text-search) built-in. Why not just use that?

PostgreSQL's native [full-text search](/learn/search-in-postgresql/full-text-search) with [`tsvector`](https://www.postgresql.org/docs/current/datatype-textsearch.html#DATATYPE-TSVECTOR) and [`tsquery`](https://www.postgresql.org/docs/current/datatype-textsearch.html#DATATYPE-TSQUERY) is functional for basic text matching, but it has a fundamental limitation that becomes apparent as soon as you order results. The ranking functions like [`ts_rank`](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING) only consider individual documents in isolation: they don't understand global corpus statistics.

Here's what this means in practice. When `ts_rank` sees "PostgreSQL" appears three times in a document, it scores that document higher than one where it appears once. That makes sense locally. But it can't tell you whether "PostgreSQL" is a common term (appearing in 80% of your documents) or a rare, discriminating term (appearing in just 5%).

It might seem simple, but this lack of global context is the fundamental difference between basic text search and modern relevance ranking.

## Lexical Search with BM25

This is where [BM25](/learn/search-concepts/bm25) enters the picture. BM25 was developed by information retrieval researchers to solve this global context problem. It's the algorithm that powers almost all modern search engines, including Elasticsearch and Solr.

BM25 solves the ranking problem by combining three intelligent signals:

- **Term Frequency:** Documents mentioning query terms more often are more relevant, but with diminishing returns. The difference between mentioning "PostgreSQL" once vs. twice is significant, but the difference between 20 times vs. 21 times is negligible.

- **Inverse Document Frequency:** Rare terms matter more than common ones. Finding "pgvector" in a document is much more discriminating than finding "PostgreSQL", it tells you something specific about what this document covers.

- **Document Length Normalization:** Shorter, concise documents are preferred over long ones that might mention terms incidentally. A focused paragraph about PostgreSQL performance is probably more relevant than a 50-page manual that mentions it in passing.

While there are [a few ways](/learn/search-in-postgresql/bm25) to get BM25 into Postgres (and a few ways of using an external search engine from Postgres), the easiest and most performant approach is with the ParadeDB [pg_search extension](https://github.com/paradedb).

ParadeDB brings production-ready BM25 directly into PostgreSQL as a native index type, with none of the operational complexity of external search systems.

### BM25 With ParadeDB

Getting started with ParadeDB is straightforward. Once you have the extension installed, you can create BM25 indexes just like any other PostgreSQL index type, then query them with a simple operator syntax.

  To add `pg_search` to your PostgreSQL instance (or to spin up a new instance
  with Docker) check out our [deployment
  guides](https://docs.paradedb.com/deploy/overview).

```sql
CREATE EXTENSION pg_search IF NOT EXISTS;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT,
    content TEXT
);

-- Create BM25 index using the english tokenizer with stemming enabled
CREATE INDEX idx_documents_bm25 ON documents
USING bm25 (
  id,
  title::pdb.simple('stemmer=english'),
  content::pdb.simple('stemmer=english')
)
WITH (key_field=id);

-- Rank relevance with BM25 across documents that mention "postgresql" or "search"
-- in the title or content.
-- Boost the title matches 2x (as they are more likely to be relevant)
SELECT id, title, pdb.score(id) AS bm25_score
FROM documents
WHERE
 title   ||| 'postgresql search'::boost(2) OR
 content ||| 'postgresql search'
ORDER BY bm25_score DESC;
```

The `|||` operator above is for match disjunction, but many other options are available. Conjunction, phrase matching, highlighting, more like this, regex queries and word proximity are all supported. ParadeDB is also smart about query optimization. It can push down `WHERE` clauses and faceting aggregations into the BM25 index where possible, making complex filtered searches much faster than traditional approaches.

This kind of lexical search excels at exact or similar matches: it's perfect when users search for specific terms, product codes, or technical concepts they know by name (or at close, see our blog on [tokenization and stemming](/blog/when-tokenization-becomes-token) for a deep dive on how this works).

But BM25 has a fundamental limitation. When someone searches for "database performance optimization," BM25 won't find documents that only mention "PostgreSQL tuning" or "query optimization", even though these documents might be exactly what the user needs. This is where we need the second piece of our hybrid search puzzle.

## Vector Similarity Search

[Vector search](/learn/search-concepts/vector-search) solves the semantic understanding problem by working with meaning instead of words. While BM25 sees "postgresql" and "database" as completely different terms, vector search understands they're conceptually related.

The magic happens through embedding models, AI networks trained to convert text into high-dimensional vectors (typically 768, 1024, or 1536 dimensions). These vectors have a remarkable property: semantically similar concepts end up close together in vector space. "PostgreSQL" and "database" will have similar embeddings, as will "performance" and "optimization."

  PostgreSQL doesn't generate embeddings automatically. You need to create
  vectors using external embedding models (like OpenAI's text-embedding-ada-002
  or open-source alternatives) and store them in your database. This can happen
  at write time when inserting documents, or through batch updates of existing
  content.

This means vector search can find relevant documents even when they don't contain your exact search terms. Search for "database speed improvements" and you'll find documents about "PostgreSQL performance tuning" or "query optimization techniques", something lexical search would miss entirely.

Getting vector search into PostgreSQL used to require external vector databases or complex custom solutions. But the [pgvector](https://github.com/pgvector/pgvector) extension changed this, bringing vector operations directly into PostgreSQL. Like ParadeDB for BM25, pgvector makes vector search a first-class PostgreSQL feature with optimized indexes and distance operators.

### Vector Search With pgvector

Adding vector search to our existing documents table requires installing pgvector, adding a vector column for embeddings, and creating an index for similarity queries.

  To add pgvector to your Postgres instance check out their [installation
  instructions](https://github.com/pgvector/pgvector?tab=readme-ov-file#installation).
  If you're using the [ParadeDB Docker
  image](https://hub.docker.com/r/paradedb/paradedb) then pgvector will be
  installed and ready to go.

```sql
-- Enable pgvector
CREATE EXTENSION vector IF NOT EXISTS;

-- Add embedding column
ALTER TABLE documents ADD COLUMN embedding vector(1536);

-- Create vector index
CREATE INDEX idx_documents_vector ON documents
USING hnsw (embedding vector_cosine_ops);

-- Search by similarity
-- The vector is the vector representation of your query text, which would be created in your app
SELECT
    id,
    title,
    1 - (embedding <=> '[0.1,0.2,...]'::vector) AS similarity
FROM documents
ORDER BY similarity DESC;
```

Vector search excels at semantic similarity, making conceptual connections that lexical search misses. But this strength is also its weakness. Vector search trades precision for understanding.

When someone searches for a specific product code like "PG-15.4" or an exact technical term like "pg_stat_statements," vector search might return documents about related but different concepts. The embedding model doesn't distinguish between "PostgreSQL 15.4" and "PostgreSQL 14.2", they're all just "PostgreSQL versions" in vector space.

This is the fundamental trade-off: BM25 gives you precision but misses semantics, while vector search gives you semantic understanding but can be fuzzy on exact matches. What if you could have both?

## Hybrid Search: The Best of Both Worlds

[Hybrid search](/learn/search-concepts/hybrid-search) combines lexical and semantic approaches to get the benefits of both: the precision of BM25 and the semantic understanding of vector search. But combining two different search approaches isn't trivial, you can't just add BM25 scores to vector similarity scores. They're measured on completely different scales.

This is where [Reciprocal Rank Fusion (RRF)](/learn/search-concepts/reciprocal-rank-fusion) provides an elegant solution. Instead of trying to normalize and combine raw scores, RRF focuses on rankings. The insight is brilliant in its simplicity: if a document ranks highly in multiple different search systems, it's probably very relevant.

## Reciprocal Rank Fusion

RRF works by converting rankings into scores using a simple formula:

$$RRF(document) = Σ 1 / (k + rank_i(document))$$

Where `k` is a constant (typically 60) that controls how quickly scores decrease with rank, and `rank_i` is the document's rank in system `i`. Documents that rank highly in multiple systems get the highest combined scores.

The beauty of RRF is that it's scale-independent. Whether your BM25 scores range from 0-10 or 0-1000, and whether your vector similarities are between 0-1 or -1 to 1, RRF only cares about the relative rankings. This makes it incredibly robust and easy to tune.

### Basic RRF Implementation

```sql
WITH
--- Full-text search, using pg_search and BM25 for ranking
fulltext AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY pdb.score(id) DESC) AS r
  FROM documents
  WHERE content ||| 'keyboard'
  LIMIT 20
),

--- Semantic search, using pgvector and cosine distance for ranking
semantic AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> '[1,2,3]') AS r
  FROM documents
  LIMIT 20
),

-- Calculate RRF contributions from each ranker
rrf AS (
  SELECT id, 1.0 / (60 + r) AS s FROM fulltext
  UNION ALL
  SELECT id, 1.0 / (60 + r) AS s FROM semantic
)

-- Sum the RRF scores, order by them, and join back the original data
SELECT
  m.id,
  SUM(s) AS score,
  m.content
FROM rrf
JOIN documents AS m USING (id)
GROUP BY m.id, m.content
ORDER BY score DESC
LIMIT 5;
```

  Notice the `LIMIT 20` in each ranking CTE. RRF works best with the top
  candidates from each system rather than feeding entire result sets. This
  focuses the fusion on the most promising documents while keeping computation
  efficient.

### Weighted RRF

The basic RRF implementation treats both search systems equally, but in practice you might want to emphasize one approach over the other. Maybe your users tend to search for specific technical terms (favor BM25), or perhaps they use more conversational queries (favor vector search).

Weighted RRF lets you balance lexical vs semantic influence by applying different weights to each system:

```sql
rrf AS (
  SELECT id, 0.7 * 1.0 / (60 + r) AS s FROM fulltext -- 70% weight
  UNION ALL
  SELECT id, 0.3 * 1.0 / (60 + r) AS s FROM semantic -- 30% weight
)
```

This configuration emphasizes lexical matching over semantic similarity, which works well for technical documentation where users often search for specific terms, function names, or error messages. Adjust the weights based on your use case and user behavior.

## Beyond Search: Adding Signals

One of the most powerful aspects of RRF is how naturally it extends beyond just search relevance. Real-world search systems need to balance relevance with business requirements like popularity, freshness, user preferences, or content quality.

Traditional search engines solve this with complex scoring functions that are difficult to tune and debug. RRF takes a different approach: treat each business requirement as a separate ranking system, then fuse them all together.

Want to boost popular content? Create a ranking based on view counts. Need to surface recent articles? Add a recency ranking. Want to promote high-quality content? Include editorial ratings. RRF lets you compose these signals naturally:

```sql
--- Popularity ranking based on view counts
popularity AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY view_count DESC) AS r
  FROM documents
  LIMIT 1000
),

--- Recency ranking based on creation date
recency AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY created_at DESC) AS r
  FROM documents
  LIMIT 1000
),
```

Notice the higher limits (1000) for non-search signals compared to search results (20). Non-search rankings like popularity and recency can afford larger candidate sets since they don't depend on query-specific relevance filtering.

This multi-signal approach gives you incredible flexibility. You can adjust weights seasonally (boost recency during news events), by user type (emphasize popularity for new users, relevance for power users), or by content type (prioritize freshness for blog posts, authority for documentation).

The key insight is that each signal is independent and interpretable. If your search results seem too focused on popular content, just reduce the popularity weight. If recent content isn't surfacing enough, increase the recency weight. No complex debugging of scoring functions, just intuitive weight adjustments.

## Building Better Search

Hybrid search in PostgreSQL combines BM25's lexical precision with vector embeddings' semantic understanding using RRF fusion. ParadeDB and pgvector make this possible without external dependencies, everything runs in your existing database with ACID guarantees and transactional consistency.

The SQL-based approach means you can see exactly how rankings work, adjust weights intuitively, and add business logic as needed. No black-box algorithms or complex external systems to manage.

Ready to get started? [Install ParadeDB](https://docs.paradedb.com/documentation/getting-started/install) and use RRF to combine BM25 with pgvector into hybrid search that actually works, and please don’t hesitate to [give us a star](https://github.com/paradedb/paradedb).

---

# Postgres as a Search Engine: The Write Performance Problem

> How ParadeDB achieved 10x improved write throughput through searchable buffers, background merging, and handling Postgres HOT chains.

Published: 2025-12-15 · Author: Ming Ying, James Blackwood-Sewell

Source: https://www.paradedb.com/blog/increased-write-performance

---

Supporting update-heavy workloads is one of the most challenging database problems in the search space. The most popular search libraries, [Lucene](https://github.com/apache/lucene) (which powers Elasticsearch and Solr) and [Tantivy](https://github.com/quickwit-oss/tantivy) (which powers ParadeDB and Quickwit), use read-optimized, immutable data structures that excel at query performance but struggle with high throughput non-batched writes.

This presents a fundamental trade-off: updating a single field requires reindexing and serializing all fields which are stored alongside it, and merging these data structures during compaction adds significant write overhead. For applications that need both fast search and high write throughput (real-time dashboards, e-commerce catalogs, and recommendation systems) this bottleneck can be a deal-breaker.

We set out to solve this problem in ParadeDB. Through two key optimizations, searchable buffers and background merging, we achieved more than a **10x improvement in write throughput** while maintaining search performance. In this post we'll dive into how we built these optimizations into Postgres, and the engineering challenges we encountered along the way.

## The Challenge: Write-Optimized vs. Read-Optimized Data Structures

Almost all search engines use LSM (log structured merge) trees or WORM (write once, read many) architectures. In both patterns, once data is written it's immutable with no in-place updates possible.

Take Lucene's segment-based architecture, which Tantivy closely mirrors. Data is arranged into immutable segments, essentially sub-indexes that contain compressed data optimized for read performance. Writing new data requires:

1. Buffering writes in memory until reaching a threshold (or a user-initiated commit)
2. Flushing the entire buffer to create a new segment (which could be tiny if it represents a single operation)
3. Periodically merging smaller segments into larger ones to maintain performance at query time

This approach works well for append-only workloads, but it breaks down when you need to support frequent updates or high throughput non-batched writes. Each write operation touches multiple data structures, and the periodic merge operations can create significant write amplification as thousands of tiny segments get rolled up into bigger ones. Reads also suffer, with results not reflecting recently written data until after the buffers are flushed.

This is particularly problematic for ParadeDB because we respect [Postgres' ACID guarantees](/blog/elasticsearch-acid-test). We can't rely on in-memory buffers that might be lost on crash, so we need to persist index changes within each transaction. If a transaction inserts or updates a single value, we create a segment with just one or a few rows. With many non-batched write operations, this results in thousands of tiny segments.

While Elastic still has this issue, it's partially mitigated by not fully supporting ACID semantics. The values of writes are durable but they aren't immediately searchable with the search indexes being updated lazily. This means that a value which is written and committed may not show up in fast follow search results.

## Building Searchable Buffers

Our first optimization was creating our own mutable write buffer inside Postgres, and making it searchable. Traditional search engines maintain an in-memory buffer that gets flushed periodically, but this buffer typically isn't queryable or durable meaning there's no read-after-write or even durability guarantee. That was never an option for us. Previously, we were aggressively committing Tantivy’s in-memory buffer on every Postgres transaction, which produced an excessive number of tiny segments.

We quickly learned maintaining a persistent mutable buffer for full document payloads on top of the Tantivy structures would be expensive, essentially double writing everything. Luckily Postgres comes with a built in row identifier we can store, and then use as a lookup key. Every row in Postgres has a `ctid`, a `(u32, u16)` tuple that identifies its physical location. By maintaining a fixed-size buffer of `ctid` values rather than full document content, we could dramatically reduce the cost of each write operation.

_Image: ParadeDB mutable segments provide a search buffer which offsets the negative effects of high-throughput insert and update workloads_

Here's how it works:

1. **Minimal write overhead**: Instead of indexing and serializing all fields for each write, we simply append a single `(u32, u16)` ctid to a per-connection buffer. On commit, this buffer is persisted in a Postgres block-backed linked list.
2. **Search-Time index construction**: When executing a query, we construct an in-memory index over the contents of all visible buffers to provide read-your-writes. For typical buffer sizes (1000 records), the additional overhead of querying the buffer is minimal compared to searching large persistent segments.
3. **Immutable segment optimization**: When buffers are full they can be written out to larger Tanitvy immutable segments, reading the actual data using the `ctid` as a key,

This approach gives us the best of both worlds: minimal write overhead and immediate searchability without waiting for mutable segment buffers to flush to immutable segments.

## The HOT Chain Challenge

Implementing searchable buffers seemed straightforward until we encountered Postgres's Heap-Only Tuple (HOT) optimization. This is where our integration with Postgres's storage layer created an unexpected challenge.

In Postgres, rows are never updated in-place. Instead, each update creates a new tuple version which is appended to the table (old versions are cleaned up later by the vacuum process). To reduce write amplification HOT chains form when:

1. No indexed columns are modified in the update
2. The new tuple version fits on the same page as the original

When this happens, Postgres creates a "heap-only" tuple that's linked to the original but not propagated to indexes, saving having to update all indexes for all tuples:

When a heap only tuple is read by an index it still points at the original index entry, and the chain is followed to find the most recent version. This happens without any additional IO, as the 8KB page has already been read into memory.

_Image: Heap only tuples (HOT) are a write time optimization in Postgres that uses inter-page linking to avoid updating indexes during UPDATE queries._

The problem arises when our searchable buffer contains a `ctid` which points to outdated tuple versions. If we try to fetch data from these stale ctids, we might encounter:

- Invalid pointers to TOAST (large object) data
- Corrupted reads from deallocated memory
- Inconsistent search results

To fix this issue we implemented Postgres's HOT chain following mechanism. Before indexing any tuple data, we now call into Postgres's internal functions to "follow the chain" and ensure we're always working with the current tuple version.

## Background Merging for Write Isolation

Our second major optimization addresses the expensive merge operations that compact smaller segments into larger ones. These merges are crucial for maintaining read performance, but they traditionally block writes and can cause significant latency spikes. In the past ParadeDB had several layers of foreground merging, combined with several layers of background merging.

Our solution runs merges using Postgres dynamic background workers: separate on-demand processes that handle compaction without blocking the main write path. Luckily, we build ParadeDB using [pgrx](https://github.com/pgcentralfoundation/pgrx) (a Rust framework for developing Postgres extensions) which provides a great [API](https://docs.rs/pgrx/latest/pgrx/bgworkers/struct.DynamicBackgroundWorker.html) for managing these workers. [James, our Marketing guy](https://github.com/pgcentralfoundation/pgrx/pull/5), wrote the original pgrx `BGworker` implementation, then [Yurii Rashkovskii](https://github.com/pgcentralfoundation/pgrx/commit/8aa148a697e01d403296d344e152395f6de6d3ac) extended it for `DynamicBackgroundWorker` management.

The background workers operate independently, merging segments based on size while the main database continues processing writes. This approach provides several benefits:

1. **Write Isolation**: Foreground writes never block on merge operations
2. **Resource Control**: Background workers can be throttled or prioritized based on system load
3. **Fault Tolerance**: Failed merge operations don't impact the main write path
4. **Parallelization**: Multiple workers (defined by a configuration option) can handle different merge operations concurrently

## Why Faster Writes Matter

These optimizations aren't just about raw performance, they solve real architectural problems that our users face. Many ParadeDB deployments serve as logical replicas of production Postgres databases, essentially acting as dedicated search instances that stay in sync with the primary.

In this setup, write performance directly impacts:

- **Replication Lag**: Slow indexing causes the search replica to fall behind the primary
- **WAL Accumulation**: Backlogged writes can lead to disk space issues on the primary
- **Search Freshness**: Users expect recently updated data to be immediately searchable
- **Operational Stability**: Write bottlenecks can cascade into broader system issues

By achieving an order of magnitude faster write throughput improvements, we've enabled ParadeDB to keep pace with even the most write-heavy production workloads while maintaining the search performance that users expect.

## Performance Results

Our write optimizations delivered significant improvements across multiple benchmarks:

- **10x write throughput** compared to our previous implementation
- **Immediate search visibility** for new writes (vs. eventual consistency)
- **Reduced replication lag** in logical replica configurations
- **Lower resource utilization** during high-write periods

As we've mentioned in previous blog posts the difference between v0.18.0 (no mutable segments, primarily foreground merging), v0.19.0 (mutable segments and background merging), and v0.20.0 (small improvements, and both features on by default) is large. The following test runs batched updates in a single thread at the same time as inserts and Top K queries to simulate a realistic workload (see [Stressgres, our load testing tool](https://github.com/paradedb/paradedb/tree/main/stressgres)).

These improvements make ParadeDB viable for use cases that were previously challenging, where full-text search, correctness, and speed are all critical.

## Looking Forward

Write performance is just one piece of the larger puzzle. ParadeDB already offers a compelling Elasticsearch alternative built on Postgres foundations, and we're continuing to invest in optimizations that push even further. With v0.20.0 we focused on performance and developer experience (with our new [v2 API](/blog/v2api)), and coming into 2026 we've got our sights on making joins better (if that sounds interesting to you then please reach out).

Our combination of Postgres's ACID guarantees with Elasticsearch-level search performance opens up exciting possibilities for application architectures. Instead of managing separate systems for transactions and search, developers can build on a single, consistent foundation.

Ready to experience faster search indexing? [Get started with ParadeDB](https://docs.paradedb.com/documentation/getting-started/install) to see how write-optimized search can simplify your architecture, and please don’t hesitate to [give us a star](https://github.com/paradedb/paradedb).

---

# Introducing ParadeDB

> We're excited to announce ParadeDB: a PostgreSQL database optimized for search.

Published: 2023-08-31 · Author: Ming Ying

Source: https://www.paradedb.com/blog/introducing-paradedb

---

We're excited to announce ParadeDB: a PostgreSQL database optimized for search.
ParadeDB is the first Postgres database built to be an Elasticsearch alternative, engineered for lightning-fast full
text, semantic, and hybrid search over Postgres tables.

## Why We Built ParadeDB

For many organizations, search remains an unsolved problem. Despite the existence of giants like Elasticsearch, most developers who have worked with Elasticsearch know how incredibly painful it is to run, tune, and manage. While alternative search engines exist, gluing these external services on top of an existing database introduces the headache and costs of reindexing and duplicating data.

Many developers seeking a unified source of truth and search engine turn towards Postgres, which offers basic full text search via `tsvector` and semantic search via `pgvector`. These tools may work for simple use cases and medium-sized datasets, but break down when tables get large or queries become complex:

1. Slow ranking and phrase search over large tables
2. No support for BM25 calculations
3. No support for hybrid search, a technique that combines vector search with full-text search
4. No real-time search — data must manually be re-indexed or re-embedded
5. Limited support for complex queries like faceting or relevance tuning

By now, we’ve witnessed dozens of engineering teams who have begrudgingly stitched Elasticsearch on top of Postgres,
only to ditch it later because it was too bloated, expensive, or convoluted. We asked ourselves: what if Postgres
itself was built for Elastic-quality search? What if developers didn’t need to choose between one unified Postgres database
with limited search capabilities or two separate services, one as the source of truth and one as the search engine?

## Who ParadeDB is For

Elasticsearch has many use cases, and we aren’t trying to tackle all of them — at least not yet. Instead, we’re focused
on nailing a core set of use cases for Postgres users that want to search over their database. ParadeDB is a good fit
for you if:

1. You want a single, Postgres-based source of truth and hate duplicating data across multiple services
2. You want to perform full-text search over large volumes of documents stored in Postgres without compromising on performance or scalability
3. You want to combine ANN/similarity search with full text search for improved semantic matching

## The Product

ParadeDB is gives Postgres the capabilities of a dedicated search engine.

1. **BM25 scoring**: Full text search with support for boolean, fuzzy, boosted, and keyword queries. Results are scored using the BM25 algorithm.
2. **Faceted search**: Easy bucketing and collection of metrics over full text search results.
3. **Hybrid search**: Results are scored with a combination of semantic relevance (i.e. vector search) and full text relevance (i.e. BM25).
4. **Real time search**: Text indexes and vector columns are automatically kept in sync with the underlying data.

## How ParadeDB Is Built

The core of ParadeDB is a vanilla Postgres database with an extension, written in Rust, that introduce enhanced search
capabilities.

ParadeDB’s search engine is built on Tantivy, an open-source, Rust-based search library heavily inspired by Apache Lucene.
Search indexes are stored in Postgres as native Postgres indexes, which obviates the need to pipe data out of Postgres
and duplicate it in foreign services and guarantees transaction safety.

ParadeDB introduces a new extension to the Postgres ecosystem: `pg_search`. `pg_search` implements
Rust-based full text search in Postgres using the BM25 scoring algorithm. This extension comes pre-installed
with ParadeDB.

## What's Next

We are currently building a cloud version of ParadeDB, and already offer a commercial self-hosted version with
support and enterprise features. If you would like to request access to the ParadeDB commercial offerings, we invite you to
[join our waitlist](https://form.typeform.com/to/jHkLmIzx).

The focus of the core team is on developing the open-source version of ParadeDB, which we will be launching in winter 2023.

We’re building in public and are excited to share ParadeDB with the community. Stay tuned for updates — in future blog
posts, we'll be covering some of the interesting technical challenges behind ParadeDB.

---

# pg_search: Elastic-Quality Full Text Search Inside Postgres

> Introducing the search capabilities of ParadeDB.

Published: 2023-11-15 · Author: Ming Ying

Source: https://www.paradedb.com/blog/introducing-search

---

With the release of `v0.6.0`, which marks the first stable release of the
  ParadeDB full-text search extension, we have renamed `pg_bm25` to `pg_search`.

We're unveiling `pg_search`: a Rust-based Postgres extension that significantly improves Postgres' full text
search capabilities. `pg_search` is named after BM25, the algorithm used by modern search engines to calculate the
relevance scores of search results.

Today, Postgres' native full text search, which uses the `tsvector` type, has two main problems:

1. **Performance**: Searching and ranking over large tables is sluggish. When tables grow to millions of rows, a single full text search can take several minutes.
2. **Functionality**: Postgres has no support for operations like fuzzy search, relevance tuning, or
   BM25 relevance scoring, which are the bread and butter of modern search engines.

`pg_search` aims to bridge the gap between the native capabilities of Postgres’ full text search and those of a specialized search engine like Elasticsearch.
The goal is to eliminate the need to bring a cumbersome service like Elasticsearch into the data stack.

Some features of `pg_search` include:

- 100% Postgres native, with zero dependencies on an external search engine
- Built on top of Tantivy, a Rust-based alternative to the Apache Lucene search library
- Query times over 1M rows are 20x faster compared to `tsquery` and `ts_rank`, Postgres' built-in full text search and sort functions
- Support for fuzzy search, aggregations, highlighting, and relevance tuning
- Relevance scoring uses BM25, the same algorithm used by Elasticsearch
- Real-time search — new data is immediately searchable without manual reindexing

`pg_search` stands on the shoulders of several open-source giants. The goal of this blog post is to recognize these projects
and to share how `pg_search` was built.

## The Shoulders of Giants

Putting a search engine inside of Postgres is hard. A few projects have attempted it, but with one caveat: every
single one has relied on an external Elasticsearch instance. This means introducing a
complex and expensive piece of infrastructure into the data stack. Perhaps the best-known example of this kind of design
is a Postgres extension called [ZomboDB](https://github.com/zombodb/zombodb).

In 2016, an open source search library called [Tantivy](https://github.com/quickwit-oss/tantivy) emerged. Tantivy
was designed as a Rust-based alternative to Apache Lucene, the search library that powers Elasticsearch.
Three years later, a library called [pgrx](https://github.com/pgcentralfoundation/pgrx) — built by the same
author of ZomboDB — made it possible to build Postgres extensions in Rust.
Combined, these projects laid the groundwork for a Postgres extension that could create Elastic-quality
search experiences within Postgres.

## Creating the Inverted Index

Like Elasticsearch, the backbone of Tantivy's search engine is a data structure called the inverted index,
which stores a mapping from words to their locations in a set of documents. An inverted index
is like the table of contents of a book — without it, you might have to examine every page to find
a specific chapter.

Rather than creating this inverted index externally, `pg_search` stores the
index inside Postgres as a new, Postgres-native index type, which we call the BM25 index. This is made possible
through Postgres' [index access method](https://www.postgresql.org/docs/current/indexam.html) API.

When a BM25 index is created, Postgres automatically updates it as new data arrives
or is deleted in the underlying SQL table. In this way, `pg_search` enables real-time search without any
additional reindexing logic.

## Building the SQL Interface

Following index creation, the next step was to expose an intuitive SQL interface for users to write search queries.
This was accomplished through the Postgres [operator API](https://www.postgresql.org/docs/current/sql-createoperator.html),
which enables the creation of custom Postgres operators. We chose the `@@@` operator to signify the beginning
of a query to the BM25 index in homage to the `@@` operator used by Postgres' native full text search.

The end result is the ability to search any table with a single SQL query.

```sql
SELECT *
FROM my_table
WHERE my_table @@@ ('idx_mock_items', '"my query string"')
```

Wherever possible, we designed the SQL interface, ParadeQL, to transparently mirror Tantivy's API. For instance,
the right-hand side of the `@@@` operator accepts Tantivy's mini query language and configuration options.

```sql
SELECT *
FROM my_table
WHERE my_table @@@ 'description:keyboard^2 OR electronics:::fuzzy_fields=description&distance=2'
```

## Performance Benchmarks

On a table with one million rows, `pg_search` indexes 50 seconds faster than `tsvector` and ranks results
20x faster. Indexing and search times are nearly identical to those of a dedicated Elasticsearch instance.
With further optimizations, we're aiming to reduce the query times compared to Elasticsearch
by an additional 2x.

More detailed benchmark results can be found in the [README](https://github.com/paradedb/paradedb/blob/main/benchmarks/README.md).

## Wrapping Up

`pg_search` is ready for use today. There are two ways to try it: [installing it](https://github.com/paradedb/paradedb/tree/main/pg_search#installation) inside an existing, self-hosted Postgres instance,
or [running the Postgres Docker image](https://github.com/paradedb/paradedb#from-self-hosted-postgres).

`pg_search` is open-source and licensed under AGPL. If you'd like to contribute, the best place to start is our
[Slack community](/slack). And please don't hesitate to show your support by
[giving us a star](https://github.com/paradedb/paradedb)!

---

# Similarity Search with SPLADE Inside Postgres

> Introducing sparse vector support in ParadeDB for efficient vector search.

Published: 2024-09-05 · Author: Ming Ying

Source: https://www.paradedb.com/blog/introducing-sparse

---

With the release of `v0.7.0`, sparse vector support was added to `pgvector`.
  As a result, we have deprecated `pg_sparse` in favor of the upstream
  `pgvector` implementation.

We’re excited to unveil `pg_sparse`: the first extension to enable efficient storage and retrieval of sparse vectors using HNSW inside Postgres. `pg_sparse` is to sparse vectors what `pgvector` is to dense vectors.

Generated by new models like SPLADE, sparse vectors can detect the presence of exact keywords while also capturing semantic similarity between terms. Unlike dense vectors, sparse vectors contain significantly more entries, most of which are zero. For instance, OpenAI’s `text-embedding-ada-002` model outputs dense vectors with 1536 entries, whereas SPLADE outputs sparse vectors with over 30,000 entries.

`pg_sparse` is a fork of `pgvector`, which means that it leverages `pgvector`'s existing vector storage and HNSW implementation. It consists of two major changes:

1. A new Postgres data type called `svector` that stores sparse vectors by their non-zero entries
2. A modification to the way that `pgvector` allocates Postgres pages to support vectors with variable numbers of non-zero entries

You can easily get started by [installing `pg_sparse`](https://github.com/paradedb/paradedb#from-self-hosted-postgres) inside an existing, self-hosted Postgres instance
or by [running our Docker image](https://github.com/paradedb/paradedb#quickstart). Run the following query to get started:

```sql
-- Load extension
CREATE EXTENSION svector;
-- Create test data
CREATE TABLE items (id bigserial PRIMARY KEY, embedding svector(4));
INSERT INTO items (embedding) VALUES ('[1,0,3,0]'), ('[0,0,5,6]');
-- Create HNSW index for cosine similarity
CREATE INDEX ON items USING shnsw (embedding svector_cosine_ops);
-- Run query
SELECT * FROM items ORDER BY embedding <=> '[3,0,1,0]';
```

## SPLADE Inside Postgres: An Example

Next, let’s walk through a more complex example that inserts, indexes, and searches sparse vectors generated by SPLADE using `pg_sparse`. To create our sparse vectors, let’s install the dependencies:

```python
!pip install -U transformers torch datasets pandas tqdm
```

Next, let’s run Python code that

1. Loads an example Huggingface dataset containing the descriptions of 50,000 movies
2. Generates a SPLADE vector for each description
3. Saves the dataset as a CSV file

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
from tqdm import tqdm
from datasets import load_dataset

model_id = 'naver/splade-cocondenser-ensembledistil'
dataset_name = 'SandipPalit/Movie_Dataset'

def create_splade(text):
    tokens = tokenizer(text, return_tensors='pt')
    output = model(**tokens)
    vec = torch.max(
        torch.log(
            1 + torch.relu(output.logits)
        ) * tokens.attention_mask.unsqueeze(-1),
    dim=1)[0].squeeze()

    return vec

def clean_text(text):
    return re.sub(r'[\r\n\t]+', ' ', text)

# Initialize SPLADE model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

# Load dataset
dataset = load_dataset(dataset_name)
training_dataset = dataset["train"]

# Generate SPLADE vectors
# Note this will take a long time, consider reducing the size of
# training_dataset to reduce the time
texts = []
vectors = []

for example in tqdm(training_dataset, desc="Processing..."):
    text = clean_text(example['Overview'])
    texts.append(text)
    vector = create_splade(text)
    vectors.append(vector.tolist())

# Save as dataframe
df = pd.DataFrame({
    'text': texts,
    'splade_vector': vectors
})

# Generate another SPLADE vector for querying
query = "Space exploration"
df.at[0, 'text'] = query
df.at[0, 'splade_vector'] = create_splade(query).tolist()

# Save to CSV
df.to_csv("splade_vectors.csv", index=False)
```

Using a Postgres client like `psql`, we can now load in our movie dataset. This will take a few minutes due to the size of the dataset.

```sql
CREATE TABLE movies (description text, splade_vector svector(30522));
\copy movies FROM '/path/to/splade_vectors.csv' DELIMITER ',' CSV HEADER;
```

Let’s also enable timing to examine the performance of our queries.

```sql
\timing
```

Now we’re ready to execute our first search. To begin, we’ll look for the ten rows that are most similar to the first row (”Space exploration") based on cosine similarity:

```sql
SELECT description
FROM movies
ORDER BY splade_vector <=> (SELECT splade_vector FROM movies LIMIT 1)
LIMIT 10;
```

Here are the results:

```csv
                                                                                                                         description
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Space exploration
 A sweeping overview of humanity’s accomplishments in space, as well as our ongoing activities and future plans.
 Delhi Boys Going To Space
 1962. The US Genesis 1491 space mission launches with a unique objective, based upon mankind's ongoing quest to seek answers to existential questions. It is the dawn of the Golden Age of Space exploration, and the possibilities seem limitless...
 Prepare for liftoff as we explore NASA's Space Shuttle program's legacy, featuring rare footage and testimonies from the people who made it all possible.
 Join the StoryBots and the space travelers of the historic Inspiration4 mission as they search for answers to kids' questions about space.
 A loner inventor who dreams of exploring space works in a math lab until a female astronaut goes missing and he can use a space helmet to save her.
 An exploration of the Alien presence on Earth and the reality of suppressed free energy technology.
 An exploration into the fate of the post-modern man.
 No matter how clear the night sky is, no matter how many millions of stars are within view, looking up at the sky on a clear night still hides the halo of man-made debris around Earth that threatens the future of space exploration and endangers us all.
(10 rows)

Time: 154.361 ms
```

This query executed a Postgres sequential scan that took 155ms to return the first ten rows. Now, let’s create an HNSW index to speed up the search results. This may take a few minutes to run.

```sql
CREATE INDEX ON movies
USING shnsw (splade_vector svector_cosine_ops);
```

With the index created, let’s re-run the same search query as above. We get back the same results, but with a query time of 2.740 ms — a 50X speedup in query time!

```csv
                                                                                                                         description
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Space exploration
 A sweeping overview of humanity’s accomplishments in space, as well as our ongoing activities and future plans.
 Delhi Boys Going To Space
 1962. The US Genesis 1491 space mission launches with a unique objective, based upon mankind's ongoing quest to seek answers to existential questions. It is the dawn of the Golden Age of Space exploration, and the possibilities seem limitless...
 Prepare for liftoff as we explore NASA's Space Shuttle program's legacy, featuring rare footage and testimonies from the people who made it all possible.
 Join the StoryBots and the space travelers of the historic Inspiration4 mission as they search for answers to kids' questions about space.
 A loner inventor who dreams of exploring space works in a math lab until a female astronaut goes missing and he can use a space helmet to save her.
 An exploration of the Alien presence on Earth and the reality of suppressed free energy technology.
 An exploration into the fate of the post-modern man.
 No matter how clear the night sky is, no matter how many millions of stars are within view, looking up at the sky on a clear night still hides the halo of man-made debris around Earth that threatens the future of space exploration and endangers us all.
(10 rows)

Time: 2.704 ms
```

## Benchmark Results

We measured index creation and query time over a dataset of 100K sparse vectors generated by SPLADE, each with 30,522 dimensions.
We set Postgres' `maintenance_working_mem` configuration to 512MB so that the entire HNSW graph could fit into memory during the build.

```sql
ALTER SYSTEM SET maintenance_work_mem = '512MB';
SELECT pg_reload_conf();
```

`ef_construction` is a parameter that creates a higher quality graph and more accurate search results, at the expense of longer
index creation times. At the default `ef_construction` value of `64`, the index takes 200 seconds to build (500 vectors / second).

Next, we compared the time to return the top 10 results with and without the HNSW index. We set `m=16`, `ef_construction=64`,
and `ef_search=40`. With HNSW, this search took 6ms. Without HNSW, this search took 150ms.

## Wrapping Up

`pg_sparse` is the latest installment in our roadmap to building an Elasticsearch alternative on Postgres. Now, Postgres
is capable of full-text search (via `pg_search`), dense vector search (via `pgvector`), and sparse vector
search.

`pg_sparse` is compatible alongside both `pg_search` and `pgvector`. In the future, we’d love to have our work merged upstream as part of one `pgvector` extension. A unified extension for all types of vector search is simpler to use and maintain, and would also enable sparse vector search in managed Postgres databases like AWS RDS that already support `pgvector`.

Until then, the best way to use `pg_sparse` is by [installing it](https://github.com/paradedb/paradedb#from-self-hosted-postgres) inside an existing, self-hosted Postgres instance,
or [running the Postgres Docker image](https://github.com/paradedb/paradedb#quickstart).

`pg_sparse` is licensed under the [PostgreSQL license](https://opensource.org/licenses/PostgreSQL) in
accordance with `pgvector`'s license. If you'd like to contribute, the best place to start is our
[Slack community](/slack). And please don't hesitate to show your support by
[giving us a star](https://github.com/paradedb/paradedb)!

---

# We Made Postgres Writes Faster, but it Broke Replication

> Exploring LSM tree implementation in PostgreSQL for better write performance.

Published: 2025-06-30 · Author: Stu Hood, Ming Ying

Source: https://www.paradedb.com/blog/lsm-trees-in-postgres

---

When we built `pg_search`, the Postgres extension for search and analytics, write throughput was a priority. To be an effective alternative to Elasticsearch we needed to support high ingest workloads in real time.
This is because many Elasticsearch use cases — like real-time dashboards, e-commerce search, and recommendation systems — involve continuous writes that must be indexed and made searchable immediately.

In a vanilla Postgres implementation, full-text search is backed by a [B-tree](https://www.postgresql.org/docs/current/btree.html) or [GIN](https://www.postgresql.org/docs/current/gin.html) (Generalized Inverted Index) structure. These indexes are good for relatively fast lookups, but they aren’t so fast for writes.

We opted for a data structure optimized for writes: a **Log-Structured Merge ([LSM](https://en.wikipedia.org/wiki/Log-structured_merge-tree)) tree**. That was great for write throughput, but it broke Postgres replication! Specifically, it broke physical replication, one of the two mechanisms that allow Postgres to replicate data from a primary node across one or more read replicas.
The other mode is logical replication, which sends individual row changes to replicas instead of copying the database byte-for-byte.

It turned out that Postgres's out-of-the-box support for physical replication, built on Write-Ahead Log (WAL) shipping, isn't quite enough for an advanced data structure like an LSM tree to be replication-safe. We were surprised to learn this, so we decided to write up our experience and describe how we fixed the problem.
In this post, we'll do a deep dive into:

1. What is an LSM tree?
2. What it means for an LSM tree to be replication-safe
3. How Postgres' WAL shipping guarantees physical consistency
4. Why atomic logging was necessary for logical consistency
5. How we leveraged a little-known but powerful Postgres setting called `hot_standby_feedback`

## What is an LSM Tree?

_Image: Diagram of an LSM tree showing the segment hierarchy and compaction process_

A Log-Structured Merge Tree (LSM tree) is a write-optimized data structure commonly used in systems like RocksDB and Cassandra.

The core idea behind an LSM tree is to turn random writes into sequential ones. Incoming writes are first stored in an in-memory buffer called a memtable, which is fast to update. Once the memtable fills up, it is flushed to disk as a sorted, immutable segment file (often called an SSTable).

These segment files are organized by size into layers or levels. Newer data is written to the topmost layer. Over time, data is gradually pushed down into lower levels through a process called compaction, where data from smaller segments is
merged, deduplicated, and rewritten into larger segments.

## What Do We Mean by Replication Safety?

A reliable distributed datastore (one which guarantees “replication safety”) must demonstrate both **physical** and **logical** consistency<sup>1</sup><sup>2</sup> across database replicas.

1. **Physical consistency** means the replica contains structurally valid data — each page or block on disk is well-formed and corresponds to a state that did exist on the primary at some point.
2. **Logical consistency** ensures that the data on the replica reflects a coherent and stable view of the database, something that could have been seen by a transaction on the primary.

A physically consistent state is _not_ always a logically consistent state. Specifically, if you take a snapshot of a physically consistent replica while replicating an in-flight transaction, it may not be logically consistent.
A good analogy is to imagine replicating a book. Physical consistency is like copying every page exactly, even if you're in the middle of a chapter — you're guaranteed to have real pages, but you might end up with half a sentence or a missing footnote.
Logical consistency is like waiting until the chapter is finished before copying it, ensuring the result makes sense to a reader.

## WAL Shipping: How Postgres Guarantees Physical Consistency

_Image: Illustration of Postgres WAL shipping from a primary to a standby to maintain physical consistency between replicas_

In a primary-standby physical replication setup, a primary server is paired with a standby server that acts as a read-only copy of its leader. The servers remain in sync by using a Write-Ahead Log (WAL) to record all binary-level changes to storage blocks on the primary server before they occur. Changes to this append-only WAL file are then streamed to the standby (a process called “log shipping”) and applied in the order received. This process enables near-realtime data synchronization between the two servers, hence the phrase “hot standby”.

## Why Atomicity is a Requirement for Physical Consistency

Atomicity is a requirement for physical consistency because Postgres locks are not replayed on replica servers. This is because replaying every lock taken on the primary would require strict timing synchronization, significantly impacting performance and hindering the ability of the standby to serve reads. Instead, the WAL uses per-buffer locks to incrementally replay edits in some particular order: it acquires an exclusive lock on the buffer (the in-memory representation of a block), makes its change, and then releases it.

The problem arises when modifying data structures that span many Postgres buffers. Without the guarantee that operations are atomic over the entire structure, these modifications can lead to structural corruption.

For example: `pg_search` uses [an unrolled linked list](https://github.com/paradedb/paradedb/blob/fa86f5635ea30b4c94608624591a624cefee83bd/pg_search/src/postgres/storage/linked_items.rs) of Postgres buffers where each node holds the read validity of a batch of segments in our LSM tree. To ensure that the primary can never observe a broken linked list, we use hand-over-hand locking (also known as lock coupling) to guarantee that the list remains physically consistent on the primary. After each buffer in the list is modified, its WAL entry becomes visible atomically on the replica.

But what happens when we want to edit multiple entries in the list “at once” (atomically), such as when a newly compacted segment replaces multiple old segments? If only the primary server mattered, then we could protect the logical integrity of multiple list nodes by applying a _global lock on the list itself_, ensuring that the contents of the list were only visible in valid state. But replicas don’t have access to global locks, so there’s no way to coordinate edits across multiple nodes (and multiple buffers) at once.

Instead, for multi-node operations `pg_search` uses a Copy-on-Write (CoW) clone of the list, and atomically swaps in the head. More generally, atomic operations insulate you from danger by eliminating reliance on coarse-grained locks.

_Image: Diagram of a copy-on-write linked list where a new version is built and the head pointer is atomically swapped to preserve consistency_

## A Problem: Vacuums Undermine Logical Consistency

Adapting algorithms to work atomically at the block level is table stakes for physical replication: if you don’t do it, your data structures are broken, and you won’t be able to use them consistently.

But even when individual WAL operations and data structures are atomically compatible, [VACUUM](https://www.postgresql.org/docs/current/sql-vacuum.html) can interfere with the execution of concurrent transactions spread across multiple WAL entries and compromise logical consistency.

To illustrate this problem, let's imagine the primary has a table with some number of rows. To ensure that concurrent writes can safely write without blocking each other, Postgres uses a mechanism called Multi-Version Concurrency Control ([MVCC](https://www.postgresql.org/docs/7.1/mvcc.html)), which creates multiple versions of a modified row (or _tuple_) instead of updating the tuple in place. When a row is updated or deleted, the previous tuple is not immediately removed but instead flagged as “dead”.

Those “dead” tuples remain on disk until a periodic maintenance operation called VACUUM is run. Like any other operation, VACUUMs are recorded in the WAL, shipped to the standby, and replayed there.

Since the “deadness” of a tuple is local to the server but the VACUUM operation is replayed globally, faults can occur when a tuple is VACUUMed from the standby prematurely. The standby might be mid-read, iterating (across multiple WAL entries) over a tuple that it considers valid, while the primary might concurrently decide to VACUUM that tuple out of existence. The standby, lacking any locking coordination or awareness of concurrent operations, replays the VACUUM while the prior transaction is still in progress. This can lead to query failure if a long-running query attempts to access a VACUUMed tuple.

## Why LSM Trees are Particularly Vulnerable to this Problem

If your Postgres is configured with read replicas and experiences a high volume of writes, you may have already seen this problem, even when using B-tree indexes. If a VACUUM is running on the primary at the same time that a query
hits a read replica, it's possible for Postgres to abort the read. However, these errors are likely infrequent and tolerable in a typical Postgres setup where VACUUMs run once every few hours.

The same cannot be said for an LSM tree, where compaction is a core, ongoing part of the system. In a high-write throughput system, compaction can happen many times per minute, even per second.
This increases the number of opportunities for conflicts to occur.

Similar to a VACUUM, a compaction rewrites data on the primary server, and needs to know when ongoing queries will no longer need that data in order to be able to safely delete old segments.

## The Logical Solution: Hot Standby Feedback

This is where an optional Postgres setting called `hot_standby_feedback` comes in. When enabled, `hot_standby_feedback` allows the standby to tell the primary what data is safe to clean up from the replica’s point of view. This information significantly reduces the likelihood of a tuple being prematurely VACUUMed, and allows `pg_search` to determine when it is safe to delete segments.

To understand what information `hot_standby_feedback` actually relays, we must first understand how tuple versioning works in Postgres. Every tuple in Postgres has two key metadata attributes: `xmin` and `xmax`. `xmin` stores the Transaction ID (XID) of the transaction that *created* or *inserted* that specific tuple version, while `xmax` stores the XID of the transaction that either *updated* or *deleted* the tuple version, effectively marking it as obsolete. When a tuple is deleted, the `xmax` value is updated with the XID of the deleting transaction. Since XID are assigned sequentially, later transactions are assigned a larger number for their XID, so another way to think about `xmin` is as a proxy for the tuple’s “creation time” and `xmax` for its “last updated or deleted time”.

_Image: Diagram of a Postgres tuple showcasing the xmin and xmax columns_

When `hot_standby_feedback` is enabled, the replica will periodically communicate the smallest `xmin` (oldest “creation time”) that any of its active queries is currently pinned to. This `xmin` identifies the oldest tuple still in use on the standby.

Armed with this information, the primary can make smarter decisions about when to permit cleanup operations (i.e. VACUUMs). If it sees that a standby query is still operating on a tuple that would otherwise be considered “dead,” it can defer cleanup until that query has finished.

## Final Thoughts

Even with the help of `hot_standby_feedback`, standby servers are fundamentally at the mercy of the WAL to provide instructions that are safe to execute in the order and moment they are received. The tension between the local incentives and global requirements is just one challenging dimension of the difficulty in achieving full replication safety in a distributed Postgres system.

To achieve both physical and logical consistency, `pg_search` implements an atomically logged LSM tree, and to achieve logical consistency, we rely on `hot_standby_feedback`.

This challenge was worth tackling because it enables the fastest possible search performance, without sacrificing consistency.
To see it in action, check out [our documentation](https://docs.paradedb.com/welcome/introduction) or our [open source project](https://github.com/paradedb/paradedb)!

<hr />

<ol>
  <li>
    Postgres supports both "physical replication" and "logical replication", and
    ParadeDB does too! This post only discusses physical replication.
  </li>
  <li>
    Physical and logical consistency are also referred to as structural
    integrity and transactional consistency.
  </li>
</ol>

---

# How We Optimized Top K in Postgres

> How ParadeDB uses principles from search engines to optimize Postgres' Top K performance.

Published: 2026-02-13 · Author: Ming Ying

Source: https://www.paradedb.com/blog/optimizing-top-k

---

In databases, _Top K_ means “give me the _K_ best rows, ordered by some column or value.” Commonly that means “the most recent rows,” “the highest scores,” or “the largest values.”

It feels like a basic problem that Postgres should solve. After all, can’t we just create an index? Yet in many production Postgres deployments, Top K is deceptively hard. This post examines where Postgres’ Top K optimizations shine, where they falter, and why search libraries like Lucene/Tantivy or databases like ParadeDB that specialize in Top K take a fundamentally different approach.

## Postgres' Solution: A Sorted B-Tree

Let’s begin with a single table containing 100M rows.

```sql
CREATE TABLE benchmark_logs (
    id SERIAL PRIMARY KEY,
    message TEXT,
    country VARCHAR(255),
    severity INTEGER,
    timestamp TIMESTAMP,
    metadata JSONB
);
```

We want to return the top 10 most recent rows by `timestamp`:

```sql
SELECT *
FROM benchmark_logs
ORDER BY timestamp DESC
LIMIT 10;
```

Without an index, this query takes 15 seconds. To accelerate it, we can create a B-Tree index on `timestamp`.

```sql
CREATE INDEX ON benchmark_logs (timestamp);
```

B-Trees are ideal for Top K queries because they are sorted structures, so the retrieval of Top K results is trivially O(K). Easy and done, right? Our query drops to an impressive 5ms.

  Under the hood, a B-tree is a hierarchy of Postgres pages (think of pages as
  nodes). Upper pages guide navigation, while the actual indexed values live in
  the leaf pages at the bottom in sorted order. To answer “give me the top K
  largest rows,” Postgres jumps from the root down to the leaf with the largest
  value and then walks backward through the linked leaves, reading entries until
  it has collected K rows. Because a B-tree is a balanced tree, the path from
  the root to any leaf is short and predictable.

## But Wait, We Need Filters Too

There’s a constraint hiding in the example above: it only works when the index fully matches the query shape. The moment you add filters that aren’t part of the index, things get complicated.

Consider a more realistic query with the filter `WHERE severity < 3`:

```sql
SELECT *
FROM benchmark_logs
WHERE severity < 3
ORDER BY timestamp DESC
LIMIT 10;
```

Now Postgres faces a dilemma. It can use the B-tree index on `timestamp` to get rows in the right `timestamp` order, but it has no way to skip directly to rows matching `severity < 3`. It must walk the index one entry at a time, checking the severity for each row and discarding most of them. In the worst case, this means walking the entire index. Or it can scan rows that match `severity < 3` first, but then it loses the ordering and must sort the results afterward.

Either way, Postgres ends up scanning far more rows than K, doing extra work to filter or sort, and suddenly we’re back to queries taking up to 15 seconds in the worst case.

## Sorting and Filtering = Combinatorial Explosion

A natural response is to add a composite B-tree index on both `severity` and `timestamp`.

```sql
CREATE INDEX ON benchmark_logs (severity, timestamp);
```

A composite B-tree is still a sorted tree, but now entries are ordered first by `severity`, and then by `timestamp` within each `severity` value. This works great for this query shape: Postgres can jump directly to the portion of the tree matching `severity < 3` and then walk the timestamps in descending order to get the top K rows.

The problem is that this solution doesn’t generalize. For instance, imagine we now also want to filter by `country`:

```sql
SELECT *
FROM benchmark_logs
WHERE country = 'United States'
AND severity < 3
ORDER BY timestamp DESC
LIMIT 10;
```

Or, if we change the sort columns:

```sql
SELECT *
FROM benchmark_logs
WHERE country = 'United States'
ORDER BY severity ASC, timestamp DESC
LIMIT 10;
```

Supporting every realistic combination of filters and sort orders requires a growing set of indexes. These indexes cause storage bloat, slower writes, and query plans that are hard to reason about.

## Postgres’ Top K Falls Apart with Search

Everything so far assumed that filters are simple predicates that can be expressed as equality or range conditions inside a B-tree. Full text search breaks that assumption.

Consider a query that combines Postgres’ native text search (which performs token matching instead of full string matching) with a range filter, and then
returns results sorted by relevance (i.e. a [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)-like score):

```sql
SELECT *,
       ts_rank(to_tsvector('english', message), q) AS rank
FROM benchmark_logs,
     plainto_tsquery('english', 'research team') AS q
WHERE to_tsvector('english', message) @@ q
  AND severity < 3
ORDER BY rank DESC
LIMIT 10;
```

This query may look similar to the earlier examples: filter rows, then return the top 10 by score. But internally, Postgres has no single structure that can satisfy all of these constraints at once.

Postgres can use a GIN (generalized inverted) index for the `tsvector` text search predicate and a B-tree for ordering or numeric filters. But because GIN does not preserve ordering and Postgres cannot combine ordering guarantees across different indexes, the planner must decompose the query into phases:

1. Use the GIN index to produce a (potentially large) set of matching row IDs
2. Fetch those rows from the “heap” (i.e. the underlying table storage)
3. Apply additional filters like `severity < 3`
4. Sort the surviving rows
5. Return the top 10 results

If the result set produced by GIN is large, the repeated heap fetches in Step 2 become very expensive. To demonstrate,
let's see how well this query can perform if we optimize it as much as possible. First, we'll precompute the
`to_tsvector` expression, which avoids some query-time overhead and helps the Postgres planner collect better statistics:

```sql
ALTER TABLE benchmark_logs
ADD COLUMN message_fts tsvector
GENERATED ALWAYS AS (to_tsvector('english', coalesce(message,''))) STORED;
```

Next, we'll create a GIN index on the generated column:

```sql
CREATE INDEX ON benchmark_logs USING gin (message_fts);
```

Finally, we'll replace the `to_tsvector` expressions in our query with the generated column:

```sql
SELECT *,
       ts_rank(message_fts, q) AS rank
FROM benchmark_logs,
     plainto_tsquery('english', 'research team') AS q
WHERE message_fts @@ q
  AND severity < 3
ORDER BY rank DESC
LIMIT 10;
```

With our GIN index in place, this query still takes 37 seconds to execute, largely because the query `research team`
returns millions of matches that must be checked against the filter and sorted.

As an additional optimization, we can create a partial GIN index on the predicate `severity < 3`.

```sql
CREATE INDEX ON benchmark_logs USING gin (message_fts) WHERE severity < 3;
```

Unfortunately, the query still takes roughly the same amount of time (33 seconds) because a large candidate set is still being returned
even with the `severity < 3` predicate<sup>1</sup>. Achieving further improvements would require altering the query itself to make it more selective
— which may or may not be feasible in real-world scenarios.

## Search Databases Think About Top K Differently

These examples highlight two problems with Top K in Postgres:

1. B-trees force you to pre-commit to a query shape, which is at odds with the idea of ad-hoc Top K queries.
2. B-trees don't play well with text search queries/GIN indexes, especially when the Top K candidate sets are large.

Search databases like ParadeDB take a fundamentally different approach. Rather than requiring many tailored indexes, we use a single compound index that contains all fields required for Top K filtering and sorting.
Unlike a B-tree, this index is not necessarily sorted. This means that the goal isn't to beat a B-tree optimized for one specific query — it's to make all
shapes of Top K queries, including those with text search and scoring, reasonably fast with low variance across different shapes<sup>2</sup>.

Additionally, because the sort keys are not known up front, we must accept that large candidate sets are inevitable. The optimization target becomes making the actual work of scanning and filtering extremely cheap, and to prune aggressively when selecting the Top K to avoid extra work.

## The Basic Data Structures: Inverted Index and Columnar Arrays

ParadeDB's index, like most search indexes, is built on two core structures. The first is an inverted index, which maps each term (i.e. “research”) to a “posting list” — a sorted list of document IDs that contain that term.

```ini
Documents                     Inverted Index
[1]: "us research"            research → [1, 2]
[2]: "ca research"            us       → [1, 3]
[3]: "us database"            ca       → [2]
                              database → [3]
```

The second is a columnar layout, which stores a single field in a contiguous, compact array. Columnar arrays are well known for accelerating analytical queries, but they are also a natural fit for Top K queries because they allow for cheap, cache-friendly lookups.

```ini
Row Store
+------------------------------+
|  id | country | severity     |
| [1] | [US]    | [2]          |
| [2] | [CA]    | [9]          |
| [3] | [US]    | [1]          |
+------------------------------+

Column Store
+------------------------------+
| id:       [1][2][3]          |
| country:  [US][CA][US]       |
| severity: [2][9][1]          |
+------------------------------+
```

In ParadeDB, these structures are provided by [Tantivy](https://github.com/quickwit-oss/tantivy), the Rust-based search library inspired by Lucene.

## A Compound Index Eliminates Expensive Row Lookups

In Postgres, every indexed row contains a pointer to where the row lives in the table. If a Top K filter cannot be fully answered by the index, Postgres must follow that pointer and materialize the entire row from the underlying table storage. This operation is expensive and prone to cache misses when performed for millions of candidates.

ParadeDB solves this by storing all searchable, filterable, and sortable fields inside a single index. Every indexed row is assigned a compact internal `u32` identifier called a document ID, and every data structure in the index refers to that same ID.

This document ID design is very efficient for boolean queries (i.e. multiple `WHERE` clauses). For example, consider the boolean `WHERE country = 'United States' AND severity < 3`. The text condition produces a stream of document IDs from the inverted index, while the range filter becomes a direct lookup into a columnar array at that same ID. Evaluating the `AND` condition reduces to intersecting `u32` streams without materializing any intermediate rows.

## Columnar Arrays Make Filters Cheap

Because candidates produced by the text search index may not be in contiguous order, columnar arrays must have true O(1) random access. In Tantivy’s columnar format, this is achieved by setting the row ID of a columnar value to its ordinal (i.e. its position in the array). So accessing a value from a column to evaluate a filter is simply:

```rust
value = column[row_id]
```

Additionally, columns are annotated with minimum and maximum value metadata. This allows range filters like `severity < 3` to skip entire columns that cannot satisfy the filter. For columns that do overlap the range, values are processed in batches instead of individually. By applying comparisons to vectors of values instead of scalars, the engine can use SIMD instructions to evaluate many values in a single CPU operation.

## Block WAND Enables Early Pruning

```ini
Current score threshold: 8.5

+-------------------+   max_score=12 → Evaluate
| Block A           |
| docs: [1 2 3 4]   |
+-------------------+

+-------------------+   max_score=7 → Skip
| Block B           |
| docs: [5 6 7 8]   |
+-------------------+
```

For queries that are ordered by a relevance score (i.e. a BM25 score<sup>3</sup>), Tantivy goes one step further with an optimization called Block WAND. Conceptually, this means skipping the evaluation of entire chunks of documents unless they have a mathematical possibility of entering the Top K.

Block WAND works by maintaining an upper bound on how much score any document within a block of document IDs could possibly achieve. As the engine fills its Top K heap, it establishes a threshold: the lowest score currently in the heap. Before evaluating a block, the engine checks the block’s maximum score. If that maximum is below the threshold, the entire block is skipped without scoring individual documents.

In ParadeDB, the analogous query to the relevance-sorted query above would be:

```sql
SELECT *, pdb.score(id) FROM benchmark_logs
WHERE severity < 3 AND message ||| 'research team'
ORDER BY pdb.score(id) DESC
LIMIT 10;
```

  `|||` is ParadeDB's [match
  disjunction](https://docs.paradedb.com/documentation/full-text/match)
  operator.

This query now drops to a very reasonable 300ms<sup>4</sup>.

## A Recent Improvement to ParadeDB's Top K Performance

In `0.21.0` Top K performance improved by up to 30% for certain benchmarked Top K queries. For instance, the following Top K query dropped from 90ms to 70ms over a 100M row dataset:

```sql
SELECT * FROM benchmark_logs
WHERE message === 'research' AND country === 'Canada'
ORDER BY severity, timestamp LIMIT 10;
```

  `===` is ParadeDB's [term
  search](https://docs.paradedb.com/documentation/full-text/term) operator.

This improvement was thanks to an upstream change in [Tantivy](https://github.com/quickwit-oss/tantivy), which changes how doc ID iterators are advanced when executing a boolean query.

Previously, in order to determine if a doc ID was present in all clauses of a boolean `AND`, Tantivy had to advance all iterators to the _next_ matching document using `seek` . For instance, consider two doc ID iterators:

```ini
Iterator A (country = 'United States'): [100, 101, 102, 103, ...]
Iterator B (country = 'Canada'): [50, 10000, ...]
```

Let’s assume iterator `B` is currently positioned on `50`, and iterator `A` is positioned on `100`. In order to determine if `100` is in `B`, we have to advance `B` to `10000` . This can be an expensive operation because advancing to the next match may require scanning several blocks containing values between `50` and `10000`.

With this change, Tantivy can instead perform a cheaper membership check for a specific doc ID without actually advancing the iterator.

## Wrapping Up

Postgres' Top K approach is a bit like all or nothing — the happy path is nearly instant, but the worst case can take seconds or even minutes.
ParadeDB, on the other hand, is designed to make any Top K query reasonably fast as long as all filters and sort keys are present in the index.

Moving forward, we still see meaningful headroom to optimize our Top K performance by pruning work earlier in the execution pipeline. One direction is index partitioning and segment-level ordering, where data is physically grouped or sorted by commonly queried dimensions (such as time ranges or coarse score buckets). With this layout, entire segments whose maximum possible score or ordering value cannot beat the current Top K threshold can be skipped.

We're also currently working on optimizing Top K joins — i.e. "search and filter over multiple tables, join the results, and return the Top K."
If this kind of work excites you, [we’re hiring](https://paradedb.notion.site/?pvs=143) for engineers who enjoy solving these kinds of database internals problems.

And please don’t hesitate to [give our open-source project](https://github.com/paradedb/paradedb) a star!

<hr />

1. The `EXPLAIN ANALYZE` output of the Top K text search query with GIN:

```sql
                                                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=22944.38..22944.41 rows=10 width=289) (actual time=33747.662..33747.667 rows=10.00 loops=1)
   Buffers: shared hit=11851 read=5890349
   ->  Sort  (cost=22944.38..22946.87 rows=995 width=289) (actual time=33747.660..33747.664 rows=10.00 loops=1)
         Sort Key: (ts_rank(benchmark_logs.message_fts, '''research'' & ''team'''::tsquery)) DESC
         Sort Method: top-N heapsort  Memory: 35kB
         Buffers: shared hit=11851 read=5890349
         ->  Bitmap Heap Scan on benchmark_logs  (cost=18964.29..22922.88 rows=995 width=289) (actual time=4791.339..25655.017 rows=10000000.00 loops=1)
               Recheck Cond: ((message_fts @@ '''research'' & ''team'''::tsquery) AND (severity < 3))
               Heap Blocks: exact=5882353
               Buffers: shared hit=11851 read=5890349
               ->  Bitmap Index Scan on benchmark_logs_message_fts_idx  (cost=0.00..18964.04 rows=995 width=0) (actual time=3669.878..3669.881 rows=10000000.00 loops=1)
                     Index Cond: (message_fts @@ '''research'' & ''team'''::tsquery)
                     Index Searches: 1
                     Buffers: shared hit=11851 read=7996
 Planning:
   Buffers: shared read=2
 Planning Time: 1.342 ms
 Execution Time: 33813.810 ms
(18 rows)
```

2. See our [benchmarks](https://github.com/paradedb/paradedb/tree/main/benchmarks) of 100M rows, published [on our public benchmarks](https://paradedb.github.io/paradedb/benchmarks/). See the `logs (100M rows)` results.
3. ParadeDB uses BM25 instead of TF-IDF for relevance scoring. BM25 is generally considered to be a more accurate relevance score.
4. The `EXPLAIN ANALYZE` output of the Top K text search query with ParadeDB:

```sql
    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1010.12..1011.35 rows=10 width=257) (actual time=297.430..298.912 rows=10.00 loops=1)
   Buffers: shared hit=20920 read=13732
   ->  Gather Merge  (cost=1010.12..4877326.97 rows=39806668 width=257) (actual time=297.428..298.910 rows=10.00 loops=1)
         Workers Planned: 7
         Workers Launched: 7
         Buffers: shared hit=20920 read=13732
         ->  Parallel Custom Scan (ParadeDB Scan) on benchmark_logs  (cost=10.00..10.02 rows=1 width=257) (actual time=249.554..249.605 rows=10.00 loops=8)
               Table: benchmark_logs
               Index: benchmark_logs_idx
               Segment Count: 8
               Heap Fetches: 10
               Exec Method: TopNScanExecState
               Scores: true
                  TopN Order By: pdb.score() desc
                  TopN Limit: 10
                  Queries: 8
               Tantivy Query: {"boolean":{"must":[{"range":{"field":"severity","lower_bound":null,"upper_bound":{"excluded":3},"is_datetime":false}},{"with_index":{"query":{"match":{"field":"message","value":"research team","tokenizer":null,"distance":null,"transposition_cost_one":null,"prefix":null,"conjunction_mode":false}}}}]}}
               Buffers: shared hit=20856 read=13732
 Planning:
   Buffers: shared hit=576 read=7
 Planning Time: 5.927 ms
 Execution Time: 299.393 ms
(22 rows)
```

---

# ParadeDB 0.20.0: Simpler and Faster

> Introducing search aggregation, V2 API as default, and performance improvements that eliminate the complexity between search and analytics in a single Postgres-native system.

Published: 2025-11-26 · Author: Philippe Noël

Source: https://www.paradedb.com/blog/paradedb-0-20-0

---

Every developer building modern applications faces the same dilemma: your users want real-time search and analytics, but your database architecture forces you to choose between consistency and performance. The story often begins with "just a simple search box," but inevitably evolves into a complex distributed architecture with eventual consistency, operational overhead, and the constant fear that your search results aren't reflecting reality.

When Ming and I started ParadeDB, we had a simple vision: developers shouldn't have to choose between the safety of ACID transactions and the power of modern search. Today, with ParadeDB `0.20.0` we are making that reality more performant and easier to use.

ParadeDB `0.20.0` delivers on three fronts that matter for real applications: search aggregations that eliminate the analytics gap, a cleaner API that reduces cognitive overhead, and write performance that scales with your application instead of against it.

## Search Aggregations: Analytics Without the Architecture

Modern applications often require more than just search: they need [facets](/learn/search-concepts/faceting), counts, histograms, and other analytics on their search results. While dedicated search engines excel at aggregations, when you're building on a transactional database you often need to do a bit of gymnastics with the query planner to get things working without visiting your indexes twice.

We have had fast aggregations on top of search queries for a while now, but it's not something we have talked about a lot. With `0.20.0` we have overhauled this feature, making [search faceting](/learn/search-concepts/faceting) (calculating counts for groups of records) a first-class citizen.

Our search analytics are powered by a new [`pdb.agg()`](https://docs.paradedb.com/documentation/aggregates/overview) function can be used in two ways:

- as an aggregate function, running a search query and providing aggregate results
- as a window function, running a search query and providing aggregate results **alongside** the Top K search result

Both variants take control of parts of query planning and execution, pushing as much work as possible down into the ParadeDB index. This removes the need for multiple queries, CTEs, or manual aggregations.

Our aggregations excel for large search result sets (millions of records), with search faceting performing **an order of magnitude faster** than the next best method (a ParadeDB search returning all results, feeding into a PostgreSQL aggregate) over more than 40 million results (full details to come in a search aggregations post).

`pdb.agg()` can also be used with a [JSON argument](https://docs.paradedb.com/documentation/aggregates/overview) that closely mirrors the [Elasticsearch aggregations API](https://www.elastic.co/docs/explore-analyze/query-filter/aggregations), and in many cases is mapped to plain SQL constructs (like `COUNT(*)`).

Let's have a look at some examples.

The most basic call doesn't use `pdb.agg()` at all, it just uses a `COUNT(*)` which is caught and planned using a search aggregation.

```sql
SELECT count(*)
FROM mock_items
WHERE description ||| 'shoes';

 count
-------
     3
(1 row)
```

This can also be done with a window function, which adds an extra column containing the count for the **whole result set** (not just the returned rows).

```sql
SELECT id, description,
       count(*) OVER (),
FROM mock_items
WHERE description ||| 'shoes'
ORDER BY rating DESC
LIMIT 2;

 id |     description     | count
----+---------------------+-------
  3 | Sleek running shoes |     3
  5 | Generic shoes       |     3
(2 rows)
```

A more complex example uses the JSON API to express a terms facet over a Top K query, which is then also added as an extra JSON column (this data is duplicated per row, we may optimize this in the future). This is useful when you want to display the Top K search results, but also show the distribution of the whole search query, doing this in one efficient index pass is one of the niceties of the Elasticsearch API that we loved and made possible in SQL.

```sql
SELECT id, description,
       pdb.agg('{"terms": {"field": "rating"}}') AS facets
FROM mock_items
WHERE description ||| 'shoes'
ORDER BY rating DESC
LIMIT 2;

 id |     description     |                   facets
----+---------------------+---------------------------------------------
  3 | Sleek running shoes | {"buckets": [{"key": 4, "doc_count": 1},
                                         {"key": 5, "doc_count": 1},
                                         {"key": 3, "doc_count": 1} ]}
  3 | Generic shoes       | {"buckets": [{"key": 4, "doc_count": 1},
                                         {"key": 5, "doc_count": 1},
                                         {"key": 3, "doc_count": 1} ]}
(2 rows)
```

Behind the scenes, these queries use our [BM25 indexes](/learn/search-concepts/bm25) for both [full-text search](/learn/search-concepts/full-text-search) and columnar analytics. You get sub-second facet calculations across millions of documents, all while maintaining perfect consistency with your transactional data.

## v2 API: Less Magic, More Clarity

None of these aggregation capabilities would matter if they were difficult to use, so alongside our performance work, we've also promoted our [V2 API](https://docs.paradedb.com/documentation/indexing/create-index) to the default experience, and the difference is immediately apparent with a much more intuitive SQL UX.

One of the most significant advantages of the v2 API is that it eliminates the schema duplication that is present in most other search systems. With Elasticsearch, you must define your data structure twice: once in your application database and again in your Elasticsearch mapping:

```json
// Elasticsearch mapping - separate from your database schema
{
  "mappings": {
    "properties": {
      "id": { "type": "integer" },
      "description": { "type": "text", "analyzer": "english" },
      "category": { "type": "keyword" }
    }
  }
}
```

Even our own v1 API required separate JSON configuration for the indexes. The v2 API leans into our definition of 'search database' and infers most of your index schema from your PostgreSQL table schema. You define text processing and search behavior directly in the index creation statement, eliminating the need to maintain parallel schema definitions.

We've had many users see unexpected behaviors in their first few queries due to incorrectly configured tokenizers. In v0.20.0, we've revamped our tokenizer defaults and made them a first-party experience when creating indexes, removing any ambiguity. Index creation is now cleaner, specifying the tokenizers and options inline:

```sql
CREATE INDEX search_idx ON mock_items
USING bm25 (id,
           title,
           (description::pdb.simple('stemmer=english',
                                    'stopwords_language=english')),
           category::pdb.literal)
WITH (key_field='id');
```

And search has improved too, with new conjunction (`&&&`) and disjunction (`|||`) operators, and simple ways of accessing common features like boosting.

```sql
SELECT id, pdb.score(id), description, category
FROM mock_items
WHERE description ||| 'shoes'::pdb.boost(2) OR category ||| 'footwear'
ORDER BY score DESC
LIMIT 5;

 id |   score   |     description      | category
----+-----------+----------------------+----------
  5 |  7.660918 | Generic shoes        | Footwear
  3 |  6.872712 | Sleek running shoes  | Footwear
  4 |  6.872712 | White jogging shoes  | Footwear
 13 | 1.9362776 | Sturdy hiking boots  | Footwear
 23 | 1.9362776 | Comfortable slippers | Footwear
(5 rows)
```

Keep an eye out for another post showing off all the new v2 API features. For now, you can explore the full API reference in our [documentation](https://docs.paradedb.com/documentation/full-text/overview).

## Write Performance: Updates That Scale With You

Our performance story from `0.19.0` to `0.20.0` reveals a truth about how we think search should work in transactional systems. Traditional search engines treat updates as expensive operations that require careful batching and scheduled maintenance windows. At first glance this might seem reasonable, because search indexes are complex data structures that need time to reorganize.

The underlying issue stems from how search engines like Lucene and Tantivy handle data. These systems use immutable data structures for their segments, which means updating a single field in a document necessitates rewriting all other fields in that segment, creating write amplification.

But that's exactly backwards for real applications. Most search workloads aren't bulk imports; they're streams of individual updates as users create content, modify records, and interact with your system. ParadeDB treats these small, frequent updates as first-class citizens that should perform as well as inserts.

Our solution is based on two key optimizations from `0.19.0`, now enabled by default in our [LSM engine](/blog/lsm-trees-in-postgres): mutable segments and background merging.

- **Mutable segments** introduce a small buffer optimized specifically for writes. Instead of immediately writing changes to immutable segments, we store a pointer to the new row in this buffer, eliminating the write amplification problem entirely. Once the buffer becomes full, background merging serializes it to an immutable structure without blocking ongoing write operations.
- **Background merging** moves all merging (including mutable segment merging) to the background, and increases the number of workers which we dedicate to this task.

Since we first introduced these features in `0.19.0` we've increased single row update performance by more than **two orders of magnitude**; the kind of improvement that can change how you think about transactional search.

Again, you can expect a follow up post with all the gory data-structure details.

## Why This Matters

Version `0.20.0` strengthens ParadeDB as an alternative to Elasticsearch by adding fast search aggregations, a streamlined v2 API, and significantly improved write performance directly into PostgreSQL. For teams that value transactional consistency and operational simplicity, we provide a path to modern search capabilities without the complexity of managing separate systems.

Ready to see search aggregations and the new API in action? [Try ParadeDB 0.20.0](https://docs.paradedb.com/documentation/getting-started/install) and experience what it's like when search, analytics, and transactions work together seamlessly.

---

# Postgres as a Recommender Engine

> Build a production-grade personalized search engine entirely within Postgres using BM25 retrieval and vector-based reranking, no external infrastructure required.

Published: 2026-01-21 · Author: Ankit Mittal

Source: https://www.paradedb.com/blog/personalized-search-in-postgresql

---

When I search for "King", I'm looking for _LOTR - The Return of the King_. You might be looking for _The King's Speech_. A generic search engine returns the same results for both of us, likely a mix of Fantasy and Drama, which means it fails both of us.

I first presented a solution to this shortcoming at [PGConf NYC 2023](https://postgresql.us/events/pgconfnyc2023/schedule/session/1412-postgres-as-personalization-engine/). Having [engineered search infrastructure systems at **Instacart**](https://tech.instacart.com/how-instacart-built-a-modern-search-infrastructure-on-postgres-c528fa601d54), I've seen firsthand how complex search systems crumble under their own weight. Similar patterns power search at giants like [Target](https://cloud.google.com/blog/topics/retail/from-query-to-cart-inside-targets-search-bar-overhaul-with-alloydb-ai) and [Alibaba](https://www.alibabacloud.com/blog/learn-how-alibaba-engineers-accommodated-for-face-masks-in-their-face-recognition-algorithms_596040).

_Image: Presenting at PGConf NYC 2023_

Personalization can take a "good enough" search experience to a magical one. Traditionally, building personalized search involved a complex Rube Goldberg machine: syncing data to Elasticsearch, building a separate ML recommendation service, dragging data to Python workers for re-ranking, then stitching it all back together. It works, but the cost is high: architectural complexity, network latency, and data synchronization nightmares.

What if you could build a production-grade personalized search engine _entirely_ within Postgres? No new infrastructure. No network hops. Just SQL.

In this post, we'll build a personalized movie search engine using just Postgres, with ParadeDB installed. We'll implement a "Retrieve and Rerank" pipeline that combines the speed of BM25 full-text search with the intelligence of vector-based personalization, all without the data ever leaving PostgreSQL.

  This post is based on a [functional prototype
  concept](https://github.com/ankitml/paradedb-reranker) I created which
  demonstrates in-database personalization patterns. If you're the type who
  reads the last page of a novel first, feel free to start there.

## The Architecture: Retrieve & Rerank

**The Retrieve and Rerank approach** breaks the problem into two stages, utilizing the strengths of different algorithms:

1. **Retrieval**: We use [**BM25**](https://www.paradedb.com/learn/search-concepts/bm25) (standard full-text search) for lexical search to find the top K candidates (100 here) matching the user's query. This is extremely fast and computationally cheap, filtering millions of rows down to a relevant subset.
2. **Reranking**: We use Cosine Similarity and [semantic / vector search](https://www.paradedb.com/learn/search-concepts/vector-search) to re-rank those 100 candidates ordering them based on the user's personal profile.

Why this approach? Because running vector similarity search on your _entire_ dataset for every query is overkill. For $$n=100$$ items (or even 1000), it would be better to ignore vector indexes, and instead brute force our way to a result.

## The Intuition: Vectors Match Vibes

Before we look at code, it helps to understand _why_ this works.

_Image: Diagram showing the Retrieve and Rerank pipeline: BM25 retrieval finds keyword matches, then vector similarity reranks results based on user preferences_

Traditional search matches **keywords**. If you search for "King", a keyword engine simply looks for the string "King" in titles. It returns _The Return of the King_, _The King's Speech_, and _The Lion King_ with equal confidence, but it has no way to know which one _you_ want.

A **User Vector** represents aggregated taste. It can be built from **explicit signals** (like the 5-star ratings we look at in this post) or **implicit signals** (like clickstream data or watch time).

In our example, if a user dislikes _The King's Speech_ but loves _Star Wars_, we steer their vector _towards_ fantasy and _away_ from drama in the embedding space.

## Implementation

Let's build it. We need two core entities: **Movies** (the content) and **Users** (the personalization context).

### 1. Core Schema

We use standard Postgres tables for movies and users, plus a ratings table for likes and dislikes.

Personalization requires user-specific signals. Here we use explicit ratings: 4-5 stars signals a like, 1-2 stars signals a dislike. You could also use implicit signals like watch time or click behavior.

```sql
-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS pg_search;
CREATE EXTENSION IF NOT EXISTS vector;

-- The content we want to search
CREATE TABLE movies (
    movie_id INTEGER PRIMARY KEY,
    title VARCHAR(500) NOT NULL,
    year SMALLINT,
    genres TEXT[],
    imdb_id VARCHAR(20),
    tmdb_id INTEGER,
    -- Embedding representing the movie's semantic content
    -- (e.g. from OpenRouter/HuggingFace)
    -- We used all-MiniLM-L12-v2 which has 384 dims
    content_embedding vector(384),
    created_at TIMESTAMP DEFAULT now()
);

-- The user profiles
CREATE TABLE users (
    user_id BIGINT PRIMARY KEY,
    -- A vector representing the user's taste
    embedding vector(384)
);
```

Movies get a ParadeDB BM25 index.

```sql
CREATE INDEX movies_search_idx ON movies
  USING bm25 (movie_id, title, year, imdb_id, tmdb_id, genres)
  WITH (key_field='movie_id');
```

We use pgvector’s vector column to store movie and users embeddings. Movie embeddings can be created by calling OpenRouter or Huggingface APIs. For the demo we used `sentence-transformers/all-minilm-l12-v2` model.

### 2. Generating User Profiles

How do we get a vector that represents a user's taste? In a complex stack, you'd have a background worker processing datastreams in Python. We can do that too, but since all our movie embeddings are already in Postgres, we can use standard SQL aggregation to get user embeddings.

```sql
-- Generate a user's profile vector:
-- Move towards movies they like (>= 4 stars)
-- Move AWAY from movies they dislike (< 3 stars)
UPDATE users SET embedding = (
    SELECT SUM(
        CASE
            WHEN r.rating >= 4.0 THEN m.content_embedding
            ELSE m.content_embedding * -1
        END
    )
    FROM ratings r
    JOIN movies m ON r.movie_id = m.movie_id
    WHERE r.user_id = users.user_id
    AND (r.rating >= 4.0 OR r.rating < 3.0) -- Ignore mediocre ratings
);
```

This is the power of "In-Database AI". You don't need a separate pipeline to keep user profiles up to date. You can update them transactionally as ratings come in.

### 3. The Unified Search and Recommender Engine

Now for the main event. We can execute the entire **Retrieve & Rerank** pipeline in a single SQL query using Common Table Expressions (CTEs).

Instead of a black box, let's break this down into four logical steps: **Retrieve**, **Normalize**, **Personalize**, and **Fuse**.

### Step 1: The Retrieval (Fast)

First, we use BM25 to find the top 100 candidates. This is our "coarse filter" that casts a wide net to find anything relevant to the text query.

```sql
WITH first_pass_retrieval AS (
    SELECT
        movie_id,
        title,
        paradedb.score(movie_id) as bm25_score
    FROM movies
    WHERE title @@@ 'action'
    ORDER BY bm25_score DESC
    LIMIT 100
),
```

### Step 2: The Normalization (Math)

BM25 scores are unbounded (they can be 10.5 or 100.2), while Vector Cosine Similarity is always between -1 and 1. To combine them fairly, we normalize the BM25 scores to a 0-1 range using a standard Min-Max scaler.

```sql
normalization AS (
    SELECT
        *,
        (bm25_score - MIN(bm25_score) OVER()) /
        NULLIF(MAX(bm25_score) OVER() - MIN(bm25_score) OVER(), 0) as normalized_bm25
    FROM first_pass_retrieval
),
```

### Step 3: The Personalization (Smart)

Now we bring in the user context. We calculate the similarity between our _normalized_ candidates and the current user's taste profile.

```sql
personalized_ranking AS (
    SELECT
        n.movie_id,
        n.title,
        n.normalized_bm25,
        -- Cosine similarity: 1 - cosine_distance transforms distance to a 0-1 similarity score
        (1 - (u.embedding <=> m.content_embedding)) as personal_score
    FROM normalization n
    JOIN movies m ON n.movie_id = m.movie_id
    CROSS JOIN users u
    WHERE u.user_id = 12345 -- The Current User
)
```

### Step 4: The Fusion (Final Polish)

Finally, we combine the signals. We take a weighted average: 50% for the text match (relevance) and 50% for the user match (personalization).

```sql
SELECT
    title,
    (0.5 * normalized_bm25 + 0.5 * personal_score) as final_score
FROM personalized_ranking
ORDER BY final_score DESC
LIMIT 10;
```

This composable approach allows you to tune the knobs. Want personalization to matter more? Change the `0.5` weight. Want to filter candidates differently? Change the `first_pass` CTE. It's just SQL.

## Other Recommender Engine Workloads

There are few more workloads that can be supported in a similar way inside Postgres.

- **Collaborative filtering (Users Like Me)**: Find other users who are similar to a user id. This can be directly used or used as a subquery to find items liked by similar users.
- **Real-time Updates via Triggers**: Whenever a user's behavior changes or a new rating arrives, a trigger can update the user's embedding in real time.

## Trade-offs: The Landscape

No architecture is perfect. While the "In-Database" approach offers unbeatable simplicity, it helps to understand where it sits in the broader landscape of personalization strategies.

### 1. In Application Layer Inference (Python/Node.js)

Fetch data and re-rank it in your application code or a dedicated microservice.

- **Pro**: **Maximum Flexibility**. Use arbitrary logic and complex ML models (PyTorch/TensorFlow) that might be hard to express in SQL.
- **Con**: **The "Data Transfer Tax"**. You ship thousands of candidate rows over the network just to discard most of them, adding latency and serialization overhead.

### 2. Dedicated Inference Platforms (e.g. Ray Serve, NVIDIA Triton)

Perform retrieval in a database, then send candidate IDs to a dedicated inference cluster for GPU-accelerated reranking.

- **Pro**: **State-of-the-Art Accuracy**. Run massive deep learning models too heavy for a standard database.
- **Con**: **Infrastructure Sprawl**. You're managing a database _and_ an ML cluster, plus paying latency for every network hop.

### 3. The "Cross-Encoder" Approach (e.g. Cohere Re-ranker)

Instead of comparing pre-computed embeddings (a Bi-Encoder), a Cross-Encoder feeds _both_ query and document into the model simultaneously.

- **Pro**: **Maximum Accuracy**. Cross-encoders understand word interactions between query and document.
- **Con**: **The Latency & Cost Cliff**. You must run this heavy model at query time for every candidate — typically 200-500ms extra latency, plus third-party API costs and privacy concerns.

### The ParadeDB Sweet Spot

The landscape above shows a spectrum: application-layer reranking offers flexibility but adds latency; dedicated inference platforms offer accuracy but add infrastructure; cross-encoders offer precision but add cost and privacy concerns.

The in-database approach hits a sweet spot for 95% of use cases. You get hybrid search and personalization with **zero added infrastructure**. No inference clusters, no reranker API calls, no ETL pipelines. Just Postgres. For the 5% of cases where squeezing out the final drops of relevance is worth the extra latency and vendor cost, a cross-encoder may be the right choice.

The main consideration? **Resource Management**. Because search runs on your database, it shares CPU and RAM with your transactional workload. For most applications, this is a non-issue given Postgres's efficiency. For high-scale deployments, you can run search on a **dedicated read replica** to isolate inference load from your primary writes. ParadeDB supports two approaches:

- **Physical replication**: The BM25 index exists on the primary. Small write cost, but read queries can be routed to specific full replicas.
- **Logical replication**: The BM25 index is isolated to a replica subset only. No transactionality guarantees, but proper workload isolation.

The best approach depends on your infrastructure and requirements.

## Conclusion

By pushing this logic into the database, we achieve **Simplicity**.

There is no "synchronization lag" where a user rates a movie but their recommendations don't update for an hour. There is no fragile ETL pipeline to debug when recommendations look weird. There is no network latency adding 50ms to every search request while data travels to a ranker.

This pattern isn't just a PostgreSQL trick; it's a fundamental optimization principle called **Compute Pushdown**. You see it everywhere in high-performance computing:

- **Big Data**: Modern data warehouses like BigQuery and Snowflake push filtering logic down to the storage layer to avoid scanning petabytes of data.
- **Edge Computing**: IoT devices process sensor data locally (at the "edge") rather than sending every raw byte to the cloud, saving massive bandwidth.

ParadeDB applies this same principle to search: **Move the compute to the data, not the data to the compute.** Treat personalization as a database query rather than an application workflow. You simplify the stack, reduce latency, and get a unified engine for both search and recommendations.

Have a recommender workload you'd like to see in Postgres? Write to us in the [ParadeDB Slack community](/slack).

---

# ParadeDB is Officially on Railway

> Deploy ParadeDB on Railway with one click. Full-text search, vector search, and hybrid search over Postgres — now available on your favorite cloud platform.

Published: 2026-04-14 · Author: Ming Ying

Source: https://www.paradedb.com/blog/railway

---

We're excited to announce that [Railway](https://railway.com) now supports ParadeDB as an official integration.
This means one-click deployment of ParadeDB onto Railway, allowing you to go from zero to a running ParadeDB instance in under a minute.

## Why Railway?

Railway is one of the leading platforms for deploying AI applications, databases, and backend services. Its developer experience is exceptional: push-to-deploy, automatic environment variables, and a clean dashboard for managing services.

## Deploy ParadeDB in One Click

Click the button below to deploy ParadeDB on Railway. Railway will spin up a Docker container running the latest ParadeDB image with sensible defaults.

[![Deploy on Railway](https://railway.com/button.svg)](https://railway.com/deploy/paradedb?referralCode=l5qxN4&utm_medium=integration&utm_source=button&utm_campaign=paradedb)

Once deployed, Railway automatically configures the following environment variables for you:

| Variable              | Default                   |
| --------------------- | ------------------------- |
| `POSTGRES_USER`       | `postgres`                |
| `POSTGRES_PASSWORD`   | Auto-generated            |
| `POSTGRES_DB`         | `paradedb`                |
| `PGPORT`              | `5432`                    |
| `DATABASE_URL`        | Private connection string |
| `DATABASE_PUBLIC_URL` | Public connection string  |

## Connect to Your Instance

After deployment, navigate to the **Variables** tab in your Railway service dashboard to find your connection strings.

To connect from another Railway service on the same project, use the private connection string:

```bash
psql $DATABASE_URL
```

To connect from your local machine, use the public connection string:

```bash
psql $DATABASE_PUBLIC_URL
```

## Try a Quick Search Query

Once connected, you can immediately start using ParadeDB's search features. Load the sample table, create a BM25 index, and run a full-text search:

```sql
CALL paradedb.create_bm25_test_table(
  schema_name => 'public',
  table_name => 'mock_items'
);

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description, category, rating, in_stock, created_at, metadata, weight_range)
WITH (key_field='id');

SELECT description, pdb.score(id)
FROM mock_items
WHERE description ||| 'running shoes' AND rating > 2
ORDER BY score DESC
LIMIT 5;
```

## What's Next

To learn more about deploying ParadeDB on Railway, visit our [documentation](https://docs.paradedb.com/deploy/cloud-platforms/railway).

---

# We've Rebranded

> Announcing our rebrand and new visual identity for ParadeDB.

Published: 2024-08-28 · Author: Ming Ying

Source: https://www.paradedb.com/blog/rebrand

---

We're excited to unveil our rebranded [landing page](https://www.paradedb.com) and [docs](https://docs.paradedb.com/welcome/introduction). For the past year, the ParadeDB team has
been focused on growing adoption of our open source project among the developer community. The old branding — dark mode with bright emerald as the primary color — reflected that.

Today, we're reaching an inflection point where ParadeDB has been deployed across thousands of production environments on multi-terabyte clusters. Our revamped landing page
and brand aesthetics highlight our commitment to supporting these larger customers.

In parallel, we've been working on a managed version of ParadeDB called Parade BYOC. Built for customers who use managed Postgres providers like Amazon
RDS, Parade BYOC runs as a search replica directly in your organization's AWS/Azure/Google Cloud account. To request early access, we invite you to [speak with our team](https://calendly.com/paradedb).

That's all for now — back to building!

---

# ParadeDB is Officially on Render

> Deploy ParadeDB on Render with one click. Full-text search, vector search, and hybrid search over Postgres — now available on your favorite cloud platform.

Published: 2026-05-12 · Author: Ming Ying

Source: https://www.paradedb.com/blog/render

---

We're excited to announce that [Render](https://render.com) now supports ParadeDB as an official integration.
This means one-click deployment of ParadeDB onto Render, allowing you to go from zero to a running ParadeDB instance in under a minute.

## Why Render?

Render is one of the leading platforms for deploying AI applications, databases, and backend services. Its developer experience is exceptional: push-to-deploy, automatic environment variables, and a clean dashboard for managing services.

## Deploy ParadeDB in One Click

The ParadeDB Render Blueprint deploys ParadeDB Community as a private service with persistent SSD storage. Click the button below to provision it in your account, or use Render's [official ParadeDB template](https://render.com/docs/deploy-paradedb).

[![Deploy to Render](https://render.com/images/deploy-to-render-button.svg)](https://render.com/deploy?repo=https://github.com/paradedb/render-blueprint)

Behind the scenes, the Blueprint will:

- Create a private service named `paradedb` running the official ParadeDB Docker image.
- Attach a 10 GB persistent disk for your database data, mounted at `/var/lib/postgresql`.
- Configure the following environment variables for you:

| Variable            | Default        |
| ------------------- | -------------- |
| `POSTGRES_USER`     | `postgres`     |
| `POSTGRES_PASSWORD` | Auto-generated |
| `POSTGRES_DB`       | `paradedb`     |

## Connect to Your Instance

ParadeDB runs as a private service on Render, which means it is not exposed to the public internet.

To connect from another service in your Render private network, use the service name as the host:

```bash
psql -h paradedb -U postgres -d paradedb
```

To connect from your local machine, SSH into the service and run `psql` from inside the container:

```bash
ssh srv-XXXXXXXXXXXXX@ssh.<region>.render.com
psql -U postgres -d paradedb
```

Replace `srv-XXXXXXXXXXXXX` with your service ID from the Render dashboard, and make sure you've added an SSH key to your Render account first.

## Try a Quick Search Query

Once connected, you can immediately start using ParadeDB's search features. Load the sample table, create a BM25 index, and run a full-text search:

```sql
CALL paradedb.create_bm25_test_table(
  schema_name => 'public',
  table_name => 'mock_items'
);

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description, category, rating, in_stock, created_at, metadata, weight_range)
WITH (key_field='id');

SELECT description, pdb.score(id)
FROM mock_items
WHERE description ||| 'running shoes' AND rating > 2
ORDER BY score DESC
LIMIT 5;
```

## What's Next

To learn more about deploying ParadeDB on Render, visit our [documentation](https://docs.paradedb.com/deploy/cloud-platforms/render).

---

# Announcing Our $12M Series A

> Announcing our Series A funding round to accelerate ParadeDB development.

Published: 2025-07-14 · Author: Ming Ying

Source: https://www.paradedb.com/blog/series-a-announcement

---

Read this announcement on
  [TechCrunch](https://techcrunch.com/2025/07/15/paradedb-takes-on-elasticsearch-as-interest-in-postgres-explodes-amid-ai-boom/).

We're excited to announce our $12M Series A fundraising round, led by [Craft Ventures](https://craftventures.com/) and joined by existing investors including [Y Combinator](https://ycombinator.com). This brings our total capital raised to $14M and accelerates our mission to build a modern Elasticsearch alternative on Postgres.

Our Series A comes on the heels of success stories with our early customers. These companies range from Fortune 500s like [Alibaba Cloud](https://alibabacloud.com); fast-growing startups like [Bilt Rewards](https://biltrewards.com), [Modern Treasury](https://moderntreasury.com), and [UnifyGTM](https://unifygtm.com); legal tech giants like [TCDI](https://tcdi.com). While these companies build different products, they share a similar story: they use Postgres as the source of truth, experience high volumes of updates, and require Elastic-quality search and analytics with zero ETL.

## Bringing Elastic-Quality Search and Analytics to Postgres

Postgres is on the rise. As the world’s most-loved relational database, Postgres commands an $80B+ market size that is expected to double in the coming years. This momentum hasn’t gone unnoticed: Neon and Crunchy Data, two Postgres startups, were recently [acquired by Databricks](https://www.databricks.com/company/newsroom/press-releases/databricks-agrees-acquire-neon-help-developers-deliver-ai-systems) and Snowflake to deepen their AI platforms.

However, Postgres’ search and analytical capabilities are nascent. On the search front, Postgres lacks first-class support for Elastic-style boolean, fuzzy, phrase, and BM25-scored queries. On the analytics front, Postgres’ storage and query engines were designed for transactional — not analytical — workloads.

As a result, many Postgres users adopt an external search engine like Elasticsearch. While powerful, Elasticsearch is a cumbersome piece of technology that is painful to run, tune, and manage. Syncing Postgres and Elastic requires denormalization and is error-prone for update-heavy workloads. Elastic is not a reliable data store, causing downtime or incorrect results in production, and can become incredibly expensive at scale.

## Our Story So Far

In late 2023, we launched the first version of ParadeDB as an open-source Postgres extension for full text search. Within months, ParadeDB became one of the [fastest-growing Postgres projects](https://github.com/paradedb/paradedb), reaching 7K+ GitHub stars and 100K+ installs.

In 2024, we shipped several critical features that brought ParadeDB closer to a production-grade Elastic replacement. These included a columnar index to support certain flavors of analytical queries and integration with the Postgres write-ahead log (WAL), which delivered crash recovery and high availability.

Around the same time, we began onboarding large enterprise customers. Our first large customer was Alibaba Cloud, who reached out because they needed a text search engine that was compatible with their Postgres-based data warehouse. Shortly after, we onboarded Modern Treasury, the payment processing API. Modern Treasury’s engineers had previously spent years wrangling Elastic clusters and understood the pain of syncing Postgres with Elastic. Their application consumes financial transactions, which require a search engine that’s relational, transaction-safe, and always in sync with their primary Postgres. ParadeDB’s architecture was a perfect fit for their use case.

## What’s Next

For the past two years, we’ve been focused on building world-class text search inside Postgres. That work laid the foundation for a search experience that rivals Elasticsearch. We’ve pushed the boundaries of what’s possible inside Postgres, enabling complex filtered, sorted, and faceted search queries while maintaining Postgres’ transactional consistency and query language.

Now, we’re setting our sights on the next frontier: making Postgres a first-class analytical (OLAP) engine that rivals Elastic’s analytical performance. Our goal is to dramatically accelerate analytical queries written in plain SQL, the kind used to power dashboards, drill-downs, and real-time insights.

This means rethinking everything from Postgres’ file formats, query planner, and execution engine. We’re drawing inspiration from existing OLAP engines like Elastic while staying true to the Postgres ethos.

To tackle this deeply technical challenge, we’ve assembled a world-class team. Engineers on our team have decades of experience in Postgres internals, search, and analytical query engines. They’ve spent their careers writing core infrastructure for the Postgres ecosystem or building database systems at companies like MongoDB, Meta, Google, and Twitter.

Thank you to everyone who’s been part of our journey so far. If you’d like to help shape the journey ahead, [we're hiring](https://paradedb.notion.site/)!

---

# A Conversation with Paul Masurel, Creator of Tantivy

> We sat down with Paul Masurel (creator of Tantivy and co-founder of Quickwit) to talk about building a search engine in Rust, following the competitive happy-path with Lucene, open-source maintenance, and the lessons learned along the way.

Published: 2026-04-02 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/blog/tantivy-interview

---

For most of its history, full-text search has been synonymous with one library: Apache Lucene. Written in Java and battle-tested over two decades, Lucene powers Elasticsearch, Solr, and much of the search infrastructure the industry relies on today. It's the kind of project that makes you think the problem is solved, that there's no room left to rethink the fundamentals.

Then in 2017, a Rust library called [Tantivy](https://github.com/quickwit-oss/tantivy) appeared on GitHub. Built by a single developer who wanted to test his understanding of search engines and his new favorite language, it was small, fast, and unapologetically modular. Within a few years it had grown into the engine behind [Quickwit](https://quickwit.io/), [ParadeDB](https://paradedb.com/), [LNX](https://github.com/lnx-search/lnx), and a growing number of search products, and had sparked a genuine performance collaboration with the Lucene team itself.

That developer is Paul Masurel. After creating Tantivy, Paul co-founded Quickwit in 2020 to build a cloud-native log search engine on top of it. Quickwit was acquired by Datadog in 2024, where Paul now works on search infrastructure at massive scale. We sat down with him to talk about the origins of Tantivy, the philosophy behind its design, what it's like competing (and collaborating) with Lucene, and the lessons he's learned along the way.

## Origins

  You've spent much of your career rethinking how search infrastructure is
  built. How did you first get interested in search systems?

  My history with search engines started at a small French enterprise search
  company called Exalead. I was a front-end engineer at the time. I like to
  think my frustration grew from not being part of the core team. Long-fermented
  frustration is an underrated driver in my opinion.

## Building Tantivy

Paul’s frustration soon led to a job as a backend engineer for a search product at Indeed in Japan, where he worked on the core search engine itself. But the idea of building something from scratch kept pulling at him. The catalyst came on a long-haul flight in 2016. The first version was “a bit silly” and only took a couple of months of spare time, but it was enough to prove the idea had legs.

What made you decide to build a search engine from scratch, and did Rust shape its architecture?

I read the Rust book during a flight from Tokyo to Paris in 2016. Being very familiar  with C++, all of the ideas were very enticing to me. I quickly worked through the [exercism.io](https://exercism.io/) Rust track and then wanted to test the language on a real-life project — something with IO, error management, multithreading. What project should I have picked?

At the time, I was working at Indeed in the search quality team. Our search engine was based on Lucene 2.4. Building a search engine was a perfect way for me to test my understanding of search engines and try out Rust on a real-life project.

Rust did impact the way the code is organized, but I wouldn't say it shaped the architecture. Lucene is really the inspiration there.

Paul didn't set out to build a Lucene replacement. He set out to build something small enough that developers could actually own it. He has described becoming more productive in Rust within two weeks than he had been after five years of C++, and experiencing "a degree of confidence that my code was not buggy, that I had never experienced in any other language".

Architecturally, Tantivy follows Lucene’s model: it is a library, not a server. It handles indexing, compression, and search, but leaves distribution and orchestration to whatever system embeds it.

Were there particular trade-offs or priorities you focused on?

Originally I wanted to keep the project small but modular. Batteries included, but minimal.

My target was companies for which search was so central to their product that they would eventually build their own engine instead of using off-the-shelf solutions like Elasticsearch. Usually such companies would have to use Lucene — and therefore Java. I thought Tantivy could be a good alternative for them.

For this reason, I wanted Tantivy to be small and modular as opposed to featureful. No matter how sophisticated Tantivy might become, these users would rather be able to plug in their own tokenizer and their own query parser.

Today, Tantivy users come in all sizes and shapes. We still sometimes refuse PRs that add niche features in order to keep the library simple, and we prefer PRs that make it possible to implement those features outside of Tantivy. Generally speaking though, this principle isn't as strong as it used to be.

## The Benchmark Game

It quickly became apparent that Tantivy was not only competitive with other search engines, in many cases it outperformed them. Under the hood, Tantivy uses finite state transducers for its term dictionary, SIMD-accelerated compression for its inverted index, and a memory-mapped I/O layer that keeps resident memory remarkably low. The result is a library that can handle indexes larger than available RAM without breaking a sweat.

Jason Wolfe, Paul's manager at Indeed, created [Search Benchmark, the Game](https://github.com/quickwit-oss/search-benchmark-game), a reproducible benchmark suite that showed Tantivy was often 2x faster than Lucene. Over the years the benchmark sparked a genuine back-and-forth between the two projects. Adrien Grand, a Lucene committer, published "[Why is Tantivy Faster than Lucene?](https://jpountz.github.io/2025/04/12/why-is-Tantivy-faster-than-Lucene.html)" and a [follow-up analysis](https://jpountz.github.io/2025/05/12/analysis-of-Search-Benchmark-the-Game.html), and Lucene has since landed patches that close the gap across [most areas](https://tantivy-search.github.io/bench/).

Can you see this game of cat and mouse between Tantivy and Lucene continuing?

Let me first talk a little bit about that benchmark. We wanted it honest and reproducible, and designed it to help us find out where Tantivy's performance was lacking.

Lucene developers approached it with genuine curiosity. Adrien Grand (at Elastic at the time) and Mike McCandless (Amazon) used it to investigate how to improve Lucene's performance. We kept a channel of communication open for the benefit of both projects. Some patches inspired by Tantivy made it into Lucene. After that, Adrien kept finding new optimizations to improve Lucene's search performance. He shared his progress with us and even left [tickets](https://github.com/quickwit-oss/tantivy/issues?q=is%3Aissue%20state%3Aopen%20author%3Ajpountz) in Tantivy's GitHub. We're trailing behind on implementing them. Today, Lucene outperforms Tantivy in many places. Overall, both projects benefited from the collaboration.

Now I'll hijack this question for a second and jump on my soapbox. Some people get the wrong idea about the nature of competition in software. Of course, Quickwit and Tantivy are projects competing with Elasticsearch and Lucene. But **competition does not have to translate into a feud** — especially in open source.

Through Tantivy, I've interacted with many companies. A good number of them turn the "fake it until you make it" saying into a culture of pettiness and hypocrisy. I truly appreciate how special Lucene and Elastic are, and how lucky we were to compete with them.

## The Open-Source Life

Paul's respect for Lucene is striking, especially given how tribal the open-source search space can get. But open source comes with its own challenges. As Tantivy's adoption grew, so did the weight of maintaining it.

Maintaining an open-source project at Tantivy's scale isn't easy. What have been the hardest challenges in guiding its growth?

Reviewing contributions — and sometimes simply saying no to them — is, I think, the hardest part. I review a lot of code, both open source and proprietary. It's very difficult to decide whether a PR should be merged, should be improved, or should be dropped. A lot of contributors are excited about the idea of contributing to open source and can submit features that don't solve any actual problem they have, should be kept external to Tantivy, introduce too much complexity, or are just not of sufficient quality.

Every time I review code, I'm torn between:

- The responsibility to not let the project drift into a terrible state
- Wanting to please people
- Quickwit's interests
- The nihilist's thesis: code quality is subjective, after all
- Managing my time and the energy required for context switching

Since Quickwit's acquisition, it's even more difficult for me to find the time and energy to review PRs. Like many developers, I struggle with having to deal with several problems in parallel, and my work at Datadog is already very challenging in that regard.

Despite that tension, the ecosystem around Tantivy has grown into something Paul never designed for. The library-only architecture that he borrowed from Lucene turned out to be its greatest asset: because Tantivy handles indexing and search but stays silent on how to shard, replicate, and coordinate across nodes, downstream projects can wrap it in radically different ways. What started as a library for companies who wanted to build their own search engine is now the foundation for products with very different goals: scale-out log search, typo-tolerant lookup, edge databases, and transactional search inside Postgres.

Tantivy now serves as the foundation for Quickwit, LNX, ParadeDB, Turso, and others. Have you seen contributions or design ideas flow back upstream?

Someone contributed support for [geo search](https://github.com/quickwit-oss/tantivy/pull/2729) seemingly out of nowhere. This is something we had wanted to add for a while. The PR is large but its quality is very impressive. I hope I will eventually find time to get it merged.

ParadeDB has also been actively contributing great PRs and deep ideas lately.

While the story of Tantivy started with you on a plane, many others have been involved over the years. Who else has helped shape your vision?

Fellow Quickwiots first! In particular, Pascal ([@pseitz](https://github.com/pseitz)) and Trinity ([@trinity-1686a](https://github.com/trinity-1686a)) probably had a strong impact on Tantivy itself. Outside Quickwit, recently Stu ([@Stuhood](https://github.com/Stuhood)) has been very impactful. Harrison ([@ChillFish8](https://github.com/ChillFish8)) also comes to mind. But also, many other people helped along the way: Evance, Pasha, Shikhar, Dru, etc.

In France, there is a monument called "Tomb of the Unknown Soldier". We should celebrate the large crowd of people who come, contribute with an idea, a nice word, an answer to a question, and disappear. Their flow is really what kept the project alive, especially in the beginning.

If we extend the question to Quickwit, I was sharing the architect cap with [Adrien](https://github.com/guilload) and [Francois](https://github.com/fmassot) was the product owner.

Did Tantivy's adoption ever surprise you? Was there a moment when you realized it had "made it"?

Tantivy's adoption is slow and steady. It's like watching my daughter grow. My parents are surprised whenever they see her, but I don't "see" her grow. I realize it when she uses a word I've never heard her use before, expresses a thought that's new to me, or grows an interest in something new.

For Tantivy, the real signs are exotic bug reports. For instance, someone reported an overflow in the 64-bit nanosecond datetime representation — they were indexing events in science fiction literature.

## Quickwit

A library growing beyond what its creator can fully track is usually the point where the story takes its next turn. For Paul, that turn was Quickwit, a bet that the same technology powering search inside applications could be reimagined to work at infrastructure scale, running directly off object storage.

After building Tantivy, you co-founded Quickwit in 2020. What was the original vision, and what gap did you see in the market?

The original vision was actually very different. There was a real-time, large-scale, search-based analytics tool we were using at Indeed that was incredibly powerful. We want to replicate a similar experience.

As we started building it, we noticed that traditional search engines were perfectly suited to run off S3, so we opportunistically pivoted to building a log search engine on S3.

The pivot is a detail that's easy to gloss over, but it says something important about how Paul works: follow the architecture, not the roadmap. If the underlying technology points somewhere interesting, go there.

How was the transition from maintaining an open-source library to running a startup?

For me, the transition was not as brutal as you might think. In the first two years, Quickwit was as close to a "just code" company as you can get. Everyone was very self-driven, so there was little to no management involved. For marketing, we quickly discovered that writing a few high-quality engineering blog posts was the right approach for us.

The hardest part was dealing with the isolation of remote work and having to accept meetings across different time zones. I was living in Japan, so it was common for me to have meetings at 8 AM or 11 PM. I never got used to either.

Early on, Quickwit was adopted by heavy-hitters like Binance, who built a 100 PB log search service indexing 1.6 PB per day. How did working with large-scale users influence Quickwit's roadmap?

Binance suggested they would be open to signing a contract eventually but never actually signed one. We had already burned our fingers with a similar company and knew we shouldn't prioritize their feature requests until we had a signed contract. We just fixed the scale road bumps they reported — we always welcome bug reports as long as they aren't too specific to an exotic workload.

Mezmo is the company that influenced our roadmap the most. They put a lot of trust in us and signed a contract for us to develop features for their product. The impact was overall positive for our roadmap — it pushed us to implement features we kept postponing because we judged them too hard.

One feature, however — the ingestion API — was implemented too fast and built to fit their specific requirements. We still suffer today from the technical debt we accumulated on that one.

## Datadog

By 2024, Quickwit had proven that Tantivy-based infrastructure could handle production workloads at serious scale. The team was preparing to raise a Series A when Datadog came calling, offering something a funding round couldn't: immediate access to some of the largest observability workloads in the world.

What made Datadog the right home for Quickwit?

  To be honest, this was a very difficult decision. One benefit we expected was
  that our technology could now be pushed to more companies, at even larger
  scales.

Was it difficult to secure a commitment to keep Tantivy and Quickwit open source?

Datadog offered to relicense Quickwit under Apache. This allowed four (and probably more) companies to build their products around Quickwit.

That said, the product we're building at Datadog is not open source. We push all improvements to Quickwit and maintain a private fork with the Datadog-specific code. Apart from that, we cannot afford to spend much time dealing with support or contributions that aren't aligned with our product's agenda.

It's a pragmatic arrangement, and a generous one by acquisition standards. The open-source projects stay open, the proprietary product stays proprietary, and the line between them is clean. What's more interesting is how working at Datadog's scale has reshaped Paul's thinking about search itself.

Working on search inside a much larger platform now, has that changed your perspective? Has any of that work fed back into Tantivy?

I'm more convinced than ever that the wall between columnar databases and search engines is entirely artificial.

We've pushed several massive optimizations into Tantivy, and improvements to Quickwit's stability have been motivated by workloads observed on Datadog's customers. Generally speaking, whenever possible, we push all of our changes back to the Tantivy and Quickwit open-source projects.

## Advice for Developers

Paul's arc, from frustrated front-end engineer to the creator of search infrastructure used by companies worldwide, is in many ways the story of someone who looked at a "solved" problem and decided it wasn't. We closed by asking what he'd tell developers who want to do the same.

If I had looked at the lexical search and BM25 space in 2016, I would have said it was solved, and that catching up would be nearly impossible. You proved otherwise. What advice would you give to developers who are eyeing "solved" problem spaces with fresh eyes?

Keep an eye on the academic world. The new ideas often come from there, and they won't make it into the industry without our help — and our sweat.

Keep refining your mental models about how systems work. Software is a collection of abstraction matryoshka dolls. Identify these abstractions and study them. It will make you a better developer: you'll start building beautiful abstractions yourself. But you'll also notice that there's a lot of value to be delivered where abstractions leak.

And keep a critical view of why the industry converged on a given solution. Maybe your problem is singular enough to not match the vanilla solution. Or maybe the industry made choices in the past when hardware and software looked very different from what we have today.

---

# Deep Dive into ParadeDB's v2 API: The Future of SQL Search

> Explore ParadeDB's v2 API that eliminates schema duplication, simplifies tokenization, and provides transparent search operators for intuitive SQL-based full-text search.

Published: 2025-12-04 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/blog/v2api

---

When you build a search engine inside PostgreSQL, you start with the fundamentals: storage integration, transaction safety, write-ahead logging, crash recovery, and index consistency. None of it is glamorous, but all of it is required. If the engine isn't solid, everything above it is wasted effort. That's where we've focused for the last year.

Our API, meanwhile, evolved as the engine took shape. It worked, but it wasn't the API we wanted to give developers long-term.

With the core now stable and fast, we can finally fix the surface.

The v2 API gives ParadeDB a clearer, more predictable model for defining indexes and writing search queries. Field definitions, tokenization, and ranking live in SQL. We have different operators to do different things. Aggregations stay familiar thanks to a light touch of Elasticsearch’s DSL where it helps.

In this post, we’ll walk through what’s new, and how the v2 API brings ParadeDB’s developer experience up to the standard of its engine. Before we start, here’s a side-by-side comparison of our legacy and v2 APIs for a non-trivial `CREATE INDEX` statement.

_Image: Diagram of an LSM tree showing the segment hierarchy and compaction process_

## Starting with Schema

The foundation of the v2 API is improved schema inference. When you create a search index, ParadeDB examines your existing table structure and automatically configures appropriate search behavior for each column type (rather than maintaining a separate JSON configuration blob). We believe your PostgreSQL schema should be the authoritative source of truth, so we built the system to work with what you already have.

All the code examples in this post use the ParadeDB getting started demo table (which you can easily install by [following the SQL steps here](https://docs.paradedb.com/documentation/getting-started/install)).

```ini
Table "public.mock_items"
        Column         |            Type
-----------------------+-----------------------------
 id                    | integer (Primary Key)
 description           | text
 rating                | integer
 category              | character varying(255)
 in_stock              | boolean
 metadata              | jsonb
 created_at            | timestamp without time zone
 weight_range          | int4range
```

With the v2 API, creating a [BM25 search index](https://docs.paradedb.com/documentation/indexing/create-index) is straightforward:

```sql
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description, category, rating, in_stock,
            created_at, metadata, weight_range)
WITH (key_field='id');
```

This single statement creates a comprehensive search index without requiring you to specify how each field should be processed. ParadeDB understands the semantics of different PostgreSQL data types and configures search behavior accordingly, with all text fields becoming available for full-text search and all non-text fields becoming available for use in search aggregates.

  We know that the `WITH (key_field=...)` parameter is annoying, especially as
  it must be the first column in the index definition, and must have a unique
  index configured (or be the primary key). We've got a [PR open to remove
  it](https://github.com/paradedb/paradedb/pull/3643), instead validating it
  automatically.

## Understanding Tokenization

One of the features we're most excited about in the v2 API is making [tokenization](https://docs.paradedb.com/documentation/tokenizers/overview) transparent and testable. We believe [understanding how text gets processed](/blog/when-tokenization-becomes-token) is crucial for building effective search experiences. The v2 API allows you to cast text to a tokenizer, then to a text array (`text[]`) to see exactly how it will be processed before you commit to an index configuration.

```sql
SELECT 'Wireless noise-cancelling headphones'::pdb.unicode_words::text[];

                    unicode_words
------------------------------------------------------
 {wireless,noise,cancelling,headphones}
(1 row)
```

This transparency extends to different tokenization strategies, each optimized for specific use cases. Our default tokenizer is now `unicode_words`, but we have a many other options for slightly different use cases:

- [**Unicode Words**](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/unicode): The unicode tokenizer splits text according to word boundaries defined by the [Unicode Standard Annex #29](https://www.unicode.org/reports/tr29/) rules.
- [**Literal**](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/literal): indexes the text in its raw form, without any splitting or processing (and allows the field to be used for analytics)
- [**Literal Normalized**](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/literal-normalized): like the literal tokenizer, but allows for token filters
- [**Whitespace**](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/whitespace): splits on whitespace
- [**Ngram**](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/ngrams): splits text into small chunks called grams, useful for partial matching
- [**Simple**](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/simple): splits on any non-alphanumeric character
- [**Regex Patterns**](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/regex): tokenizes text using a regular expression
- [**Chinese Compatible**](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/chinese-compatible): a simple tokenizer for Chinese, Japanese, and Korean characters
- [**Lindera**](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/lindera): uses prebuilt dictionaries to tokenize Chinese, Japanese, and Korean text
- [**ICU**](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/icu): splits text according to the Unicode standard (often with more accuracy for Chinese, Japanese, and Korean text than `unicode_words`)
- [**Jieba**](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/jieba): the most advanced Chinese tokenizer that uses both a dictionary and statistical models
- [**Source Code**](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/source-code): tokenizes text that is actually code

You can test out other tokenizers using the same approach:

```sql
-- N-grams for partial matching
SELECT 'iPhone'::pdb.ngram(3,3)::text[];

      ngram
------------------
 {iph,pho,hon,one}
(1 row)

-- Literal for exact matching
SELECT 'iPhone 16'::pdb.literal::text[];

 literal
---------
 {iPhone 16}
(1 row)

-- ICU for international text
SELECT 'Café München'::pdb.icu::text[];

       icu
-------------------
 {café,münchen}
(1 row)
```

Being able to test tokenization strategies in SQL means you can experiment with different approaches and understand their implications before making decisions that affect your production search experience. We think this experimentation capability is essential for building search experiences that work the way you expect.

## Configuring Search Behavior

The v2 API's approach to index configuration combines automation with flexibility. While the system makes intelligent defaults based on your schema, you retain full control over how each field is processed. This balance allows you to start quickly with sensible defaults while providing the customization options needed for sophisticated search applications. Once you've selected your tokenization strategy you can apply these customisations with token filters.

One feature we really like is the ability to tokenize the same column multiple ways:

```sql
CREATE INDEX search_idx ON mock_items
USING bm25 (
    id,
    description,                                        -- Standard word tokenization
    (description::pdb.ngram(3,3, 'alias=desc_ngram'))   -- Partial matching
) WITH (key_field='id');
```

This example shows how you can apply different tokenization strategies to the same field for different search patterns. The description field is tokenized both for standard word matching and partial matching (using an alias). This multi-tokenizer approach allows a single field to support multiple search behaviors without data duplication in the base table.

[Token filters](https://docs.paradedb.com/documentation/token-filters/overview) provide another layer of search quality improvement. These filters process tokens after initial tokenization to improve search accuracy and relevance. The v2 API makes these filters configurable through declarative parameters inside the index column list rather than using JSON in an external blob.

```sql
CREATE INDEX search_idx ON mock_items
USING bm25 (
    id,
    (description::pdb.simple(
        'stemmer=english',              -- "running" matches "runs"
        'ascii_folding=true',           -- "café" matches "cafe"
        'stopwords_language=english'    -- Filter common words
    )),
    category
) WITH (key_field='id');
```

The stemming filter reduces tokens to their root forms, allowing searches for "running" to match documents containing "runs" or "runner." ASCII folding removes diacritical marks, ensuring that searches work consistently across different character encodings. Stopword filtering removes common tokens that rarely contribute to search relevance. These filters work consistently at both index time and search time, ensuring that query processing matches the indexed data. We have [seven token filters](https://docs.paradedb.com/documentation/token-filters/overview) available today, with some others like synonyms on the way.

## Handling JSON Data

Modern applications increasingly rely on semi-structured data stored in JSON (or JSONB) fields, along with other complex PostgreSQL data types like arrays and ranges. The v2 API handles these data types naturally, automatically indexing nested structures and making them searchable without requiring manual schema mapping.

```sql
-- All JSON fields are automatically indexed
CREATE INDEX search_idx ON mock_items
USING bm25 (id, metadata) WITH (key_field='id');

-- Query nested properties directly
SELECT description FROM mock_items
WHERE metadata->'color' ||| 'blue'
```

When you index a JSON field, ParadeDB automatically analyzes the nested structure and creates searchable fields for each property. This automatic indexing extends to arbitrarily nested JSON structures, allowing you to search deep into complex documents without manual configuration. The system handles type detection for JSON values (including values in arrays), applying appropriate tokenization to text properties while making numeric and boolean values filterable.

You can also index only specific parts of a JSON schema:

```sql
CREATE INDEX search_idx ON mock_items
USING bm25 (
    id,
    description,
    ((metadata->'color')::pdb.literal('alias=json_color'))
) WITH (key_field='id');
```

We designed this automatic handling because we believe complex data types should just work without manual configuration (we handle ranges and arrays as well). With the v2 API, adding new properties to your JSON documents immediately makes them searchable without index rebuilds or configuration changes.

## Search Operators

The v2 API introduces a set of [SQL operators](https://docs.paradedb.com/documentation/full-text/overview) that make search query intent immediately clear. Instead of learning JSON query syntax or domain-specific languages, you use operators that communicate the type of matching you want to perform.

```sql
-- Match disjunction: match any of these words (OR behavior)
SELECT description FROM mock_items
WHERE description ||| 'wireless bluetooth';

-- Match conjunction: match all of these words (AND behavior)
SELECT description FROM mock_items
WHERE description &&& 'wireless bluetooth headphones';

-- Exact phrase matching, respects order and position of tokens
SELECT description FROM mock_items
WHERE description ### 'wireless bluetooth';

-- Exact token matching for categorical fields
SELECT description FROM mock_items
WHERE category === 'Electronics';

-- Complex query expressions (proximity, regex, parsing, etc.)
SELECT description FROM mock_items
WHERE description @@@ ('wireless' ## 2 ## 'headphones');
```

Each operator has a specific semantic meaning:

- `|||`: performs disjunctive matching, finding documents that contain any of the query terms
- `&&&`: requires all terms to be present (conjunction)
- `###`: enforces both term presence and positional requirements, perfect for phrase matching
- `===`: performs exact token matching, ideal for categorical fields or identifier lookups
- `@@@`: evaluates complex query expressions like proximity searches, regex patterns, and parsed queries

## Proximity Search

[Proximity search](https://docs.paradedb.com/documentation/full-text/proximity) is about finding terms that occur near each other, even when they aren't adjacent. This matters in domains like legal or compliance text, where meaning depends on concepts appearing together. For example, contract and obligation appearing within a few words of each other is very different from those words appearing in unrelated parts of a long document.

A simple OR search (`contract ||| obligation`) is too broad here — it returns documents where the terms appear anywhere, not necessarily together.

The example below uses sleek shoes, which is a lighter example than legal text but illustrates the mechanics:

```sql
-- "sleek" within 2 words of "shoes"
SELECT description FROM mock_items
WHERE description @@@ ('sleek' ## 2 ## 'shoes');

-- Require "sleek" before "shoes"
SELECT description FROM mock_items
WHERE description @@@ ('sleek' ##> 1 ##> 'shoes');
```

Proximity uses the `@@@` operator with one of two proximity operators to indicate distance and order:

- `##` — matches terms within a given token distance, in any order.
- `##>` — matches terms within a given distance and enforces order (left term must appear first).
- The number between the two represents the token distance to search over.

Proximity gives you a more meaningful match than OR (`|||`) but also a more flexible than an exact phrase (`###`). It's especially useful in legal text, long-form descriptions, and any content where related ideas span multiple words.

## Fuzzy Search

[Fuzzy search](https://docs.paradedb.com/documentation/full-text/fuzzy) handles the reality that users make typos. Rather than returning no results for misspelled queries, fuzzy matching finds terms that are similar to what the user typed. This is essential for real-world search applications where perfect spelling can't be assumed.

Fuzzy search works by measuring edit distance — the number of single-character operations needed to transform one word into another. These operations include insertions ("shoe" → "shoes"), deletions ("running" → "running"), and transpositions ("shose" → "shoes").

```sql
-- Basic fuzzy matching with edit distance of 1
SELECT description FROM mock_items
WHERE description ||| 'sheos'::pdb.fuzzy(1);  -- Finds "shoes"

-- Handle multiple typos with higher edit distance
SELECT description FROM mock_items
WHERE description ||| 'runing shose'::pdb.fuzzy(2);  -- Finds "running shoes"

-- Fuzzy matching works with conjunction too
SELECT description FROM mock_items
WHERE description &&& 'wireles bluetoth'::pdb.fuzzy(2);  -- Finds "wireless bluetooth"
```

Fuzzy search transforms failed searches into successful ones. A user searching for "headfones" with `fuzzy(2)` will find documents containing "headphones," even though the exact spelling doesn't match. This capability dramatically improves search usability without requiring users to learn specialized syntax or retry their searches with different spellings.

## Relevance and Scoring

Search without relevance ranking is simply filtered data. The v2 API provides comprehensive relevance [scoring capabilities](https://docs.paradedb.com/documentation/sorting/score) that allow you to understand and control how search results are ranked. This transparency is crucial for building search applications where result ordering significantly impacts user experience.

ParadeDB uses the industry standard [BM25 algorithm](/learn/search-concepts/bm25) for relevance scoring, which [outperforms the Postgres built-in `tsvector`](/learn/search-in-postgresql/bm25) method. BM25 considers both term frequency within documents and term frequency across the entire corpus, ensuring that common terms have less impact on relevance than rare, distinctive terms.

```sql
-- View relevance scores
SELECT description, pdb.score(id)
FROM mock_items
WHERE description ||| 'wireless headphones'
ORDER BY pdb.score(id) DESC
LIMIT 5;
```

The `pdb.score()` function (which takes the index `key_field` as its first argument) lets you see the actual relevance scores that determine result ordering. This visibility allows you to understand why results are ranked in a particular order and debug relevance issues when building search experiences.

For more sophisticated relevance tuning, the v2 API supports query-time boosting:

```sql
SELECT description, pdb.score(id) FROM mock_items
WHERE description ||| 'shoes'::pdb.boost(2.0) OR category ||| 'footwear'
ORDER BY pdb.score(id) DESC;
```

Boosting allows you to emphasize certain terms or fields in relevance calculations. In this example, matches in the description field receive twice the weight of matches in the category field. This capability enables you to implement business logic directly in your search queries, prioritizing results based on factors beyond simple text matching.

## Search Highlighting

Building effective search interfaces often requires showing users which parts of documents match their queries. [Highlighting](https://docs.paradedb.com/documentation/full-text/highlight) in ParadeDB is as simple as including the `pdb.snippet()` function call:

```sql
-- Basic highlighting
SELECT pdb.snippet(description) FROM mock_items
WHERE description ||| 'wireless bluetooth'
LIMIT 5;

                    snippet
-----------------------------------------------
 "Compact <b>wireless</b> <b>bluetooth</b> speaker"
(1 row)

-- Custom highlighting tags
SELECT pdb.snippets(description,
    start_tag => '<mark>',
    end_tag => '</mark>',
    max_num_chars => 100
) FROM mock_items
WHERE description ||| 'running'
LIMIT 5;

                         snippets
----------------------------------------------------------
 "Sleek <mark>running</mark> shoes with premium materials"
(1 row)
```

The highlighting system generates contextual snippets that show matching terms within their surrounding text. This approach provides users with enough context to understand why a particular document matched their query. The ability to customize highlighting tags allows you to integrate highlighting with your application's styling requirements.

The `pdb.snippets` function provides additional control over snippet generation, allowing you to specify the maximum number of characters, custom highlighting tags, and other parameters that affect snippet quality and appearance.

## Search Analytics and Faceting

Modern search applications require more than just returning matching documents. Users expect [faceted navigation](/learn/search-concepts/faceting), result counts, statistical summaries, and other analytical insights that help them understand and refine their searches. The v2 API integrates analytics capabilities directly into the search infrastructure, allowing you to compute these insights efficiently in a single index pass.

The [`pdb.agg()`](https://docs.paradedb.com/documentation/aggregates/overview) function can be used in two ways:

- As an aggregate function, running a search query and providing aggregate results
- As a window function, running a search query and providing aggregate results **alongside** the Top K search results

Both variants push as much work as possible down into the ParadeDB index, eliminating the need for multiple queries, CTEs, or manual aggregations.

```sql
SELECT count(*)
FROM mock_items
WHERE description ||| 'shoes';

 count
-------
     3
(1 row)
```

Window functions add analytical data alongside search results:

```sql
SELECT id, description,
       count(*) OVER (),
FROM mock_items
WHERE description ||| 'shoes'
ORDER BY rating DESC
LIMIT 2;

 id |     description     | count
----+---------------------+-------
  3 | Sleek running shoes |     3
  5 | Generic shoes       |     3
(2 rows)
```

These examples demonstrate faceted search using window functions — returning Top K search results while providing analytics for the entire result set:

```sql
SELECT id, description,
       pdb.agg('{"terms": {"field": "rating"}}') AS facets
FROM mock_items
WHERE description ||| 'shoes'
ORDER BY rating DESC
LIMIT 2;

 id |     description     |                   facets
----+---------------------+---------------------------------------------
  3 | Sleek running shoes | {"buckets": [{"key": 4, "doc_count": 1},
                                         {"key": 5, "doc_count": 1},
                                         {"key": 3, "doc_count": 1} ]}
  3 | Generic shoes       | {"buckets": [{"key": 4, "doc_count": 1},
                                         {"key": 5, "doc_count": 1},
                                         {"key": 3, "doc_count": 1} ]}
(2 rows)
```

Both queries use window functions (`OVER` clauses) to return only the top 2 results while computing analytics over all matching documents. This eliminates the need for separate queries and makes it possible to build rich search interfaces with a single efficient operation using ParadeDB's BM25 indexes.

We'll have a full post coming up on how we built this search aggregations system and its performance characteristics.

## Advanced Query Types

For applications with sophisticated search requirements, the v2 API includes several [advanced query capabilities](https://docs.paradedb.com/documentation/query-builder/overview) that address complex use cases without requiring external systems or custom development. These features integrate seamlessly with the core search functionality, allowing you to build powerful search experiences using familiar SQL syntax.

```sql
-- "More Like This" for recommendations like document with an ID of 3
SELECT id, description, rating, category
FROM mock_items
WHERE id @@@ pdb.more_like_this(3)
ORDER BY id;

-- Regex matching
SELECT description FROM mock_items
WHERE description @@@ pdb.regex('.*shoes.*');

-- Query parsing for complex search strings
SELECT description, rating, category FROM mock_items
WHERE id @@@ pdb.parse('
    description:(running OR jogging) AND
    rating:>3 AND
    category:footwear
', lenient => true);
```

The v2 API provides three advanced capabilities for sophisticated search requirements:

- **[More Like This](https://docs.paradedb.com/documentation/query-builder/specialized/more-like-this)**: finds documents similar to a given document by analyzing term patterns and frequencies, creating representative terms and matching documents that contain those terms
- **[Regex matching](https://docs.paradedb.com/documentation/query-builder/term/regex)**: provides power users with precise control over search patterns, enabling complex text matching that goes beyond simple term searching
- **[Query parsing](https://docs.paradedb.com/documentation/query-builder/compound/query-parser)**: accepts raw user-provided query strings using [Tantivy's query language](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html), supporting field-specific searches, boolean operators, and range queries with optional lenient mode for best-effort parsing

## Trying the v2 API

This post covers the core features of the v2 API, but there's much more to explore. For the complete feature set, including advanced ranking options, custom analyzers, and specialized query types, check out the [full documentation](https://docs.paradedb.com/documentation/full-text/overview).

All the examples in this post use ParadeDB's `mock_items` table, which you can install and try yourself following the [getting started guide](https://docs.paradedb.com/documentation/getting-started/install). Every query shown here works with the demo data, allowing you to explore the v2 API's capabilities hands-on.

The v2 API represents our vision for how search should work in SQL databases: powerful, flexible, and integrated rather than separate and complex. Because ParadeDB runs inside PostgreSQL, you get all the PostgreSQL guarantees you rely on: ACID transactions, backup and recovery, security, and the entire ecosystem of extensions and tools.

[Get started with ParadeDB](https://docs.paradedb.com/documentation/getting-started/install) to see how the v2 API can simplify your search architecture while expanding your search capabilities, and please don’t hesitate to [give us a star](https://github.com/paradedb/paradedb).

---

# What We Think About When We Think About Benchmarking

> Publishing a benchmark has two hard problems: telling a story worth reading and getting the mechanics right. Here's how we think about both.

Published: 2026-05-12 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/blog/what-we-think-about-when-we-think-about-benchmarking

---

We're building a search engine inside Postgres, so of course we need to show people how it performs. That means publishing benchmarks: both against ourselves, and against other engines like Elasticsearch.

Benchmarking boils down to two concerns: the workload (what you measure) and the execution (how you measure it). The two shape each other, so where do you start? Do you begin with the workload you want to highlight and build the execution around it? Or do you build the execution first and let the workloads come later? This post is about the path we picked, and the runner we built around it.

The title is a nod to Haruki Murakami's [_What I Talk About When I Talk About Running_](https://en.wikipedia.org/wiki/What_I_Talk_About_When_I_Talk_About_Running), a book about repetition, routine, and what long stretches of disciplined practice reveal over time.

Benchmarking works the same way. You learn what your system actually is by measuring it honestly, repeatedly, in conditions that aren't designed to flatter you.

## Why We Built Our Runner First

The clearest example of the workload-first path is [ClickBench](https://benchmark.clickhouse.com/), a well respected analytical benchmark. One dataset, one query set, dozens of backends, and a public leaderboard. It works because it's _simple_. A single problem, a single procedure, a single result format. Nothing varies but the database itself. The runner inherits the same simplicity: shell-script automation built for one job, with no concurrent users, no ramping load, no custom workloads, just best-of-three single shots.

Being incredibly easy to run, understand, and contribute to is what made ClickBench a [Schelling point](<https://en.wikipedia.org/wiki/Focal_point_(game_theory)>) for analytical workloads. The tradeoff is that the workload and the runner are baked together. Push past that shape and you're building a new runner.

We could have taken the same path for search. Pick a representative workload, call it TextBench, build a runner around it, ship a leaderboard. That’s a viable approach, but it would also bind us to that initial workload. And once we started expanding beyond it, we’d lose the simplicity that made the approach attractive in the first place. We wanted a runner that we could grow with, not one which constrained us.

So we took the less-trodden path. We built the runner first, and kept benchmark workloads as a separate concern. ParadeDB Benchmarker is backend-agnostic, scenario-agnostic, and dataset-agnostic. Instead of fixing the workload and simplifying the runner around it, we worked to simplify the experience of defining new workloads. Because of this, the same runner can carry many benchmarks, including ones that have nothing to do with search and ones our future users haven't imagined yet.

## How We Think About Execution

Building a runner worth reusing means getting the mechanics right for any benchmark you throw at it. Most of the ways benchmarks go wrong are mechanical, and the same problems show up over and over.

### Environment and Resource Isolation

Every backend needs to start from the same place every time: identical resource constraints, the same setup steps, the same starting state. When you compare multiple backends in one run, they shouldn't be competing with each other for resources either.

### Configuration Capture

A poorly-tuned Postgres will lose to a well-tuned Elasticsearch, and vice versa. Configuration is part of the benchmark whether you acknowledge it or not, and a striking number of published benchmarks don't post their full configs. A runner needs to save the full context alongside the results: backend tunables (PostgreSQL GUCs, Elasticsearch JVM settings, versions), setup scripts (`VACUUM ANALYZE`, force merge), and resource allocation. If you can't reproduce a result, you can't trust it.

### Scenario Variety

Different questions need different test shapes. "How fast is a single query?" is a different scenario from "what happens at 1,000 QPS?", which is different again from "what does latency look like when reads and writes hit the system at the same time?" A runner should support all of these: constant throughput, ramping load, concurrent virtual users, mixed read/write workloads, and the ability to compose them.

### Rich Statistics

Tail latency over many runs is usually where systems actually fail. An average or minimum response time hides almost everything important. You need full latency distributions (P50, P95, P99 at minimum), throughput over time, and error rates, not point-in-time summaries.

### Application-Shaped Queries

A benchmark should send queries the way a real application would: over the network, using the same client libraries production code uses. CLI tools and developer interfaces (psql, mongosh, the Elasticsearch dev console) often take shortcuts that production traffic doesn't have access to: different connection pooling, different transports, optimizations for interactive use. If you benchmark through them, you're measuring the CLI as much as the database.

### Driver Separation

You're probably not an expert in every system you're benchmarking. We know Postgres well; we don't know the best way to construct an Elasticsearch query as well as someone who has been writing them in production for years. The code that talks to each database should be separated from the benchmark scripts, so experts in each system can audit and improve how their database is being driven without touching the benchmark itself. Each backend should run in its own native syntax, with no abstraction layer quietly translating one system into another’s strengths.

## ParadeDB Benchmarker

Once we decided the runner mattered more than any single benchmark, we needed infrastructure that could survive changing workloads, changing databases, and changing questions. The result was [ParadeDB Benchmarker](https://github.com/paradedb/benchmarker). It's open source, and built as a layer on top of [k6](https://k6.io/), Grafana's load testing framework (which is usually known for front-end testing).

k6 is an incredible benchmarking tool, and already solves most of the hard execution-engine problems: virtual user scheduling, request firing, latency measurement, ramping load, and scenario orchestration. It already has a database plugin with [xk6-sql](https://github.com/grafana/xk6-sql), but we needed to be able to work with databases that were outside the SQL camp (mainly Elasticsearch, OpenSearch, and MongoDB).

We also wanted better benchmarking ergonomics: dashboards, run exports, dataset loading, and reproducible local environments. This led to us building the `xk6-database` extension.

Here's how the pieces fit together: backends, datasets, query scripting, and the built-in dashboard.

### Backends and Environments

From the start, we wanted PostgreSQL, Elasticsearch, OpenSearch, ClickHouse, MongoDB Atlas Search, and ParadeDB itself to run under the same framework. But we didn't want to force them through a shared abstraction layer; each backend gets its own driver and runs in its own native syntax.

If you’d like to run locally, we provide Docker Compose profiles that bring up isolated backend environments with a single command.

### Datasets and the Loader

Datasets are treated as separate units: schema, raw data, and per-backend setup scripts live together in one directory. Adding a new dataset means dropping a new directory in `./datasets/` rather than touching the runner.

Data goes in through a separate loader CLI, which handles schema setup, bulk ingest, and post-load steps like VACUUM ANALYZE or force merge automatically.

### A First Benchmark

You write benchmark strategies in standard k6 JavaScript. A minimal single-backend workload targeting a local ParadeDB container would look like this:

```javascript

const backends = db.backends({ backends: ["paradedb"] });
const terms = db.terms(open("./search_terms.json"));

const scenarios = {
  paradedb: {
    executor: "constant-vus",
    vus: 5,
    duration: "30s",
    exec: "paradedbQuery",
  },
};

// Collect metrics from the Docker containers
export const collectMetrics = backends.addDockerMetricsCollector(
  scenarios,
  "30s",
);

export const options = { scenarios };

export function paradedbQuery() {
  backends
    .get("paradedb")
    .query(
      `SELECT id, title FROM documents WHERE content ||| $1 LIMIT 10`,
      terms.next(),
    );
}
```

Each VU runs `paradedbQuery` in a loop for 30 seconds. The runner times every call, tags it with the backend, and pushes the metrics to the dashboard.

Swap `constant-vus` for `ramping-vus` to ramp load up and down across stages, or `constant-arrival-rate` to fire a fixed number of requests per second regardless of how fast the system answers.

### Composing Scenarios

A single script can compose multiple scenarios across multiple backends. You might run search at 200 QPS, an aggregation at 100 QPS, and an ingest stream at 1,000 rows per second, all hitting the same backend at the same time. A built-in phase timer staggers a second backend so the two runs don't compete for resources.

Comparing raw ingest throughput across databases with different write semantics rarely tells you anything useful; comparing query latency under sustained write pressure tells you a lot.

### Streaming Results

Results stream to the browser as the test runs: latency percentiles per backend (P50, P90, P95, P99), query and ingest throughput, and per-container CPU and memory pulled from Docker.

Backend configs, setup scripts, and query patterns are captured alongside the timeline, so a result can be reproduced and audited later.

The code can be found at [github.com/paradedb/benchmarker](https://github.com/paradedb/benchmarker), with a full walkthrough coming soon.

## What's Next

We've shipped drivers for the backends we get compared to most often. We'd value experts in any of them auditing how Benchmarker drives them, so the comparisons we publish go up against each system at its best. And if there's a backend you'd like to see added, contributions are welcome.

Stay tuned for our upcoming performance blogs, where we'll put Benchmarker to work: real comparisons against other search engines, and the optimization journeys that came from improving those numbers.

---

_Ready to run your own database benchmarks?_ Check out [ParadeDB Benchmarker](https://github.com/paradedb/benchmarker).

---

# From Text to Token: How Tokenization Pipelines Work

> Understanding how search engines transform text into tokens through character filtering, tokenization, stemming, and stopword removal.

Published: 2025-10-10 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/blog/when-tokenization-becomes-token

---

When you type a sentence into a search box, it’s easy to imagine the search engine seeing the same thing you do. In reality, search engines (or [search databases](https://www.paradedb.com/blog/elasticsearch-acid-test)) don’t store blobs of text, and they don’t store sentences. They don’t even store words in the way we think of them. They dismantle input text (both indexed and query), scrub it clean, and reassemble it into something slightly more abstract and far more useful: tokens. These tokens are what you search with, and what is stored in your [inverted indexes](https://www.paradedb.com/learn/search-concepts/full-text-search#indexing) to search over.

Let’s slow down and watch that pipeline in action, pausing at each stage to see how language is broken apart and remade, and how that affects results.

We’ll use a twist on "The quick brown fox jumps over the lazy dog" as our test case. It has everything that makes tokenization interesting: capitalization, punctuation, an accent, and words that change as they move through the pipeline. By the end, it’ll look different, but be perfectly prepared for search.

<p />

  This isn’t a complete pipeline, just a look at some of the common filters
  you’ll find in lexical search systems. Different databases and search engines
  expose many of these filters as composable building blocks that you can
  enable, disable, or reorder to suit your needs.

The same general ideas apply whether you're using [Lucene](https://github.com/apache/lucene)/[Elasticsearch](https://github.com/elastic/elasticsearch),
[Tantivy](https://github.com/quickwit-oss/tantivy)/[ParadeDB](https://github.com/paradedb/paradedb), or [Postgres](https://postgresql.org) full-text search.

## Filtering Text With Case and Character Folding

Before we even think about breaking our text down we need to think about filtering out anything which isn’t useful. This usually means auditing the characters which make up our text string: transforming all letters to lower-case, and if we know we might have them folding any diacritics (like in résumé, façade, or Noël) to their base letter.

This step ensures that characters are normalized and consistent before tokenization begins. Café becomes cafe, and résumé becomes resume, allowing searches to match regardless of accents. Lowercasing ensures that database matches Database, though it can introduce quirks: like matching Olive (the name) with olive (the snack). Most systems accept this trade-off: false positives are better than missed results. Code search is a notable exception, since it often needs to preserve symbols and respect casing like _camelCase_ or _PascalCase_.

Let’s take a look at how our input string is transformed. We are replacing the capital T with a lower-case one, and also folding the `é` to an `e`. Nothing too surprising here. All of these boxes are interactive, so feel free to put in your own sentences to see the results.

Of course, there are many more filters that can be applied here, but for the sake of brevity, let’s move on.

## Splitting Text Into Searchable Pieces with Tokenization

The tokenization phase takes our filtered text and splits it up into indexable units. This is where we move from dealing with a sentence as a single unit to treating it as a collection of discrete, searchable parts called tokens.

The most common approach for English text is simple whitespace and punctuation tokenization: split on spaces and marks, and you’ve got tokens. But even this basic step has nuances: tabs, line breaks, or hyphenated words like full-text can all behave differently. Each system has its quirks, the default Lucene tokenizer turns `it’s` into `[it's]`, while the Tantivy splits into `[it, s]`[^1].

Generally speaking there are three classes of tokenizers:

1. **Word oriented tokenizers** break text into individual words at word boundaries. This includes simple whitespace tokenizers that split on spaces, as well as more sophisticated language-aware tokenizers that understand non-English character sets[^2] . These work well for most search applications where you want to match whole words.

2. **Partial Word Tokenizers** split words into smaller fragments, useful for matching parts of words or handling compound terms. [N-gram](https://en.wikipedia.org/wiki/N-gram) tokenizers create overlapping character sequences, while edge n-gram tokenizers focus on prefixes or suffixes. These are powerful for autocomplete features and fuzzy matching but can create noise in search results.

3. **Structured Text Tokenizers** are designed for specific data formats like URLs, email addresses, file paths, or structured data. They preserve meaningful delimiters and handle domain-specific patterns that would be mangled by general-purpose tokenizers. These can be essential when your content contains non-prose text that needs special handling.

For our example we will be using a simple tokenizer, but you can also toggle to a trigram (an n-gram with a length of 3) tokenizer below to get a feel for how different the output would be (don't forget you can change the text in the top box to play round).

## Throwing Out Filler With Stopwords

Some words carry little weight. They appear everywhere, diluting meaning: "the", "and", "of", "are". These are stopwords. Search engines often throw them out entirely[^3], betting that what remains will carry more signal.

This is not without risk. In _The Who_, "the" matters. That's why stopword lists are usually configurable[^4] and not universal. In systems which support [BM25](https://www.paradedb.com/learn/search-concepts/bm25) they are often left out altogether because the ranking formula gives less weight to very common terms, but in systems which don't support BM25 (like Postgres tsvector) stopwords are critically important.

Notice how removing stopwords immediately makes our token list more focused? We've gone from ten tokens to eight, and what remains carries more semantic weight.

## Cutting Down to the Root with Stemming

`Jump`, `jumps`, `jumped`, `jumping`. Humans see the connection instantly. Computers don't, unless we give them a way[^5].

Enter stemming. A stemmer is a rule-based machine that chops words down to a common core. Sometimes this happens elegantly, and sometimes it happens brutally. The foundation for most modern English stemming comes from [Martin Porter's 1980 algorithm](https://www.cs.toronto.edu/~frank/csc2501/Readings/R2_Porter/Porter-1980.pdf), which defined the approach that gave search engines consistent rules for stripping suffixes while respecting word structure. Today many stemmers are based on the [Snowball](https://snowballstem.org/) variant.

The results can look odd. `Database` becomes `databas,` `lazy` becomes `lazi.` But that's okay because stemmers don't care about aesthetics, they care about consistency. If every form of `lazy` collapses to `lazi,` the search engine can treat them as one[^6]. There's also lemmatization, which uses linguistic knowledge to convert words to their dictionary forms, but it's more complex and computationally expensive than stemming's "good enough" approach[^7]. {/* * codespell:ignore databas * */}

Here's the final transformation: our tokens have been reduced to their essential stems. `Jumped` becomes `jump,` `lazy` becomes `lazi,` and `database` becomes `databas.` These stems might not look like real words, but they serve a crucial purpose: they're consistent. Whether someone searches for `jumping,` `jumped,` or `jumps,` they'll all reduce to `jump` and match our indexed content. This is the power of stemming: bridging the gap between the many ways humans express the same concept. {/* * codespell:ignore databas * */}

## The Final Tokens

Our sentence has traveled through the complete pipeline. What started as _"The full-text database jumped over the lazy café dog"_ has been transformed through each stage: stripped of punctuation and capitalization, split into individual words, filtered of common stopwords, and finally reduced to stems.

The result is a clean set of eight tokens:

{/* prettier-ignore */}
 {/* * codespell:ignore databas * */}

This transformation is applied to any data we store in our inverted index, and also to our queries. When someone searches for "databases are jumping," that query gets tokenized: lowercased, split, stopwords removed, and stemmed. It becomes `databas` and `jump`, which will match our indexed content perfectly. {/* * codespell:ignore databas * */}

## Why Tokenization Matters

Tokenization doesn’t get the glory. Nobody brags about their stopword filter at conferences. But it’s the quiet engine of search. Without it, `dogs` wouldn’t match `dog`, and `jumping` wouldn’t find `jump`.

Every search engine invests heavily here because everything else (scoring, ranking, relevance) depends on getting tokens right. It’s not glamorous, but it’s precise, and when you get this part right, everything else in search works better.

[Get started with ParadeDB](https://docs.paradedb.com/documentation/getting-started/install) to see how modern search databases handle tokenization for you, and please don’t hesitate to [give us a star](https://github.com/paradedb/paradedb).

---

[^1]: Which is better? That depends: `it's` seems more correct and skips storing a useless `s` token, but it wouldn't match on a search for `it`.

[^2]: General purpose morphological libraries like [Jieba](https://github.com/fxsjy/jieba) and [Lindera](https://github.com/lindera/lindera) are often used to provide tokenizers that can deal with Chinese, Korean, Japanese and characters.

[^3]: When we remove stopwords we still keep the original position of each token in the document. This allows positional queries ("find `cat` within five words of `dog`" even though we have discarded words.

[^4]: Lucene and Tantivy both have stopwords off by default, and when enabled for English they use the same default list: `[a, an, and, are, as, at, be, but, by, for, if, in,into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with]`

[^5]: Another method is vector search, which trades lexical stemming for searching over semantic meaning.

[^6]: Words can also be overstemmed, consider `university` and `universe` which both stem to `univers`, but have very different meanings.

[^7]: [Lemmatization](https://en.wikipedia.org/wiki/Lemmatization) uses actual word lists to make sure it only makes real words. To do this it accurately it needs to know the "part of speech" the source word is (noun, verb, etc..).

---

# Customers

# Alibaba Picks ParadeDB to Bring Full Text Search to its Postgres-Based Data Warehouse

> How Alibaba Cloud integrated ParadeDB for full text search in their Postgres-based data warehouse.

Published: 2024-09-24 · Author: Ming Ying

Source: https://www.paradedb.com/customers/case-study-alibaba

---

<p>Overview</p>

## About Alibaba Cloud

Alibaba Cloud is the cloud computing arm of Chinese tech giant Alibaba Group and is the largest Asia Pacific cloud provider.

One of Alibaba Cloud’s products, AnalyticDB for PostgreSQL, is a data warehouse built on Postgres. AnalyticDB for PostgreSQL uses a distributed architecture known as MPP (massively parallel processing) to process petabytes of data with high concurrency and low latency.

<p>Challenge</p>

## Alibaba Needed a Full Text Search Solution That Could Handle Petabytes of Data

Prior to integrating with ParadeDB, AnalyticDB for PostgreSQL’s full text search (FTS) capabilities were limited to Postgres native full text search, which uses the `tsvector` type and GIN indexing.

As large enterprise customers onboarded with AnalyticDB for PostgreSQL, many of them had sophisticated full text search needs that were not met by `tsvector`:

1. Over multi-terabyte tables, `tsvector` performance degraded significantly. Alibaba's customers needed a search engine that could meet or exceed the query latency and throughput of Lucene in high concurrency scenarios over multi-terabyte tables.
2. Without BM25 scoring, `tsvector` ranked results poorly because it does not factor in important variables like term frequency.
3. Common advanced search queries like fuzzy search, relevance tuning, and faceting were not supported.

<p>Solution</p>

## ParadeDB Delivers Full Postgres Compatibility with Zero Infrastructure Overhead

Because AnalyticDB for PostgreSQL is fully Postgres compatible, using an external service like Elasticsearch was not a viable option.

With ParadeDB, AnalyticDB's search indexes are baked directly into Postgres. This means that ParadeDB gives Alibaba Elastic-quality search and all the
features of Postgres, including

1. Full integration with Postgres’ native backup, restore, and high availability functionalities.
2. First-class JOIN support so data does not need to be denormalized.
3. Careful integration with Postgres’ query planner so that complex full text search queries can be inspected and optimized.

<div>
  <span>
    ParadeDB's SQL-like query syntax is friendly to our users. Data development
    engineers can quickly master and apply it to application systems.
  </span>
  <span>
    Pang Bo, Product Manager
  </span>
</div>

<p>Outcomes</p>

## ParadeDB Delivers 5X Better Performance per Core Compared to Lucene

Alibaba ran an extensive 60-day evaluation process, during which they benchmarked ParadeDB against Lucene (Elasticsearch's underlying search engine) over a corpus of 100 million Wikipedia documents.
Alibaba measured four criteria: index build time, index size on disk, throughput (queries per second), and latency (round trip time in milliseconds).

### Test Environment

Both Lucene and ParadeDB were run on identical machines with 4 CPU cores, 16GB RAM, and PL1 ESSDs. 4 data nodes were dedicated to ParadeDB and Lucene each.

### Index Build Time

ParadeDB indexed 100 million documents over than twice as quickly as Lucene.

### Index Size

Lucene and ParadeDB indexes consumed a similar amount of disk space.

### Throughput

With 40 concurrent readers, ParadeDB was able to process 5X queries per second compared to Lucene. This difference was magnified as the number of concurrent readers grew.

### Latency

With 40 concurrent readers, ParadeDB's round trip query times were 5X faster than Lucene.

This means that, with ParadeDB, Alibaba is able to meet the business-critical workloads of its most demanding customers.

<div>
  <span>
    ParadeDB has excellent performance and throughput in the field of Full Text
    Search, helping our clients achieve structured analysis and full-text
    retrieval using a pure Postgres engine.
  </span>
  <span>
    Pang Bo, Product Manager
  </span>
</div>

---

# Bilt Reduces Postgres Query Timeouts by 95% with ParadeDB

> How Bilt improved their Postgres performance with ParadeDB, reducing query timeouts by 95%.

Published: 2025-07-28 · Author: Ming Ying

Source: https://www.paradedb.com/customers/case-study-bilt

---

<b>
  Bilt Rewards deployed ParadeDB as a logical replica of their primary Postgres,
  hosted in GCP. ParadeDB handles BILT’s most expensive faceted and fuzzy search
  queries, improving accuracy and reducing query timeouts by 95%.
</b>
<div>
  {/* Desktop Table */}
  <table>
    <thead>
      <tr>
        <th>5ms</th>
        <th>300GB</th>
        <th>7K rows/s</th>
        <th>40 QPS</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>P95 Query Latency</td>
        <td>Data Stored</td>
        <td>Write Volume</td>
        <td>Read Volume</td>
      </tr>
    </tbody>
  </table>
  {/* Mobile Vertical List */}
  <div>
    <div>
      <span>P95 Query Latency</span>
      <span>5ms</span>
    </div>
    <div>
      <span>Data Stored</span>
      <span>300GB</span>
    </div>
    <div>
      <span>Write Volume</span>
      <span>7K rows/s</span>
    </div>
    <div>
      <span>Read Volume</span>
      <span>40 QPS</span>
    </div>
  </div>
</div>

### Overview

Bilt is the housing and neighborhood commerce network that turns everyday housing and neighborhood spend into valuable rewards and benefits. It’s the first program that lets members earn rewards on rent and HOA payments while helping them build a path to homeownership. As of December 2024, Bilt was processing over $36 billion in annual rent and HOA payments.

The Bilt Alliance — developed in partnership with some of the nation’s largest residential owners and operators — is a network of more than 4.5 million homes across the country.
This all-in-one platform enables property managers to monitor resident and payment information across all lease units,
manage loyalty incentive programs, and streamline rent payments.

### The Problem

The platform’s original search functionality — built on Postgres — was lackluster and prevented Bilt from delivering a premium user experience.

The first problem was accuracy of search results. Postgres' limited tokenizers and query types meant that users often couldn’t find what they were looking for. Bilt's system relied on exact token matching that compared the hashed values of the query with the hashed, underlying text data. Bilt also tried Postgres' built-in trigram index, which improved results but remained unsatisfactory.

The second issue was speed. Queries frequently encountered high latency and timeouts, particularly for users who needed to search millions of data records. Operations like filters, `ORDER BY` clauses, and cross-schema `JOIN`s were especially time-intensive. As a stopgap solution, Bilt prohibited engineers from joining across schemas in a single query. Instead, the system orchestrated multiple separate API calls and combined their results on the client side — a workaround which created a tangled mess of spaghetti code.

### The Solution

When Bilt brought in ParadeDB, they saw an instant speedup to their search queries.
This is because ParadeDB's modern query execution engine can efficiently parallelize and execute SQL `WHERE`, `ORDER BY`, `JOIN`, and other clauses. ParadeDB outperformed Bilt's vanilla Postgres setup, which handles these clauses as separate post-processing steps.

Search accuracy also improved, thanks to ParadeDB’s built-in support for fuzzy matching. The results were especially noticeable in Bilt's "property reviews" page,
which implements fuzzy search over property and property manager information.

<div>
  <div>
    <span>
      Using ParadeDB has unlocked the ability to rapidly launch new search
      capabilities across our products—something that previously would have
      taken weeks of effort.
    </span>
    <span>
      John King, Backend Engineer
    </span>
  </div>
</div>

### Results

Bilt chose ParadeDB because it required the least amount of infrastructure changes and engineering effort to unlock these search functionalities. Unlike any other search database, syncing a primary Postgres with ParadeDB requires almost zero operational effort. This is because ParadeDB can use Postgres’ built-in logical replication to ingest data from the primary Postgres.

ParadeDB also enabled Bilt to leverage its existing Postgres knowledge and deploy within its cloud provider, GCP. Implementation required only minor adjustments to existing Postgres search queries, which saved weeks of engineering effort compared to migrating the query logic to an alternate solution.

In Bilt’s Resident Loyalty Portal — where property managers can search for resident, payment, and unit identifiers — query timeouts dropped by 95% and P95 query latency fell by 50%. At the same time, Bilt was able to move the most expensive queries out of their primary Postgres and into ParadeDB.

Ultimately, ParadeDB's robust, performant architecture has enabled Bilt to rapidly launch new search capabilities across its suite of products. Bilt’s success story demonstrates why platforms built on Postgres demand a Postgres-native search solution.

---

# Cofactr Migrates from MongoDB Atlas to ParadeDB

> How Cofactr replaced MongoDB Atlas Search with ParadeDB to unify its search and primary data stores on Postgres in AWS GovCloud.

Published: 2026-04-21 · Author: Ming Ying

Source: https://www.paradedb.com/customers/case-study-cofactr

---

<b>
  After three high-impact outages on a managed search vendor, Cofactr
  consolidated its stack onto Postgres and has run disruption-free on ParadeDB
  ever since.
</b>
<div>
  {/* Desktop Table */}
  <table>
    <thead>
      <tr>
        <th>10M+</th>
        <th>7.8K/s</th>
        <th>245 QPS</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Total Rows</td>
        <td>Peak Write Rate</td>
        <td>Peak Read Rate</td>
      </tr>
    </tbody>
  </table>
  {/* Mobile Vertical List */}
  <div>
    <div>
      <span>Total Rows</span>
      <span>10M+</span>
    </div>
    <div>
      <span>Peak Write Rate</span>
      <span>7.8K/s</span>
    </div>
    <div>
      <span>Peak Read Rate</span>
      <span>245 QPS</span>
    </div>
  </div>
</div>

<div>
  <div>
    <span>
      ParadeDB lets us unify our search and primary data stores around Postgres
      without giving up performance. Its operational stability has made it a
      trusted part of our stack.
    </span>
    <span>
      Noah Trueblood, Director, Data & AI
    </span>
  </div>
</div>

### Overview

[Cofactr](https://cofactr.com) is a full-service electronics purchasing platform for hardware teams. Customers upload a bill of materials, and Cofactr handles the rest: its AI agents source components, negotiate pricing, and place orders, while Cofactr's warehouses store the parts and assemble custom kits.

Cofactr's platform is also compliant with the International Traffic in Arms Regulations (ITAR), the US government framework governing the export of defense articles and technical data. That means data access must be restricted to US Persons, and the entire stack runs inside AWS GovCloud under strict security controls.

### The Problem

Before ParadeDB, Cofactr's search architecture was split across two database systems. An internal service aggregated and normalized part data from many sources into MongoDB Atlas, using Atlas Search to explore a large parts dataset of structured and unstructured fields. Cleaned records were then synced into a Postgres database behind the customer-facing platform, which relied on a mix of native full-text search and custom query patterns.

Maintaining two search approaches across two database systems made synchronization operationally complex and gave the team less control over search index freshness and consistency than they wanted. Atlas was also fully managed: when issues came up, Cofactr was dependent on MongoDB's support staff to take action. Across 2023 and 2024, three high-impact service disruptions required MongoDB support to escalate and resolve.

MongoDB Atlas for Government also carried a steep markup and ran outside Cofactr's AWS account, which added another cloud environment to the compliance surface area.

Cofactr came to realize that its workload and query patterns were a natural fit for Postgres, and that consolidating around a Postgres-native search engine would simplify the stack without compromising on capabilities.

### The Solution

Cofactr's technical requirements were clear: tight control over replication from the primary database to the search database, low replication latency, and the ability to absorb bursts of traffic without degrading. The team evaluated MongoDB Atlas, OpenSearch, and native Postgres full-text search (`tsvector`/`tsquery`/`ts_rank`) alongside ParadeDB.

MongoDB Atlas offered low read latency, but replication lag from the primary database was opaque and outside Cofactr's control, and the GovCloud markup plus extra compliance surface area made it a poor fit. OpenSearch offered more control and an expansive feature set but added operational complexity to the stack. Native Postgres full-text search could run inside RDS, but initial benchmarking suggested that index and query performance wouldn't meet Cofactr's long-term needs, and `ts_rank` was too limited.

ParadeDB stood out for two reasons. First, consolidating search into Postgres collapsed the stack around a single database system the team already knew well. Second, by using `dblink` to pull data from the primary database into ParadeDB, the time from ingesting new data into the primary database to having it searchable in ParadeDB was small.

### Results

Cofactr deployed ParadeDB as a standalone database in AWS GovCloud — the same environment as the rest of its ITAR-compliant infrastructure — to power freeform search across its electronic components catalog.

The ParadeDB database today spans roughly 11 million rows. Cofactr deliberately keeps this lean: it only contains data for parts customers are likely to access, and only the fields that go into the search index. When additional context is needed, the team enriches the result set at query time.

In the last 90 days, the highest peak write rate was roughly 7,800 rows/s on a 60-second rollup. Peak read throughput hit 245 QPS on a 30-second rollup. ParadeDB has absorbed these bursts without issue.

The operational story is where the difference is starkest. Since adopting ParadeDB, Cofactr has had zero major disruptions.

### ParadeDB

ParadeDB is the simplest way to add Elastic-quality search to your Postgres. If you have a use case around large scale search in Postgres,
we'd [love to chat](mailto:hello@paradedb.com).

---

# INSA Strasbourg Powers New Research Database with ParadeDB

> How INSA replaced Elasticsearch with ParadeDB for better performance and simplicity.

Published: 2024-08-29 · Author: Ming Ying

Source: https://www.paradedb.com/customers/case-study-insa

---

<p>Overview</p>

## About INSA Strasbourg

INSA Strasbourg (National Institute of Applied Sciences) is a leading engineering institution located in France. INSA Strasbourg's engineering sciences research division consists of 200 scientists who primarily conduct health and environmental research.

<p>Problem</p>

## INSA Strasbourg Needed a Reliable Data Store and Search Engine to Back Cutting-Edge Scientific Research

INSA Strasbourg was building an internal system that could be used by researchers to search over a massive corpus of scientific publications. Their dataset was 1.5 terabytes, consisting of 66 million documents alongside metadata like publication dates, numeric, and boolean flags.

Their engineers outlined five requirements for their document retrieval system:

1. **BM25 support**: The BM25 algorithm is the established standard for full text retrieval.
2. **SQL support**: Since INSA Strasbourg's engineers were familiar with SQL, they wanted to avoid query languages that strayed from SQL.
3. **Python compatibility**: INSA Strasbourg wanted to easily interact with the chosen solution from Python code.
4. **Kubernetes compatibility**: This would make it easy for INSA Strasbourg to deploy their solution.
5. **Open source**: INSA Strasbourg's engineering philosophy is centered around the adoption of open source, which enables them to experiment with software on their own infrastructure without long approval processes.

<p>Solution</p>

## ParadeDB Wins on Operational Simplicity and Postgres Compatibility

<div>
  <span>
    We wanted to avoid going with the obvious Elasticsearch route because we
    wanted to avoid JVM overhead. We also wanted an extensible ecosystem.
  </span>
  <span>
    Iliass Ayaou, Research Engineer
  </span>
</div>

In addition to ParadeDB, INSA Strasbourg also evaluated Elasticsearch, Manticore Search, Meilisearch, Typesense and Quickwit.

After an initial evaluation process, INSA Strasbourg narrowed its options down to ParadeDB and Elasticsearch. However, the operations team was concerned about Elastic. They wanted to avoid the overhead that Elastic’s JVM (Java virtual machine), which is notoriously resource-intensive and difficult to manage, would add to their infrastructure.

Unlike Elasticsearch, ParadeDB is built on Postgres, a battle-tested relational database that INSA Strasbourg already knew how to operate. This meant that INSA Strasbourg could leverage the entire Postgres ecosystem, including other Postgres extensions and Postgres-compatible tooling. INSA Strasbourg trusted that the Postgres ecosystem would continue to grow and evolve in the coming years.

<p>Outcomes</p>

## INSA Strasbourg Supercharges Their Document Search System

<div>
  <span>
    We have been more than satisfied with ParadeDB's BM25 performance for
    search.
  </span>
  <span>
    Iliass Ayaou, Research Engineer
  </span>
</div>

Integrating ParadeDB with INSA Strasbourg's existing on-prem Kubernetes cluster took a matter of days. From an operational perspective, INSA’s database administrator was already experienced with configuring and scaling Postgres. For engineers, ParadeDB integrated seamlessly with the existing code base, since interacting with ParadeDB from any programming language is the same as interacting with a Postgres database.

Since moving ParadeDB into production, INSA Strasbourg has supercharged their document search experience. The engineering team has been extremely satisfied with ParadeDB’s sub-100ms query speeds, BM25 scoring, and hybrid search capabilities. Using ParadeDB saves several hours of engineering time per week, since their engineers know how to use Postgres and don’t need to learn a new query language.

---

# ParadeDB Powers Modern Treasury's Core UI and Search APIs

> How Modern Treasury replaced legacy indexes and roll-up tables with ParadeDB, eliminating query timeouts and increasing write speeds by 2-3x.

Published: 2026-03-16 · Author: Ming Ying

Source: https://www.paradedb.com/customers/case-study-modern-treasury

---

<b>
  ParadeDB powers Modern Treasury's core UI and search APIs, eliminating query
  timeouts and increasing write speeds by 3x.
</b>
<div>
  {/* Desktop Table */}
  <table>
    <thead>
      <tr>
        <th>10TB</th>
        <th>500M+</th>
        <th>50K/min</th>
        <th>3x</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Data Stored</td>
        <td>Largest Table Rows</td>
        <td>Peak Write Rate</td>
        <td>Write Speed Increase</td>
      </tr>
    </tbody>
  </table>
  {/* Mobile Vertical List */}
  <div>
    <div>
      <span>Data Stored</span>
      <span>10TB</span>
    </div>
    <div>
      <span>Largest Table Rows</span>
      <span>500M+</span>
    </div>
    <div>
      <span>Peak Write Rate</span>
      <span>50K/min</span>
    </div>
    <div>
      <span>Write Speed Increase</span>
      <span>2-3x</span>
    </div>
  </div>
</div>

<div>
  <div>
    <span>
      Built on the solid foundation of Postgres, ParadeDB provides the good
      parts of Elasticsearch we actually need, without the infrastructure and
      hassle of data synchronization.
    </span>
    <span>
      Matt Higgins, Head of Engineering
    </span>
  </div>
</div>

### Overview

Modern Treasury is the payments operations platform for businesses that move money at scale. The platform offers APIs, dashboards, and workflow tooling to manage the full lifecycle of payments — from initiation and approval through reconciliation and reporting. Modern Treasury's customers span financial services, fintech, and enterprise, relying on the platform to process high volumes of transactions with accuracy down to the penny.

### The Problem

As Modern Treasury's customer base grew, so did the complexity of powering its core product UIs. The platform's flexible filter interfaces allow customers to search, sort, and aggregate across payments, transactions, and ledger entries using any combination of dozens of attributes. With many customers storing over 100M records, the permutations of possible query plans made it impossible to cover every case with traditional B-tree indexes.

  Our [Top K in Postgres]() blog post explores why B-tree indexes don't
  generalize well to workloads that combine many filters with flexible sorting.

The team endlessly added B-tree indexes to cover different customer search patterns, but there was no reasonable way to index every permutation without causing write performance degradation. As a workaround, Modern Treasury built summary roll-up tables to support basic aggregations — a common pattern for workloads where Postgres struggles at high record counts. However, these roll-up tables introduced data delays and added significant complexity to the system.

The result was a long tail of edge-case query timeouts. Approximately 5% of queries failed entirely due to unoptimized query plans, and the engineering team was stuck in a reactive cycle of adding indexes to patch individual failures.

### The Solution

Modern Treasury needed a backend that could power subsecond responses across its core UIs without requiring a different index for every query pattern. The requirements boiled down to filtering records based on any of dozens of attributes, sorting by any attribute, and returning the top results. Crucially, the system had to be accurate to the penny.

The team evaluated Elasticsearch but concluded that the data consistency challenges and deployment complexity it would introduce were not worth the trade-off. While Elasticsearch is mature and feature-rich, the value of Postgres replication and SQL-native infrastructure offered by ParadeDB — as a native Postgres extension — outweighed the bells and whistles.

Modern Treasury deployed ParadeDB as a logical replica of their primary Postgres across 7 clusters. This meant the team could leverage its existing Postgres knowledge and infrastructure. No new data pipelines, no sync jobs, and no new query language. Implementation required only minor adjustments to existing SQL queries.

### Results

<div>
  <div>
    <span>
      Switching to ParadeDB allowed us to replace as many as 26 B-tree and GIN
      indexes over one table with a single ParadeDB index. This has massively
      improved our write throughput and simplified our operational overhead.
    </span>
    <span>
      Eric Barendt, Software Engineer
    </span>
  </div>
</div>

ParadeDB resolved an entire class of edge-case query timeouts for Modern Treasury. The team no longer needs to think about the many permutations of query plans that previously caused failures. The 5% of queries that previously timed out are now subsecond.

ParadeDB's performance on counting and summing operations enabled Modern Treasury to add real-time counts and aggregations to user interfaces that previously could not support them. The platform now powers grouped aggregations and transaction-to-payment reconciliation workflows that were previously too expensive to run inline.

Perhaps most significantly, ParadeDB allowed Modern Treasury to drop a large number of legacy B-tree indexes that the UI no longer depended on. With fewer indexes competing for write throughput, raw write speeds increased by 2-3x on the platform's most important database tables.

Modern Treasury now uses ParadeDB across its payments, transactions, and ledger products, servicing both customer-facing search APIs and internal analytic processes.

### ParadeDB

ParadeDB is the simplest way to add Elastic-quality search to your Postgres. If you have a use case around large scale search in Postgres,
we'd [love to chat](mailto:hello@paradedb.com).

---

# Sweetspot Unifies Hybrid Search on Postgres with ParadeDB

> How SweetSpot leveraged ParadeDB for their search infrastructure.

Published: 2024-06-09 · Author: Ming Ying

Source: https://www.paradedb.com/customers/case-study-sweetspot

---

## Sweetspot Overview

[Sweetspot](https://sweetspot.so) is an AI-driven platform for government procurement and contracting. Sweetspot streamlines the entire contracting process from
opportunity identification to bid/no-bid decision making to proposal development.

## The Challenge

Sweetspot needed a hybrid search — the combination of full-text and vector search — solution to power its contract search and AI-assisted proposal drafting products.

Since Sweetspot was already using Postgres, it was important that their search solution interoperated seamlessly with Postgres. Before using ParadeDB, Sweetspot
experienced significant friction synchronizing large volumes of data between Postgres and external databases like Elasticsearch.
This synchronization involved writing and maintaining custom connectors, which were prone to bugs and data inconsistencies. Additionally,
many of the vector databases that Sweetspot evaluated had unreliable infrastructure which caused frequent downtime in their application.

## The Solution

Sweetspot needed a search solution that was easy to integrate, compatible with existing Postgres architecture, and offered reliable performance over
15 million documents and 2 million vectors.

After an extensive review, Sweetspot selected ParadeDB. With ParadeDB, Sweetspot unified its data stack on Postgres, which removed the need to manage a
separate database or use custom connectors. ParadeDB's advanced full-text and vector search capabilities enabled Sweetspot to build powerful hybrid search
experiences without additional operational complexity.

## Outcomes

Integrating with ParadeDB took less than a week since Sweetspot was already using Postgres and could reuse its existing code.

Post-integration, Sweetspot observed a 50% overall reduction in query latency. Hybrid search queries achieved a P99 latency of 50 milliseconds and had improved
precision and recall compared to pure vector search.

ParadeDB also simplifies the process of achieving FedRAMP and NIST compliance since Postgres has robust authentication, access control, logging, and disaster recovery built in.

If you think that ParadeDB can solve a use case for your organization, we invite you to [contact us](mailto:sales@paradedb.com).

---

# Terrapin Finance Cuts Bond Search Latency by 25x with ParadeDB

> How Terrapin Finance replaced pg_trgm with ParadeDB to power sub-200ms full-text search across 1.1 billion rows of fixed-income data.

Published: 2026-03-17 · Author: Ming Ying

Source: https://www.paradedb.com/customers/case-study-terrapin-finance

---

<b>
  Terrapin Finance deployed ParadeDB inside their self-hosted Postgres to power
  sub-200ms full-text search across 1.1 billion rows of bond data — a 25x
  improvement over their previous vanilla Postgres setup.
</b>
<div>
  {/* Desktop Table */}
  <table>
    <thead>
      <tr>
        <th>1.3TB</th>
        <th>1.1B</th>
        <th>8K QPS</th>
        <th>25x</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Data Stored</td>
        <td>Total Rows</td>
        <td>Peak Read Rate</td>
        <td>Faster Queries</td>
      </tr>
    </tbody>
  </table>
  {/* Mobile Vertical List */}
  <div>
    <div>
      <span>Data Stored</span>
      <span>1.3TB</span>
    </div>
    <div>
      <span>Total Rows</span>
      <span>1.1B</span>
    </div>
    <div>
      <span>Peak Read Rate</span>
      <span>8.26K QPS</span>
    </div>
    <div>
      <span>Faster Queries</span>
      <span>25x</span>
    </div>
  </div>
</div>

<div>
  <div>
    <span>
      ParadeDB is the most mature search solution for Postgres available, and it
      delivered exactly what we needed.
    </span>
    <span>
      Juozas Norkus, CTO
    </span>
  </div>
</div>

### Overview

[Terrapin Finance](https://terrapinfinance.com) is a fixed-income data provider specializing in reference and pricing data APIs for US municipal, government, and corporate bonds. The company serves institutional clients who depend on accurate, real-time bond data for trading and analysis. Terrapin is also building a municipal credit intelligence platform that combines document search, credit risk analytics, and AI-driven research tools.

### The Problem

Terrapin's bond reference data APIs serve data derived from massive source tables spanning over a billion rows. A key requirement from their clients is the ability to filter bonds by issuer name, a natural-language field that doesn't lend itself to exact-match lookups. On the product side, Terrapin's credit risk analysis web UI features a type-ahead search bar where users enter a unique identifier or bond issuer name and see matching results as they type.

For a financial data platform, search responsiveness is non-negotiable: analysts expect instant feedback as they narrow down results across millions of securities.

The team initially built this functionality with `pg_trgm`, Postgres' built-in trigram matching extension. While trigram indexes are a common starting point for fuzzy text search in Postgres, they struggled under the weight of Terrapin's data. Search queries routinely took 5 seconds to return results, far too slow for a type-ahead interface. This was also painful for API consumers running batch queries against issuer names.

### The Solution

Terrapin's requirements were straightforward: they needed a search solution that was fast and that didn't require introducing new infrastructure. The team's entire stack was built on Postgres, and they had no interest in bolting on a separate search engine with its own cluster, sync pipeline, and query language.

When evaluating options, ParadeDB stood out as the most mature and complete Postgres-native alternative to `pg_trgm`. Because ParadeDB runs as an extension inside Postgres, Terrapin could deploy it directly into their existing self-hosted instance — no sidecars, no replication targets, and no new operational burden. The migration was surgical: swap out the trigram queries for ParadeDB's full-text search syntax, and the existing tables, schemas, and deployment pipeline stayed exactly as they were.

### Results

The impact was immediate. Search queries that previously took 3 to 5 seconds dropped to 100–200ms, a 25x improvement in query execution time. The type-ahead search bar in Terrapin's credit risk UI went from sluggish to instantaneous, and API consumers saw the same gains across their programmatic queries.

Beyond the core bond data APIs, Terrapin leveraged ParadeDB to build something more [ambitious](https://www.linkedin.com/posts/miguel-jaques-8026ba205_blazingly-fast-municipal-document-search-activity-7363610999263834112-Ys9S?utm_source=share&utm_medium=member_desktop&rcm=ACoAABvY5gUBFGMKpKGjG8yDwIIVOWaX4xRk0_w): a full-text document search engine spanning millions of official statements, corporate actions, and financial disclosures from the municipal bond market. This capability is a cornerstone of Terrapin's upcoming municipal intelligence platform, powering the AI-based research tools their clients will use to surface insights from the vast landscape of municipal bond documentation.

### ParadeDB

ParadeDB is the simplest way to add Elastic-quality search to your Postgres. If you have a use case around large-scale search in Postgres,
we'd [love to chat](mailto:hello@paradedb.com).

---

# Learn

# pgvector Limitations

> Understand the practical limits of pgvector: index cost, filtered search performance, missing features, and operational tradeoffs.

Published: 2026-05-14 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/postgresql/pgvector-limitations

---

[pgvector](/learn/postgresql/what-is-pgvector) brings [vector similarity search](/learn/search-concepts/vector-search) into PostgreSQL with HNSW and IVFFlat indexes, distance operators, and tight integration with the rest of your relational schema. That integration is its main strength, but pgvector still has practical limits around scale, filtering, index maintenance, and retrieval features. Some limits come from running inside PostgreSQL, while others come from pgvector's index design and feature set.

Knowing these limits up front helps you size hardware, plan your index strategy, and decide which parts of the workload belong elsewhere.

## Index Build Time and Memory Cost

HNSW indexes give good recall and query latency when the working set fits in memory, but query performance degrades as the corpus grows beyond cache. Large HNSW graphs involve more random memory access, larger candidate sets, and a longer p99 tail once pages come from disk instead of RAM.

They are also slow to build and expensive to keep resident. A 10M-row table with 1,536-dimensional embeddings can take hours to index on a single core, and the resulting index is typically tens of gigabytes. For acceptable query latency the index needs to live in `shared_buffers` or the OS page cache. Once it spills to disk, p99 latency climbs sharply. [Tuning pgvector Performance](/learn/postgresql/tuning-pgvector) covers the parameters that control build cost and memory footprint.

IVFFlat indexes build faster but require a representative sample for k-means clustering. If you build the index before loading most of your data, the centroids end up misplaced and recall suffers.

## Filtered Search Performance

Combining vector search with metadata filters is one of pgvector's weakest spots. Given a query like:

```sql
-- Find the 10 most similar documents for a specific tenant
SELECT id FROM documents
WHERE tenant_id = 42
ORDER BY embedding <=> $1
LIMIT 10;
```

pgvector walks the HNSW graph and then applies the filter. If `tenant_id = 42` matches only a small fraction of rows, the index may surface candidates that all fail the filter, causing recall to drop unless the scan goes deeper. Iterative scans help by continuing through more of the HNSW graph or IVFFlat lists until enough filtered rows are found or a scan limit is reached. That improves recall, but it also increases CPU, memory use, and tail latency.

The tradeoff is workload-dependent. Moderate filters can often be handled with `hnsw.iterative_scan`, `ivfflat.iterative_scan`, and higher candidate settings. Very selective filters may still need partitioning, partial indexes, or a different retrieval plan because the index spends too much work finding rows the filter will reject. pgvector supports partial indexes and table partitioning, but it does not have the same automatic filtered-ANN planning that some dedicated vector engines provide.

## Dimensionality and Vector Size Limits

The indexable `vector` type tops out at 2,000 dimensions. The `halfvec` type (16-bit floats) extends this to 4,000 dimensions and roughly halves storage, `bit` vectors extend to 64,000 dimensions through binary quantization, and `sparsevec` stores up to 1,000 nonzero elements. Using these types usually requires choosing the storage format up front, casting in your queries, or building expression indexes for quantized representations. pgvector supports binary quantization, but it does not provide native product quantization or automatic scalar quantization like many dedicated vector engines.

Each `vector(1536)` value occupies about 6KB on disk before row overhead. At 100M rows, raw vector storage alone exceeds 600GB before any index is built.

Indexes also skip `NULL` vectors, and cosine indexes skip zero vectors. If those values can appear in your data, account for them during ingestion or recall testing.

## Update and Delete Behavior

HNSW indexes do not reclaim space when rows are deleted: tombstoned nodes stay in the graph until the index is rebuilt. Workloads with frequent re-embeddings or churn cause index bloat and a slow drop in recall. `REINDEX CONCURRENTLY` is available but expensive on large indexes.

IVFFlat indexes have a related problem. Centroids are fixed at build time, so as data drifts the clustering becomes less representative and recall falls until you rebuild.

## Missing Features

Several capabilities that come standard in dedicated vector systems are absent or require extra work in pgvector:

- **[Hybrid search](/learn/search-concepts/hybrid-search)**: no native [BM25](/learn/search-concepts/bm25). Combining [`tsvector`](/learn/search-in-postgresql/full-text-search) ranking with vector distance requires manual score fusion, typically via [Reciprocal Rank Fusion](/learn/search-concepts/reciprocal-rank-fusion).
- **Reranking**: no built-in cross-encoder or learned ranker integration.
- **Multi-vector documents**: no native late-interaction document abstraction. ColBERT-style retrieval needs multiple rows, multiple vector columns, or application-side aggregation.
- **Query-time quantization**: no automatic product or scalar quantization at query time. You choose the storage type, cast explicitly, or maintain expression indexes for quantized search.
- **Embedding quality**: pgvector stores and searches vectors, but it cannot fix poor embeddings, weak chunking, or missing document context.

  Extensions like [pg_search](/learn/search-in-postgresql/bm25) add BM25 to
  PostgreSQL alongside pgvector, which is one way to build hybrid search without
  leaving the database.

## Operational Tradeoffs

Because pgvector runs inside PostgreSQL, vector workloads share CPU, memory, and I/O with your OLTP traffic. A burst of vector queries can starve transactional queries of buffer cache and parallel workers, and the reverse is also true. Vertical scaling is the primary lever; native sharding of vector data across nodes is not supported.

A dedicated replica is a common compromise. It keeps vector reads away from the primary's OLTP workload while preserving PostgreSQL as the system of record. Physical replicas usually have low lag but still receive the primary's write and WAL volume. Logical replicas can decouple more of the read workload, but they usually add more replication lag and operational complexity. Very large corpora may still belong in a separate engine.

## Summary

pgvector is the right choice when your dataset fits comfortably on one PostgreSQL instance, your filter selectivity is moderate, and you value transactional consistency with the rest of your relational data. The edges to watch are index build cost, filtered query performance, missing hybrid and reranking primitives, and the absence of horizontal scaling. Knowing where those edges sit makes it easier to decide whether pgvector is sufficient on its own, whether to pair it with [full-text search](/learn/search-concepts/full-text-search) for [hybrid retrieval](/learn/search-concepts/lexical-vs-semantic), or whether part of the workload belongs in a dedicated engine.

---

# Tuning pgvector Performance

> Learn how to optimize pgvector performance for efficient vector similarity search in PostgreSQL.

Published: 2026-05-14 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/postgresql/tuning-pgvector

---

[pgvector](/learn/postgresql/what-is-pgvector) ships with useful defaults, but [vector similarity search](/learn/search-concepts/vector-search) is expensive enough that a few well-chosen parameters make the difference between a query that returns in milliseconds and one that scans the whole table. Tuning pgvector means picking the right index, sizing it for your data, and matching the runtime knobs to the recall you actually need.

The parameters that move performance most are index type, HNSW and IVFFlat settings, memory configuration, filtered-search controls, and query-plan verification.

## Choosing an Index

Without an index, pgvector falls back to a sequential scan that computes the distance for every row. That is fine for tens of thousands of vectors and untenable beyond that. pgvector offers two approximate indexes:

- **HNSW** (Hierarchical Navigable Small Worlds) builds a layered graph. It usually gives stronger recall and query latency, at the cost of slower builds and more memory.
- **IVFFlat** clusters vectors into lists with k-means. It builds faster and uses less memory, but recall is sensitive to the data present at build time.

HNSW is the right default for most workloads. Reach for IVFFlat only when build time or memory is the binding constraint.

## Reranking Candidate Sets

Not every pgvector workload needs an ANN index. A common pattern is to use full-text search or another structured query to retrieve a small candidate set, then run exact vector scoring over those rows in memory. For example, you might take the top 100 lexical matches and rerank them by embedding distance.

This works well when another retrieval step has already narrowed the search space. It avoids HNSW or IVFFlat build cost, keeps recall exact within the candidate set, and is often simpler for [hybrid search](/learn/search-concepts/hybrid-search) pipelines. It does not replace an ANN index when the vector search itself must scan a large corpus.

## Tuning HNSW

HNSW has two build-time parameters and one query-time parameter:

```sql
CREATE INDEX ON items
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
```

- **`m`** (default 16): maximum connections per node. Higher values improve recall and increase index size and build time. 16 is a strong default; raise to 24 or 32 only if recall is insufficient.
- **`ef_construction`** (default 64): candidate list size during the build. Larger values improve graph quality and slow the build. 64 to 200 is a useful range.
- **`hnsw.ef_search`** (default 40): candidate list size at query time. This is the main recall/latency dial you actually turn in production.

```sql
SET hnsw.ef_search = 100;
```

Increase `ef_search` until recall meets your target, then stop. The cost grows roughly linearly with this value.

## Tuning IVFFlat

IVFFlat has one build parameter and one query parameter:

```sql
CREATE INDEX ON items
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);
```

- **`lists`**: number of k-means clusters. A common starting point is `rows / 1000` up to one million rows, and `sqrt(rows)` beyond that.
- **`ivfflat.probes`** (default 1): clusters searched at query time. Higher values improve recall at proportional cost. A reasonable starting point is `sqrt(lists)`.

```sql
SET ivfflat.probes = 10;
```

  IVFFlat indexes should be built after representative data is loaded. Building
  on an empty or small table produces poor clusters that hurt recall even after
  more data arrives.

## Memory and Build Settings

Index builds are bounded by `maintenance_work_mem`. If the build does not fit, it spills and slows down sharply. For non-trivial vector tables, raise this before building:

```sql
SET maintenance_work_mem = '4GB';
SET max_parallel_maintenance_workers = 4;
```

For long builds, track progress through PostgreSQL's index build view:

```sql
SELECT phase, blocks_done, blocks_total
FROM pg_stat_progress_create_index;
```

At query time, performance depends on the index staying in cache. Size `shared_buffers` and OS cache so the working set of the HNSW graph stays resident. A vector index that spills to disk has a long latency tail no parameter will fix.

## Tuning Filtered Searches

Filtered vector searches are harder than unfiltered vector queries because the index may find close vectors that do not match your `WHERE` clause. If too many candidates are filtered out, the query can return fewer rows than requested or miss relevant matches.

Start by enabling iterative scans. This tells pgvector to keep scanning the index after filtering removes candidates:

```sql
SET hnsw.iterative_scan = strict_order;
SET hnsw.ef_search = 100;
```

Use `strict_order` when exact distance ordering matters. Use `relaxed_order` when you can accept slightly looser ordering for better speed. HNSW supports both modes; IVFFlat supports `relaxed_order`:

```sql
SET ivfflat.iterative_scan = relaxed_order;
SET ivfflat.probes = 10;
```

If recall is still too low, raise the candidate count before raising scan limits. For HNSW, `hnsw.ef_search` controls the candidate list size. For IVFFlat, `ivfflat.probes` controls how many lists are searched.

When you need to cap the worst-case cost, use scan limits:

```sql
SET hnsw.max_scan_tuples = 20000;
SET hnsw.scan_mem_multiplier = 2;
SET ivfflat.max_probes = 100;
```

Tune these settings against measured recall and latency. Higher values recover more filtered results, but they also increase CPU, memory, and tail latency. If a filter is extremely selective, partitioning or a partial index may work better than asking one large ANN index to scan deeper.

## Choosing the Distance Operator

Use the operator that matches how your embeddings were generated:

```sql
SELECT id FROM items ORDER BY embedding <=> $1 LIMIT 10;  -- cosine
SELECT id FROM items ORDER BY embedding <#> $1 LIMIT 10;  -- inner product
SELECT id FROM items ORDER BY embedding <-> $1 LIMIT 10;  -- L2
```

The index must be created with the matching opclass (`vector_cosine_ops`, `vector_ip_ops`, or `vector_l2_ops`). A mismatch produces a sequential scan with no warning. For normalized embeddings (OpenAI, Cohere), cosine and inner product are equivalent; inner product is slightly faster.

## Verifying the Index Is Used

Always check with `EXPLAIN`:

```sql
EXPLAIN (ANALYZE, BUFFERS)
SELECT id FROM items
ORDER BY embedding <=> $1
LIMIT 10;
```

Look for an `Index Scan` on the vector index. A `Seq Scan` means the planner ignored it, usually because the operator does not match the opclass, the `LIMIT` is missing, or a `WHERE` predicate forced a different plan.

## Summary

Most pgvector performance comes from three decisions: use HNSW unless you have a reason not to, tune `ef_search` (or `probes`) to the recall you need, and give the index enough memory to stay resident. Everything else is refinement.

For the failure modes that no amount of tuning will fix, see [pgvector Limitations](/learn/postgresql/pgvector-limitations).

---

# What is pgrx?

> Learn about pgrx, the Rust framework for building PostgreSQL extensions with memory safety and modern tooling.

Published: 2026-02-24 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/postgresql/what-is-pgrx

---

PostgreSQL extensions have traditionally been written in C, which gives full access to the database's internals but requires manual memory management and careful handling of allocation contexts. A bug in a C extension can crash the entire database server. [pgrx](https://github.com/pgcentralfoundation/pgrx) is a Rust framework that changes this: you write Rust functions annotated with macros, and pgrx handles the FFI bindings, type conversions, SQL generation, and memory management. Memory errors, null pointer dereferences, and data races are caught at compile time rather than in production.

## How pgrx Works

pgrx sits between your Rust code and PostgreSQL's C extension API. When you annotate a function with `#[pg_extern]`, pgrx generates the C-compatible wrapper function, the SQL `CREATE FUNCTION` statement, and the type conversion logic at compile time.

A minimal example:

```rust
use pgrx::prelude::*;

// Tells PostgreSQL this is a loadable extension
pgrx::pg_module_magic!();

// Exposed to PostgreSQL as: CREATE FUNCTION hello(name text) RETURNS text
#[pg_extern]
fn hello(name: &str) -> String {
    format!("Hello, {}!", name)
}
```

After building and installing the extension, this function is callable from SQL:

```sql
SELECT hello('world');
-- Returns: Hello, world!
```

The `#[pg_extern]` macro handles converting PostgreSQL's `text` datum to a Rust `&str` on input and a Rust `String` back to a `text` datum on output. This type mapping extends to integers, floats, arrays, JSON, composite types, and custom types you define.

## Memory Safety Across the Boundary

PostgreSQL manages memory through a hierarchy of memory contexts: short-lived contexts for individual expressions, transaction-scoped contexts, and long-lived contexts for cached data. C extensions must allocate in the correct context and avoid holding references across context resets, or risk use-after-free bugs.

pgrx maps these contexts to Rust's ownership model. When a pgrx function returns a value, the framework allocates the result in the appropriate PostgreSQL memory context. Rust's borrow checker prevents you from holding references to data that PostgreSQL might free. If your Rust code panics, pgrx catches it at the FFI boundary and converts it into a PostgreSQL error, aborting the current transaction cleanly instead of crashing the server.

For expected error conditions, pgrx functions can return a `Result`. The `Err` variant is converted into a PostgreSQL `ERROR`, rolling back the current transaction:

```rust
#[pg_extern]
fn safe_divide(a: f64, b: f64) -> Result<f64, &'static str> {
    if b == 0.0 {
        // Becomes a PostgreSQL ERROR, rolling back the transaction
        Err("division by zero")
    } else {
        Ok(a / b)
    }
}
```

## Development Workflow

pgrx includes a CLI tool, `cargo-pgrx`, that manages the full lifecycle:

```bash
# Install the pgrx toolchain
cargo install cargo-pgrx

# Download and compile the last five PostgreSQL versions for testing
cargo pgrx init

# Create a new extension project
cargo pgrx new my_extension

# Compile, install, and open a psql session with the extension loaded
cargo pgrx run pg18

# Run tests against a real PostgreSQL instance
cargo pgrx test
```

`cargo pgrx run` compiles the extension, installs it into a local PostgreSQL instance, and drops you into a `psql` session where the extension is already loaded. `cargo pgrx test` runs your test suite against a real PostgreSQL instance, not a mock, so tests exercise the actual extension behavior including SQL generation and type conversions.

```rust
#[cfg(any(test, feature = "pg_test"))]
#[pg_schema]
mod tests {
    use pgrx::prelude::*;

    #[pg_test]
    fn test_hello() {
        // Runs inside a real PostgreSQL transaction
        let result = Spi::get_one::<String>("SELECT hello('world')");
        assert_eq!(result, Ok(Some("Hello, world!".to_string())));
    }
}
```

  pgrx tests start a real PostgreSQL instance and execute queries against it.
  This catches issues that unit tests with mocked interfaces would miss, like
  incorrect SQL generation or type conversion errors.

## Beyond Simple Functions

pgrx supports the full range of PostgreSQL extension capabilities:

- **Custom types** via `#[derive(PostgresType)]`, which generates the input/output functions and type definitions
- **Custom operators** via `#[pg_operator]`, letting you define new SQL operators backed by Rust functions
- **Set-returning functions** that yield rows one at a time using Rust iterators
- **SPI access** for executing SQL queries from within extension functions
- **Aggregate functions** with custom state types and transition logic
- **Background workers** for long-running processes managed by PostgreSQL

This coverage means most extensions that could be written in C can be written in pgrx instead, with the same level of PostgreSQL integration.

## Extensions Built with pgrx

Several production extensions use pgrx:

- [ParadeDB](https://github.com/paradedb/paradedb): search and analytics extension for PostgreSQL, built on Tantivy
- [pgrag](https://github.com/neondatabase/pgrag): RAG pipeline extensions from Neon
- [pg_graphql](https://github.com/supabase/pg_graphql): GraphQL query engine embedded in PostgreSQL, from Supabase
- [pg_jsonschema](https://github.com/supabase/pg_jsonschema): JSON Schema validation as a PostgreSQL function

pgrx has also been adopted by major cloud and data platforms, including Microsoft, Amazon, Databricks, and Snowflake.

## History and Governance

pgrx was created by [Eric Ridge](https://github.com/eeeebbbbrrrr), who had been building PostgreSQL extensions in C since version 8.0. The framework grew out of his work on [ZomboDB](https://github.com/zombodb/zombodb), an Elasticsearch-backed indexing extension (now deprecated) — after years of writing C extensions, he built pgrx to present PostgreSQL's internals through Rust's idioms instead. The project was originally named "pgx" and [renamed to pgrx](https://github.com/pgcentralfoundation/pgrx/issues/1106) in April 2023.

The project lives under the [PgCentral Foundation](https://pgcentral.org/), a 501(c)(3) nonprofit, though day-to-day development is still led by Ridge and other core maintainers.

## When to Use pgrx

pgrx is a good fit when you need to extend PostgreSQL with logic that benefits from Rust's performance or safety characteristics. Compute-heavy functions, custom index types, integrations with Rust libraries, and any extension where a crash would be unacceptable are natural candidates. You may also choose to use PGRX when you benefit from Rust's ecosystem, one of the reasons ParadeDB choose Rust was because the amazing Tantivy library existed to help with full-text search.

For simpler logic (data validation, lightweight transformations, glue code), PL/pgSQL may be sufficient and easier to deploy, since it doesn't require compiling native code.

## Summary

pgrx lets you write PostgreSQL extensions in Rust instead of C, providing memory safety, automatic type conversion, SQL generation, and an integrated test workflow. It covers the full PostgreSQL extension API (functions, types, operators, aggregates, SPI, and background workers) while keeping the compile-time safety guarantees that make Rust extensions more reliable than their C equivalents. If your extension needs to be both safe and fast, pgrx is how you get there.

---

# What is pgvector?

> Learn about pgvector, the PostgreSQL extension that adds vector data types and similarity search for embeddings and AI workloads.

Published: 2026-05-14 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/postgresql/what-is-pgvector

---

Modern AI applications produce embeddings: fixed-length arrays of floating-point numbers called vectors that represent the semantic content of text, images, or audio. Similar inputs map to vectors that sit near each other in this space, so retrieval becomes a [vector search](/learn/search-concepts/vector-search) problem rather than a keyword match. For example, a vector search for "reset my password" can match a document about "account recovery" even if the exact words do not overlap, unlike [lexical search](/learn/search-concepts/lexical-vs-semantic), which matches terms directly.

[pgvector](https://github.com/pgvector/pgvector) is a PostgreSQL extension that stores and searches those embeddings with a native `vector` data type, distance operators, and approximate nearest neighbor indexes. It does not create embeddings itself; you generate them with an embedding model, then insert the resulting vectors into PostgreSQL.

## How pgvector Works

pgvector registers a new column type with PostgreSQL. You declare the dimensionality up front, insert vectors as arrays, and query them with distance operators:

```sql
CREATE EXTENSION vector;

CREATE TABLE documents (
    id      bigserial PRIMARY KEY,
    content text,
    -- A 1536-dimensional embedding
    embedding vector(1536)
);

INSERT INTO documents (content, embedding)
VALUES ('PostgreSQL is a relational database', '[0.12, -0.04, ...]');
```

Four distance operators cover the common similarity measures. The right one depends on how your embeddings were generated:

- `<=>` for cosine distance, which compares vector direction and is the usual choice for normalized text embeddings
- `<#>` for negative inner product, which favors larger dot products and is often equivalent to cosine distance when vectors are normalized
- `<->` for L2 (Euclidean) distance, which compares raw geometric distance and is useful when magnitude carries meaning
- `<+>` for L1 (taxicab) distance, which sums absolute coordinate differences and is less common for modern text embeddings

A nearest-neighbor query is an `ORDER BY` on one of these operators with a `LIMIT`:

```sql
SELECT id, content
FROM documents
ORDER BY embedding <=> '[0.10, -0.02, ...]'::vector  -- input embedding
LIMIT 5;
```

Because the result is just rows from a regular table, you can combine vector search with `WHERE` filters, joins, transactions, and any other SQL feature in the same query. The operator in the query must match the operator class on the index, or PostgreSQL cannot use that index for nearest-neighbor search.

## Indexes for Approximate Nearest Neighbor Search

Without an index, pgvector runs an exact scan: every row is compared against the query vector. That is accurate but scales linearly with table size. For larger tables, pgvector provides two approximate nearest neighbor (ANN) index types that trade a small amount of recall for much faster queries.

**IVFFlat** partitions vectors into lists using k-means clustering. At query time, only the most relevant lists are scanned:

```sql
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
```

In pgvector, IVFFlat should be built after data is loaded, since its clusters depend on the existing vector distribution.

**HNSW** builds a multi-layer graph where each node links to nearby neighbors. Searches walk the graph from coarse to fine layers:

```sql
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
```

HNSW indexes are slower and more memory-intensive to build than IVFFlat, but they typically deliver better recall-versus-latency tradeoffs and can be built incrementally as rows arrive.

Each index type supports separate operator classes for L2, inner product, cosine, and L1 distance. The operator class in the index must match the operator used in queries, or PostgreSQL will fall back to a sequential scan. For a deeper walk-through of parameter selection and workload tuning, see [Tuning pgvector Performance](/learn/postgresql/tuning-pgvector).

  Recall in pgvector is tunable at query time. For IVFFlat, set
  `ivfflat.probes`; for HNSW, set `hnsw.ef_search`. Higher values inspect more
  candidates and improve recall at the cost of latency.

## Filtered Vector Search

Real applications rarely search the entire corpus. You usually want the nearest vectors that also match some metadata filter, such as a tenant, category, or date range:

```sql
SELECT id, content
FROM documents
WHERE tenant_id = 42
  AND created_at > now() - interval '30 days'
ORDER BY embedding <=> $1
LIMIT 10;
```

With approximate indexes, pgvector applies filters after the index scan chooses candidate rows. If a filter removes too many candidates, the query can return fewer than `LIMIT` rows.

Recent pgvector versions support iterative index scans through `hnsw.iterative_scan` and `ivfflat.iterative_scan`. When enabled, pgvector keeps scanning more of the HNSW graph or IVFFlat lists until it finds enough filtered rows or reaches a configured scan limit. This improves filtered vector search without requiring a separate query pattern, but it still trades more work for better recall. The [limitations](/learn/postgresql/pgvector-limitations) guide covers the tradeoffs in more detail.

## Storage and Quantization

A single 1536-dimensional `vector` column stores 6 KB per row at full precision. For very large tables, pgvector supports lower-precision types that reduce storage and memory usage:

- `halfvec` stores each component as a 16-bit float, up to 4,000 dimensions
- `bit` stores binary vectors up to 64,000 dimensions for Hamming and Jaccard distance
- `sparsevec` stores up to 1,000 nonzero elements, useful for sparse embeddings

Both IVFFlat and HNSW can be built on these lower-precision or sparse types, so you can fit larger indexes in memory while keeping the same nearest-neighbor query shape. For very large indexes, binary quantization plus reranking can reduce the working set while preserving final result quality.

## When to Use pgvector

pgvector is a strong fit when your embeddings live alongside relational data you already store in PostgreSQL. Keeping vectors in the same database avoids dual-write problems, lets you join across tables, and reuses existing backup, replication, and access control. It is widely supported by managed Postgres providers and often used as the retrieval layer in [RAG](/learn/search-concepts/retrieval-augmented-generation) systems.

Because vectors capture semantic meaning but miss exact terms like product codes or filenames, most production retrieval pipelines pair pgvector with a [full-text search](/learn/search-in-postgresql/full-text-search) index and combine the two results with [hybrid search](/learn/search-concepts/hybrid-search). The [lexical vs. semantic](/learn/search-concepts/lexical-vs-semantic) article covers the tradeoffs in more detail.

A dedicated vector database may make more sense when you need billions of vectors, very high write throughput on the vector path alone, or features such as native multi-vector ranking and learned indexes that pgvector does not yet provide.

## Summary

pgvector turns PostgreSQL into a capable vector database by adding a `vector` type, distance operators, and IVFFlat and HNSW indexes for approximate nearest neighbor search. It integrates cleanly with the rest of SQL, so embeddings can be filtered, joined, and transacted on like any other column. For many applications that combine structured data with AI-generated embeddings, pgvector provides production-quality similarity search without running a separate system.

---

# What is Block WAND?

> Learn about Block-Max WAND, the dynamic pruning algorithm that accelerates top-k document retrieval in search engines.

Published: 2026-02-27 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/search-concepts/block-wand

---

Block-Max WAND (BMW) is a dynamic pruning algorithm that accelerates Top K document retrieval in search engines, the kind of query you express in SQL as `ORDER BY score DESC LIMIT k`. It works with scoring functions like [BM25](/learn/search-concepts/bm25) by dividing inverted index posting lists into fixed-size blocks, precomputing the maximum possible score within each block, and using those scores to skip entire blocks of documents that cannot appear in the final results. In most implementations, the algorithm returns exactly the same results as exhaustive scoring while doing a fraction of the work.

The technique was introduced by [Shuai Ding and Torsten Suel in 2011](https://dl.acm.org/doi/10.1145/2009916.2010048) and has since been adopted by [Apache Lucene](https://github.com/apache/lucene), [Tantivy](https://github.com/quickwit-oss/tantivy), and other search engines as a core query optimization.

## From WAND to Block WAND

Block WAND builds on two earlier pruning techniques:

- **MaxScore** ([Turtle & Flood, 1995](https://www.sciencedirect.com/science/article/abs/pii/030645739500020H)) stores a single global maximum score per term across the entire posting list. It uses this to classify terms as "essential" or "non-essential" and skip documents that cannot beat the current threshold.
- **WAND** ([Broder et al., 2003](https://dl.acm.org/doi/10.1145/956863.956944)) introduces a pivot mechanism that sorts term pointers by document ID and calculates whether the combined upper-bound scores of aligned terms can exceed the threshold. If not, it advances past those documents without scoring them.

Both techniques rely on a single global upper bound per term: the highest score that term produces across the entire index. This bound can be inflated by a single outlier document, making it loose and limiting how aggressively documents can be skipped.

Block WAND is not a different algorithm; it is WAND with finer-grained upper bounds. Instead of one maximum score for a posting list of millions of documents, it records the maximum score for each block of typically 128 documents. These tighter bounds plug directly into WAND's existing pivot mechanism, allowing far more aggressive pruning.

## How Block WAND Works

### Index Structure

During indexing, each posting list is divided into fixed-size blocks. For each block, the index stores:

- The **block-max score**: the highest score any document in this block can contribute for this term
- The **maximum document ID** in the block

These values are stored alongside the compressed posting data with minimal overhead, typically 5-8% additional index size.

### Query Processing

When you execute a query for the Top K results, the algorithm maintains a result heap and a dynamic threshold: the minimum score needed to enter the current Top K. Processing follows these steps:

1. **Sort term pointers** by their current document ID, smallest first.
2. **Find the pivot document.** Walk through the sorted pointers, accumulating block-max scores until their sum exceeds the threshold. The document where the sum first crosses the threshold is the pivot, the earliest document that could possibly score high enough to enter the Top K.
3. **Check block bounds**: for each term's posting list, look up the block-max score for the block containing the pivot document. If a term's block-max is too low to help the pivot exceed the threshold, skip that term's iterator ahead to its next block. Different terms may skip different blocks.
4. **Score the document**: if the combined block-max bounds survive the check, fully score the pivot document. If it exceeds the threshold, insert it into the Top K heap and raise the threshold.
5. **Repeat** until all posting lists are exhausted.

As higher-scoring documents are found, the threshold rises, and pruning becomes more aggressive; the algorithm accelerates as it runs.

### Why Per-Block Bounds Matter

Consider a common term like "the" with a posting list spanning millions of documents. Its global maximum score might be 3.5 (from one short document where "the" appears disproportionately often). In most blocks of 128 documents, the actual maximum contribution is much lower, perhaps 0.8. Using 0.8 instead of 3.5 as the upper bound allows the algorithm to skip far more blocks.

This effect compounds across multiple query terms. When every term has a tighter bound, the combined upper-bound estimate drops substantially, and entire regions of the index are eliminated without decompressing a single document ID.

## Example: Block WAND in Action

Suppose you search for **"machine learning"** and want the top 10 results. The index has separate posting lists for each term, and each posting list is divided into blocks. The algorithm walks both lists in parallel, checking whether each block pair can produce a combined score above the current threshold.

```text
threshold = 0.0

Block 1 (docs 1–128):
  "machine" block-max = 4.1, "learning" block-max = 3.8, sum = 7.9
  -> sum > threshold, score documents
  -> threshold rises to 5.3

Block 2 (docs 129–256):
  "machine" block-max = 1.2, "learning" block-max = 2.4, sum = 3.6
  -> sum < threshold (3.6 < 5.3), skip entire block

Block 3 (docs 257–384):
  "machine" block-max = 3.9, "learning" block-max = 4.5, sum = 8.4
  -> sum > threshold, score documents
  -> threshold rises to 6.8

Block 4 (docs 385–512):
  "machine" block-max = 0.9, "learning" block-max = 1.8, sum = 2.7
  -> sum < threshold (2.7 < 6.8), skip entire block
```

Notice that each term has its own block-max score, and the algorithm sums them to decide whether a block is worth scoring. As the threshold rises, more blocks are pruned.

## Performance

Benchmarks across academic and industry implementations show consistent results:

- **3x to 7x speedup** on standard term queries compared to exhaustive scoring ([Grand et al., ECIR 2020](https://link.springer.com/chapter/10.1007/978-3-030-45442-5_3))
- In practice, Block WAND typically scores only 10-40% of matching documents, with the rest pruned at the block level
- Greatest impact on **disjunctive (OR) queries** with high-frequency terms, where the most documents can be skipped
- Benefits scale with collection size; larger indexes see proportionally more pruning

The algorithm only applies to Top K queries by score, not exhaustive counts or Top K on fast fields, since it relies on a threshold from the result heap to make pruning decisions. The speedup is greatest when queries contain common terms with long posting lists, where most blocks can be skipped.

## Scoring Constraints

Block WAND's pruning is safe only when the scoring function is additive across terms and each term's contribution is non-negative. This holds for [BM25](/learn/search-concepts/bm25), TF-IDF, and most classical retrieval functions. Scoring models that produce negative term contributions or non-additive scores (such as some neural rerankers) cannot use Block WAND for safe pruning, since a negative contribution could make a skipped document score higher than the upper bound suggests.

## Adoption in Search Engines

- **Apache Lucene 8.0** (2019) adopted Block WAND for Top K disjunctive queries, bringing the optimization to [Elasticsearch](https://github.com/elastic/elasticsearch) and [Solr](https://github.com/apache/solr)
- **[Tantivy](https://github.com/quickwit-oss/tantivy)** implements Block WAND for boolean queries with `should` clauses, inherited by [Quickwit](https://github.com/quickwit-oss/quickwit) and [ParadeDB](https://github.com/paradedb/paradedb). Tantivy's block-max values are not always exact: in rare cases a block-max may underestimate the true maximum, which can cause a marginally lower-scoring document to be returned instead of the true top result

## Summary

Block WAND accelerates Top K retrieval by replacing loose global score bounds with tight per-block bounds, allowing the search engine to skip large portions of the index without sacrificing result quality. It processes the same query with a fraction of the scoring work, and the speedup grows as the threshold tightens during execution. For search engines handling disjunctive queries over large collections, it is one of the most effective optimizations available.

---

# What is BM25?

> Learn about BM25, the ranking algorithm that powers modern search engines.

Published: 2024-10-05 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/search-concepts/bm25

---

[Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is a **ranking function** used in search engines to score documents by relevance to a query. The BM is short for `best matching`, and `25` is the version of the function that worked best in testing. It improves on earlier methods like [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) by accounting for document length and diminishing returns from repeated terms. It powers Apache Lucene-based search systems like [Elasticsearch](https://github.com/elastic/elasticsearch) and [Solr](https://github.com/apache/solr), and Tantivy-based search systems like [ParadeDB](https://github.com/paradedb/paradedb).

Even though it's over 30 years old, BM25 remains the default ranking algorithm in most search engines because it’s simple, explainable, and performs well in practice.

## How BM25 Works

It's important to note that BM25 is what’s known as a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) retrieval function. It doesn’t look at term order, phrase structure, or proximity: only at which words appear and how frequently they occur.

BM25 starts by breaking the search query into individual terms. For each term, it scores how well every document matches based on three main signals:

- **Term frequency (TF):** documents that mention a query term more often score higher.
- **Inverse document frequency (IDF):** rare terms are weighted more than common ones.
- **Document length normalization:** shorter documents are preferred to long ones that mention the term in passing.

Each term contributes its own score, and BM25 sums those scores to produce an overall relevance value for the document.

In simplified form, BM25 can be thought of as:

$$
\text{score} = \sum \text{IDF} \times \text{adjusted\_term\_frequency}
$$

where the adjusted term frequency reduces the impact of very frequent terms and normalizes for document length.

  The full [BM25
  formula](https://en.wikipedia.org/wiki/Okapi_BM25#The_ranking_function) is
  slightly more complex. It includes tunable parameters $$k₁$$ (how quickly term
  frequency saturates) and $$b$$ (how strongly document length is normalized).
  These controls make BM25 adaptable across different datasets and document
  types.

## Why BM25 Works Well

BM25 has three key strengths which have allowed it to stay relevant for so long:

- **Simplicity:** easy to understand and implement.
- **Efficiency:** fast enough to run in real-time over large datasets.
- **Explainability:** each part of the score can be traced to a clear factor.

Because of these traits, BM25 remains the baseline for relevance in modern search systems.

## When to Use BM25

BM25 excels in retrieval workloads where specific keywords carry significant information. For example:

- Searching for a brand name in a product catalog
- Looking up a stock ticker in a trading app
- Matching a merchant name in a credit card transaction ledger
- Finding names in a legal document
- Finding a diagnosis code in a medical report
- Retrieving files to feed into an LLM in a RAG application

Because BM25 relies on simple term statistics, it is accurate, consistent and extremely fast. That makes it a popular choice in applications where low query latency and keyword relevance are critical.

## Example: Scoring a Query with BM25

Imagine a user searches for **"inverted index"**.

BM25 first breaks the query into two terms: **“inverted”** and **“index.”**

Then, for each document, it scores how well those terms match:

| **Document** | **Length** | **Term counts**            | **TF–IDF signals**                                          | **Relative Score** |
| ------------ | ---------- | -------------------------- | ----------------------------------------------------------- | ------------------ |
| **Doc A**    | 60 words   | inverted: 1<br />index: 1  | rare → high IDF<br />common → low IDF                       | **Higher**         |
| **Doc B**    | 200 words  | inverted: 1<br />index: 10 | rare → high IDF<br />common → low IDF<br />+ length penalty | **Lower**          |

Even though _Doc B_ repeats “index” more times, BM25 favors _Doc A_ because:

1. “inverted” is a rare term with higher IDF,
2. both documents mention it once (similar TF), and
3. _Doc A_ is shorter, so its matches are more concentrated.

**BM25 rewards focused, relevant mentions of rare terms, not repetition in long text.**

## Summary

BM25 remains the standard ranking algorithm for keyword search.

It combines frequency, rarity, and length normalization into a single scoring model that quickly produces balanced and predictable results.

It’s the foundation that modern lexical hybrid search systems continue to build on.

---

# What is Faceted Search?

> Learn about faceting and faceted search, how they relate to aggregations, and when to use them.

Published: 2024-10-31 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/search-concepts/faceting

---

**Faceted search**, also known as faceting, is a [search technique](https://en.wikipedia.org/wiki/Faceted_search) that organizes [full-text search](/learn/search-concepts/full-text-search) results into filterable categories, known as _facets_. Each facet represents a property of the data (such as color, brand, or price range) and lets users narrow down large result sets by selecting multiple values simultaneously. It’s the foundation of interactive filtering in e-commerce, document search, and analytics platforms.

Faceting is an interesting problem because while it can in theory be computed client-side or across multiple queries, search responses typically need to be fast. That means facets are usually calculated efficiently within the same query that produces the search results in a single index pass.

## What is a Facet?

Mathematically, a facet is one of the faces of a many-sided object. In computer science, facets describe how information about an object is organized and grouped by properties or attributes. A facet refers to a category or dimension of data (for example, color), while a facet value represents a specific option within that category (for example, brown or blue).

Each facet value is adjoined with an integer denoting the number of filtered items that belong to that value. For example, a facet might be represented as the following:

```jsx
brown(10);
blue(7);
white(4);
```

Facets are additive: a user might check multiple facet values (e.g. `brown shoes` and `cheap shoes`) to cut down search results. Facets can categorize sets of values (e.g. `brown`, `blue`, `white` etc.) and also buckets of ranges (e.g. `$0 to $99`, `$100 to $999`, `+$1,000` etc.).

In practice, facets travel with Top K searches. Showing facet counts alongside the first N results lets users refine immediately, understand distribution at a glance, and avoid extra round-trips. Executing ranking and aggregation in the same pass keeps latency low and proportional to matches, not the total document set.

## Faceted Search Implementations

At a small scale, faceted search can be computed client-side or with simple SQL queries. For example, a web application could fetch all matching records and compute facet counts in JavaScript, or run a few GROUP BY queries in SQL. But both approaches fall apart as datasets grow.

A **client-side implementation** requires transferring every matching record to the client so counts can be calculated locally. Even a modest search returning tens of thousands of results would quickly become hard to work with.

A **naïve SQL implementation** would need multiple passes over the same data, first running the search query and then running the faceting on the full result set before returning the top K records.

In production systems, faceting is usually performed in tandem with the search query itself. Rather than running separate aggregation queries, search engines compute facet counts during the same index scan that retrieves matching documents. Search databases and search engines such as Elasticsearch, Solr, and ParadeDB do this by maintaining auxiliary data structures (often columnar or bitmap-based) that allow category counts to be aggregated without rereading all documents. Each facet value is mapped to a compact integer ordinal or bitset, so counts can be computed in memory as part of the query execution path.

Faceted search also operates over the entire result set even when the current query is limited. Even if only a handful of documents are displayed, facet counts must reflect all matching documents. Because it’s impossible to predict what combinations of filters users will apply, facets can’t be precomputed in advance. Instead, indexes are organized so that these aggregations can be calculated efficiently (often in logarithmic time) during a single index pass.

## Faceted Search Examples

In Elasticsearch users can perform faceting alongside a search query with the [aggregations API](https://www.elastic.co/docs/reference/aggregations):

```json
POST /products/_search
{
  "size": 10,
  "query": {
    "multi_match": {
      "query": "wireless headphones",
      "fields": ["title", "description"]
    }
  },
  "aggs": {
    "price_histogram": {
      "histogram": {
        "field": "price",
        "interval": 50
      }
    }
  }
}
```

In ParadeDB the same thing can be accomplished using the `pdb.agg` function as a window.

```sql
SELECT
  *,
  pdb.score(id) AS bm25_score,
  pdb.agg('{"histogram": { "field": price, "interval": 50 }}') OVER () AS facets
FROM products
WHERE
  title       ||| 'wireless headphones' OR
  description ||| 'wireless headphones'
ORDER BY bm25_score
LIMIT 10
```

## What are Common Use Cases for Faceting?

Faceting becomes essential whenever users need to explore and filter large datasets while maintaining search relevance. The key scenarios fall into several categories:

### E-commerce and Product Discovery

E-commerce platforms like Amazon and Wayfair are the most visible examples, but the pattern extends beyond retail. Users searching through thousands of products need to quickly narrow results by brand, price range, customer ratings, and availability. Faceting allows exploratory browsing—users can start with a broad search like "laptop" and progressively refine by processor type, screen size, and budget without losing the relevance ranking of their original query.

### Content and Knowledge Management

Organizations with large content repositories face similar challenges. A support agent searching 50,000 help tickets needs to filter by product area, severity level, and date range while maintaining the relevance of their keyword search. Documentation platforms like Confluence allow users to find relevant articles while filtering by team ownership, document type, or last modified date. Without faceting, users would either get overwhelming result sets or lose important context by switching to pure categorical browsing.

### Analytics and Data Exploration

Analytics platforms like PostHog or Mixpanel enable users to search and filter event data across multiple dimensions simultaneously. A product manager investigating user behavior can search for specific actions while faceting by device type, geographic region, and session length. The combination of search relevance with dimensional filtering allows for both targeted investigation and serendipitous discovery.

### Specialized Search Applications

Recruiting platforms like LinkedIn demonstrate faceting in talent search, where recruiters need to find candidates matching job descriptions while filtering by location, experience level, and specific skills. Legal research platforms allow lawyers to search case law while filtering by jurisdiction, date range, and legal area. Real estate platforms let buyers search for properties while filtering by price, location, and amenities.

In all these cases, faceting solves the fundamental problem of information overload: how to maintain the power of full-text search while giving users intuitive ways to navigate large result sets.

## Summary

Faceted search organizes search results into filterable categories called facets, allowing users to narrow down results by selecting specific attributes like color, price, or other characteristics.

Faceted search is natively available in search-first solutions like ParadeDB or Elasticsearch.

---

# What is Full-Text Search?

> Learn about full-text search capabilities, how they work, and when to use them in your applications.

Published: 2024-10-06 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/search-concepts/full-text-search

---

[Full-text search](https://en.wikipedia.org/wiki/Full-text_search) (FTS) is designed to find documents based on the words they contain rather than just structured fields or metadata. It brings intelligence to text retrieval by analyzing and indexing language, allowing queries to return results ranked by relevance instead of simple matches.

This capability transforms a datastore into a search engine. Rather than forcing users to remember exact values or field names, full-text search lets them describe what they're looking for in natural terms, and still find it.

Full-text search operates through two core components: **indexing** and **querying**. Indexing transforms raw text into structured, searchable data, while querying provides flexible ways to retrieve and rank relevant documents.

## Indexing

Indexing is the foundation of full-text search performance. Instead of storing entire documents as opaque blocks that must be scanned linearly, indexing decomposes them into smaller, searchable parts and rebuilds them into structures optimized for retrieval.

### Text Analysis Pipeline

The indexing process begins with text analysis, where raw text is transformed into a form that computers can index efficiently. The pipeline typically includes:

1. **Tokenization:** Splitting text into discrete terms or tokens, such as words or phrases.
2. **Normalization:** Lowercasing terms, removing punctuation and stop words, and often applying stemming so that `running`, `ran`, and `run` are treated as related.
3. **Indexing:** Storing each token alongside the documents that contain it in a structure called an inverted index.

  Systems like [Elasticsearch](https://github.com/elastic/elasticsearch),
  [Solr](https://github.com/apache/solr),
  [ParadeDB](https://github.com/paradedb/paradedb), and [PostgreSQL
  ts_vector](https://www.postgresql.org/docs/current/datatype-textsearch.html)
  manage this entire pipeline automatically, allowing developers to add
  high-quality search to their applications without reinventing the underlying
  mechanics.

### Inverted Indexes

The foundation of full-text search is the [inverted index](https://en.wikipedia.org/wiki/Inverted_index), a data structure that enables very fast lookups by reversing the usual document-to-terms relationship.

Instead of asking _"what terms appear in this document?"_, an inverted index can be used to answer _"which documents contain this term?"_

The core structure is a dictionary where each term maps to a postings list, describing every document that contains the term.

Each entry in a postings list typically includes:

- The **document ID**
- The **term frequency** (how often the term appears in that document)
- Optionally, the **positions** where the term occurs (enabling phrase and proximity queries)

#### Example: Building an Inverted Index

Consider indexing two simple documents:

| **ID** | **Text**                     |
| ------ | ---------------------------- |
| 1      | "PostgreSQL supports search" |
| 2      | "Search engines use indexes" |

After tokenization and normalization, the inverted index looks like this (using an [english language tokenizer](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/simple) with no stemming):

| **Term**   | **Postings List**                          |
| ---------- | ------------------------------------------ |
| postgresql | `[(1, tf=1, pos=[0])]`                     |
| supports   | `[(1, tf=1, pos=[1])]`                     |
| search     | `[(1, tf=1, pos=[2]), (2, tf=1, pos=[0])]` |
| engines    | `[(2, tf=1, pos=[1])]`                     |
| indexes    | `[(2, tf=1, pos=[3])]`                     |

When you search for "search", the system looks up the term and finds documents `[1, 2]` without scanning any text.

#### Performance Benefits

Inverted indexes provide:

- **Fast retrieval:** $$O(1)$$ term lookup followed by efficient postings list intersection
- **Space efficiency:** Each unique stemmed term stored only once, regardless of corpus size
- **Flexible scoring:** Rich statistics support advanced ranking algorithms

Modern implementations include optimizations like compression, skip lists, and index partitioning to handle large-scale deployments efficiently.

## Querying

Once text is indexed, querying provides flexible ways to retrieve and rank relevant documents. When a user submits a query, the same analysis steps used during indexing are applied to their search terms, which are then looked up in the index.

Modern search engines provide powerful query builder APIs that allow developers to programmatically construct complex search queries and integrate business logic directly into their search
operations. This pushes as much work as possible down into the query engine (where it can happen the most efficiently), so the client doesn’t have to post-filter or reprocess results.

### Query Types and Processing

Different query types use inverted indexes in different ways:

- **Boolean queries** ("PostgreSQL AND search") use set intersection of postings lists to find documents that match logical conditions.
- **Phrase queries** use position information to ensure terms appear consecutively or in proximity to each other. For example, you could search for an exact match of "search engines" or for when "search" is within five words of "indexes".
- **Ranked queries** apply algorithms like [BM25](/learn/search-concepts/bm25) using term frequency and document frequency statistics to give the most relevant results first.

### Advanced Query Features

Full-text search provides much more than the ability to find exact matches. It introduces a set of features that make search both flexible and forgiving:

- **Fuzzy matching** corrects small errors and typos, so a query for `databse` still returns `database`. {/* * codespell:ignore databse * */}
- **Boolean operators** give power users fine-grained control with logic like `python AND (api OR web)`, filtering results to match complex conditions.
- **Field weighting** acknowledges that some parts of a document matter more than others—a match in a title often carries more importance than one in the body text.
- **Proximity queries** find words that are located close to other words (`dogs within 5 words of cats`).
- **Faceting** enables users to filter search results by categories or attributes (like "brand," "price range," or "date") while maintaining text-based relevance scoring.

## When Full-Text Search Shines

Full-text search becomes essential whenever large volumes of unstructured or semi-structured text need to be queried quickly and intuitively. It thrives in scenarios where users think in terms of language, not schema.

Some common examples include:

### Content and Communication Search

- **Content discovery:** Searching across articles, blog posts, or documentation pages
- **Communication search:** Navigating chat histories, emails, or support tickets
- **E-commerce:** Product catalogs where users search by description, features, or brand names

### Technical Search

- **Code search:** Finding functions, variables, or patterns across codebases
- **Log search:** Analyzing application logs, error messages, and system events for debugging and monitoring

### AI and Machine Learning

- **Retrieval Augmented Generation (RAG):** Powering AI systems that need to find relevant context from knowledge bases before generating responses

In all of these workloads, users aren't just filtering, they're searching. They know what they want to find conceptually but not exactly where it's stored. Full-text search bridges that gap by turning text into a structured, ranked representation of meaning.

## When Full-Text Search Is Not Enough

While full-text search excels at lexical matching, it has limitations when users search by meaning rather than keywords. Full-text search uses these methods to capture linguistic variation, but its focus remains lexical: it matches words, not meanings.

Consider these scenarios where traditional full-text search struggles:

- **Semantic similarity:** A search for "automobile" won't match documents about "cars" unless explicit synonyms are configured
- **Cross-language search:** Keywords in one language won't match semantically equivalent content in another language
- **Domain-specific terminology:** Technical concepts may be expressed differently across documents but carry the same meaning

For these use cases, **vector search** provides a complementary approach. Vector search represents text as high-dimensional vectors that capture semantic meaning, allowing searches based on conceptual similarity rather than exact keyword matches.

Many modern search systems use **hybrid approaches** that combine both full-text and vector search:

- Full-text search for precise keyword matching and boolean logic
- Vector search for semantic similarity and concept-based retrieval
- Ranking algorithms that blend both approaches for optimal relevance

## Summary

Full-text search brings lexical structure to unstructured data. By analyzing, indexing, and ranking text, it makes information findable through natural language rather than rigid filters or exact matches.

It remains one of the most effective tools for large-scale text retrieval: fast, reliable, and deeply optimized for keyword relevance. Its success lies in bridging how people think about information with how computers store it.

---

# What is Hybrid Search?

> Learn about hybrid search systems that combine lexical and semantic search for comprehensive results.

Published: 2024-10-06 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/search-concepts/hybrid-search

---

Hybrid search combines [full-text search](/learn/search-concepts/full-text-search) and [vector search](/learn/search-concepts/vector-search) to deliver both precise keyword matching and semantic understanding in a single query. This approach solves the fundamental trade-off between finding exact terms and discovering conceptually related content, making it particularly valuable for applications where users express queries in diverse ways but expect comprehensive, relevant results.

## Why Combine Search Methods?

Each search approach has distinct strengths and weaknesses that make combination valuable:

- [Full-text search](/learn/search-concepts/full-text-search) relies on exact keyword matching and struggles when users express concepts differently than documents. A search for "automobile" won't find documents about "cars" unless explicit synonyms are configured. Technical terms, product codes, and precise specifications work well, but natural language queries often fail to match semantically related content.
- [Vector search](/learn/search-concepts/vector-search) understands semantic relationships but may miss critical exact matches. Searching for a specific product model like "iPhone 15 Pro" might return results about "premium smartphones" that are semantically similar but factually incorrect. Product codes, technical specifications, and precise terminology can be overlooked in favor of conceptually related but wrong results.

## Combining Rankings

The core challenge in hybrid search is combining scores from different ranking algorithms. [Full-text search](/learn/search-concepts/full-text-search) systems like [BM25](/learn/search-concepts/bm25) produce relevance scores based on term frequency and document statistics, while [vector search](/learn/search-concepts/vector-search) generates similarity scores from mathematical distance calculations. These scores operate on different scales and can't be directly combined.

[Reciprocal Rank Fusion](/learn/search-concepts/reciprocal-rank-fusion) (RRF) is one way of solving this problem. Instead of trying to normalize different scoring systems, RRF assigns scores based on where documents appear in each ranked list. Documents that rank highly in both search methods receive the highest combined scores.

  For a full discussion of RRF, including why it works, how it can be weighted,
  and a working SQL example, see our [Reciprocal Rank
  Fusion](/learn/search-concepts/reciprocal-rank-fusion) article.

## Query Processing in Hybrid Systems

Hybrid search systems process queries through processing pipelines that can execute simultaneously:

1. **Text processing:** The query undergoes traditional text analysis (tokenization, stemming, and stopword removal) before matching against the inverted index using algorithms like [BM25](/learn/search-concepts/bm25)
2. **Vector generation:** The same query is converted into a dense vector embedding using machine learning models, then compared against document embeddings using similarity metrics like cosine distance
3. **Result fusion:** Both result sets are combined using techniques like [Reciprocal Rank Fusion](/learn/search-concepts/reciprocal-rank-fusion) to produce a unified ranking

This parallel processing allows hybrid systems to capture different aspects of relevance. A query for "wireless headphones with noise cancellation" will match documents containing those exact terms while also finding content about "Bluetooth earbuds with ANC" that shares semantic meaning but uses different terminology. {/* * codespell:ignore anc * */}

## Implementation Considerations

Building effective hybrid search requires careful attention to several technical factors:

- **Index synchronization** ensures that text and vector indexes remain consistent as documents are added, updated, or deleted. Inconsistencies between indexes can cause confusing results where documents appear in one search method but not the other.
- **Embedding model selection** significantly impacts semantic search quality. Models trained on domain-specific data typically outperform general-purpose models for specialized applications. A medical search system benefits from embeddings trained on medical literature rather than general web content.
- **Fusion weight tuning** requires experimentation with real user queries and feedback. The relative weighting between text and vector search depends on user behavior. Technical users might prefer exact keyword matching, while casual users benefit from semantic flexibility.

## When Hybrid Search Excels

Hybrid search provides clear advantages in applications requiring both precision and semantic understanding:

- **E-commerce search:** Find exact product names and models while understanding descriptive queries like "comfortable running shoes"
- **Technical documentation:** Match precise API names and error codes while finding conceptually related troubleshooting content
- **Enterprise search:** Locate specific documents by title or ID while discovering semantically related information across different departments
- **Customer support:** Match exact error messages while understanding natural language problem descriptions
- **Academic research:** Find papers with specific methodologies while discovering conceptually related work using different terminology

In each case, users benefit from both the precision of keyword matching and the flexibility of semantic understanding.

## Limitations

While hybrid search offers superior relevance, it introduces significant complexity:

- **Computational overhead:** Running both [full-text](/learn/search-concepts/full-text-search) and [vector search](/learn/search-concepts/vector-search) simultaneously increases computational requirements and storage needs
- **Implementation complexity:** Managing two different indexing systems and tuning fusion parameters requires engineering effort
- **Query latency:** Processing queries through multiple systems and combining results can increase response times
- **Cost implications:** Maintaining both keyword indexes and vector embeddings increases infrastructure costs

For many applications, a single search method may provide sufficient results with much lower complexity. Hybrid search makes sense when the improved relevance justifies the additional overhead.

## Summary

Hybrid search represents the current state-of-the-art in information retrieval by combining the precision of [full-text search](/learn/search-concepts/full-text-search) with the semantic understanding of [vector search](/learn/search-concepts/vector-search).

While the added complexity and computational overhead are significant, if they are tuned well hybrid systems can deliver superior relevance across diverse query types.

---

# Lexical vs Semantic Search

> Learn the differences between lexical and semantic search methods, when to use each approach, and how hybrid search combines both.

Published: 2024-11-28 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/search-concepts/lexical-vs-semantic

---

Different search workloads have different relevancy requirements. For example, some search problems only depend on exact or approximate string matching, where the presence (or absence) of a substring determines relevance (e.g. `“cat”` and `“black cat”`). Others depend on semantic similarity, where meaning determines relevance (e.g. `“cat”` and `“feline”`).

From an implementation perspective, these two patterns are orthogonal:

- Lexical methods (including full-text search) match tokens precisely. They only succeed when the query and document share vocabulary.
- Semantic methods capture intended meaning, but are poorly suited for exact identifiers (IDs, numbers, proper names) that lexical systems handle precisely.

This difference is why no single retrieval method performs well across all workloads (however, using both together with hybrid search offers a compromise).

Understanding when to use lexical, semantic, or hybrid search depends on your specific use case and the type of queries your users perform.

## Lexical Search

Lexical search has been part of information systems for decades. Early workloads depended on deterministic matching. Most queries issued by databases and other internal tools referenced exact terms like IDs or error codes. For example, a SQL query using `LIKE '%error%'` searches for any text containing the substring "error".

Modern lexical search is much more advanced and determines relevance based on token presence and frequency. Documents are scored higher when query terms appear more frequently or are rarer across the entire corpus.

### How does Lexical Search Work?

Most lexical search implementations use inverted indexes. An inverted index maps each token to the documents that contain it. This enables fast lookups even when datasets grow.

However, results also need to be ranked. [BM25](/learn/search-concepts/bm25) (Best Matching 25) is the most widely used ranking function for lexical retrieval. When indexing, text is passed through an analyzer that tokenizes the input. These tokens might be individual words, phrases, or lexemes (root words of each word). Each token becomes an entry in the inverted index, along with how many documents contain that token and how frequently it appears.

Lexical search can be enhanced with techniques like stemming (reducing words to root forms) and synonym expansion (matching predefined word lists). However, these improvements still rely on exact token matching rather than understanding contextual meaning.

### What is Full-Text Search?

[Full-text search](/learn/search-concepts/full-text-search) extends this model by adding structure to the query itself. Instead of treating tokens independently, full-text search lets developers configure how tokens should relate to one another.

For example, phrase search requires tokens to appear in a specific sequence, which is useful whenever order might affect meaning (such as error messages or function signatures). Proximity search, on the other hand, allows tokens to appear near each other but not in an exact order; this better reflects how related terms appear in documentation or logs.

Full-text search is still purely lexical. It uses an inverted index and extends the same token statistics. However, full-text search provides a more expressive query toolkit.

## Semantic Search

Semantic search (also known as [vector search](/learn/search-concepts/vector-search)) addresses a fundamental limitation of lexical search. Two pieces of text can describe the same idea without overlapping words. If a query uses different phrasing than the sought-after document, search results might not be relevant (although lexical with synonyms can help, it still can't index concepts).

Dense vector retrieval closes this gap by approximating meaning instead of relying on matching terms. Embedding models (like text-embedding-ada-002, sentence-transformers, or E5) learn relationships between concepts from large datasets; these concepts are then mapped into a continuous vector space.

### How does Semantic Search Work?

When indexing, an embedding model converts each document into a dense vector. These vectors represent semantic information learned from the model’s training data. Because this representation is continuous, documents that express similar ideas can be placed near one another even if they use completely different vocabulary.

Searching across all vectors directly would be very expensive, so databases use approximate nearest neighbor (ANN) indexes like HNSW or IVF. These structures organize vectors into partitions, making it possible to retrieve a small candidate set. ANN indexes trade a small amount of recall for significant gains in throughput and latency.

At query time, the system embeds the query into the same vector space. It then evaluates how close the query vector is to each candidate vector using similarity metrics such as cosine similarity, the dot product, or the L2 distance.

## Hybrid Search

[Hybrid search](/learn/search-concepts/hybrid-search) does not replace lexical or semantic search. Instead, it coordinates them. Lexical methods are reliable when queries contain structured tokens. Semantic methods are reliable when the query expresses intent in natural language. Hybrid search combines these strengths.

In hybrid search, both lexical and semantic methods retrieve their own set of candidates. Each list reflects a different assumption about what the user meant: one grounded in exact terms, the other grounded in contextual meaning. Documents surfaced by either method may be relevant. Merging these lists into a single ranking creates a more stable relevance signal across query types that may change.

### How does Hybrid Search Work?

Hybrid search begins by running multiple retrieval methods in parallel. Typically, lexical search retrieves and ranks candidates using BM25, which is almost always the best baseline for lexical retrieval. At the same time, a vector search retrieves candidates using dense embeddings. Each method produces its own ranked list based on its internal scoring rules.

At query time, these lists need to be merged into a single final ranking. The scores from each method cannot be directly compared as lexical scores and vector measures use varying scales. However, techniques like [Reciprocal Rank Fusion](/learn/search-concepts/reciprocal-rank-fusion) (RRF) ignore raw scores entirely and strictly use the rank position of each document within its own list. RRF assigns each document a fused score.

## Summary

Lexical, semantic, and hybrid search methods exist because users express intent in different ways. Some queries rely on exact tokens, while others rely on context. Each retrieval method optimizes for a different type of relevance.

Beyond choosing retrieval methods, query rewriting (expanding or modifying queries before search) and reranking (adjusting results after retrieval) are equally important for improving search quality. These techniques can significantly enhance any underlying search approach. Modern applications like [Retrieval-Augmented Generation (RAG)](/learn/search-concepts/retrieval-augmented-generation) rely heavily on these search fundamentals to provide relevant context to language models.

One practical challenge that many teams face is when their applications require search in a separate system from their primary database (e.g. Elasticsearch and Postgres). In cases like these, the burden falls on developers to keep both systems consistent.

Modern database engines that support both lexical and semantic search natively help by running retrieval methods directly on database rows, eliminating the need for separate index-sync pipelines.

---

# What is Reciprocal Rank Fusion?

> Learn about Reciprocal Rank Fusion (RRF), a technique for combining multiple search result rankings.

Published: 2024-10-06 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/search-concepts/reciprocal-rank-fusion

---

[Reciprocal Rank Fusion](https://plg.uwaterloo.ca/%7Egvcormac/cormacksigir09-rrf.pdf) (RRF) is a technique for combining multiple ranked lists of search results into a single ranking. Originally developed for information retrieval research, RRF has become essential in modern search systems to allow ranking across multiple retrieval methods, such as combining lexical and semantic search results in hybrid search applications.

RRF's elegance lies in its simplicity: rather than normalizing scores within each scoring system, it works directly with document positions in the ranked lists.

## How RRF Works

RRF operates on a simple principle: documents that appear highly ranked across multiple search methods are likely to be genuinely relevant.

The standard RRF formula treats all ranking sources equally, but _Weighted RRF_ allows you to assign different weights to each method.

For example, when combining BM25 and vector search you might assign weights of 1.0 and 0.7 respectively to show that you value lexical over semantic meaning.

$$
\text{WeightedRRF}(d) = \sum_{r \in R} w_r \cdot \frac{1}{k + rank_r(d)}
$$

Where

- $$d$$ is a document
- $$R$$ is the set of ranked lists
- $$rank_r(d)$$ is the rank of document $$d$$ in ranking $$r$$
- $$w_r$$ is the weight for ranking $$r$$ (typically between 0 and 1). In non-weighted/traditional RRF this is always 1.
- $$k$$ is a constant (typically 60) that controls the fusion behavior

The fusion process works as follows:

1. Start with multiple ranked lists from different retrieval methods (discard the actual ranking numbers).
1. Calculate RRF scores for each document that appears in any list and multiply by the weight (if used).
1. Merge all documents and sort by their combined RRF scores

Documents missing from a ranking contribute zero to that ranking's sum.

  The constant $$k=60$$ is almost always used because it has been empirically shown to work well across different datasets.

$$k$$ can be tuned based on your specific use case and data characteristics. Lower values (20-40) will give top results more influence, higher values (80-100) will give a more gradual contribution difference.

## Why RRF Works Well

RRF has three key strengths:

- **Simplicity:** No training data, complex optimization, or normalization required. Easy to implement and fast to compute.
- **Robustness:** Works with different score scales without normalization. Handles partial results gracefully.
- **Effectiveness:** Often outperforms more sophisticated fusion techniques, especially when combining complementary methods.

Research consistently shows RRF's effectiveness when combining different retrieval approaches like keyword search with semantic similarity.

## When to Use RRF

While RRF can be used any time there are multiple scoring systems which need to be combined in a single query it excels in several scenarios:

**Hybrid Search Systems**
Combining lexical search ([BM25](/learn/search-concepts/bm25)) with semantic vector search:

- BM25 finds specific keywords and technical terms
- Vector search captures conceptual similarity
- RRF harnesses both strengths

**Multi-Field Search**
Search across different document parts with separate rankings:

- Title search results
- Body content results
- Metadata results

**Personalization**
Merge general relevance with personalized signals based on user behavior or preferences.

## Example: RRF in Action

Consider a search for `machine learning tutorial` using both lexical and semantic search:

**Lexical Search Results:**

1. "Complete Machine Learning Tutorial Guide"
2. "Tutorial: Introduction to ML Algorithms"
3. "Python Machine Learning Handbook"

**Semantic Search Results:**

1. "AI and Deep Learning Fundamentals"
2. "Complete Machine Learning Tutorial Guide"
3. "Beginner's Guide to Neural Networks"

**RRF Calculation (k=60):**

- "Complete ML Tutorial" appears in both lists (rank 1 and 2): `1/61 + 1/62 = 0.0326`
- "AI and Deep Learning" appears only in semantic (rank 1): `1/61 = 0.0164`
- "Tutorial: Intro to ML" appears only in lexical (rank 2): `1/62 = 0.0161`

Weights allow you to emphasize one retrieval method over another. For instance, in a hybrid setup:

- BM25 might have a weight of 1.0 for lexical precision.
- A vector search model might have 0.7 for semantic similarity.

The effect is simple but powerful: RRF still rewards agreement across rankers, but weights let you encode which signals you trust most. This makes weighted RRF especially useful in production systems where retrieval sources vary in quality or purpose.

## Implementation in SQL

A variety of databases and search engines have pre-implemented RRF, but it's a very simple formula and can be easily demonstrated in SQL as follows:

```sql
WITH fulltext AS (
  SELECT id, RANK() OVER (ORDER BY score DESC) AS rank
  FROM (
    SELECT id, pdb.score(id) AS score
    FROM mock_items
    WHERE description @@@ 'keyboard'
    ORDER BY pdb.score(id) DESC
    LIMIT 20
  )
),
--- Semantic search, using pgvector and cosine distance for ranking
semantic AS (
  SELECT
    id,
    RANK() OVER (ORDER BY embedding <=> '[1,2,3]') AS rank
  FROM mock_items
  ORDER BY embedding <=> '[1,2,3]'
  LIMIT 20
),

-- Calculate RRF contributions from each ranker
rrf AS (
  SELECT id, 1.0 / (60 + rank) AS s FROM fulltext
  UNION ALL
  SELECT id, 1.0 / (60 + rank) AS s FROM semantic
)

-- Sum the RRF scores, order by them, and join back the original data
SELECT
  m.id,
  sum(s),
  m.description
FROM rrf
JOIN mock_items AS m USING (id)
GROUP BY m.id, m.description
ORDER BY sum(s) DESC
LIMIT 5;
```

## Summary

Reciprocal Rank Fusion provides an elegant solution for combining multiple search rankings without the complexity that comes from score based systems. Its simplicity, robustness, and proven effectiveness make it the standard approach for hybrid search systems that need to merge results from different retrieval techniques.

---

# What is RAG?

> Learn about Retrieval-Augmented Generation (RAG) and how it uses search to increase LLM effectiveness.

Published: 2024-10-21 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/search-concepts/retrieval-augmented-generation

---

Retrieval-Augmented Generation (RAG) is an architecture that retrieves knowledge from external sources to improve the accuracy of responses from AI models. Without RAG, AI models only have access to training data. With RAG, prompts are appended with relevant context to assist the AI model’s output.

Originally, RAG emerged as a strategy to tackle an LLM’s limited context window. Passing all available context (such as a company’s internal knowledge base or a customer database) would have surpassed an LLM’s input limit. Modern LLMs have significantly larger context windows; however, RAG remains as a technique to filter context that will assist, and not distract, an LLM’s reasoning.

## How does RAG work?

RAG consists of three steps that correspond to its name:

1. **Retrieval**: Information is retrieved from external data sources. Usually the user query will be used, with the RAG pipeline employing vector search or full-text search to surface relevant information.
2. **Augmentation**: Retrieved information is appended to the user’s query and demarcated as context.
3. **Generation:** The AI model (e.g. GPT-5, Claude, etc.) generates an output from the augmented query.

The final step of RAG is straightforward. However, you must decide on the optimal retrieval and augmentation strategies for your application.

## Strategies for RAG retrieval

RAG can work with any external data source. This includes an existing database, an API, or a document repository. However, the most common implementations of RAG are ones that are purpose-built around both storage and retrieval. If data is not efficiently indexed in storage, then it is difficult to find the correct data to retrieve.

There are two common strategies (and a hybridized option) for efficiently storing and retrieving data: vector search and full-text search.

### Vector Search

[Vector search](/learn/search-concepts/vector-search) retrieves information based on semantic similarity using multi-dimensional vector embeddings. Vector search works best when users need conceptually related information, even if the exact words differ. It's particularly effective for multimodal applications and finding information that relates to the query's meaning rather than exact text matches.

### Full-Text Search

[Full-Text search](/learn/search-concepts/full-text-search) (FTS) retrieves information based on matching textual content, usually using an algorithm like [BM25](/learn/search-concepts/bm25). Full-text search works best when users know exactly what they’re looking for: like a product code, a filename, or a specific phrase. It’s precise, deterministic, and fast.

### Hybrid Search

Because vector search struggles to retrieve textually similar information (e.g. a shared book title) and full-text search struggles to retrieve semantically similar information (e.g. two books on the same topic), an alternative is hybrid search. [Hybrid search](/learn/search-concepts/hybrid-search) employs both vector search and full-text search to provide a cumulative approach.

Using [reciprocal rank fusion](/learn/search-concepts/reciprocal-rank-fusion), hybrid search retrieves documents that highly rank on both search approaches.

## Augmentation

Once a RAG application retrieves relevant information, it must combine it with the user's original query before sending the final prompt to the AI model. This augmentation step determines how effectively the model can use the retrieved context.

The most common approach is to append the retrieved documents to the original prompt with clear delimiters. For example:

```text
Context:
[Retrieved document 1]
[Retrieved document 2]

User Query: What is our return policy?

Please answer the user's question based on the provided context.
```

You can improve augmentation by:

- **Ranking and filtering**: Only include the most relevant retrieved documents to avoid overwhelming the model with excessive context
- **Summarization**: Condense lengthy retrieved documents into key points before inclusion
- **Structured formatting**: Organize retrieved information with headers, bullet points, or numbered lists for easier model comprehension
- **Source attribution**: Include document titles or metadata so the model can reference specific sources in its response

The goal is to provide enough relevant context for accurate responses while maintaining clarity and staying within the model's context window limits.

## What are the benefits of RAG?

RAG is a necessary architecture for most business applications of AI. AI models aren’t trained on private information, such as a company’s internal policies, product documentation, or customer data. Instead, they must be provided with that context at query time to effectively answer prompts.

An alternative (or supplementary) strategy to RAG is fine-tuning, where AI models are further trained on contextual information before query time. This approach must happen ahead of time and is much more complex; but it can be ideal for applications where context is fixed and not dynamic, where queries are predictably centered around a specific domain, and where context isn’t excessively abundant.

## Summary

RAG is an architecture for retrieving relevant information to improve the performance of LLM model responses. Usually, the most challenging aspect of RAG is choosing the right retrieval strategy that optimizes for semantic similarity, textual similarity, or a combination of the two.

---

# What is Vector Search?

> Learn about vector search, how it enables semantic similarity matching using high-dimensional embeddings.

Published: 2024-10-06 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/search-concepts/vector-search

---

Vector search, also known as similarity search and semantic search, focuses on finding similar content based on semantic meaning rather than exact keyword matches. Unlike traditional [full-text search](/learn/search-concepts/full-text-search) that asks "which documents contain these tokens?", vector search asks "which documents mean something similar to this?"

This semantic understanding makes it particularly valuable for natural language queries, content recommendation, and applications where meaning matters more than exact terminology.

Vector search operates through three core components: **embedding generation**, **indexing**, and **querying**. Embedding generation converts text into numerical vectors, indexing stores these vectors in specialized data structures, and querying uses mathematical similarity calculations to find the most relevant content.

  Vector search and vector indexes can be provided either by specialized vector
  databases (like [Pinecone](https://www.pinecone.io/) or, by search engines
  with vector indexes (like
  [Elasticsearch](https://github.com/elastic/elasticsearch)), or by general
  purpose databases with vector extensions (like
  [PostgreSQL](https://postgresql.org) with
  [pgvector](https://github.com/pgvector/pgvector)).

## Embedding Generation

Embedding generation is the foundation that lets vector search appear to understand intent. Machine learning models convert text into dense numerical vectors that capture semantic meaning. This isn't an exact science, just as two people can have different understandings of sentences, so can different embedding models.

### The Embedding Pipeline

The embedding process transforms raw text into searchable vectors through several steps:

1. **Text preprocessing:** Cleaning, normalizing, and chunking input text. Because embedding systems can't handle large documents, input text will be broken down into smaller chunks which will be embedded separately.
2. **Model inference:** Running chunks through local or remote embedding models to generate high-dimensional vectors.
3. **Vector normalization:** Normalizing vectors for consistent similarity calculations.

All vector search systems must decide whether to run embedding models locally or use remote APIs.

- **Local models** like [BGE-M3](https://bge-model.com/bge/bge_m3.html) or open-source alternatives provide full control over latency, costs, and data privacy. You can optimize inference speed and avoid per-request API costs, but require computational resources and model management.

- **Remote APIs** like OpenAI or Anthropic's embedding services offer convenience without infrastructure overhead. However, they introduce network latency, per-request costs, and potential privacy concerns when sending data to external services.

Embedding generation speed directly affects user experience. Local models can be optimized for your hardware and batched for efficiency, while remote APIs face network round-trip delays. For real-time applications, embedding pre-computation or caching strategies become essential regardless of the approach.

The model's dimensionality also impacts both embedding speed and storage requirements: higher dimensions often mean better accuracy, but bring increased computational and storage costs.

## Indexing

Indexing stores the generated vectors in data structures optimized for fast similarity search. Unlike traditional search that builds inverted indexes of terms, vector search creates specialized indexes for high-dimensional numerical data. These indexes allow queries to efficiently find similar vectors without comparing every vector pair. Different indexing strategies optimize for various performance characteristics:

- **HNSW (Hierarchical Navigable Small World):** Builds multi-layer graph structure for fast approximate search with good recall
- **IVF (Inverted File Index):** Partitions vectors into clusters using k-means, searches only relevant clusters
- **DiskANN:** Optimized for SSD storage, enables vector search on datasets larger than RAM

Many other index types exist for specialized use cases, each balancing search speed, accuracy, and memory requirements.

### Example: Building a Vector Index

Consider indexing two simple documents:

| **ID** | **Text**                    |
| ------ | --------------------------- |
| 1      | "machine learning tutorial" |
| 2      | "AI programming guide"      |

After embedding generation, each document becomes a high-dimensional vector:

| **Document** | **Vector (simplified 3D representation)** |
| ------------ | ----------------------------------------- |
| 1            | `[0.8, 0.3, 0.1]`                         |
| 2            | `[0.7, 0.4, 0.2]`                         |

The vector index stores these vectors alongside their document IDs, optimized for fast similarity search operations. The choice of index type (HNSW, IVF, etc.) determines how these vectors are organized and accessed during search.

## Querying

Vector search converts user queries into vectors and finds the most similar documents through mathematical distance calculations.

The process is straightforward: convert the query to a vector using the same embedding model, calculate similarity scores against all indexed vectors, and return results ranked by similarity.

**Similarity measures** determine how close vectors are to each other:

- **Cosine similarity:** Measures angle between vectors (most common)
- **Dot product:** Considers both angle and magnitude
- **Euclidean distance:** Straight-line distance in vector space

Vector search systems often provide additional query capabilities like filtering by metadata, approximate search for speed, and threshold filtering to ensure quality results.

It's important to note that vector systems have no concept of tokens or words, so concepts like exact matching and proximity don't exist. The correctness of the results also relies on the embedding model's semantic understanding matching the end users.

## Where Vector Search Excels

Vector search excels in applications where semantic understanding provides clear advantages:

- **Content recommendation:** Find conceptually similar content regardless of exact wording
- **Semantic search:** Natural language queries like "Why is my website slow?" find articles about optimization, performance, and CDNs
- **Retrieval-Augmented Generation (RAG):** Find relevant context documents for language models to provide accurate, grounded responses
- **Multimodal applications:** Visual similarity search, cross-modal search between text and images, audio/video content matching

## When Vector Search Is Not Enough

While vector search excels at semantic similarity, it has limitations when users need exact keyword matching or precise terminology:

- **Exact keyword requirements:** Product codes, technical specifications, or proper names often require precise lexical matching.
- **Positional queries:** Vector indexes have no concept of words or position in the document.

Vector search requires more storage and compute resources than traditional search. Even at query time, generating embeddings and computing similarity scores can add noticeable latency compared to [BM25](/learn/search-concepts/bm25).

In many scenarios, **[hybrid search](/learn/search-concepts/hybrid-search)** approaches that combine both vector and traditional [full-text search](/learn/search-concepts/full-text-search) often provide the best solution:

- Full-text search for precise keyword matching and boolean logic
- Vector search for semantic similarity and concept-based retrieval
- Ranking algorithms like [Reciprocal Rank Fusion](/learn/search-concepts/reciprocal-rank-fusion) that blend both approaches for optimal relevance

## Summary

Vector search represents a divergent advancement in information retrieval, enabling the discovery of semantically similar content by representing data as high-dimensional numerical vectors and using mathematical similarity measures for retrieval. This approach excels at understanding meaning and context, but can't search for exact words or phrases.

The technology has become essential for recommendation systems and semantic search implementations that require nuanced understanding of content relationships.

---

# Implementing BM25 in PostgreSQL

> Learn how to implement BM25 ranking in PostgreSQL using various approaches from SQL to ParadeDB extensions.

Published: 2024-10-15 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/search-in-postgresql/bm25

---

Postgres is a powerful general-purpose database: but its built-in full-text
search has limits, especially around ranking. It supports basic term matching,
but lacks a native implementation of BM25, the probabilistic ranking function
that underpins most modern search systems (including Elasticsearch).
[BM25](https://www.paradedb.com/learn/search-concepts/bm25) (short for *Best
Matching 25*, often called *Okapi BM25*) measures relevance by balancing term
frequency, document rarity, and length normalization, an approach that
consistently outperforms both simpler and more complex ranking methods.

In this guide, you will learn a few approaches to implementing BM25 in Postgres databases.

## How Does BM25 Work?

BM25’s goal is to rank the results of a search query by relevancy.

To do this BM25 considers three variables:

- (i) the term frequency (TF) of each word in the document.
- (ii) the inverse document frequency (IDF), which quantifies and weights the rarity of each word across all documents
- (iii) The document’s length. By factoring in the document’s length, BM25 can promote shorter documents that might be more centered around the search query.

BM25 positively weights (i) shorter documents, (ii) those with a high frequency of query words, and (iii) terms that occur more rarely across documents. If you’re curious about the algorithm’s precise mechanics, you can learn more through our [BM25 chapter](https://www.paradedb.com/learn/search-concepts/bm25) in our [Search Concepts](https://www.paradedb.com/learn/search-concepts/full-text-search) collection.

## Postgres Supports Full-Text Search, But Not BM25

Despite not implementing BM25, Postgres does support full-text search via [`tsvector`](https://www.postgresql.org/docs/current/datatype-textsearch.html) and its family of functions.

Using [`ts_rank`](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING), Postgres developers can rank documents by relevance, but all scoring is not the same. The built in ranking doesn’t perform as well as BM25 because it only accounts for term frequency within a document and document length, but not inverse document frequency to capture the rarity of words in the full corpus. It’s also slower, needing to read from every tsvector in the set to return ranking results from the index (even for a LIMIT/Top K query).

## Routes to Implement BM25 in Postgres

Of course where there is a will there is a way, Postgres enthusiasts have [found several methods to adding BM25 support,](https://www.postgresql.eu/events/pgconfeu2024/sessions/session/6012/slides/592/Di_Qi_LT_BM25.pdf) from raw SQL to a dedicated BM25 extension.

Let’s break apart some of these approaches in detail.

### Implementing BM25 Using Just SQL

The most rudimentary (and unscalable) approach is using just SQL at query time.

BM25’s three variables—document length, term frequency, and IDF can be calculated [manually](https://www.alibabacloud.com/blog/keyword-analysis-with-postgresql-cosine-and-linear-correlation-algorithms-for-text-analysis_595793) using `tsvector` objects and the `ts_stat` data which Postgres collects, however it's both very convoluted and very slow. It could potentially be an option for a static corpus that never changes, but still feels more like a proof-of-concept that a production ready method.

### Implementing BM25 Using PL/PGSQL and Auxiliary Tables

A more usable path to implement BM25 is to use custom PL/PSQL and auxiliary tables. With each term occupying a row, table lookups are reduced to one per term in your query. These results can then be aggregated with a custom function.

A library that takes this approach is [plpgsql_bm25](https://github.com/jankovicsandras/plpgsql_bm25) by [András Jankovics](https://github.com/jankovicsandras). plpgsql_bm25 implements BM25 using just SQL and auxiliary tables. The function signature to create a table that holds BM25 records looks like this:

```sql
bm25createindex(
  tablename TEXT,
  columnname TEXT,
  algo TEXT DEFAULT '',
  stopwordslanguage TEXT DEFAULT '')
RETURNS VOID
```

Unfortunately this approach is still very slow, and the tables (and their indexes) need to be updated after every source data change which leads to write amplification.

### Extending Postgres to Support BM25 Indexes

A more scalable approach is using Postgres's index access method (IAM) to create a native BM25 index that is created on the source table.

[ParadeDB](https://www.paradedb.com) uses this method to implement BM25 indexes in Rust, adding them to PostgreSQL though the open-source `pg_search` extension.

When an index is added to a table it can cover multiple columns, providing advanced full-text search DX, as well as fast columnar lookups and aggregations for non-text fields.

```sql
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description, category, rating)
WITH (key_field='id');
```

The BM25 index is laid out on disk as an [**LSM tree**](https://docs.paradedb.com/welcome/architecture#lsm-tree), where each segment in the tree consists of both an inverted index and columnar index. The inverted and columnar indexes optimize for fast reads, while the LSM tree optimizes for high-frequency writes.

ParadeDB users can access full-text search and BM25 relevancy ranking by using custom operators and functions. For example, a search using match disjunction (find all documents which match one or more terms), ordering by BM25 scores, and then returning the top 5 would look like this:

```sql
SELECT id, paradedb.score(id)
FROM mock_items
WHERE description ||| 'running shoes'
ORDER BY paradedb.score(id)
LIMIT 5;
```

In most cases ParadeDB performs similarly to Elasticsearch, returning results in milliseconds even across large datasets.

## **What Are the Benefits of BM25 Support in Postgres?**

Bringing BM25 directly into Postgres has two major effects: it improves ranking quality and simplifies system design.

Postgres’s built-in full-text search—based on tsvector, tsquery, and ts_rank is capable but rudimentary. Its scoring function uses a linear weighting of term frequency and field importance, but it ignores inverse document frequency (IDF) and applies no document-length normalization.

Running BM25 inside Postgres also eliminates the need for a separate search engine and the ETL processes that keep it synchronized. There’s no risk of stale indexes, no duplicate infrastructure, and no latency gap between writes and searchable data. Each change to a table immediately updates the BM25 index, keeping search results transactionally consistent with the source of truth.

Finally, because BM25 operates as a native index access method, it participates in Postgres’s planner and transaction model. Queries benefit from caching, parallelism, and predictable performance characteristics: something external search engines can’t easily guarantee.

## Getting Started with BM25 in Postgres

Postgres doesn’t natively support BM25, but as we’ve seen, there are several ways to bring it into the database, from SQL-based prototypes to full extensions like ParadeDB’s pg_search. Each approach trades off simplicity, performance, and maintenance effort, but all share the same goal: better, more meaningful search results inside Postgres itself.

---

# Full-Text Search in PostgreSQL

> Learn how to implement powerful text search directly in PostgreSQL using built-in full-text search capabilities.

Published: 2024-10-16 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/search-in-postgresql/full-text-search

---

PostgreSQL includes powerful built-in [full-text search](/learn/search-concepts/full-text-search) capabilities that transform the database into a search engine. Rather than relying on external services, you can implement sophisticated text search directly within PostgreSQL using the same ACID transactions, security model, and operational infrastructure that manage your application data.

PostgreSQL's full-text search handles the complete pipeline from text analysis to ranked results. It tokenizes documents, removes stop words, applies stemming, builds searchable indexes, and ranks results by relevance, all with the performance and reliability you expect from PostgreSQL. This article discusses what it can do, and where it falls short.

**We love using PostgreSQL for full-text search, but we built ParadeDB because we don't think the built in full-text search is enough for a modern search application.**

That's not to say that the built in features aren't great, but for when you need a streamlined developer experience, [BM25](/learn/search-concepts/bm25) ranking, powerful facets, a feature set that matches Elasticsearch, or increased performance then ParadeDB is a better fit.

## Core Components

PostgreSQL's full-text search is built around specialized data types and functions that work together to provide comprehensive text search capabilities. These are combined to both store and query text vectors.

### Text Search Data Types

PostgreSQL provides two fundamental data types for full-text search. The [`tsvector`](https://www.postgresql.org/docs/current/datatype-textsearch.html#DATATYPE-TSVECTOR) type represents processed, searchable text as a sorted list of distinct tokens with position information. The [`tsquery`](https://www.postgresql.org/docs/current/datatype-textsearch.html#DATATYPE-TSQUERY) type represents search queries with boolean operators, phrase matching, and proximity requirements.

These types handle the complexity of text analysis while exposing a clean interface for indexing and querying.

#### Understanding tsvector

A `tsvector` transforms raw text into a structured, searchable format:

```sql
-- Convert text to tsvector
SELECT to_tsvector
  ('english',
   'PostgreSQL provides full-text search capabilities in database'
  );
```

```txt
 'capabl':7
 'databas':9
 'full':4
 'full-text':3
 'postgresql':1
 'provid':2
 'search':6
 'text':5
```

Notice how the result:

- Normalizes words to lowercase
- Stems the words ("provides" → "provid", "capabilities" → "capabl") {/* * codespell:ignore provid * */}
- Removes stop words like "in"
- Records position information (`:1`, `:2`, etc.)
- Explodes hyphenated terms like "full-text" while keeping the original

One limitation of `tsvector` is that it represents the text of a single row as a single vector. To combine multiple text fields, you typically concatenate them (`title || ' ' || description || ' ' || body`) into one `tsvector`. This works, but it flattens field boundaries, so you can’t search or rank fields independently. Alternatively, you could store separate `tsvector` columns for each field, but then each is indexed and scored independently, fragmenting your corpus.

  ParadeDB uses a BM25 covering index avoids this trade-off: it can maintain
  field-level statistics and weighting within a single index while still
  computing global document statistics.

#### Building tsquery

A `tsquery` represents search conditions with boolean logic that operates on a `tsvector` type:

```sql
-- Simple term query
SELECT to_tsquery('english', 'postgresql');

-- Boolean operators
SELECT to_tsquery('english', 'postgresql & search');

-- Phrase search
SELECT to_tsquery('english', 'full <-> text');

-- Prefix matching
SELECT to_tsquery('english', 'post:*');
```

### Text Search Configurations

PostgreSQL ships with predefined configurations for multiple languages, each with appropriate tokenization rules, stop word lists, and stemming algorithms. You can see the configurations available on your system with the following query:

```sql
-- Available configurations
SELECT cfgname FROM pg_ts_config;
```

Common configurations include `english`, `spanish`, `french`, `german`, and `simple` (no stemming). You can also create custom configurations for specialized domains. While other systems like Elasticsearch operate on filter chaining, if you wanted to change the way Postgres full-text search works you'd create another configuration with different options enabled.

## Indexing for Performance

Full-text search performance depends entirely on proper indexing. In practice, you'll almost always use functional indexes rather than storing `tsvector` data in columns. Functional indexes avoid duplicating your text data, eliminate the need for synchronization triggers, and provide the same query performance as materialized columns.

PostgreSQL supports two index types for text search, with GIN indexes being the standard choice for most applications.

### GIN Indexes

GIN (Generalized Inverted Index) indexes are the preferred approach for full-text search. Rather than adding a separate `tsvector` column, create the index directly on your text processing function:

```sql
-- Create functional GIN index (recommended approach)
CREATE INDEX idx_articles_search ON articles
USING GIN(to_tsvector('english', title || ' ' || content));
```

This approach provides several advantages over storing `tsvector` in a column. You avoid duplicating your text data in the database, which saves storage space and reduces backup sizes. There's no need for triggers or other synchronization mechanisms to keep the search data current with your source text. The index automatically reflects any changes to the underlying text columns. Query performance remains similar to using a materialized `tsvector` column since PostgreSQL can use the index directly. Even if you did store `tsvector` data in a column, you'd still need to create an index on that column for acceptable performance.

GIN indexes excel at fast lookups for any query term, efficient boolean operations, and compact storage for large vocabularies. They're optimized for read-heavy workloads where the text content doesn't change frequently.

### GiST Indexes

GiST (Generalized Search Tree) indexes offer an alternative that can be more suitable for specific scenarios:

```sql
-- Create functional GiST index
CREATE INDEX idx_articles_search_gist ON articles
USING GiST(to_tsvector('english', title || ' ' || content));
```

GiST indexes perform better when you need proximity or phrase queries, when updates are frequent (GiST updates faster than GIN), or when index size is a concern. However, for most full-text search applications, GIN indexes provide better overall performance.

## Basic Text Search Operations

With your functional index in place, you can perform sophisticated text searches without any additional setup or data synchronization.

### Executing Searches

Use the `@@` operator to match text against search queries:

```sql
-- Find articles about PostgreSQL
SELECT title, content
FROM articles
WHERE to_tsvector('english', title || ' ' || content) @@
      to_tsquery('english', 'postgresql');

-- Boolean search
SELECT title, content
FROM articles
WHERE to_tsvector('english', title || ' ' || content) @@
      to_tsquery('english', 'postgresql & (search | query)');

-- Phrase search
SELECT title, content
FROM articles
WHERE to_tsvector('english', title || ' ' || content) @@
      to_tsquery('english', 'full <-> text');
```

### Ranking Results

PostgreSQL provides [ranking functions](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING) that score documents by relevance, though with important limitations compared to modern search engines:

```sql
-- Rank by relevance using ts_rank
SELECT title, content,
  ts_rank(to_tsvector('english', title || ' ' || content), query) as rank
FROM articles, to_tsquery('english', 'postgresql & search') query
WHERE to_tsvector('english', title || ' ' || content) @@ query
ORDER BY rank DESC;
```

PostgreSQL's `ts_rank` function provides basic relevance scoring, but it's significantly less sophisticated than modern ranking algorithms like [BM25](/learn/search-concepts/bm25) used by Elasticsearch. The `ts_rank` function only considers term frequency within documents and lacks global document frequency information that would indicate how rare or common terms are across your entire corpus. This means common words aren't properly downweighted, and rare, potentially more relevant terms don't receive appropriate boosts.

PostgreSQL also offers `ts_rank_cd` with cover density algorithms for alternative ranking approaches, but these still fall short of BM25's comprehensive approach to relevance scoring. For applications where search ranking quality is critical, this represents a significant limitation compared to dedicated search engines.

## Advanced Query Techniques

### Highlighting Results

PostgreSQL can highlight matching terms in search results:

```sql
SELECT title,
  ts_headline('english', content, query) as highlighted_content
FROM articles, to_tsquery('english', 'postgresql & search') query
WHERE search_vector @@ query;
```

### Multi-Column Search

Search across multiple text fields with different weights

```sql
-- Weight title more heavily than content
SELECT title, content,
  ts_rank_cd(
    setweight(to_tsvector('english', title), 'A') ||
    setweight(to_tsvector('english', content), 'B'),
    query
  ) as rank
FROM articles, to_tsquery('english', 'postgresql') query
WHERE (
  setweight(to_tsvector('english', title), 'A') ||
  setweight(to_tsvector('english', content), 'B')
) @@ query
ORDER BY rank DESC;
```

### Fuzzy Matching

The `pg_trgm` extension can be added to handle typos and variations with similarity search:

```sql
-- Enable pg_trgm extension for trigram similarity
CREATE EXTENSION pg_trgm;

-- Find similar terms
SELECT word FROM dictionary
WHERE similarity(word, 'postgresgl') > 0.3
ORDER BY similarity(word, 'postgresgl') DESC;
```

## When to Use PostgreSQL Full-Text Search

PostgreSQL's built-in search excels when search needs to be transactional with other database operations, maintaining ACID guarantees across your entire application. It performs well for applications with millions of documents rather than billions, making it suitable for most business applications. The language-specific configurations work well for multi-language content, and the ability to join search results with other tables simplifies complex queries that would require multiple systems in other architectures.

The operational simplicity of managing a single database system rather than separate search infrastructure makes PostgreSQL full-text search attractive for teams that want to avoid the complexity of distributed systems. PostgreSQL full-text search provides excellent performance for most applications while maintaining the operational simplicity of a single database system.

## Limitations and Alternatives

While PostgreSQL's full-text search is powerful, it has limitations compared to specialized search engines. Performance degrades with very large document collections (millions of documents), and the system provides limited support for faceting, fuzzy matching, or complex highlighting. The ranking algorithms offer fewer options for customization compared to dedicated search engines, and real-time analytics capabilities are less sophisticated than what dedicated search engines provide.

For applications requiring these advanced features, consider hybrid approaches that use PostgreSQL for transactional data and specialized search engines like Elasticsearch.

  Another option is [ParadeDB](https://www.paradedb.com), which uses an
  extension to bring the feature set and performance of Elasticsearch inside
  Postgres.

## Summary

PostgreSQL's built-in full-text search transforms your database into a capable search engine. With `tsvector`, `tsquery`, and proper indexing, you can implement sophisticated text search that integrates seamlessly with your application's data and transactions.

The combination of SQL's expressiveness with full-text search capabilities makes PostgreSQL an excellent choice for applications where search is important, but doesn't justify the complexity of separate search infrastructure.

---

# Introduction to Tantivy

> Introduction to Tantivy, a fast full-text search engine library written in Rust

Published: 2025-10-22 · Author: James Blackwood-Sewell

Source: https://www.paradedb.com/learn/tantivy/introduction

---

**[Tantivy](https://github.com/quickwit-oss/tantivy)** is a fast [full-text search](/learn/search-concepts/full-text-search) library written in Rust that reimagines Apache Lucene's design with modern performance optimizations. Like Lucene, Tantivy is a library that you embed directly into your applications rather than a search engine that runs as a separate service. Think of it as a modern, Rust-based evolution of Lucene's core principles.

The project emerged from creator [Paul Masurel](https://fulmicoton.com/)'s decade-long fascination with search engines and his discovery that "Rust was solving all of the pain points I experienced with C++ or Java with mindblowing elegance." What started as a [Rust learning exercise](https://fulmicoton.com/posts/behold-tantivy/) became a production-ready search library that challenges the assumption that you need complex infrastructure to achieve powerful search capabilities.

Tantivy follows Lucene's fundamental architecture and algorithms, including [BM25](/learn/search-concepts/bm25) scoring and similar indexing strategies, but takes advantage of Rust's memory safety and performance characteristics, plus some [innovative optimizations](https://fulmicoton.com/posts/tantivy-stacker/) that push the boundaries of what's possible with modern hardware. The result is a search library that starts up in under 10 milliseconds and runs approximately twice as fast as Lucene in benchmarks while providing equivalent search capabilities.

## How Tantivy Works

Like Lucene, Tantivy follows a schema-based approach where you define the structure and types of your searchable data upfront. This allows the library to build specialized indexes optimized for different kinds of queries, including inverted indexes for text search, and columnar fast fields for faceting and point lookup queries.

### The Segmented Architecture

Tantivy organizes indexes into multiple small "segments" rather than maintaining one monolithic index structure. This segmented approach enables several key capabilities:

- **Concurrent Indexing:** Multiple threads can index documents simultaneously without blocking
- **Memory Management:** Large datasets can be indexed without loading everything into RAM
- **Incremental Updates:** New documents create new segments, avoiding expensive full index rebuilds

Each segment operates as an independent, immutable unit containing its own inverted indexes, document storage, and metadata. During search operations, Tantivy queries all segments in parallel and merges results, while background processes periodically merge smaller segments into larger ones for optimal performance.

### Search and Retrieval

Search operations work by parsing queries into execution plans against the inverted indexes and columnar fast fields. Tantivy uses [BM25](/learn/search-concepts/bm25) scoring for relevance ranking and supports boolean queries, phrase matching, and fuzzy search. The library employs Finite State Transducers (FSTs) for efficient term dictionary storage and implements sophisticated integer compression techniques to minimize memory footprint.

Results are returned in relevance order with minimal memory allocation overhead, taking advantage of Rust's zero-cost abstractions and careful memory management throughout the search pipeline.

## Key Features

Tantivy includes the core features you'd expect from a modern search library:

**Text Analysis:**

- Configurable tokenization pipeline with language support
- Stemming to match word variations (running, runs, ran)
- Stop word removal and custom text processing
- Built-in configurations for common languages
- Ability to extend with new tokenizers and behaviors (eg: [Lindera Tantivy](https://github.com/lindera/lindera-tantivy))

**Query Capabilities:**

- [BM25](/learn/search-concepts/bm25) relevance scoring for ranking results
- Fuzzy search for handling typos and variations
- Phrase queries with proximity matching
- Boolean logic (AND, OR, NOT) combining multiple terms and fields
- Range queries for numeric and date fields
- Faceted search for filtering and aggregation

**Performance Features:**

- Fast fields which store associated data in a columnar format for efficient sorting and aggregation without full document retrieval
- Compressed document storage for space efficiency
- Multithreaded indexing with configurable memory usage

## When to Choose Tantivy

Tantivy works well when you need search functionality embedded directly in your application. It's commonly used in desktop applications, command-line tools, and web applications where managing separate search infrastructure isn't practical.

**Single-Process Applications:** Desktop software, CLI tools, and single-server web applications benefit from Tantivy's embedded nature. The search index lives alongside your application data, eliminating network latency and infrastructure complexity.

**Resource-Constrained Environments:** Edge computing, IoT devices, and environments with limited memory benefit from Tantivy's efficiency. The library's minimal resource requirements make sophisticated search possible in contexts where running Elasticsearch would be impractical.

**Development and Testing:** The embedded approach simplifies development workflows. Your test suite runs against the same search implementation as production, without requiring external services or complex test setup.

**Real-Time Search Requirements:** Applications that need immediate search availability after data changes benefit from Tantivy's fast indexing. New documents become searchable within milliseconds rather than requiring separate batch processing.

The library approach means search availability is tied to your application, there's no separate service to manage, monitor, or keep in sync. This also simplifies development since you don't need to coordinate multiple services during testing.

## The Tantivy Ecosystem

If you need a complete search engine rather than just the library, several projects build on Tantivy to provide higher-level functionality. Just as Lucene serves as the foundation for search servers like [Elasticsearch](https://github.com/elastic/elasticsearch) and Solr, Tantivy has become the foundation for specialized search solutions.

[Quickwit](https://github.com/quickwit-oss/quickwit) builds on Tantivy to create a distributed search engine designed for log management and observability data. Where Elasticsearch might struggle with the volume and velocity of log data, Quickwit uses Tantivy's efficiency plus cloud-native architecture to handle massive log ingestion and search workloads. It's particularly well-suited for applications that need to search through petabytes of time-series data.

[ParadeDB](https://github.com/paradedb/paradedb) takes a different approach, embedding Tantivy directly into [PostgreSQL](https://postgresql.org) as an extension. This gives you modern search capabilities, including BM25 scoring and advanced text analysis, without leaving your existing database infrastructure. ParadeDB bridges the gap between traditional SQL databases and modern search engines, letting you run sophisticated search queries alongside your regular database operations.

These ecosystem projects demonstrate Tantivy's flexibility as a foundation for building complete search solutions.

## Getting Started

A typical Tantivy integration follows a simple pattern: define a schema, create an index, add documents, and search. Here's what a minimal implementation looks like:

```rust
use tantivy::*;

// Define schema and create index
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let content = schema_builder.add_text_field("content", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);

// Add documents
let mut index_writer = index.writer(50_000_000)?;
index_writer.add_document(doc!(
    title => "Sample Document",
    content => "This is sample content for searching."
))?;
index_writer.commit()?;

// Search
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![title, content]);
let query = query_parser.parse_query("sample")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
```

The example uses an in-memory index for simplicity, but production applications typically use disk-based indexes for persistence. The `writer(50_000_000)` call allocates 50MB for the indexing buffer, Tantivy's memory management ensures this space is used efficiently through its custom allocation strategies. For more examples see the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/).

## Summary

Tantivy represents a fundamental shift in how we think about search infrastructure. Born from one developer's exploration of Rust's potential, it has evolved into a production-ready library that challenges the traditional trade-offs between performance, simplicity, and capability.

By reimagining Lucene's proven architecture with modern systems programming principles (and without the JVM), Tantivy demonstrates that you don't need to choose between powerful search capabilities and operational simplicity. As both a standalone library and the foundation for projects like Quickwit and ParadeDB, Tantivy is reshaping what's possible in search technology.

---
	Apache 2.0	BSL	AGPLv3	ELv2
Familiar	✅	❌	✅	❌
Future-Proof	❌	✅	✅	✅
Community-Driven	✅	❌	✅	❌
Use Case	Postgres Variant
Data Warehouse	Redshift, Greenplum
Search and Analytics	ParadeDB
Vector Search	pgvector
Time Series	Timescale