) for analytical workloads. The tradeoff is that the workload and the runner are baked together. Push past that shape and you're building a new runner.
We could have taken the same path for search. Pick a representative workload, call it TextBench, build a runner around it, ship a leaderboard. That’s a viable approach, but it would also bind us to that initial workload. And once we started expanding beyond it, we’d lose the simplicity that made the approach attractive in the first place. We wanted a runner that we could grow with, not one which constrained us.
So we took the less-trodden path. We built the runner first, and kept benchmark workloads as a separate concern. ParadeDB Benchmarker is backend-agnostic, scenario-agnostic, and dataset-agnostic. Instead of fixing the workload and simplifying the runner around it, we worked to simplify the experience of defining new workloads. Because of this, the same runner can carry many benchmarks, including ones that have nothing to do with search and ones our future users haven't imagined yet.
## How We Think About Execution
Building a runner worth reusing means getting the mechanics right for any benchmark you throw at it. Most of the ways benchmarks go wrong are mechanical, and the same problems show up over and over.
### Environment and Resource Isolation
Every backend needs to start from the same place every time: identical resource constraints, the same setup steps, the same starting state. When you compare multiple backends in one run, they shouldn't be competing with each other for resources either.
### Configuration Capture
A poorly-tuned Postgres will lose to a well-tuned Elasticsearch, and vice versa. Configuration is part of the benchmark whether you acknowledge it or not, and a striking number of published benchmarks don't post their full configs. A runner needs to save the full context alongside the results: backend tunables (PostgreSQL GUCs, Elasticsearch JVM settings, versions), setup scripts (`VACUUM ANALYZE`, force merge), and resource allocation. If you can't reproduce a result, you can't trust it.
### Scenario Variety
Different questions need different test shapes. "How fast is a single query?" is a different scenario from "what happens at 1,000 QPS?", which is different again from "what does latency look like when reads and writes hit the system at the same time?" A runner should support all of these: constant throughput, ramping load, concurrent virtual users, mixed read/write workloads, and the ability to compose them.
### Rich Statistics
Tail latency over many runs is usually where systems actually fail. An average or minimum response time hides almost everything important. You need full latency distributions (P50, P95, P99 at minimum), throughput over time, and error rates, not point-in-time summaries.
### Application-Shaped Queries
A benchmark should send queries the way a real application would: over the network, using the same client libraries production code uses. CLI tools and developer interfaces (psql, mongosh, the Elasticsearch dev console) often take shortcuts that production traffic doesn't have access to: different connection pooling, different transports, optimizations for interactive use. If you benchmark through them, you're measuring the CLI as much as the database.
### Driver Separation
You're probably not an expert in every system you're benchmarking. We know Postgres well; we don't know the best way to construct an Elasticsearch query as well as someone who has been writing them in production for years. The code that talks to each database should be separated from the benchmark scripts, so experts in each system can audit and improve how their database is being driven without touching the benchmark itself. Each backend should run in its own native syntax, with no abstraction layer quietly translating one system into another’s strengths.
## ParadeDB Benchmarker
Once we decided the runner mattered more than any single benchmark, we needed infrastructure that could survive changing workloads, changing databases, and changing questions. The result was [ParadeDB Benchmarker](https://github.com/paradedb/benchmarker). It's open source, and built as a layer on top of [k6](https://k6.io/), Grafana's load testing framework (which is usually known for front-end testing).
k6 is an incredible benchmarking tool, and already solves most of the hard execution-engine problems: virtual user scheduling, request firing, latency measurement, ramping load, and scenario orchestration. It already has a database plugin with [xk6-sql](https://github.com/grafana/xk6-sql), but we needed to be able to work with databases that were outside the SQL camp (mainly Elasticsearch, OpenSearch, and MongoDB).
We also wanted better benchmarking ergonomics: dashboards, run exports, dataset loading, and reproducible local environments. This led to us building the `xk6-database` extension.
Here's how the pieces fit together: backends, datasets, query scripting, and the built-in dashboard.
### Backends and Environments
From the start, we wanted PostgreSQL, Elasticsearch, OpenSearch, ClickHouse, MongoDB Atlas Search, and ParadeDB itself to run under the same framework. But we didn't want to force them through a shared abstraction layer; each backend gets its own driver and runs in its own native syntax.
If you’d like to run locally, we provide Docker Compose profiles that bring up isolated backend environments with a single command.
### Datasets and the Loader
Datasets are treated as separate units: schema, raw data, and per-backend setup scripts live together in one directory. Adding a new dataset means dropping a new directory in `./datasets/` rather than touching the runner.
Data goes in through a separate loader CLI, which handles schema setup, bulk ingest, and post-load steps like VACUUM ANALYZE or force merge automatically.
### A First Benchmark
You write benchmark strategies in standard k6 JavaScript. A minimal single-backend workload targeting a local ParadeDB container would look like this:
```javascript
const backends = db.backends({ backends: ["paradedb"] });
const terms = db.terms(open("./search_terms.json"));
const scenarios = {
paradedb: {
executor: "constant-vus",
vus: 5,
duration: "30s",
exec: "paradedbQuery",
},
};
// Collect metrics from the Docker containers
export const collectMetrics = backends.addDockerMetricsCollector(
scenarios,
"30s",
);
export const options = { scenarios };
export function paradedbQuery() {
backends
.get("paradedb")
.query(
`SELECT id, title FROM documents WHERE content ||| $1 LIMIT 10`,
terms.next(),
);
}
```
Each VU runs `paradedbQuery` in a loop for 30 seconds. The runner times every call, tags it with the backend, and pushes the metrics to the dashboard.
Swap `constant-vus` for `ramping-vus` to ramp load up and down across stages, or `constant-arrival-rate` to fire a fixed number of requests per second regardless of how fast the system answers.
### Composing Scenarios
A single script can compose multiple scenarios across multiple backends. You might run search at 200 QPS, an aggregation at 100 QPS, and an ingest stream at 1,000 rows per second, all hitting the same backend at the same time. A built-in phase timer staggers a second backend so the two runs don't compete for resources.
Comparing raw ingest throughput across databases with different write semantics rarely tells you anything useful; comparing query latency under sustained write pressure tells you a lot.
### Streaming Results
Results stream to the browser as the test runs: latency percentiles per backend (P50, P90, P95, P99), query and ingest throughput, and per-container CPU and memory pulled from Docker.
Backend configs, setup scripts, and query patterns are captured alongside the timeline, so a result can be reproduced and audited later.
The code can be found at [github.com/paradedb/benchmarker](https://github.com/paradedb/benchmarker), with a full walkthrough coming soon.
## What's Next
We've shipped drivers for the backends we get compared to most often. We'd value experts in any of them auditing how Benchmarker drives them, so the comparisons we publish go up against each system at its best. And if there's a backend you'd like to see added, contributions are welcome.
Stay tuned for our upcoming performance blogs, where we'll put Benchmarker to work: real comparisons against other search engines, and the optimization journeys that came from improving those numbers.
---
_Ready to run your own database benchmarks?_ Check out [ParadeDB Benchmarker](https://github.com/paradedb/benchmarker).
---
# From Text to Token: How Tokenization Pipelines Work
> Understanding how search engines transform text into tokens through character filtering, tokenization, stemming, and stopword removal.
Published: 2025-10-10 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/blog/when-tokenization-becomes-token
---
When you type a sentence into a search box, it’s easy to imagine the search engine seeing the same thing you do. In reality, search engines (or [search databases](https://www.paradedb.com/blog/elasticsearch-acid-test)) don’t store blobs of text, and they don’t store sentences. They don’t even store words in the way we think of them. They dismantle input text (both indexed and query), scrub it clean, and reassemble it into something slightly more abstract and far more useful: tokens. These tokens are what you search with, and what is stored in your [inverted indexes](https://www.paradedb.com/learn/search-concepts/full-text-search#indexing) to search over.
Let’s slow down and watch that pipeline in action, pausing at each stage to see how language is broken apart and remade, and how that affects results.
We’ll use a twist on "The quick brown fox jumps over the lazy dog" as our test case. It has everything that makes tokenization interesting: capitalization, punctuation, an accent, and words that change as they move through the pipeline. By the end, it’ll look different, but be perfectly prepared for search.
This isn’t a complete pipeline, just a look at some of the common filters
you’ll find in lexical search systems. Different databases and search engines
expose many of these filters as composable building blocks that you can
enable, disable, or reorder to suit your needs.
The same general ideas apply whether you're using [Lucene](https://github.com/apache/lucene)/[Elasticsearch](https://github.com/elastic/elasticsearch),
[Tantivy](https://github.com/quickwit-oss/tantivy)/[ParadeDB](https://github.com/paradedb/paradedb), or [Postgres](https://postgresql.org) full-text search.
## Filtering Text With Case and Character Folding
Before we even think about breaking our text down we need to think about filtering out anything which isn’t useful. This usually means auditing the characters which make up our text string: transforming all letters to lower-case, and if we know we might have them folding any diacritics (like in résumé, façade, or Noël) to their base letter.
This step ensures that characters are normalized and consistent before tokenization begins. Café becomes cafe, and résumé becomes resume, allowing searches to match regardless of accents. Lowercasing ensures that database matches Database, though it can introduce quirks: like matching Olive (the name) with olive (the snack). Most systems accept this trade-off: false positives are better than missed results. Code search is a notable exception, since it often needs to preserve symbols and respect casing like _camelCase_ or _PascalCase_.
Let’s take a look at how our input string is transformed. We are replacing the capital T with a lower-case one, and also folding the `é` to an `e`. Nothing too surprising here. All of these boxes are interactive, so feel free to put in your own sentences to see the results.
Of course, there are many more filters that can be applied here, but for the sake of brevity, let’s move on.
## Splitting Text Into Searchable Pieces with Tokenization
The tokenization phase takes our filtered text and splits it up into indexable units. This is where we move from dealing with a sentence as a single unit to treating it as a collection of discrete, searchable parts called tokens.
The most common approach for English text is simple whitespace and punctuation tokenization: split on spaces and marks, and you’ve got tokens. But even this basic step has nuances: tabs, line breaks, or hyphenated words like full-text can all behave differently. Each system has its quirks, the default Lucene tokenizer turns `it’s` into `[it's]`, while the Tantivy splits into `[it, s]`[^1].
Generally speaking there are three classes of tokenizers:
1. **Word oriented tokenizers** break text into individual words at word boundaries. This includes simple whitespace tokenizers that split on spaces, as well as more sophisticated language-aware tokenizers that understand non-English character sets[^2] . These work well for most search applications where you want to match whole words.
2. **Partial Word Tokenizers** split words into smaller fragments, useful for matching parts of words or handling compound terms. [N-gram](https://en.wikipedia.org/wiki/N-gram) tokenizers create overlapping character sequences, while edge n-gram tokenizers focus on prefixes or suffixes. These are powerful for autocomplete features and fuzzy matching but can create noise in search results.
3. **Structured Text Tokenizers** are designed for specific data formats like URLs, email addresses, file paths, or structured data. They preserve meaningful delimiters and handle domain-specific patterns that would be mangled by general-purpose tokenizers. These can be essential when your content contains non-prose text that needs special handling.
For our example we will be using a simple tokenizer, but you can also toggle to a trigram (an n-gram with a length of 3) tokenizer below to get a feel for how different the output would be (don't forget you can change the text in the top box to play round).
## Throwing Out Filler With Stopwords
Some words carry little weight. They appear everywhere, diluting meaning: "the", "and", "of", "are". These are stopwords. Search engines often throw them out entirely[^3], betting that what remains will carry more signal.
This is not without risk. In _The Who_, "the" matters. That's why stopword lists are usually configurable[^4] and not universal. In systems which support [BM25](https://www.paradedb.com/learn/search-concepts/bm25) they are often left out altogether because the ranking formula gives less weight to very common terms, but in systems which don't support BM25 (like Postgres tsvector) stopwords are critically important.
Notice how removing stopwords immediately makes our token list more focused? We've gone from ten tokens to eight, and what remains carries more semantic weight.
## Cutting Down to the Root with Stemming
`Jump`, `jumps`, `jumped`, `jumping`. Humans see the connection instantly. Computers don't, unless we give them a way[^5].
Enter stemming. A stemmer is a rule-based machine that chops words down to a common core. Sometimes this happens elegantly, and sometimes it happens brutally. The foundation for most modern English stemming comes from [Martin Porter's 1980 algorithm](https://www.cs.toronto.edu/~frank/csc2501/Readings/R2_Porter/Porter-1980.pdf), which defined the approach that gave search engines consistent rules for stripping suffixes while respecting word structure. Today many stemmers are based on the [Snowball](https://snowballstem.org/) variant.
The results can look odd. `Database` becomes `databas,` `lazy` becomes `lazi.` But that's okay because stemmers don't care about aesthetics, they care about consistency. If every form of `lazy` collapses to `lazi,` the search engine can treat them as one[^6]. There's also lemmatization, which uses linguistic knowledge to convert words to their dictionary forms, but it's more complex and computationally expensive than stemming's "good enough" approach[^7]. {/* * codespell:ignore databas * */}
Here's the final transformation: our tokens have been reduced to their essential stems. `Jumped` becomes `jump,` `lazy` becomes `lazi,` and `database` becomes `databas.` These stems might not look like real words, but they serve a crucial purpose: they're consistent. Whether someone searches for `jumping,` `jumped,` or `jumps,` they'll all reduce to `jump` and match our indexed content. This is the power of stemming: bridging the gap between the many ways humans express the same concept. {/* * codespell:ignore databas * */}
## The Final Tokens
Our sentence has traveled through the complete pipeline. What started as _"The full-text database jumped over the lazy café dog"_ has been transformed through each stage: stripped of punctuation and capitalization, split into individual words, filtered of common stopwords, and finally reduced to stems.
The result is a clean set of eight tokens:
{/* prettier-ignore */}
{/* * codespell:ignore databas * */}
This transformation is applied to any data we store in our inverted index, and also to our queries. When someone searches for "databases are jumping," that query gets tokenized: lowercased, split, stopwords removed, and stemmed. It becomes `databas` and `jump`, which will match our indexed content perfectly. {/* * codespell:ignore databas * */}
## Why Tokenization Matters
Tokenization doesn’t get the glory. Nobody brags about their stopword filter at conferences. But it’s the quiet engine of search. Without it, `dogs` wouldn’t match `dog`, and `jumping` wouldn’t find `jump`.
Every search engine invests heavily here because everything else (scoring, ranking, relevance) depends on getting tokens right. It’s not glamorous, but it’s precise, and when you get this part right, everything else in search works better.
[Get started with ParadeDB](https://docs.paradedb.com/documentation/getting-started/install) to see how modern search databases handle tokenization for you, and please don’t hesitate to [give us a star](https://github.com/paradedb/paradedb).
---
[^1]: Which is better? That depends: `it's` seems more correct and skips storing a useless `s` token, but it wouldn't match on a search for `it`.
[^2]: General purpose morphological libraries like [Jieba](https://github.com/fxsjy/jieba) and [Lindera](https://github.com/lindera/lindera) are often used to provide tokenizers that can deal with Chinese, Korean, Japanese and characters.
[^3]: When we remove stopwords we still keep the original position of each token in the document. This allows positional queries ("find `cat` within five words of `dog`" even though we have discarded words.
[^4]: Lucene and Tantivy both have stopwords off by default, and when enabled for English they use the same default list: `[a, an, and, are, as, at, be, but, by, for, if, in,into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with]`
[^5]: Another method is vector search, which trades lexical stemming for searching over semantic meaning.
[^6]: Words can also be overstemmed, consider `university` and `universe` which both stem to `univers`, but have very different meanings.
[^7]: [Lemmatization](https://en.wikipedia.org/wiki/Lemmatization) uses actual word lists to make sure it only makes real words. To do this it accurately it needs to know the "part of speech" the source word is (noun, verb, etc..).
---
# Customers
# Alibaba Picks ParadeDB to Bring Full Text Search to its Postgres-Based Data Warehouse
> How Alibaba Cloud integrated ParadeDB for full text search in their Postgres-based data warehouse.
Published: 2024-09-24 · Author: Ming Ying
Source: https://www.paradedb.com/customers/case-study-alibaba
---
Overview
## About Alibaba Cloud
Alibaba Cloud is the cloud computing arm of Chinese tech giant Alibaba Group and is the largest Asia Pacific cloud provider.
One of Alibaba Cloud’s products, AnalyticDB for PostgreSQL, is a data warehouse built on Postgres. AnalyticDB for PostgreSQL uses a distributed architecture known as MPP (massively parallel processing) to process petabytes of data with high concurrency and low latency.
Challenge
## Alibaba Needed a Full Text Search Solution That Could Handle Petabytes of Data
Prior to integrating with ParadeDB, AnalyticDB for PostgreSQL’s full text search (FTS) capabilities were limited to Postgres native full text search, which uses the `tsvector` type and GIN indexing.
As large enterprise customers onboarded with AnalyticDB for PostgreSQL, many of them had sophisticated full text search needs that were not met by `tsvector`:
1. Over multi-terabyte tables, `tsvector` performance degraded significantly. Alibaba's customers needed a search engine that could meet or exceed the query latency and throughput of Lucene in high concurrency scenarios over multi-terabyte tables.
2. Without BM25 scoring, `tsvector` ranked results poorly because it does not factor in important variables like term frequency.
3. Common advanced search queries like fuzzy search, relevance tuning, and faceting were not supported.
Solution
## ParadeDB Delivers Full Postgres Compatibility with Zero Infrastructure Overhead
Because AnalyticDB for PostgreSQL is fully Postgres compatible, using an external service like Elasticsearch was not a viable option.
With ParadeDB, AnalyticDB's search indexes are baked directly into Postgres. This means that ParadeDB gives Alibaba Elastic-quality search and all the
features of Postgres, including
1. Full integration with Postgres’ native backup, restore, and high availability functionalities.
2. First-class JOIN support so data does not need to be denormalized.
3. Careful integration with Postgres’ query planner so that complex full text search queries can be inspected and optimized.
ParadeDB's SQL-like query syntax is friendly to our users. Data development
engineers can quickly master and apply it to application systems.
Pang Bo, Product Manager
Outcomes
## ParadeDB Delivers 5X Better Performance per Core Compared to Lucene
Alibaba ran an extensive 60-day evaluation process, during which they benchmarked ParadeDB against Lucene (Elasticsearch's underlying search engine) over a corpus of 100 million Wikipedia documents.
Alibaba measured four criteria: index build time, index size on disk, throughput (queries per second), and latency (round trip time in milliseconds).
### Test Environment
Both Lucene and ParadeDB were run on identical machines with 4 CPU cores, 16GB RAM, and PL1 ESSDs. 4 data nodes were dedicated to ParadeDB and Lucene each.
### Index Build Time
ParadeDB indexed 100 million documents over than twice as quickly as Lucene.
### Index Size
Lucene and ParadeDB indexes consumed a similar amount of disk space.
### Throughput
With 40 concurrent readers, ParadeDB was able to process 5X queries per second compared to Lucene. This difference was magnified as the number of concurrent readers grew.
### Latency
With 40 concurrent readers, ParadeDB's round trip query times were 5X faster than Lucene.
This means that, with ParadeDB, Alibaba is able to meet the business-critical workloads of its most demanding customers.
ParadeDB has excellent performance and throughput in the field of Full Text
Search, helping our clients achieve structured analysis and full-text
retrieval using a pure Postgres engine.
Pang Bo, Product Manager
---
# Bilt Reduces Postgres Query Timeouts by 95% with ParadeDB
> How Bilt improved their Postgres performance with ParadeDB, reducing query timeouts by 95%.
Published: 2025-07-28 · Author: Ming Ying
Source: https://www.paradedb.com/customers/case-study-bilt
---
Bilt Rewards deployed ParadeDB as a logical replica of their primary Postgres,
hosted in GCP. ParadeDB handles BILT’s most expensive faceted and fuzzy search
queries, improving accuracy and reducing query timeouts by 95%.
{/* Desktop Table */}
| 5ms |
300GB |
7K rows/s |
40 QPS |
| P95 Query Latency |
Data Stored |
Write Volume |
Read Volume |
{/* Mobile Vertical List */}
P95 Query Latency
5ms
Data Stored
300GB
Write Volume
7K rows/s
Read Volume
40 QPS
### Overview
Bilt is the housing and neighborhood commerce network that turns everyday housing and neighborhood spend into valuable rewards and benefits. It’s the first program that lets members earn rewards on rent and HOA payments while helping them build a path to homeownership. As of December 2024, Bilt was processing over $36 billion in annual rent and HOA payments.
The Bilt Alliance — developed in partnership with some of the nation’s largest residential owners and operators — is a network of more than 4.5 million homes across the country.
This all-in-one platform enables property managers to monitor resident and payment information across all lease units,
manage loyalty incentive programs, and streamline rent payments.
### The Problem
The platform’s original search functionality — built on Postgres — was lackluster and prevented Bilt from delivering a premium user experience.
The first problem was accuracy of search results. Postgres' limited tokenizers and query types meant that users often couldn’t find what they were looking for. Bilt's system relied on exact token matching that compared the hashed values of the query with the hashed, underlying text data. Bilt also tried Postgres' built-in trigram index, which improved results but remained unsatisfactory.
The second issue was speed. Queries frequently encountered high latency and timeouts, particularly for users who needed to search millions of data records. Operations like filters, `ORDER BY` clauses, and cross-schema `JOIN`s were especially time-intensive. As a stopgap solution, Bilt prohibited engineers from joining across schemas in a single query. Instead, the system orchestrated multiple separate API calls and combined their results on the client side — a workaround which created a tangled mess of spaghetti code.
### The Solution
When Bilt brought in ParadeDB, they saw an instant speedup to their search queries.
This is because ParadeDB's modern query execution engine can efficiently parallelize and execute SQL `WHERE`, `ORDER BY`, `JOIN`, and other clauses. ParadeDB outperformed Bilt's vanilla Postgres setup, which handles these clauses as separate post-processing steps.
Search accuracy also improved, thanks to ParadeDB’s built-in support for fuzzy matching. The results were especially noticeable in Bilt's "property reviews" page,
which implements fuzzy search over property and property manager information.
Using ParadeDB has unlocked the ability to rapidly launch new search
capabilities across our products—something that previously would have
taken weeks of effort.
John King, Backend Engineer
### Results
Bilt chose ParadeDB because it required the least amount of infrastructure changes and engineering effort to unlock these search functionalities. Unlike any other search database, syncing a primary Postgres with ParadeDB requires almost zero operational effort. This is because ParadeDB can use Postgres’ built-in logical replication to ingest data from the primary Postgres.
ParadeDB also enabled Bilt to leverage its existing Postgres knowledge and deploy within its cloud provider, GCP. Implementation required only minor adjustments to existing Postgres search queries, which saved weeks of engineering effort compared to migrating the query logic to an alternate solution.
In Bilt’s Resident Loyalty Portal — where property managers can search for resident, payment, and unit identifiers — query timeouts dropped by 95% and P95 query latency fell by 50%. At the same time, Bilt was able to move the most expensive queries out of their primary Postgres and into ParadeDB.
Ultimately, ParadeDB's robust, performant architecture has enabled Bilt to rapidly launch new search capabilities across its suite of products. Bilt’s success story demonstrates why platforms built on Postgres demand a Postgres-native search solution.
---
# Cofactr Migrates from MongoDB Atlas to ParadeDB
> How Cofactr replaced MongoDB Atlas Search with ParadeDB to unify its search and primary data stores on Postgres in AWS GovCloud.
Published: 2026-04-21 · Author: Ming Ying
Source: https://www.paradedb.com/customers/case-study-cofactr
---
After three high-impact outages on a managed search vendor, Cofactr
consolidated its stack onto Postgres and has run disruption-free on ParadeDB
ever since.
{/* Desktop Table */}
| 10M+ |
7.8K/s |
245 QPS |
| Total Rows |
Peak Write Rate |
Peak Read Rate |
{/* Mobile Vertical List */}
Total Rows
10M+
Peak Write Rate
7.8K/s
Peak Read Rate
245 QPS
ParadeDB lets us unify our search and primary data stores around Postgres
without giving up performance. Its operational stability has made it a
trusted part of our stack.
Noah Trueblood, Director, Data & AI
### Overview
[Cofactr](https://cofactr.com) is a full-service electronics purchasing platform for hardware teams. Customers upload a bill of materials, and Cofactr handles the rest: its AI agents source components, negotiate pricing, and place orders, while Cofactr's warehouses store the parts and assemble custom kits.
Cofactr's platform is also compliant with the International Traffic in Arms Regulations (ITAR), the US government framework governing the export of defense articles and technical data. That means data access must be restricted to US Persons, and the entire stack runs inside AWS GovCloud under strict security controls.
### The Problem
Before ParadeDB, Cofactr's search architecture was split across two database systems. An internal service aggregated and normalized part data from many sources into MongoDB Atlas, using Atlas Search to explore a large parts dataset of structured and unstructured fields. Cleaned records were then synced into a Postgres database behind the customer-facing platform, which relied on a mix of native full-text search and custom query patterns.
Maintaining two search approaches across two database systems made synchronization operationally complex and gave the team less control over search index freshness and consistency than they wanted. Atlas was also fully managed: when issues came up, Cofactr was dependent on MongoDB's support staff to take action. Across 2023 and 2024, three high-impact service disruptions required MongoDB support to escalate and resolve.
MongoDB Atlas for Government also carried a steep markup and ran outside Cofactr's AWS account, which added another cloud environment to the compliance surface area.
Cofactr came to realize that its workload and query patterns were a natural fit for Postgres, and that consolidating around a Postgres-native search engine would simplify the stack without compromising on capabilities.
### The Solution
Cofactr's technical requirements were clear: tight control over replication from the primary database to the search database, low replication latency, and the ability to absorb bursts of traffic without degrading. The team evaluated MongoDB Atlas, OpenSearch, and native Postgres full-text search (`tsvector`/`tsquery`/`ts_rank`) alongside ParadeDB.
MongoDB Atlas offered low read latency, but replication lag from the primary database was opaque and outside Cofactr's control, and the GovCloud markup plus extra compliance surface area made it a poor fit. OpenSearch offered more control and an expansive feature set but added operational complexity to the stack. Native Postgres full-text search could run inside RDS, but initial benchmarking suggested that index and query performance wouldn't meet Cofactr's long-term needs, and `ts_rank` was too limited.
ParadeDB stood out for two reasons. First, consolidating search into Postgres collapsed the stack around a single database system the team already knew well. Second, by using `dblink` to pull data from the primary database into ParadeDB, the time from ingesting new data into the primary database to having it searchable in ParadeDB was small.
### Results
Cofactr deployed ParadeDB as a standalone database in AWS GovCloud — the same environment as the rest of its ITAR-compliant infrastructure — to power freeform search across its electronic components catalog.
The ParadeDB database today spans roughly 11 million rows. Cofactr deliberately keeps this lean: it only contains data for parts customers are likely to access, and only the fields that go into the search index. When additional context is needed, the team enriches the result set at query time.
In the last 90 days, the highest peak write rate was roughly 7,800 rows/s on a 60-second rollup. Peak read throughput hit 245 QPS on a 30-second rollup. ParadeDB has absorbed these bursts without issue.
The operational story is where the difference is starkest. Since adopting ParadeDB, Cofactr has had zero major disruptions.
### ParadeDB
ParadeDB is the simplest way to add Elastic-quality search to your Postgres. If you have a use case around large scale search in Postgres,
we'd [love to chat](mailto:hello@paradedb.com).
---
# INSA Strasbourg Powers New Research Database with ParadeDB
> How INSA replaced Elasticsearch with ParadeDB for better performance and simplicity.
Published: 2024-08-29 · Author: Ming Ying
Source: https://www.paradedb.com/customers/case-study-insa
---
Overview
## About INSA Strasbourg
INSA Strasbourg (National Institute of Applied Sciences) is a leading engineering institution located in France. INSA Strasbourg's engineering sciences research division consists of 200 scientists who primarily conduct health and environmental research.
Problem
## INSA Strasbourg Needed a Reliable Data Store and Search Engine to Back Cutting-Edge Scientific Research
INSA Strasbourg was building an internal system that could be used by researchers to search over a massive corpus of scientific publications. Their dataset was 1.5 terabytes, consisting of 66 million documents alongside metadata like publication dates, numeric, and boolean flags.
Their engineers outlined five requirements for their document retrieval system:
1. **BM25 support**: The BM25 algorithm is the established standard for full text retrieval.
2. **SQL support**: Since INSA Strasbourg's engineers were familiar with SQL, they wanted to avoid query languages that strayed from SQL.
3. **Python compatibility**: INSA Strasbourg wanted to easily interact with the chosen solution from Python code.
4. **Kubernetes compatibility**: This would make it easy for INSA Strasbourg to deploy their solution.
5. **Open source**: INSA Strasbourg's engineering philosophy is centered around the adoption of open source, which enables them to experiment with software on their own infrastructure without long approval processes.
Solution
## ParadeDB Wins on Operational Simplicity and Postgres Compatibility
We wanted to avoid going with the obvious Elasticsearch route because we
wanted to avoid JVM overhead. We also wanted an extensible ecosystem.
Iliass Ayaou, Research Engineer
In addition to ParadeDB, INSA Strasbourg also evaluated Elasticsearch, Manticore Search, Meilisearch, Typesense and Quickwit.
After an initial evaluation process, INSA Strasbourg narrowed its options down to ParadeDB and Elasticsearch. However, the operations team was concerned about Elastic. They wanted to avoid the overhead that Elastic’s JVM (Java virtual machine), which is notoriously resource-intensive and difficult to manage, would add to their infrastructure.
Unlike Elasticsearch, ParadeDB is built on Postgres, a battle-tested relational database that INSA Strasbourg already knew how to operate. This meant that INSA Strasbourg could leverage the entire Postgres ecosystem, including other Postgres extensions and Postgres-compatible tooling. INSA Strasbourg trusted that the Postgres ecosystem would continue to grow and evolve in the coming years.
Outcomes
## INSA Strasbourg Supercharges Their Document Search System
We have been more than satisfied with ParadeDB's BM25 performance for
search.
Iliass Ayaou, Research Engineer
Integrating ParadeDB with INSA Strasbourg's existing on-prem Kubernetes cluster took a matter of days. From an operational perspective, INSA’s database administrator was already experienced with configuring and scaling Postgres. For engineers, ParadeDB integrated seamlessly with the existing code base, since interacting with ParadeDB from any programming language is the same as interacting with a Postgres database.
Since moving ParadeDB into production, INSA Strasbourg has supercharged their document search experience. The engineering team has been extremely satisfied with ParadeDB’s sub-100ms query speeds, BM25 scoring, and hybrid search capabilities. Using ParadeDB saves several hours of engineering time per week, since their engineers know how to use Postgres and don’t need to learn a new query language.
---
# ParadeDB Powers Modern Treasury's Core UI and Search APIs
> How Modern Treasury replaced legacy indexes and roll-up tables with ParadeDB, eliminating query timeouts and increasing write speeds by 2-3x.
Published: 2026-03-16 · Author: Ming Ying
Source: https://www.paradedb.com/customers/case-study-modern-treasury
---
ParadeDB powers Modern Treasury's core UI and search APIs, eliminating query
timeouts and increasing write speeds by 3x.
{/* Desktop Table */}
| 10TB |
500M+ |
50K/min |
3x |
| Data Stored |
Largest Table Rows |
Peak Write Rate |
Write Speed Increase |
{/* Mobile Vertical List */}
Data Stored
10TB
Largest Table Rows
500M+
Peak Write Rate
50K/min
Write Speed Increase
2-3x
Built on the solid foundation of Postgres, ParadeDB provides the good
parts of Elasticsearch we actually need, without the infrastructure and
hassle of data synchronization.
Matt Higgins, Head of Engineering
### Overview
Modern Treasury is the payments operations platform for businesses that move money at scale. The platform offers APIs, dashboards, and workflow tooling to manage the full lifecycle of payments — from initiation and approval through reconciliation and reporting. Modern Treasury's customers span financial services, fintech, and enterprise, relying on the platform to process high volumes of transactions with accuracy down to the penny.
### The Problem
As Modern Treasury's customer base grew, so did the complexity of powering its core product UIs. The platform's flexible filter interfaces allow customers to search, sort, and aggregate across payments, transactions, and ledger entries using any combination of dozens of attributes. With many customers storing over 100M records, the permutations of possible query plans made it impossible to cover every case with traditional B-tree indexes.
Our [Top K in Postgres]() blog post explores why B-tree indexes don't
generalize well to workloads that combine many filters with flexible sorting.
The team endlessly added B-tree indexes to cover different customer search patterns, but there was no reasonable way to index every permutation without causing write performance degradation. As a workaround, Modern Treasury built summary roll-up tables to support basic aggregations — a common pattern for workloads where Postgres struggles at high record counts. However, these roll-up tables introduced data delays and added significant complexity to the system.
The result was a long tail of edge-case query timeouts. Approximately 5% of queries failed entirely due to unoptimized query plans, and the engineering team was stuck in a reactive cycle of adding indexes to patch individual failures.
### The Solution
Modern Treasury needed a backend that could power subsecond responses across its core UIs without requiring a different index for every query pattern. The requirements boiled down to filtering records based on any of dozens of attributes, sorting by any attribute, and returning the top results. Crucially, the system had to be accurate to the penny.
The team evaluated Elasticsearch but concluded that the data consistency challenges and deployment complexity it would introduce were not worth the trade-off. While Elasticsearch is mature and feature-rich, the value of Postgres replication and SQL-native infrastructure offered by ParadeDB — as a native Postgres extension — outweighed the bells and whistles.
Modern Treasury deployed ParadeDB as a logical replica of their primary Postgres across 7 clusters. This meant the team could leverage its existing Postgres knowledge and infrastructure. No new data pipelines, no sync jobs, and no new query language. Implementation required only minor adjustments to existing SQL queries.
### Results
Switching to ParadeDB allowed us to replace as many as 26 B-tree and GIN
indexes over one table with a single ParadeDB index. This has massively
improved our write throughput and simplified our operational overhead.
Eric Barendt, Software Engineer
ParadeDB resolved an entire class of edge-case query timeouts for Modern Treasury. The team no longer needs to think about the many permutations of query plans that previously caused failures. The 5% of queries that previously timed out are now subsecond.
ParadeDB's performance on counting and summing operations enabled Modern Treasury to add real-time counts and aggregations to user interfaces that previously could not support them. The platform now powers grouped aggregations and transaction-to-payment reconciliation workflows that were previously too expensive to run inline.
Perhaps most significantly, ParadeDB allowed Modern Treasury to drop a large number of legacy B-tree indexes that the UI no longer depended on. With fewer indexes competing for write throughput, raw write speeds increased by 2-3x on the platform's most important database tables.
Modern Treasury now uses ParadeDB across its payments, transactions, and ledger products, servicing both customer-facing search APIs and internal analytic processes.
### ParadeDB
ParadeDB is the simplest way to add Elastic-quality search to your Postgres. If you have a use case around large scale search in Postgres,
we'd [love to chat](mailto:hello@paradedb.com).
---
# Sweetspot Unifies Hybrid Search on Postgres with ParadeDB
> How SweetSpot leveraged ParadeDB for their search infrastructure.
Published: 2024-06-09 · Author: Ming Ying
Source: https://www.paradedb.com/customers/case-study-sweetspot
---
## Sweetspot Overview
[Sweetspot](https://sweetspot.so) is an AI-driven platform for government procurement and contracting. Sweetspot streamlines the entire contracting process from
opportunity identification to bid/no-bid decision making to proposal development.
## The Challenge
Sweetspot needed a hybrid search — the combination of full-text and vector search — solution to power its contract search and AI-assisted proposal drafting products.
Since Sweetspot was already using Postgres, it was important that their search solution interoperated seamlessly with Postgres. Before using ParadeDB, Sweetspot
experienced significant friction synchronizing large volumes of data between Postgres and external databases like Elasticsearch.
This synchronization involved writing and maintaining custom connectors, which were prone to bugs and data inconsistencies. Additionally,
many of the vector databases that Sweetspot evaluated had unreliable infrastructure which caused frequent downtime in their application.
## The Solution
Sweetspot needed a search solution that was easy to integrate, compatible with existing Postgres architecture, and offered reliable performance over
15 million documents and 2 million vectors.
After an extensive review, Sweetspot selected ParadeDB. With ParadeDB, Sweetspot unified its data stack on Postgres, which removed the need to manage a
separate database or use custom connectors. ParadeDB's advanced full-text and vector search capabilities enabled Sweetspot to build powerful hybrid search
experiences without additional operational complexity.
## Outcomes
Integrating with ParadeDB took less than a week since Sweetspot was already using Postgres and could reuse its existing code.
Post-integration, Sweetspot observed a 50% overall reduction in query latency. Hybrid search queries achieved a P99 latency of 50 milliseconds and had improved
precision and recall compared to pure vector search.
ParadeDB also simplifies the process of achieving FedRAMP and NIST compliance since Postgres has robust authentication, access control, logging, and disaster recovery built in.
If you think that ParadeDB can solve a use case for your organization, we invite you to [contact us](mailto:sales@paradedb.com).
---
# Terrapin Finance Cuts Bond Search Latency by 25x with ParadeDB
> How Terrapin Finance replaced pg_trgm with ParadeDB to power sub-200ms full-text search across 1.1 billion rows of fixed-income data.
Published: 2026-03-17 · Author: Ming Ying
Source: https://www.paradedb.com/customers/case-study-terrapin-finance
---
Terrapin Finance deployed ParadeDB inside their self-hosted Postgres to power
sub-200ms full-text search across 1.1 billion rows of bond data — a 25x
improvement over their previous vanilla Postgres setup.
{/* Desktop Table */}
| 1.3TB |
1.1B |
8K QPS |
25x |
| Data Stored |
Total Rows |
Peak Read Rate |
Faster Queries |
{/* Mobile Vertical List */}
Data Stored
1.3TB
Total Rows
1.1B
Peak Read Rate
8.26K QPS
Faster Queries
25x
ParadeDB is the most mature search solution for Postgres available, and it
delivered exactly what we needed.
Juozas Norkus, CTO
### Overview
[Terrapin Finance](https://terrapinfinance.com) is a fixed-income data provider specializing in reference and pricing data APIs for US municipal, government, and corporate bonds. The company serves institutional clients who depend on accurate, real-time bond data for trading and analysis. Terrapin is also building a municipal credit intelligence platform that combines document search, credit risk analytics, and AI-driven research tools.
### The Problem
Terrapin's bond reference data APIs serve data derived from massive source tables spanning over a billion rows. A key requirement from their clients is the ability to filter bonds by issuer name, a natural-language field that doesn't lend itself to exact-match lookups. On the product side, Terrapin's credit risk analysis web UI features a type-ahead search bar where users enter a unique identifier or bond issuer name and see matching results as they type.
For a financial data platform, search responsiveness is non-negotiable: analysts expect instant feedback as they narrow down results across millions of securities.
The team initially built this functionality with `pg_trgm`, Postgres' built-in trigram matching extension. While trigram indexes are a common starting point for fuzzy text search in Postgres, they struggled under the weight of Terrapin's data. Search queries routinely took 5 seconds to return results, far too slow for a type-ahead interface. This was also painful for API consumers running batch queries against issuer names.
### The Solution
Terrapin's requirements were straightforward: they needed a search solution that was fast and that didn't require introducing new infrastructure. The team's entire stack was built on Postgres, and they had no interest in bolting on a separate search engine with its own cluster, sync pipeline, and query language.
When evaluating options, ParadeDB stood out as the most mature and complete Postgres-native alternative to `pg_trgm`. Because ParadeDB runs as an extension inside Postgres, Terrapin could deploy it directly into their existing self-hosted instance — no sidecars, no replication targets, and no new operational burden. The migration was surgical: swap out the trigram queries for ParadeDB's full-text search syntax, and the existing tables, schemas, and deployment pipeline stayed exactly as they were.
### Results
The impact was immediate. Search queries that previously took 3 to 5 seconds dropped to 100–200ms, a 25x improvement in query execution time. The type-ahead search bar in Terrapin's credit risk UI went from sluggish to instantaneous, and API consumers saw the same gains across their programmatic queries.
Beyond the core bond data APIs, Terrapin leveraged ParadeDB to build something more [ambitious](https://www.linkedin.com/posts/miguel-jaques-8026ba205_blazingly-fast-municipal-document-search-activity-7363610999263834112-Ys9S?utm_source=share&utm_medium=member_desktop&rcm=ACoAABvY5gUBFGMKpKGjG8yDwIIVOWaX4xRk0_w): a full-text document search engine spanning millions of official statements, corporate actions, and financial disclosures from the municipal bond market. This capability is a cornerstone of Terrapin's upcoming municipal intelligence platform, powering the AI-based research tools their clients will use to surface insights from the vast landscape of municipal bond documentation.
### ParadeDB
ParadeDB is the simplest way to add Elastic-quality search to your Postgres. If you have a use case around large-scale search in Postgres,
we'd [love to chat](mailto:hello@paradedb.com).
---
# Learn
# pgvector Limitations
> Understand the practical limits of pgvector: index cost, filtered search performance, missing features, and operational tradeoffs.
Published: 2026-05-14 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/postgresql/pgvector-limitations
---
[pgvector](/learn/postgresql/what-is-pgvector) brings [vector similarity search](/learn/search-concepts/vector-search) into PostgreSQL with HNSW and IVFFlat indexes, distance operators, and tight integration with the rest of your relational schema. That integration is its main strength, but pgvector still has practical limits around scale, filtering, index maintenance, and retrieval features. Some limits come from running inside PostgreSQL, while others come from pgvector's index design and feature set.
Knowing these limits up front helps you size hardware, plan your index strategy, and decide which parts of the workload belong elsewhere.
## Index Build Time and Memory Cost
HNSW indexes give good recall and query latency when the working set fits in memory, but query performance degrades as the corpus grows beyond cache. Large HNSW graphs involve more random memory access, larger candidate sets, and a longer p99 tail once pages come from disk instead of RAM.
They are also slow to build and expensive to keep resident. A 10M-row table with 1,536-dimensional embeddings can take hours to index on a single core, and the resulting index is typically tens of gigabytes. For acceptable query latency the index needs to live in `shared_buffers` or the OS page cache. Once it spills to disk, p99 latency climbs sharply. [Tuning pgvector Performance](/learn/postgresql/tuning-pgvector) covers the parameters that control build cost and memory footprint.
IVFFlat indexes build faster but require a representative sample for k-means clustering. If you build the index before loading most of your data, the centroids end up misplaced and recall suffers.
## Filtered Search Performance
Combining vector search with metadata filters is one of pgvector's weakest spots. Given a query like:
```sql
-- Find the 10 most similar documents for a specific tenant
SELECT id FROM documents
WHERE tenant_id = 42
ORDER BY embedding <=> $1
LIMIT 10;
```
pgvector walks the HNSW graph and then applies the filter. If `tenant_id = 42` matches only a small fraction of rows, the index may surface candidates that all fail the filter, causing recall to drop unless the scan goes deeper. Iterative scans help by continuing through more of the HNSW graph or IVFFlat lists until enough filtered rows are found or a scan limit is reached. That improves recall, but it also increases CPU, memory use, and tail latency.
The tradeoff is workload-dependent. Moderate filters can often be handled with `hnsw.iterative_scan`, `ivfflat.iterative_scan`, and higher candidate settings. Very selective filters may still need partitioning, partial indexes, or a different retrieval plan because the index spends too much work finding rows the filter will reject. pgvector supports partial indexes and table partitioning, but it does not have the same automatic filtered-ANN planning that some dedicated vector engines provide.
## Dimensionality and Vector Size Limits
The indexable `vector` type tops out at 2,000 dimensions. The `halfvec` type (16-bit floats) extends this to 4,000 dimensions and roughly halves storage, `bit` vectors extend to 64,000 dimensions through binary quantization, and `sparsevec` stores up to 1,000 nonzero elements. Using these types usually requires choosing the storage format up front, casting in your queries, or building expression indexes for quantized representations. pgvector supports binary quantization, but it does not provide native product quantization or automatic scalar quantization like many dedicated vector engines.
Each `vector(1536)` value occupies about 6KB on disk before row overhead. At 100M rows, raw vector storage alone exceeds 600GB before any index is built.
Indexes also skip `NULL` vectors, and cosine indexes skip zero vectors. If those values can appear in your data, account for them during ingestion or recall testing.
## Update and Delete Behavior
HNSW indexes do not reclaim space when rows are deleted: tombstoned nodes stay in the graph until the index is rebuilt. Workloads with frequent re-embeddings or churn cause index bloat and a slow drop in recall. `REINDEX CONCURRENTLY` is available but expensive on large indexes.
IVFFlat indexes have a related problem. Centroids are fixed at build time, so as data drifts the clustering becomes less representative and recall falls until you rebuild.
## Missing Features
Several capabilities that come standard in dedicated vector systems are absent or require extra work in pgvector:
- **[Hybrid search](/learn/search-concepts/hybrid-search)**: no native [BM25](/learn/search-concepts/bm25). Combining [`tsvector`](/learn/search-in-postgresql/full-text-search) ranking with vector distance requires manual score fusion, typically via [Reciprocal Rank Fusion](/learn/search-concepts/reciprocal-rank-fusion).
- **Reranking**: no built-in cross-encoder or learned ranker integration.
- **Multi-vector documents**: no native late-interaction document abstraction. ColBERT-style retrieval needs multiple rows, multiple vector columns, or application-side aggregation.
- **Query-time quantization**: no automatic product or scalar quantization at query time. You choose the storage type, cast explicitly, or maintain expression indexes for quantized search.
- **Embedding quality**: pgvector stores and searches vectors, but it cannot fix poor embeddings, weak chunking, or missing document context.
Extensions like [pg_search](/learn/search-in-postgresql/bm25) add BM25 to
PostgreSQL alongside pgvector, which is one way to build hybrid search without
leaving the database.
## Operational Tradeoffs
Because pgvector runs inside PostgreSQL, vector workloads share CPU, memory, and I/O with your OLTP traffic. A burst of vector queries can starve transactional queries of buffer cache and parallel workers, and the reverse is also true. Vertical scaling is the primary lever; native sharding of vector data across nodes is not supported.
A dedicated replica is a common compromise. It keeps vector reads away from the primary's OLTP workload while preserving PostgreSQL as the system of record. Physical replicas usually have low lag but still receive the primary's write and WAL volume. Logical replicas can decouple more of the read workload, but they usually add more replication lag and operational complexity. Very large corpora may still belong in a separate engine.
## Summary
pgvector is the right choice when your dataset fits comfortably on one PostgreSQL instance, your filter selectivity is moderate, and you value transactional consistency with the rest of your relational data. The edges to watch are index build cost, filtered query performance, missing hybrid and reranking primitives, and the absence of horizontal scaling. Knowing where those edges sit makes it easier to decide whether pgvector is sufficient on its own, whether to pair it with [full-text search](/learn/search-concepts/full-text-search) for [hybrid retrieval](/learn/search-concepts/lexical-vs-semantic), or whether part of the workload belongs in a dedicated engine.
---
# Tuning pgvector Performance
> Learn how to optimize pgvector performance for efficient vector similarity search in PostgreSQL.
Published: 2026-05-14 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/postgresql/tuning-pgvector
---
[pgvector](/learn/postgresql/what-is-pgvector) ships with useful defaults, but [vector similarity search](/learn/search-concepts/vector-search) is expensive enough that a few well-chosen parameters make the difference between a query that returns in milliseconds and one that scans the whole table. Tuning pgvector means picking the right index, sizing it for your data, and matching the runtime knobs to the recall you actually need.
The parameters that move performance most are index type, HNSW and IVFFlat settings, memory configuration, filtered-search controls, and query-plan verification.
## Choosing an Index
Without an index, pgvector falls back to a sequential scan that computes the distance for every row. That is fine for tens of thousands of vectors and untenable beyond that. pgvector offers two approximate indexes:
- **HNSW** (Hierarchical Navigable Small Worlds) builds a layered graph. It usually gives stronger recall and query latency, at the cost of slower builds and more memory.
- **IVFFlat** clusters vectors into lists with k-means. It builds faster and uses less memory, but recall is sensitive to the data present at build time.
HNSW is the right default for most workloads. Reach for IVFFlat only when build time or memory is the binding constraint.
## Reranking Candidate Sets
Not every pgvector workload needs an ANN index. A common pattern is to use full-text search or another structured query to retrieve a small candidate set, then run exact vector scoring over those rows in memory. For example, you might take the top 100 lexical matches and rerank them by embedding distance.
This works well when another retrieval step has already narrowed the search space. It avoids HNSW or IVFFlat build cost, keeps recall exact within the candidate set, and is often simpler for [hybrid search](/learn/search-concepts/hybrid-search) pipelines. It does not replace an ANN index when the vector search itself must scan a large corpus.
## Tuning HNSW
HNSW has two build-time parameters and one query-time parameter:
```sql
CREATE INDEX ON items
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
```
- **`m`** (default 16): maximum connections per node. Higher values improve recall and increase index size and build time. 16 is a strong default; raise to 24 or 32 only if recall is insufficient.
- **`ef_construction`** (default 64): candidate list size during the build. Larger values improve graph quality and slow the build. 64 to 200 is a useful range.
- **`hnsw.ef_search`** (default 40): candidate list size at query time. This is the main recall/latency dial you actually turn in production.
```sql
SET hnsw.ef_search = 100;
```
Increase `ef_search` until recall meets your target, then stop. The cost grows roughly linearly with this value.
## Tuning IVFFlat
IVFFlat has one build parameter and one query parameter:
```sql
CREATE INDEX ON items
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);
```
- **`lists`**: number of k-means clusters. A common starting point is `rows / 1000` up to one million rows, and `sqrt(rows)` beyond that.
- **`ivfflat.probes`** (default 1): clusters searched at query time. Higher values improve recall at proportional cost. A reasonable starting point is `sqrt(lists)`.
```sql
SET ivfflat.probes = 10;
```
IVFFlat indexes should be built after representative data is loaded. Building
on an empty or small table produces poor clusters that hurt recall even after
more data arrives.
## Memory and Build Settings
Index builds are bounded by `maintenance_work_mem`. If the build does not fit, it spills and slows down sharply. For non-trivial vector tables, raise this before building:
```sql
SET maintenance_work_mem = '4GB';
SET max_parallel_maintenance_workers = 4;
```
For long builds, track progress through PostgreSQL's index build view:
```sql
SELECT phase, blocks_done, blocks_total
FROM pg_stat_progress_create_index;
```
At query time, performance depends on the index staying in cache. Size `shared_buffers` and OS cache so the working set of the HNSW graph stays resident. A vector index that spills to disk has a long latency tail no parameter will fix.
## Tuning Filtered Searches
Filtered vector searches are harder than unfiltered vector queries because the index may find close vectors that do not match your `WHERE` clause. If too many candidates are filtered out, the query can return fewer rows than requested or miss relevant matches.
Start by enabling iterative scans. This tells pgvector to keep scanning the index after filtering removes candidates:
```sql
SET hnsw.iterative_scan = strict_order;
SET hnsw.ef_search = 100;
```
Use `strict_order` when exact distance ordering matters. Use `relaxed_order` when you can accept slightly looser ordering for better speed. HNSW supports both modes; IVFFlat supports `relaxed_order`:
```sql
SET ivfflat.iterative_scan = relaxed_order;
SET ivfflat.probes = 10;
```
If recall is still too low, raise the candidate count before raising scan limits. For HNSW, `hnsw.ef_search` controls the candidate list size. For IVFFlat, `ivfflat.probes` controls how many lists are searched.
When you need to cap the worst-case cost, use scan limits:
```sql
SET hnsw.max_scan_tuples = 20000;
SET hnsw.scan_mem_multiplier = 2;
SET ivfflat.max_probes = 100;
```
Tune these settings against measured recall and latency. Higher values recover more filtered results, but they also increase CPU, memory, and tail latency. If a filter is extremely selective, partitioning or a partial index may work better than asking one large ANN index to scan deeper.
## Choosing the Distance Operator
Use the operator that matches how your embeddings were generated:
```sql
SELECT id FROM items ORDER BY embedding <=> $1 LIMIT 10; -- cosine
SELECT id FROM items ORDER BY embedding <#> $1 LIMIT 10; -- inner product
SELECT id FROM items ORDER BY embedding <-> $1 LIMIT 10; -- L2
```
The index must be created with the matching opclass (`vector_cosine_ops`, `vector_ip_ops`, or `vector_l2_ops`). A mismatch produces a sequential scan with no warning. For normalized embeddings (OpenAI, Cohere), cosine and inner product are equivalent; inner product is slightly faster.
## Verifying the Index Is Used
Always check with `EXPLAIN`:
```sql
EXPLAIN (ANALYZE, BUFFERS)
SELECT id FROM items
ORDER BY embedding <=> $1
LIMIT 10;
```
Look for an `Index Scan` on the vector index. A `Seq Scan` means the planner ignored it, usually because the operator does not match the opclass, the `LIMIT` is missing, or a `WHERE` predicate forced a different plan.
## Summary
Most pgvector performance comes from three decisions: use HNSW unless you have a reason not to, tune `ef_search` (or `probes`) to the recall you need, and give the index enough memory to stay resident. Everything else is refinement.
For the failure modes that no amount of tuning will fix, see [pgvector Limitations](/learn/postgresql/pgvector-limitations).
---
# What is pgrx?
> Learn about pgrx, the Rust framework for building PostgreSQL extensions with memory safety and modern tooling.
Published: 2026-02-24 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/postgresql/what-is-pgrx
---
PostgreSQL extensions have traditionally been written in C, which gives full access to the database's internals but requires manual memory management and careful handling of allocation contexts. A bug in a C extension can crash the entire database server. [pgrx](https://github.com/pgcentralfoundation/pgrx) is a Rust framework that changes this: you write Rust functions annotated with macros, and pgrx handles the FFI bindings, type conversions, SQL generation, and memory management. Memory errors, null pointer dereferences, and data races are caught at compile time rather than in production.
## How pgrx Works
pgrx sits between your Rust code and PostgreSQL's C extension API. When you annotate a function with `#[pg_extern]`, pgrx generates the C-compatible wrapper function, the SQL `CREATE FUNCTION` statement, and the type conversion logic at compile time.
A minimal example:
```rust
use pgrx::prelude::*;
// Tells PostgreSQL this is a loadable extension
pgrx::pg_module_magic!();
// Exposed to PostgreSQL as: CREATE FUNCTION hello(name text) RETURNS text
#[pg_extern]
fn hello(name: &str) -> String {
format!("Hello, {}!", name)
}
```
After building and installing the extension, this function is callable from SQL:
```sql
SELECT hello('world');
-- Returns: Hello, world!
```
The `#[pg_extern]` macro handles converting PostgreSQL's `text` datum to a Rust `&str` on input and a Rust `String` back to a `text` datum on output. This type mapping extends to integers, floats, arrays, JSON, composite types, and custom types you define.
## Memory Safety Across the Boundary
PostgreSQL manages memory through a hierarchy of memory contexts: short-lived contexts for individual expressions, transaction-scoped contexts, and long-lived contexts for cached data. C extensions must allocate in the correct context and avoid holding references across context resets, or risk use-after-free bugs.
pgrx maps these contexts to Rust's ownership model. When a pgrx function returns a value, the framework allocates the result in the appropriate PostgreSQL memory context. Rust's borrow checker prevents you from holding references to data that PostgreSQL might free. If your Rust code panics, pgrx catches it at the FFI boundary and converts it into a PostgreSQL error, aborting the current transaction cleanly instead of crashing the server.
For expected error conditions, pgrx functions can return a `Result`. The `Err` variant is converted into a PostgreSQL `ERROR`, rolling back the current transaction:
```rust
#[pg_extern]
fn safe_divide(a: f64, b: f64) -> Result {
if b == 0.0 {
// Becomes a PostgreSQL ERROR, rolling back the transaction
Err("division by zero")
} else {
Ok(a / b)
}
}
```
## Development Workflow
pgrx includes a CLI tool, `cargo-pgrx`, that manages the full lifecycle:
```bash
# Install the pgrx toolchain
cargo install cargo-pgrx
# Download and compile the last five PostgreSQL versions for testing
cargo pgrx init
# Create a new extension project
cargo pgrx new my_extension
# Compile, install, and open a psql session with the extension loaded
cargo pgrx run pg18
# Run tests against a real PostgreSQL instance
cargo pgrx test
```
`cargo pgrx run` compiles the extension, installs it into a local PostgreSQL instance, and drops you into a `psql` session where the extension is already loaded. `cargo pgrx test` runs your test suite against a real PostgreSQL instance, not a mock, so tests exercise the actual extension behavior including SQL generation and type conversions.
```rust
#[cfg(any(test, feature = "pg_test"))]
#[pg_schema]
mod tests {
use pgrx::prelude::*;
#[pg_test]
fn test_hello() {
// Runs inside a real PostgreSQL transaction
let result = Spi::get_one::("SELECT hello('world')");
assert_eq!(result, Ok(Some("Hello, world!".to_string())));
}
}
```
pgrx tests start a real PostgreSQL instance and execute queries against it.
This catches issues that unit tests with mocked interfaces would miss, like
incorrect SQL generation or type conversion errors.
## Beyond Simple Functions
pgrx supports the full range of PostgreSQL extension capabilities:
- **Custom types** via `#[derive(PostgresType)]`, which generates the input/output functions and type definitions
- **Custom operators** via `#[pg_operator]`, letting you define new SQL operators backed by Rust functions
- **Set-returning functions** that yield rows one at a time using Rust iterators
- **SPI access** for executing SQL queries from within extension functions
- **Aggregate functions** with custom state types and transition logic
- **Background workers** for long-running processes managed by PostgreSQL
This coverage means most extensions that could be written in C can be written in pgrx instead, with the same level of PostgreSQL integration.
## Extensions Built with pgrx
Several production extensions use pgrx:
- [ParadeDB](https://github.com/paradedb/paradedb): search and analytics extension for PostgreSQL, built on Tantivy
- [pgrag](https://github.com/neondatabase/pgrag): RAG pipeline extensions from Neon
- [pg_graphql](https://github.com/supabase/pg_graphql): GraphQL query engine embedded in PostgreSQL, from Supabase
- [pg_jsonschema](https://github.com/supabase/pg_jsonschema): JSON Schema validation as a PostgreSQL function
pgrx has also been adopted by major cloud and data platforms, including Microsoft, Amazon, Databricks, and Snowflake.
## History and Governance
pgrx was created by [Eric Ridge](https://github.com/eeeebbbbrrrr), who had been building PostgreSQL extensions in C since version 8.0. The framework grew out of his work on [ZomboDB](https://github.com/zombodb/zombodb), an Elasticsearch-backed indexing extension (now deprecated) — after years of writing C extensions, he built pgrx to present PostgreSQL's internals through Rust's idioms instead. The project was originally named "pgx" and [renamed to pgrx](https://github.com/pgcentralfoundation/pgrx/issues/1106) in April 2023.
The project lives under the [PgCentral Foundation](https://pgcentral.org/), a 501(c)(3) nonprofit, though day-to-day development is still led by Ridge and other core maintainers.
## When to Use pgrx
pgrx is a good fit when you need to extend PostgreSQL with logic that benefits from Rust's performance or safety characteristics. Compute-heavy functions, custom index types, integrations with Rust libraries, and any extension where a crash would be unacceptable are natural candidates. You may also choose to use PGRX when you benefit from Rust's ecosystem, one of the reasons ParadeDB choose Rust was because the amazing Tantivy library existed to help with full-text search.
For simpler logic (data validation, lightweight transformations, glue code), PL/pgSQL may be sufficient and easier to deploy, since it doesn't require compiling native code.
## Summary
pgrx lets you write PostgreSQL extensions in Rust instead of C, providing memory safety, automatic type conversion, SQL generation, and an integrated test workflow. It covers the full PostgreSQL extension API (functions, types, operators, aggregates, SPI, and background workers) while keeping the compile-time safety guarantees that make Rust extensions more reliable than their C equivalents. If your extension needs to be both safe and fast, pgrx is how you get there.
---
# What is pgvector?
> Learn about pgvector, the PostgreSQL extension that adds vector data types and similarity search for embeddings and AI workloads.
Published: 2026-05-14 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/postgresql/what-is-pgvector
---
Modern AI applications produce embeddings: fixed-length arrays of floating-point numbers called vectors that represent the semantic content of text, images, or audio. Similar inputs map to vectors that sit near each other in this space, so retrieval becomes a [vector search](/learn/search-concepts/vector-search) problem rather than a keyword match. For example, a vector search for "reset my password" can match a document about "account recovery" even if the exact words do not overlap, unlike [lexical search](/learn/search-concepts/lexical-vs-semantic), which matches terms directly.
[pgvector](https://github.com/pgvector/pgvector) is a PostgreSQL extension that stores and searches those embeddings with a native `vector` data type, distance operators, and approximate nearest neighbor indexes. It does not create embeddings itself; you generate them with an embedding model, then insert the resulting vectors into PostgreSQL.
## How pgvector Works
pgvector registers a new column type with PostgreSQL. You declare the dimensionality up front, insert vectors as arrays, and query them with distance operators:
```sql
CREATE EXTENSION vector;
CREATE TABLE documents (
id bigserial PRIMARY KEY,
content text,
-- A 1536-dimensional embedding
embedding vector(1536)
);
INSERT INTO documents (content, embedding)
VALUES ('PostgreSQL is a relational database', '[0.12, -0.04, ...]');
```
Four distance operators cover the common similarity measures. The right one depends on how your embeddings were generated:
- `<=>` for cosine distance, which compares vector direction and is the usual choice for normalized text embeddings
- `<#>` for negative inner product, which favors larger dot products and is often equivalent to cosine distance when vectors are normalized
- `<->` for L2 (Euclidean) distance, which compares raw geometric distance and is useful when magnitude carries meaning
- `<+>` for L1 (taxicab) distance, which sums absolute coordinate differences and is less common for modern text embeddings
A nearest-neighbor query is an `ORDER BY` on one of these operators with a `LIMIT`:
```sql
SELECT id, content
FROM documents
ORDER BY embedding <=> '[0.10, -0.02, ...]'::vector -- input embedding
LIMIT 5;
```
Because the result is just rows from a regular table, you can combine vector search with `WHERE` filters, joins, transactions, and any other SQL feature in the same query. The operator in the query must match the operator class on the index, or PostgreSQL cannot use that index for nearest-neighbor search.
## Indexes for Approximate Nearest Neighbor Search
Without an index, pgvector runs an exact scan: every row is compared against the query vector. That is accurate but scales linearly with table size. For larger tables, pgvector provides two approximate nearest neighbor (ANN) index types that trade a small amount of recall for much faster queries.
**IVFFlat** partitions vectors into lists using k-means clustering. At query time, only the most relevant lists are scanned:
```sql
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
```
In pgvector, IVFFlat should be built after data is loaded, since its clusters depend on the existing vector distribution.
**HNSW** builds a multi-layer graph where each node links to nearby neighbors. Searches walk the graph from coarse to fine layers:
```sql
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
```
HNSW indexes are slower and more memory-intensive to build than IVFFlat, but they typically deliver better recall-versus-latency tradeoffs and can be built incrementally as rows arrive.
Each index type supports separate operator classes for L2, inner product, cosine, and L1 distance. The operator class in the index must match the operator used in queries, or PostgreSQL will fall back to a sequential scan. For a deeper walk-through of parameter selection and workload tuning, see [Tuning pgvector Performance](/learn/postgresql/tuning-pgvector).
Recall in pgvector is tunable at query time. For IVFFlat, set
`ivfflat.probes`; for HNSW, set `hnsw.ef_search`. Higher values inspect more
candidates and improve recall at the cost of latency.
## Filtered Vector Search
Real applications rarely search the entire corpus. You usually want the nearest vectors that also match some metadata filter, such as a tenant, category, or date range:
```sql
SELECT id, content
FROM documents
WHERE tenant_id = 42
AND created_at > now() - interval '30 days'
ORDER BY embedding <=> $1
LIMIT 10;
```
With approximate indexes, pgvector applies filters after the index scan chooses candidate rows. If a filter removes too many candidates, the query can return fewer than `LIMIT` rows.
Recent pgvector versions support iterative index scans through `hnsw.iterative_scan` and `ivfflat.iterative_scan`. When enabled, pgvector keeps scanning more of the HNSW graph or IVFFlat lists until it finds enough filtered rows or reaches a configured scan limit. This improves filtered vector search without requiring a separate query pattern, but it still trades more work for better recall. The [limitations](/learn/postgresql/pgvector-limitations) guide covers the tradeoffs in more detail.
## Storage and Quantization
A single 1536-dimensional `vector` column stores 6 KB per row at full precision. For very large tables, pgvector supports lower-precision types that reduce storage and memory usage:
- `halfvec` stores each component as a 16-bit float, up to 4,000 dimensions
- `bit` stores binary vectors up to 64,000 dimensions for Hamming and Jaccard distance
- `sparsevec` stores up to 1,000 nonzero elements, useful for sparse embeddings
Both IVFFlat and HNSW can be built on these lower-precision or sparse types, so you can fit larger indexes in memory while keeping the same nearest-neighbor query shape. For very large indexes, binary quantization plus reranking can reduce the working set while preserving final result quality.
## When to Use pgvector
pgvector is a strong fit when your embeddings live alongside relational data you already store in PostgreSQL. Keeping vectors in the same database avoids dual-write problems, lets you join across tables, and reuses existing backup, replication, and access control. It is widely supported by managed Postgres providers and often used as the retrieval layer in [RAG](/learn/search-concepts/retrieval-augmented-generation) systems.
Because vectors capture semantic meaning but miss exact terms like product codes or filenames, most production retrieval pipelines pair pgvector with a [full-text search](/learn/search-in-postgresql/full-text-search) index and combine the two results with [hybrid search](/learn/search-concepts/hybrid-search). The [lexical vs. semantic](/learn/search-concepts/lexical-vs-semantic) article covers the tradeoffs in more detail.
A dedicated vector database may make more sense when you need billions of vectors, very high write throughput on the vector path alone, or features such as native multi-vector ranking and learned indexes that pgvector does not yet provide.
## Summary
pgvector turns PostgreSQL into a capable vector database by adding a `vector` type, distance operators, and IVFFlat and HNSW indexes for approximate nearest neighbor search. It integrates cleanly with the rest of SQL, so embeddings can be filtered, joined, and transacted on like any other column. For many applications that combine structured data with AI-generated embeddings, pgvector provides production-quality similarity search without running a separate system.
---
# What is Block WAND?
> Learn about Block-Max WAND, the dynamic pruning algorithm that accelerates top-k document retrieval in search engines.
Published: 2026-02-27 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/search-concepts/block-wand
---
Block-Max WAND (BMW) is a dynamic pruning algorithm that accelerates Top K document retrieval in search engines, the kind of query you express in SQL as `ORDER BY score DESC LIMIT k`. It works with scoring functions like [BM25](/learn/search-concepts/bm25) by dividing inverted index posting lists into fixed-size blocks, precomputing the maximum possible score within each block, and using those scores to skip entire blocks of documents that cannot appear in the final results. In most implementations, the algorithm returns exactly the same results as exhaustive scoring while doing a fraction of the work.
The technique was introduced by [Shuai Ding and Torsten Suel in 2011](https://dl.acm.org/doi/10.1145/2009916.2010048) and has since been adopted by [Apache Lucene](https://github.com/apache/lucene), [Tantivy](https://github.com/quickwit-oss/tantivy), and other search engines as a core query optimization.
## From WAND to Block WAND
Block WAND builds on two earlier pruning techniques:
- **MaxScore** ([Turtle & Flood, 1995](https://www.sciencedirect.com/science/article/abs/pii/030645739500020H)) stores a single global maximum score per term across the entire posting list. It uses this to classify terms as "essential" or "non-essential" and skip documents that cannot beat the current threshold.
- **WAND** ([Broder et al., 2003](https://dl.acm.org/doi/10.1145/956863.956944)) introduces a pivot mechanism that sorts term pointers by document ID and calculates whether the combined upper-bound scores of aligned terms can exceed the threshold. If not, it advances past those documents without scoring them.
Both techniques rely on a single global upper bound per term: the highest score that term produces across the entire index. This bound can be inflated by a single outlier document, making it loose and limiting how aggressively documents can be skipped.
Block WAND is not a different algorithm; it is WAND with finer-grained upper bounds. Instead of one maximum score for a posting list of millions of documents, it records the maximum score for each block of typically 128 documents. These tighter bounds plug directly into WAND's existing pivot mechanism, allowing far more aggressive pruning.
## How Block WAND Works
### Index Structure
During indexing, each posting list is divided into fixed-size blocks. For each block, the index stores:
- The **block-max score**: the highest score any document in this block can contribute for this term
- The **maximum document ID** in the block
These values are stored alongside the compressed posting data with minimal overhead, typically 5-8% additional index size.
### Query Processing
When you execute a query for the Top K results, the algorithm maintains a result heap and a dynamic threshold: the minimum score needed to enter the current Top K. Processing follows these steps:
1. **Sort term pointers** by their current document ID, smallest first.
2. **Find the pivot document.** Walk through the sorted pointers, accumulating block-max scores until their sum exceeds the threshold. The document where the sum first crosses the threshold is the pivot, the earliest document that could possibly score high enough to enter the Top K.
3. **Check block bounds**: for each term's posting list, look up the block-max score for the block containing the pivot document. If a term's block-max is too low to help the pivot exceed the threshold, skip that term's iterator ahead to its next block. Different terms may skip different blocks.
4. **Score the document**: if the combined block-max bounds survive the check, fully score the pivot document. If it exceeds the threshold, insert it into the Top K heap and raise the threshold.
5. **Repeat** until all posting lists are exhausted.
As higher-scoring documents are found, the threshold rises, and pruning becomes more aggressive; the algorithm accelerates as it runs.
### Why Per-Block Bounds Matter
Consider a common term like "the" with a posting list spanning millions of documents. Its global maximum score might be 3.5 (from one short document where "the" appears disproportionately often). In most blocks of 128 documents, the actual maximum contribution is much lower, perhaps 0.8. Using 0.8 instead of 3.5 as the upper bound allows the algorithm to skip far more blocks.
This effect compounds across multiple query terms. When every term has a tighter bound, the combined upper-bound estimate drops substantially, and entire regions of the index are eliminated without decompressing a single document ID.
## Example: Block WAND in Action
Suppose you search for **"machine learning"** and want the top 10 results. The index has separate posting lists for each term, and each posting list is divided into blocks. The algorithm walks both lists in parallel, checking whether each block pair can produce a combined score above the current threshold.
```text
threshold = 0.0
Block 1 (docs 1–128):
"machine" block-max = 4.1, "learning" block-max = 3.8, sum = 7.9
-> sum > threshold, score documents
-> threshold rises to 5.3
Block 2 (docs 129–256):
"machine" block-max = 1.2, "learning" block-max = 2.4, sum = 3.6
-> sum < threshold (3.6 < 5.3), skip entire block
Block 3 (docs 257–384):
"machine" block-max = 3.9, "learning" block-max = 4.5, sum = 8.4
-> sum > threshold, score documents
-> threshold rises to 6.8
Block 4 (docs 385–512):
"machine" block-max = 0.9, "learning" block-max = 1.8, sum = 2.7
-> sum < threshold (2.7 < 6.8), skip entire block
```
Notice that each term has its own block-max score, and the algorithm sums them to decide whether a block is worth scoring. As the threshold rises, more blocks are pruned.
## Performance
Benchmarks across academic and industry implementations show consistent results:
- **3x to 7x speedup** on standard term queries compared to exhaustive scoring ([Grand et al., ECIR 2020](https://link.springer.com/chapter/10.1007/978-3-030-45442-5_3))
- In practice, Block WAND typically scores only 10-40% of matching documents, with the rest pruned at the block level
- Greatest impact on **disjunctive (OR) queries** with high-frequency terms, where the most documents can be skipped
- Benefits scale with collection size; larger indexes see proportionally more pruning
The algorithm only applies to Top K queries by score, not exhaustive counts or Top K on fast fields, since it relies on a threshold from the result heap to make pruning decisions. The speedup is greatest when queries contain common terms with long posting lists, where most blocks can be skipped.
## Scoring Constraints
Block WAND's pruning is safe only when the scoring function is additive across terms and each term's contribution is non-negative. This holds for [BM25](/learn/search-concepts/bm25), TF-IDF, and most classical retrieval functions. Scoring models that produce negative term contributions or non-additive scores (such as some neural rerankers) cannot use Block WAND for safe pruning, since a negative contribution could make a skipped document score higher than the upper bound suggests.
## Adoption in Search Engines
- **Apache Lucene 8.0** (2019) adopted Block WAND for Top K disjunctive queries, bringing the optimization to [Elasticsearch](https://github.com/elastic/elasticsearch) and [Solr](https://github.com/apache/solr)
- **[Tantivy](https://github.com/quickwit-oss/tantivy)** implements Block WAND for boolean queries with `should` clauses, inherited by [Quickwit](https://github.com/quickwit-oss/quickwit) and [ParadeDB](https://github.com/paradedb/paradedb). Tantivy's block-max values are not always exact: in rare cases a block-max may underestimate the true maximum, which can cause a marginally lower-scoring document to be returned instead of the true top result
## Summary
Block WAND accelerates Top K retrieval by replacing loose global score bounds with tight per-block bounds, allowing the search engine to skip large portions of the index without sacrificing result quality. It processes the same query with a fraction of the scoring work, and the speedup grows as the threshold tightens during execution. For search engines handling disjunctive queries over large collections, it is one of the most effective optimizations available.
---
# What is BM25?
> Learn about BM25, the ranking algorithm that powers modern search engines.
Published: 2024-10-05 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/search-concepts/bm25
---
[Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is a **ranking function** used in search engines to score documents by relevance to a query. The BM is short for `best matching`, and `25` is the version of the function that worked best in testing. It improves on earlier methods like [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) by accounting for document length and diminishing returns from repeated terms. It powers Apache Lucene-based search systems like [Elasticsearch](https://github.com/elastic/elasticsearch) and [Solr](https://github.com/apache/solr), and Tantivy-based search systems like [ParadeDB](https://github.com/paradedb/paradedb).
Even though it's over 30 years old, BM25 remains the default ranking algorithm in most search engines because it’s simple, explainable, and performs well in practice.
## How BM25 Works
It's important to note that BM25 is what’s known as a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) retrieval function. It doesn’t look at term order, phrase structure, or proximity: only at which words appear and how frequently they occur.
BM25 starts by breaking the search query into individual terms. For each term, it scores how well every document matches based on three main signals:
- **Term frequency (TF):** documents that mention a query term more often score higher.
- **Inverse document frequency (IDF):** rare terms are weighted more than common ones.
- **Document length normalization:** shorter documents are preferred to long ones that mention the term in passing.
Each term contributes its own score, and BM25 sums those scores to produce an overall relevance value for the document.
In simplified form, BM25 can be thought of as:
$$
\text{score} = \sum \text{IDF} \times \text{adjusted\_term\_frequency}
$$
where the adjusted term frequency reduces the impact of very frequent terms and normalizes for document length.
The full [BM25
formula](https://en.wikipedia.org/wiki/Okapi_BM25#The_ranking_function) is
slightly more complex. It includes tunable parameters $$k₁$$ (how quickly term
frequency saturates) and $$b$$ (how strongly document length is normalized).
These controls make BM25 adaptable across different datasets and document
types.
## Why BM25 Works Well
BM25 has three key strengths which have allowed it to stay relevant for so long:
- **Simplicity:** easy to understand and implement.
- **Efficiency:** fast enough to run in real-time over large datasets.
- **Explainability:** each part of the score can be traced to a clear factor.
Because of these traits, BM25 remains the baseline for relevance in modern search systems.
## When to Use BM25
BM25 excels in retrieval workloads where specific keywords carry significant information. For example:
- Searching for a brand name in a product catalog
- Looking up a stock ticker in a trading app
- Matching a merchant name in a credit card transaction ledger
- Finding names in a legal document
- Finding a diagnosis code in a medical report
- Retrieving files to feed into an LLM in a RAG application
Because BM25 relies on simple term statistics, it is accurate, consistent and extremely fast. That makes it a popular choice in applications where low query latency and keyword relevance are critical.
## Example: Scoring a Query with BM25
Imagine a user searches for **"inverted index"**.
BM25 first breaks the query into two terms: **“inverted”** and **“index.”**
Then, for each document, it scores how well those terms match:
| **Document** | **Length** | **Term counts** | **TF–IDF signals** | **Relative Score** |
| ------------ | ---------- | -------------------------- | ----------------------------------------------------------- | ------------------ |
| **Doc A** | 60 words | inverted: 1
index: 1 | rare → high IDF
common → low IDF | **Higher** |
| **Doc B** | 200 words | inverted: 1
index: 10 | rare → high IDF
common → low IDF
+ length penalty | **Lower** |
Even though _Doc B_ repeats “index” more times, BM25 favors _Doc A_ because:
1. “inverted” is a rare term with higher IDF,
2. both documents mention it once (similar TF), and
3. _Doc A_ is shorter, so its matches are more concentrated.
**BM25 rewards focused, relevant mentions of rare terms, not repetition in long text.**
## Summary
BM25 remains the standard ranking algorithm for keyword search.
It combines frequency, rarity, and length normalization into a single scoring model that quickly produces balanced and predictable results.
It’s the foundation that modern lexical hybrid search systems continue to build on.
---
# What is Faceted Search?
> Learn about faceting and faceted search, how they relate to aggregations, and when to use them.
Published: 2024-10-31 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/search-concepts/faceting
---
**Faceted search**, also known as faceting, is a [search technique](https://en.wikipedia.org/wiki/Faceted_search) that organizes [full-text search](/learn/search-concepts/full-text-search) results into filterable categories, known as _facets_. Each facet represents a property of the data (such as color, brand, or price range) and lets users narrow down large result sets by selecting multiple values simultaneously. It’s the foundation of interactive filtering in e-commerce, document search, and analytics platforms.
Faceting is an interesting problem because while it can in theory be computed client-side or across multiple queries, search responses typically need to be fast. That means facets are usually calculated efficiently within the same query that produces the search results in a single index pass.
## What is a Facet?
Mathematically, a facet is one of the faces of a many-sided object. In computer science, facets describe how information about an object is organized and grouped by properties or attributes. A facet refers to a category or dimension of data (for example, color), while a facet value represents a specific option within that category (for example, brown or blue).
Each facet value is adjoined with an integer denoting the number of filtered items that belong to that value. For example, a facet might be represented as the following:
```jsx
brown(10);
blue(7);
white(4);
```
Facets are additive: a user might check multiple facet values (e.g. `brown shoes` and `cheap shoes`) to cut down search results. Facets can categorize sets of values (e.g. `brown`, `blue`, `white` etc.) and also buckets of ranges (e.g. `$0 to $99`, `$100 to $999`, `+$1,000` etc.).
In practice, facets travel with Top K searches. Showing facet counts alongside the first N results lets users refine immediately, understand distribution at a glance, and avoid extra round-trips. Executing ranking and aggregation in the same pass keeps latency low and proportional to matches, not the total document set.
## Faceted Search Implementations
At a small scale, faceted search can be computed client-side or with simple SQL queries. For example, a web application could fetch all matching records and compute facet counts in JavaScript, or run a few GROUP BY queries in SQL. But both approaches fall apart as datasets grow.
A **client-side implementation** requires transferring every matching record to the client so counts can be calculated locally. Even a modest search returning tens of thousands of results would quickly become hard to work with.
A **naïve SQL implementation** would need multiple passes over the same data, first running the search query and then running the faceting on the full result set before returning the top K records.
In production systems, faceting is usually performed in tandem with the search query itself. Rather than running separate aggregation queries, search engines compute facet counts during the same index scan that retrieves matching documents. Search databases and search engines such as Elasticsearch, Solr, and ParadeDB do this by maintaining auxiliary data structures (often columnar or bitmap-based) that allow category counts to be aggregated without rereading all documents. Each facet value is mapped to a compact integer ordinal or bitset, so counts can be computed in memory as part of the query execution path.
Faceted search also operates over the entire result set even when the current query is limited. Even if only a handful of documents are displayed, facet counts must reflect all matching documents. Because it’s impossible to predict what combinations of filters users will apply, facets can’t be precomputed in advance. Instead, indexes are organized so that these aggregations can be calculated efficiently (often in logarithmic time) during a single index pass.
## Faceted Search Examples
In Elasticsearch users can perform faceting alongside a search query with the [aggregations API](https://www.elastic.co/docs/reference/aggregations):
```json
POST /products/_search
{
"size": 10,
"query": {
"multi_match": {
"query": "wireless headphones",
"fields": ["title", "description"]
}
},
"aggs": {
"price_histogram": {
"histogram": {
"field": "price",
"interval": 50
}
}
}
}
```
In ParadeDB the same thing can be accomplished using the `pdb.agg` function as a window.
```sql
SELECT
*,
pdb.score(id) AS bm25_score,
pdb.agg('{"histogram": { "field": price, "interval": 50 }}') OVER () AS facets
FROM products
WHERE
title ||| 'wireless headphones' OR
description ||| 'wireless headphones'
ORDER BY bm25_score
LIMIT 10
```
## What are Common Use Cases for Faceting?
Faceting becomes essential whenever users need to explore and filter large datasets while maintaining search relevance. The key scenarios fall into several categories:
### E-commerce and Product Discovery
E-commerce platforms like Amazon and Wayfair are the most visible examples, but the pattern extends beyond retail. Users searching through thousands of products need to quickly narrow results by brand, price range, customer ratings, and availability. Faceting allows exploratory browsing—users can start with a broad search like "laptop" and progressively refine by processor type, screen size, and budget without losing the relevance ranking of their original query.
### Content and Knowledge Management
Organizations with large content repositories face similar challenges. A support agent searching 50,000 help tickets needs to filter by product area, severity level, and date range while maintaining the relevance of their keyword search. Documentation platforms like Confluence allow users to find relevant articles while filtering by team ownership, document type, or last modified date. Without faceting, users would either get overwhelming result sets or lose important context by switching to pure categorical browsing.
### Analytics and Data Exploration
Analytics platforms like PostHog or Mixpanel enable users to search and filter event data across multiple dimensions simultaneously. A product manager investigating user behavior can search for specific actions while faceting by device type, geographic region, and session length. The combination of search relevance with dimensional filtering allows for both targeted investigation and serendipitous discovery.
### Specialized Search Applications
Recruiting platforms like LinkedIn demonstrate faceting in talent search, where recruiters need to find candidates matching job descriptions while filtering by location, experience level, and specific skills. Legal research platforms allow lawyers to search case law while filtering by jurisdiction, date range, and legal area. Real estate platforms let buyers search for properties while filtering by price, location, and amenities.
In all these cases, faceting solves the fundamental problem of information overload: how to maintain the power of full-text search while giving users intuitive ways to navigate large result sets.
## Summary
Faceted search organizes search results into filterable categories called facets, allowing users to narrow down results by selecting specific attributes like color, price, or other characteristics.
Faceted search is natively available in search-first solutions like ParadeDB or Elasticsearch.
---
# What is Full-Text Search?
> Learn about full-text search capabilities, how they work, and when to use them in your applications.
Published: 2024-10-06 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/search-concepts/full-text-search
---
[Full-text search](https://en.wikipedia.org/wiki/Full-text_search) (FTS) is designed to find documents based on the words they contain rather than just structured fields or metadata. It brings intelligence to text retrieval by analyzing and indexing language, allowing queries to return results ranked by relevance instead of simple matches.
This capability transforms a datastore into a search engine. Rather than forcing users to remember exact values or field names, full-text search lets them describe what they're looking for in natural terms, and still find it.
Full-text search operates through two core components: **indexing** and **querying**. Indexing transforms raw text into structured, searchable data, while querying provides flexible ways to retrieve and rank relevant documents.
## Indexing
Indexing is the foundation of full-text search performance. Instead of storing entire documents as opaque blocks that must be scanned linearly, indexing decomposes them into smaller, searchable parts and rebuilds them into structures optimized for retrieval.
### Text Analysis Pipeline
The indexing process begins with text analysis, where raw text is transformed into a form that computers can index efficiently. The pipeline typically includes:
1. **Tokenization:** Splitting text into discrete terms or tokens, such as words or phrases.
2. **Normalization:** Lowercasing terms, removing punctuation and stop words, and often applying stemming so that `running`, `ran`, and `run` are treated as related.
3. **Indexing:** Storing each token alongside the documents that contain it in a structure called an inverted index.
Systems like [Elasticsearch](https://github.com/elastic/elasticsearch),
[Solr](https://github.com/apache/solr),
[ParadeDB](https://github.com/paradedb/paradedb), and [PostgreSQL
ts_vector](https://www.postgresql.org/docs/current/datatype-textsearch.html)
manage this entire pipeline automatically, allowing developers to add
high-quality search to their applications without reinventing the underlying
mechanics.
### Inverted Indexes
The foundation of full-text search is the [inverted index](https://en.wikipedia.org/wiki/Inverted_index), a data structure that enables very fast lookups by reversing the usual document-to-terms relationship.
Instead of asking _"what terms appear in this document?"_, an inverted index can be used to answer _"which documents contain this term?"_
The core structure is a dictionary where each term maps to a postings list, describing every document that contains the term.
Each entry in a postings list typically includes:
- The **document ID**
- The **term frequency** (how often the term appears in that document)
- Optionally, the **positions** where the term occurs (enabling phrase and proximity queries)
#### Example: Building an Inverted Index
Consider indexing two simple documents:
| **ID** | **Text** |
| ------ | ---------------------------- |
| 1 | "PostgreSQL supports search" |
| 2 | "Search engines use indexes" |
After tokenization and normalization, the inverted index looks like this (using an [english language tokenizer](https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/simple) with no stemming):
| **Term** | **Postings List** |
| ---------- | ------------------------------------------ |
| postgresql | `[(1, tf=1, pos=[0])]` |
| supports | `[(1, tf=1, pos=[1])]` |
| search | `[(1, tf=1, pos=[2]), (2, tf=1, pos=[0])]` |
| engines | `[(2, tf=1, pos=[1])]` |
| indexes | `[(2, tf=1, pos=[3])]` |
When you search for "search", the system looks up the term and finds documents `[1, 2]` without scanning any text.
#### Performance Benefits
Inverted indexes provide:
- **Fast retrieval:** $$O(1)$$ term lookup followed by efficient postings list intersection
- **Space efficiency:** Each unique stemmed term stored only once, regardless of corpus size
- **Flexible scoring:** Rich statistics support advanced ranking algorithms
Modern implementations include optimizations like compression, skip lists, and index partitioning to handle large-scale deployments efficiently.
## Querying
Once text is indexed, querying provides flexible ways to retrieve and rank relevant documents. When a user submits a query, the same analysis steps used during indexing are applied to their search terms, which are then looked up in the index.
Modern search engines provide powerful query builder APIs that allow developers to programmatically construct complex search queries and integrate business logic directly into their search
operations. This pushes as much work as possible down into the query engine (where it can happen the most efficiently), so the client doesn’t have to post-filter or reprocess results.
### Query Types and Processing
Different query types use inverted indexes in different ways:
- **Boolean queries** ("PostgreSQL AND search") use set intersection of postings lists to find documents that match logical conditions.
- **Phrase queries** use position information to ensure terms appear consecutively or in proximity to each other. For example, you could search for an exact match of "search engines" or for when "search" is within five words of "indexes".
- **Ranked queries** apply algorithms like [BM25](/learn/search-concepts/bm25) using term frequency and document frequency statistics to give the most relevant results first.
### Advanced Query Features
Full-text search provides much more than the ability to find exact matches. It introduces a set of features that make search both flexible and forgiving:
- **Fuzzy matching** corrects small errors and typos, so a query for `databse` still returns `database`. {/* * codespell:ignore databse * */}
- **Boolean operators** give power users fine-grained control with logic like `python AND (api OR web)`, filtering results to match complex conditions.
- **Field weighting** acknowledges that some parts of a document matter more than others—a match in a title often carries more importance than one in the body text.
- **Proximity queries** find words that are located close to other words (`dogs within 5 words of cats`).
- **Faceting** enables users to filter search results by categories or attributes (like "brand," "price range," or "date") while maintaining text-based relevance scoring.
## When Full-Text Search Shines
Full-text search becomes essential whenever large volumes of unstructured or semi-structured text need to be queried quickly and intuitively. It thrives in scenarios where users think in terms of language, not schema.
Some common examples include:
### Content and Communication Search
- **Content discovery:** Searching across articles, blog posts, or documentation pages
- **Communication search:** Navigating chat histories, emails, or support tickets
- **E-commerce:** Product catalogs where users search by description, features, or brand names
### Technical Search
- **Code search:** Finding functions, variables, or patterns across codebases
- **Log search:** Analyzing application logs, error messages, and system events for debugging and monitoring
### AI and Machine Learning
- **Retrieval Augmented Generation (RAG):** Powering AI systems that need to find relevant context from knowledge bases before generating responses
In all of these workloads, users aren't just filtering, they're searching. They know what they want to find conceptually but not exactly where it's stored. Full-text search bridges that gap by turning text into a structured, ranked representation of meaning.
## When Full-Text Search Is Not Enough
While full-text search excels at lexical matching, it has limitations when users search by meaning rather than keywords. Full-text search uses these methods to capture linguistic variation, but its focus remains lexical: it matches words, not meanings.
Consider these scenarios where traditional full-text search struggles:
- **Semantic similarity:** A search for "automobile" won't match documents about "cars" unless explicit synonyms are configured
- **Cross-language search:** Keywords in one language won't match semantically equivalent content in another language
- **Domain-specific terminology:** Technical concepts may be expressed differently across documents but carry the same meaning
For these use cases, **vector search** provides a complementary approach. Vector search represents text as high-dimensional vectors that capture semantic meaning, allowing searches based on conceptual similarity rather than exact keyword matches.
Many modern search systems use **hybrid approaches** that combine both full-text and vector search:
- Full-text search for precise keyword matching and boolean logic
- Vector search for semantic similarity and concept-based retrieval
- Ranking algorithms that blend both approaches for optimal relevance
## Summary
Full-text search brings lexical structure to unstructured data. By analyzing, indexing, and ranking text, it makes information findable through natural language rather than rigid filters or exact matches.
It remains one of the most effective tools for large-scale text retrieval: fast, reliable, and deeply optimized for keyword relevance. Its success lies in bridging how people think about information with how computers store it.
---
# What is Hybrid Search?
> Learn about hybrid search systems that combine lexical and semantic search for comprehensive results.
Published: 2024-10-06 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/search-concepts/hybrid-search
---
Hybrid search combines [full-text search](/learn/search-concepts/full-text-search) and [vector search](/learn/search-concepts/vector-search) to deliver both precise keyword matching and semantic understanding in a single query. This approach solves the fundamental trade-off between finding exact terms and discovering conceptually related content, making it particularly valuable for applications where users express queries in diverse ways but expect comprehensive, relevant results.
## Why Combine Search Methods?
Each search approach has distinct strengths and weaknesses that make combination valuable:
- [Full-text search](/learn/search-concepts/full-text-search) relies on exact keyword matching and struggles when users express concepts differently than documents. A search for "automobile" won't find documents about "cars" unless explicit synonyms are configured. Technical terms, product codes, and precise specifications work well, but natural language queries often fail to match semantically related content.
- [Vector search](/learn/search-concepts/vector-search) understands semantic relationships but may miss critical exact matches. Searching for a specific product model like "iPhone 15 Pro" might return results about "premium smartphones" that are semantically similar but factually incorrect. Product codes, technical specifications, and precise terminology can be overlooked in favor of conceptually related but wrong results.
## Combining Rankings
The core challenge in hybrid search is combining scores from different ranking algorithms. [Full-text search](/learn/search-concepts/full-text-search) systems like [BM25](/learn/search-concepts/bm25) produce relevance scores based on term frequency and document statistics, while [vector search](/learn/search-concepts/vector-search) generates similarity scores from mathematical distance calculations. These scores operate on different scales and can't be directly combined.
[Reciprocal Rank Fusion](/learn/search-concepts/reciprocal-rank-fusion) (RRF) is one way of solving this problem. Instead of trying to normalize different scoring systems, RRF assigns scores based on where documents appear in each ranked list. Documents that rank highly in both search methods receive the highest combined scores.
For a full discussion of RRF, including why it works, how it can be weighted,
and a working SQL example, see our [Reciprocal Rank
Fusion](/learn/search-concepts/reciprocal-rank-fusion) article.
## Query Processing in Hybrid Systems
Hybrid search systems process queries through processing pipelines that can execute simultaneously:
1. **Text processing:** The query undergoes traditional text analysis (tokenization, stemming, and stopword removal) before matching against the inverted index using algorithms like [BM25](/learn/search-concepts/bm25)
2. **Vector generation:** The same query is converted into a dense vector embedding using machine learning models, then compared against document embeddings using similarity metrics like cosine distance
3. **Result fusion:** Both result sets are combined using techniques like [Reciprocal Rank Fusion](/learn/search-concepts/reciprocal-rank-fusion) to produce a unified ranking
This parallel processing allows hybrid systems to capture different aspects of relevance. A query for "wireless headphones with noise cancellation" will match documents containing those exact terms while also finding content about "Bluetooth earbuds with ANC" that shares semantic meaning but uses different terminology. {/* * codespell:ignore anc * */}
## Implementation Considerations
Building effective hybrid search requires careful attention to several technical factors:
- **Index synchronization** ensures that text and vector indexes remain consistent as documents are added, updated, or deleted. Inconsistencies between indexes can cause confusing results where documents appear in one search method but not the other.
- **Embedding model selection** significantly impacts semantic search quality. Models trained on domain-specific data typically outperform general-purpose models for specialized applications. A medical search system benefits from embeddings trained on medical literature rather than general web content.
- **Fusion weight tuning** requires experimentation with real user queries and feedback. The relative weighting between text and vector search depends on user behavior. Technical users might prefer exact keyword matching, while casual users benefit from semantic flexibility.
## When Hybrid Search Excels
Hybrid search provides clear advantages in applications requiring both precision and semantic understanding:
- **E-commerce search:** Find exact product names and models while understanding descriptive queries like "comfortable running shoes"
- **Technical documentation:** Match precise API names and error codes while finding conceptually related troubleshooting content
- **Enterprise search:** Locate specific documents by title or ID while discovering semantically related information across different departments
- **Customer support:** Match exact error messages while understanding natural language problem descriptions
- **Academic research:** Find papers with specific methodologies while discovering conceptually related work using different terminology
In each case, users benefit from both the precision of keyword matching and the flexibility of semantic understanding.
## Limitations
While hybrid search offers superior relevance, it introduces significant complexity:
- **Computational overhead:** Running both [full-text](/learn/search-concepts/full-text-search) and [vector search](/learn/search-concepts/vector-search) simultaneously increases computational requirements and storage needs
- **Implementation complexity:** Managing two different indexing systems and tuning fusion parameters requires engineering effort
- **Query latency:** Processing queries through multiple systems and combining results can increase response times
- **Cost implications:** Maintaining both keyword indexes and vector embeddings increases infrastructure costs
For many applications, a single search method may provide sufficient results with much lower complexity. Hybrid search makes sense when the improved relevance justifies the additional overhead.
## Summary
Hybrid search represents the current state-of-the-art in information retrieval by combining the precision of [full-text search](/learn/search-concepts/full-text-search) with the semantic understanding of [vector search](/learn/search-concepts/vector-search).
While the added complexity and computational overhead are significant, if they are tuned well hybrid systems can deliver superior relevance across diverse query types.
---
# Lexical vs Semantic Search
> Learn the differences between lexical and semantic search methods, when to use each approach, and how hybrid search combines both.
Published: 2024-11-28 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/search-concepts/lexical-vs-semantic
---
Different search workloads have different relevancy requirements. For example, some search problems only depend on exact or approximate string matching, where the presence (or absence) of a substring determines relevance (e.g. `“cat”` and `“black cat”`). Others depend on semantic similarity, where meaning determines relevance (e.g. `“cat”` and `“feline”`).
From an implementation perspective, these two patterns are orthogonal:
- Lexical methods (including full-text search) match tokens precisely. They only succeed when the query and document share vocabulary.
- Semantic methods capture intended meaning, but are poorly suited for exact identifiers (IDs, numbers, proper names) that lexical systems handle precisely.
This difference is why no single retrieval method performs well across all workloads (however, using both together with hybrid search offers a compromise).
Understanding when to use lexical, semantic, or hybrid search depends on your specific use case and the type of queries your users perform.
## Lexical Search
Lexical search has been part of information systems for decades. Early workloads depended on deterministic matching. Most queries issued by databases and other internal tools referenced exact terms like IDs or error codes. For example, a SQL query using `LIKE '%error%'` searches for any text containing the substring "error".
Modern lexical search is much more advanced and determines relevance based on token presence and frequency. Documents are scored higher when query terms appear more frequently or are rarer across the entire corpus.
### How does Lexical Search Work?
Most lexical search implementations use inverted indexes. An inverted index maps each token to the documents that contain it. This enables fast lookups even when datasets grow.
However, results also need to be ranked. [BM25](/learn/search-concepts/bm25) (Best Matching 25) is the most widely used ranking function for lexical retrieval. When indexing, text is passed through an analyzer that tokenizes the input. These tokens might be individual words, phrases, or lexemes (root words of each word). Each token becomes an entry in the inverted index, along with how many documents contain that token and how frequently it appears.
Lexical search can be enhanced with techniques like stemming (reducing words to root forms) and synonym expansion (matching predefined word lists). However, these improvements still rely on exact token matching rather than understanding contextual meaning.
### What is Full-Text Search?
[Full-text search](/learn/search-concepts/full-text-search) extends this model by adding structure to the query itself. Instead of treating tokens independently, full-text search lets developers configure how tokens should relate to one another.
For example, phrase search requires tokens to appear in a specific sequence, which is useful whenever order might affect meaning (such as error messages or function signatures). Proximity search, on the other hand, allows tokens to appear near each other but not in an exact order; this better reflects how related terms appear in documentation or logs.
Full-text search is still purely lexical. It uses an inverted index and extends the same token statistics. However, full-text search provides a more expressive query toolkit.
## Semantic Search
Semantic search (also known as [vector search](/learn/search-concepts/vector-search)) addresses a fundamental limitation of lexical search. Two pieces of text can describe the same idea without overlapping words. If a query uses different phrasing than the sought-after document, search results might not be relevant (although lexical with synonyms can help, it still can't index concepts).
Dense vector retrieval closes this gap by approximating meaning instead of relying on matching terms. Embedding models (like text-embedding-ada-002, sentence-transformers, or E5) learn relationships between concepts from large datasets; these concepts are then mapped into a continuous vector space.
### How does Semantic Search Work?
When indexing, an embedding model converts each document into a dense vector. These vectors represent semantic information learned from the model’s training data. Because this representation is continuous, documents that express similar ideas can be placed near one another even if they use completely different vocabulary.
Searching across all vectors directly would be very expensive, so databases use approximate nearest neighbor (ANN) indexes like HNSW or IVF. These structures organize vectors into partitions, making it possible to retrieve a small candidate set. ANN indexes trade a small amount of recall for significant gains in throughput and latency.
At query time, the system embeds the query into the same vector space. It then evaluates how close the query vector is to each candidate vector using similarity metrics such as cosine similarity, the dot product, or the L2 distance.
## Hybrid Search
[Hybrid search](/learn/search-concepts/hybrid-search) does not replace lexical or semantic search. Instead, it coordinates them. Lexical methods are reliable when queries contain structured tokens. Semantic methods are reliable when the query expresses intent in natural language. Hybrid search combines these strengths.
In hybrid search, both lexical and semantic methods retrieve their own set of candidates. Each list reflects a different assumption about what the user meant: one grounded in exact terms, the other grounded in contextual meaning. Documents surfaced by either method may be relevant. Merging these lists into a single ranking creates a more stable relevance signal across query types that may change.
### How does Hybrid Search Work?
Hybrid search begins by running multiple retrieval methods in parallel. Typically, lexical search retrieves and ranks candidates using BM25, which is almost always the best baseline for lexical retrieval. At the same time, a vector search retrieves candidates using dense embeddings. Each method produces its own ranked list based on its internal scoring rules.
At query time, these lists need to be merged into a single final ranking. The scores from each method cannot be directly compared as lexical scores and vector measures use varying scales. However, techniques like [Reciprocal Rank Fusion](/learn/search-concepts/reciprocal-rank-fusion) (RRF) ignore raw scores entirely and strictly use the rank position of each document within its own list. RRF assigns each document a fused score.
## Summary
Lexical, semantic, and hybrid search methods exist because users express intent in different ways. Some queries rely on exact tokens, while others rely on context. Each retrieval method optimizes for a different type of relevance.
Beyond choosing retrieval methods, query rewriting (expanding or modifying queries before search) and reranking (adjusting results after retrieval) are equally important for improving search quality. These techniques can significantly enhance any underlying search approach. Modern applications like [Retrieval-Augmented Generation (RAG)](/learn/search-concepts/retrieval-augmented-generation) rely heavily on these search fundamentals to provide relevant context to language models.
One practical challenge that many teams face is when their applications require search in a separate system from their primary database (e.g. Elasticsearch and Postgres). In cases like these, the burden falls on developers to keep both systems consistent.
Modern database engines that support both lexical and semantic search natively help by running retrieval methods directly on database rows, eliminating the need for separate index-sync pipelines.
---
# What is Reciprocal Rank Fusion?
> Learn about Reciprocal Rank Fusion (RRF), a technique for combining multiple search result rankings.
Published: 2024-10-06 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/search-concepts/reciprocal-rank-fusion
---
[Reciprocal Rank Fusion](https://plg.uwaterloo.ca/%7Egvcormac/cormacksigir09-rrf.pdf) (RRF) is a technique for combining multiple ranked lists of search results into a single ranking. Originally developed for information retrieval research, RRF has become essential in modern search systems to allow ranking across multiple retrieval methods, such as combining lexical and semantic search results in hybrid search applications.
RRF's elegance lies in its simplicity: rather than normalizing scores within each scoring system, it works directly with document positions in the ranked lists.
## How RRF Works
RRF operates on a simple principle: documents that appear highly ranked across multiple search methods are likely to be genuinely relevant.
The standard RRF formula treats all ranking sources equally, but _Weighted RRF_ allows you to assign different weights to each method.
For example, when combining BM25 and vector search you might assign weights of 1.0 and 0.7 respectively to show that you value lexical over semantic meaning.
$$
\text{WeightedRRF}(d) = \sum_{r \in R} w_r \cdot \frac{1}{k + rank_r(d)}
$$
Where
- $$d$$ is a document
- $$R$$ is the set of ranked lists
- $$rank_r(d)$$ is the rank of document $$d$$ in ranking $$r$$
- $$w_r$$ is the weight for ranking $$r$$ (typically between 0 and 1). In non-weighted/traditional RRF this is always 1.
- $$k$$ is a constant (typically 60) that controls the fusion behavior
The fusion process works as follows:
1. Start with multiple ranked lists from different retrieval methods (discard the actual ranking numbers).
1. Calculate RRF scores for each document that appears in any list and multiply by the weight (if used).
1. Merge all documents and sort by their combined RRF scores
Documents missing from a ranking contribute zero to that ranking's sum.
The constant $$k=60$$ is almost always used because it has been empirically shown to work well across different datasets.
$$k$$ can be tuned based on your specific use case and data characteristics. Lower values (20-40) will give top results more influence, higher values (80-100) will give a more gradual contribution difference.
## Why RRF Works Well
RRF has three key strengths:
- **Simplicity:** No training data, complex optimization, or normalization required. Easy to implement and fast to compute.
- **Robustness:** Works with different score scales without normalization. Handles partial results gracefully.
- **Effectiveness:** Often outperforms more sophisticated fusion techniques, especially when combining complementary methods.
Research consistently shows RRF's effectiveness when combining different retrieval approaches like keyword search with semantic similarity.
## When to Use RRF
While RRF can be used any time there are multiple scoring systems which need to be combined in a single query it excels in several scenarios:
**Hybrid Search Systems**
Combining lexical search ([BM25](/learn/search-concepts/bm25)) with semantic vector search:
- BM25 finds specific keywords and technical terms
- Vector search captures conceptual similarity
- RRF harnesses both strengths
**Multi-Field Search**
Search across different document parts with separate rankings:
- Title search results
- Body content results
- Metadata results
**Personalization**
Merge general relevance with personalized signals based on user behavior or preferences.
## Example: RRF in Action
Consider a search for `machine learning tutorial` using both lexical and semantic search:
**Lexical Search Results:**
1. "Complete Machine Learning Tutorial Guide"
2. "Tutorial: Introduction to ML Algorithms"
3. "Python Machine Learning Handbook"
**Semantic Search Results:**
1. "AI and Deep Learning Fundamentals"
2. "Complete Machine Learning Tutorial Guide"
3. "Beginner's Guide to Neural Networks"
**RRF Calculation (k=60):**
- "Complete ML Tutorial" appears in both lists (rank 1 and 2): `1/61 + 1/62 = 0.0326`
- "AI and Deep Learning" appears only in semantic (rank 1): `1/61 = 0.0164`
- "Tutorial: Intro to ML" appears only in lexical (rank 2): `1/62 = 0.0161`
Weights allow you to emphasize one retrieval method over another. For instance, in a hybrid setup:
- BM25 might have a weight of 1.0 for lexical precision.
- A vector search model might have 0.7 for semantic similarity.
The effect is simple but powerful: RRF still rewards agreement across rankers, but weights let you encode which signals you trust most. This makes weighted RRF especially useful in production systems where retrieval sources vary in quality or purpose.
## Implementation in SQL
A variety of databases and search engines have pre-implemented RRF, but it's a very simple formula and can be easily demonstrated in SQL as follows:
```sql
WITH fulltext AS (
SELECT id, RANK() OVER (ORDER BY score DESC) AS rank
FROM (
SELECT id, pdb.score(id) AS score
FROM mock_items
WHERE description @@@ 'keyboard'
ORDER BY pdb.score(id) DESC
LIMIT 20
)
),
--- Semantic search, using pgvector and cosine distance for ranking
semantic AS (
SELECT
id,
RANK() OVER (ORDER BY embedding <=> '[1,2,3]') AS rank
FROM mock_items
ORDER BY embedding <=> '[1,2,3]'
LIMIT 20
),
-- Calculate RRF contributions from each ranker
rrf AS (
SELECT id, 1.0 / (60 + rank) AS s FROM fulltext
UNION ALL
SELECT id, 1.0 / (60 + rank) AS s FROM semantic
)
-- Sum the RRF scores, order by them, and join back the original data
SELECT
m.id,
sum(s),
m.description
FROM rrf
JOIN mock_items AS m USING (id)
GROUP BY m.id, m.description
ORDER BY sum(s) DESC
LIMIT 5;
```
## Summary
Reciprocal Rank Fusion provides an elegant solution for combining multiple search rankings without the complexity that comes from score based systems. Its simplicity, robustness, and proven effectiveness make it the standard approach for hybrid search systems that need to merge results from different retrieval techniques.
---
# What is RAG?
> Learn about Retrieval-Augmented Generation (RAG) and how it uses search to increase LLM effectiveness.
Published: 2024-10-21 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/search-concepts/retrieval-augmented-generation
---
Retrieval-Augmented Generation (RAG) is an architecture that retrieves knowledge from external sources to improve the accuracy of responses from AI models. Without RAG, AI models only have access to training data. With RAG, prompts are appended with relevant context to assist the AI model’s output.
Originally, RAG emerged as a strategy to tackle an LLM’s limited context window. Passing all available context (such as a company’s internal knowledge base or a customer database) would have surpassed an LLM’s input limit. Modern LLMs have significantly larger context windows; however, RAG remains as a technique to filter context that will assist, and not distract, an LLM’s reasoning.
## How does RAG work?
RAG consists of three steps that correspond to its name:
1. **Retrieval**: Information is retrieved from external data sources. Usually the user query will be used, with the RAG pipeline employing vector search or full-text search to surface relevant information.
2. **Augmentation**: Retrieved information is appended to the user’s query and demarcated as context.
3. **Generation:** The AI model (e.g. GPT-5, Claude, etc.) generates an output from the augmented query.
The final step of RAG is straightforward. However, you must decide on the optimal retrieval and augmentation strategies for your application.
## Strategies for RAG retrieval
RAG can work with any external data source. This includes an existing database, an API, or a document repository. However, the most common implementations of RAG are ones that are purpose-built around both storage and retrieval. If data is not efficiently indexed in storage, then it is difficult to find the correct data to retrieve.
There are two common strategies (and a hybridized option) for efficiently storing and retrieving data: vector search and full-text search.
### Vector Search
[Vector search](/learn/search-concepts/vector-search) retrieves information based on semantic similarity using multi-dimensional vector embeddings. Vector search works best when users need conceptually related information, even if the exact words differ. It's particularly effective for multimodal applications and finding information that relates to the query's meaning rather than exact text matches.
### Full-Text Search
[Full-Text search](/learn/search-concepts/full-text-search) (FTS) retrieves information based on matching textual content, usually using an algorithm like [BM25](/learn/search-concepts/bm25). Full-text search works best when users know exactly what they’re looking for: like a product code, a filename, or a specific phrase. It’s precise, deterministic, and fast.
### Hybrid Search
Because vector search struggles to retrieve textually similar information (e.g. a shared book title) and full-text search struggles to retrieve semantically similar information (e.g. two books on the same topic), an alternative is hybrid search. [Hybrid search](/learn/search-concepts/hybrid-search) employs both vector search and full-text search to provide a cumulative approach.
Using [reciprocal rank fusion](/learn/search-concepts/reciprocal-rank-fusion), hybrid search retrieves documents that highly rank on both search approaches.
## Augmentation
Once a RAG application retrieves relevant information, it must combine it with the user's original query before sending the final prompt to the AI model. This augmentation step determines how effectively the model can use the retrieved context.
The most common approach is to append the retrieved documents to the original prompt with clear delimiters. For example:
```text
Context:
[Retrieved document 1]
[Retrieved document 2]
User Query: What is our return policy?
Please answer the user's question based on the provided context.
```
You can improve augmentation by:
- **Ranking and filtering**: Only include the most relevant retrieved documents to avoid overwhelming the model with excessive context
- **Summarization**: Condense lengthy retrieved documents into key points before inclusion
- **Structured formatting**: Organize retrieved information with headers, bullet points, or numbered lists for easier model comprehension
- **Source attribution**: Include document titles or metadata so the model can reference specific sources in its response
The goal is to provide enough relevant context for accurate responses while maintaining clarity and staying within the model's context window limits.
## What are the benefits of RAG?
RAG is a necessary architecture for most business applications of AI. AI models aren’t trained on private information, such as a company’s internal policies, product documentation, or customer data. Instead, they must be provided with that context at query time to effectively answer prompts.
An alternative (or supplementary) strategy to RAG is fine-tuning, where AI models are further trained on contextual information before query time. This approach must happen ahead of time and is much more complex; but it can be ideal for applications where context is fixed and not dynamic, where queries are predictably centered around a specific domain, and where context isn’t excessively abundant.
## Summary
RAG is an architecture for retrieving relevant information to improve the performance of LLM model responses. Usually, the most challenging aspect of RAG is choosing the right retrieval strategy that optimizes for semantic similarity, textual similarity, or a combination of the two.
---
# What is Vector Search?
> Learn about vector search, how it enables semantic similarity matching using high-dimensional embeddings.
Published: 2024-10-06 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/search-concepts/vector-search
---
Vector search, also known as similarity search and semantic search, focuses on finding similar content based on semantic meaning rather than exact keyword matches. Unlike traditional [full-text search](/learn/search-concepts/full-text-search) that asks "which documents contain these tokens?", vector search asks "which documents mean something similar to this?"
This semantic understanding makes it particularly valuable for natural language queries, content recommendation, and applications where meaning matters more than exact terminology.
Vector search operates through three core components: **embedding generation**, **indexing**, and **querying**. Embedding generation converts text into numerical vectors, indexing stores these vectors in specialized data structures, and querying uses mathematical similarity calculations to find the most relevant content.
Vector search and vector indexes can be provided either by specialized vector
databases (like [Pinecone](https://www.pinecone.io/) or, by search engines
with vector indexes (like
[Elasticsearch](https://github.com/elastic/elasticsearch)), or by general
purpose databases with vector extensions (like
[PostgreSQL](https://postgresql.org) with
[pgvector](https://github.com/pgvector/pgvector)).
## Embedding Generation
Embedding generation is the foundation that lets vector search appear to understand intent. Machine learning models convert text into dense numerical vectors that capture semantic meaning. This isn't an exact science, just as two people can have different understandings of sentences, so can different embedding models.
### The Embedding Pipeline
The embedding process transforms raw text into searchable vectors through several steps:
1. **Text preprocessing:** Cleaning, normalizing, and chunking input text. Because embedding systems can't handle large documents, input text will be broken down into smaller chunks which will be embedded separately.
2. **Model inference:** Running chunks through local or remote embedding models to generate high-dimensional vectors.
3. **Vector normalization:** Normalizing vectors for consistent similarity calculations.
All vector search systems must decide whether to run embedding models locally or use remote APIs.
- **Local models** like [BGE-M3](https://bge-model.com/bge/bge_m3.html) or open-source alternatives provide full control over latency, costs, and data privacy. You can optimize inference speed and avoid per-request API costs, but require computational resources and model management.
- **Remote APIs** like OpenAI or Anthropic's embedding services offer convenience without infrastructure overhead. However, they introduce network latency, per-request costs, and potential privacy concerns when sending data to external services.
Embedding generation speed directly affects user experience. Local models can be optimized for your hardware and batched for efficiency, while remote APIs face network round-trip delays. For real-time applications, embedding pre-computation or caching strategies become essential regardless of the approach.
The model's dimensionality also impacts both embedding speed and storage requirements: higher dimensions often mean better accuracy, but bring increased computational and storage costs.
## Indexing
Indexing stores the generated vectors in data structures optimized for fast similarity search. Unlike traditional search that builds inverted indexes of terms, vector search creates specialized indexes for high-dimensional numerical data. These indexes allow queries to efficiently find similar vectors without comparing every vector pair. Different indexing strategies optimize for various performance characteristics:
- **HNSW (Hierarchical Navigable Small World):** Builds multi-layer graph structure for fast approximate search with good recall
- **IVF (Inverted File Index):** Partitions vectors into clusters using k-means, searches only relevant clusters
- **DiskANN:** Optimized for SSD storage, enables vector search on datasets larger than RAM
Many other index types exist for specialized use cases, each balancing search speed, accuracy, and memory requirements.
### Example: Building a Vector Index
Consider indexing two simple documents:
| **ID** | **Text** |
| ------ | --------------------------- |
| 1 | "machine learning tutorial" |
| 2 | "AI programming guide" |
After embedding generation, each document becomes a high-dimensional vector:
| **Document** | **Vector (simplified 3D representation)** |
| ------------ | ----------------------------------------- |
| 1 | `[0.8, 0.3, 0.1]` |
| 2 | `[0.7, 0.4, 0.2]` |
The vector index stores these vectors alongside their document IDs, optimized for fast similarity search operations. The choice of index type (HNSW, IVF, etc.) determines how these vectors are organized and accessed during search.
## Querying
Vector search converts user queries into vectors and finds the most similar documents through mathematical distance calculations.
The process is straightforward: convert the query to a vector using the same embedding model, calculate similarity scores against all indexed vectors, and return results ranked by similarity.
**Similarity measures** determine how close vectors are to each other:
- **Cosine similarity:** Measures angle between vectors (most common)
- **Dot product:** Considers both angle and magnitude
- **Euclidean distance:** Straight-line distance in vector space
Vector search systems often provide additional query capabilities like filtering by metadata, approximate search for speed, and threshold filtering to ensure quality results.
It's important to note that vector systems have no concept of tokens or words, so concepts like exact matching and proximity don't exist. The correctness of the results also relies on the embedding model's semantic understanding matching the end users.
## Where Vector Search Excels
Vector search excels in applications where semantic understanding provides clear advantages:
- **Content recommendation:** Find conceptually similar content regardless of exact wording
- **Semantic search:** Natural language queries like "Why is my website slow?" find articles about optimization, performance, and CDNs
- **Retrieval-Augmented Generation (RAG):** Find relevant context documents for language models to provide accurate, grounded responses
- **Multimodal applications:** Visual similarity search, cross-modal search between text and images, audio/video content matching
## When Vector Search Is Not Enough
While vector search excels at semantic similarity, it has limitations when users need exact keyword matching or precise terminology:
- **Exact keyword requirements:** Product codes, technical specifications, or proper names often require precise lexical matching.
- **Positional queries:** Vector indexes have no concept of words or position in the document.
Vector search requires more storage and compute resources than traditional search. Even at query time, generating embeddings and computing similarity scores can add noticeable latency compared to [BM25](/learn/search-concepts/bm25).
In many scenarios, **[hybrid search](/learn/search-concepts/hybrid-search)** approaches that combine both vector and traditional [full-text search](/learn/search-concepts/full-text-search) often provide the best solution:
- Full-text search for precise keyword matching and boolean logic
- Vector search for semantic similarity and concept-based retrieval
- Ranking algorithms like [Reciprocal Rank Fusion](/learn/search-concepts/reciprocal-rank-fusion) that blend both approaches for optimal relevance
## Summary
Vector search represents a divergent advancement in information retrieval, enabling the discovery of semantically similar content by representing data as high-dimensional numerical vectors and using mathematical similarity measures for retrieval. This approach excels at understanding meaning and context, but can't search for exact words or phrases.
The technology has become essential for recommendation systems and semantic search implementations that require nuanced understanding of content relationships.
---
# Implementing BM25 in PostgreSQL
> Learn how to implement BM25 ranking in PostgreSQL using various approaches from SQL to ParadeDB extensions.
Published: 2024-10-15 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/search-in-postgresql/bm25
---
Postgres is a powerful general-purpose database: but its built-in full-text
search has limits, especially around ranking. It supports basic term matching,
but lacks a native implementation of BM25, the probabilistic ranking function
that underpins most modern search systems (including Elasticsearch).
[BM25](https://www.paradedb.com/learn/search-concepts/bm25) (short for *Best
Matching 25*, often called *Okapi BM25*) measures relevance by balancing term
frequency, document rarity, and length normalization, an approach that
consistently outperforms both simpler and more complex ranking methods.
In this guide, you will learn a few approaches to implementing BM25 in Postgres databases.
## How Does BM25 Work?
BM25’s goal is to rank the results of a search query by relevancy.
To do this BM25 considers three variables:
- (i) the term frequency (TF) of each word in the document.
- (ii) the inverse document frequency (IDF), which quantifies and weights the rarity of each word across all documents
- (iii) The document’s length. By factoring in the document’s length, BM25 can promote shorter documents that might be more centered around the search query.
BM25 positively weights (i) shorter documents, (ii) those with a high frequency of query words, and (iii) terms that occur more rarely across documents. If you’re curious about the algorithm’s precise mechanics, you can learn more through our [BM25 chapter](https://www.paradedb.com/learn/search-concepts/bm25) in our [Search Concepts](https://www.paradedb.com/learn/search-concepts/full-text-search) collection.
## Postgres Supports Full-Text Search, But Not BM25
Despite not implementing BM25, Postgres does support full-text search via [`tsvector`](https://www.postgresql.org/docs/current/datatype-textsearch.html) and its family of functions.
Using [`ts_rank`](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING), Postgres developers can rank documents by relevance, but all scoring is not the same. The built in ranking doesn’t perform as well as BM25 because it only accounts for term frequency within a document and document length, but not inverse document frequency to capture the rarity of words in the full corpus. It’s also slower, needing to read from every tsvector in the set to return ranking results from the index (even for a LIMIT/Top K query).
## Routes to Implement BM25 in Postgres
Of course where there is a will there is a way, Postgres enthusiasts have [found several methods to adding BM25 support,](https://www.postgresql.eu/events/pgconfeu2024/sessions/session/6012/slides/592/Di_Qi_LT_BM25.pdf) from raw SQL to a dedicated BM25 extension.
Let’s break apart some of these approaches in detail.
### Implementing BM25 Using Just SQL
The most rudimentary (and unscalable) approach is using just SQL at query time.
BM25’s three variables—document length, term frequency, and IDF can be calculated [manually](https://www.alibabacloud.com/blog/keyword-analysis-with-postgresql-cosine-and-linear-correlation-algorithms-for-text-analysis_595793) using `tsvector` objects and the `ts_stat` data which Postgres collects, however it's both very convoluted and very slow. It could potentially be an option for a static corpus that never changes, but still feels more like a proof-of-concept that a production ready method.
### Implementing BM25 Using PL/PGSQL and Auxiliary Tables
A more usable path to implement BM25 is to use custom PL/PSQL and auxiliary tables. With each term occupying a row, table lookups are reduced to one per term in your query. These results can then be aggregated with a custom function.
A library that takes this approach is [plpgsql_bm25](https://github.com/jankovicsandras/plpgsql_bm25) by [András Jankovics](https://github.com/jankovicsandras). plpgsql_bm25 implements BM25 using just SQL and auxiliary tables. The function signature to create a table that holds BM25 records looks like this:
```sql
bm25createindex(
tablename TEXT,
columnname TEXT,
algo TEXT DEFAULT '',
stopwordslanguage TEXT DEFAULT '')
RETURNS VOID
```
Unfortunately this approach is still very slow, and the tables (and their indexes) need to be updated after every source data change which leads to write amplification.
### Extending Postgres to Support BM25 Indexes
A more scalable approach is using Postgres's index access method (IAM) to create a native BM25 index that is created on the source table.
[ParadeDB](https://www.paradedb.com) uses this method to implement BM25 indexes in Rust, adding them to PostgreSQL though the open-source `pg_search` extension.
When an index is added to a table it can cover multiple columns, providing advanced full-text search DX, as well as fast columnar lookups and aggregations for non-text fields.
```sql
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description, category, rating)
WITH (key_field='id');
```
The BM25 index is laid out on disk as an [**LSM tree**](https://docs.paradedb.com/welcome/architecture#lsm-tree), where each segment in the tree consists of both an inverted index and columnar index. The inverted and columnar indexes optimize for fast reads, while the LSM tree optimizes for high-frequency writes.
ParadeDB users can access full-text search and BM25 relevancy ranking by using custom operators and functions. For example, a search using match disjunction (find all documents which match one or more terms), ordering by BM25 scores, and then returning the top 5 would look like this:
```sql
SELECT id, paradedb.score(id)
FROM mock_items
WHERE description ||| 'running shoes'
ORDER BY paradedb.score(id)
LIMIT 5;
```
In most cases ParadeDB performs similarly to Elasticsearch, returning results in milliseconds even across large datasets.
## **What Are the Benefits of BM25 Support in Postgres?**
Bringing BM25 directly into Postgres has two major effects: it improves ranking quality and simplifies system design.
Postgres’s built-in full-text search—based on tsvector, tsquery, and ts_rank is capable but rudimentary. Its scoring function uses a linear weighting of term frequency and field importance, but it ignores inverse document frequency (IDF) and applies no document-length normalization.
Running BM25 inside Postgres also eliminates the need for a separate search engine and the ETL processes that keep it synchronized. There’s no risk of stale indexes, no duplicate infrastructure, and no latency gap between writes and searchable data. Each change to a table immediately updates the BM25 index, keeping search results transactionally consistent with the source of truth.
Finally, because BM25 operates as a native index access method, it participates in Postgres’s planner and transaction model. Queries benefit from caching, parallelism, and predictable performance characteristics: something external search engines can’t easily guarantee.
## Getting Started with BM25 in Postgres
Postgres doesn’t natively support BM25, but as we’ve seen, there are several ways to bring it into the database, from SQL-based prototypes to full extensions like ParadeDB’s pg_search. Each approach trades off simplicity, performance, and maintenance effort, but all share the same goal: better, more meaningful search results inside Postgres itself.
---
# Full-Text Search in PostgreSQL
> Learn how to implement powerful text search directly in PostgreSQL using built-in full-text search capabilities.
Published: 2024-10-16 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/search-in-postgresql/full-text-search
---
PostgreSQL includes powerful built-in [full-text search](/learn/search-concepts/full-text-search) capabilities that transform the database into a search engine. Rather than relying on external services, you can implement sophisticated text search directly within PostgreSQL using the same ACID transactions, security model, and operational infrastructure that manage your application data.
PostgreSQL's full-text search handles the complete pipeline from text analysis to ranked results. It tokenizes documents, removes stop words, applies stemming, builds searchable indexes, and ranks results by relevance, all with the performance and reliability you expect from PostgreSQL. This article discusses what it can do, and where it falls short.
**We love using PostgreSQL for full-text search, but we built ParadeDB because we don't think the built in full-text search is enough for a modern search application.**
That's not to say that the built in features aren't great, but for when you need a streamlined developer experience, [BM25](/learn/search-concepts/bm25) ranking, powerful facets, a feature set that matches Elasticsearch, or increased performance then ParadeDB is a better fit.
## Core Components
PostgreSQL's full-text search is built around specialized data types and functions that work together to provide comprehensive text search capabilities. These are combined to both store and query text vectors.
### Text Search Data Types
PostgreSQL provides two fundamental data types for full-text search. The [`tsvector`](https://www.postgresql.org/docs/current/datatype-textsearch.html#DATATYPE-TSVECTOR) type represents processed, searchable text as a sorted list of distinct tokens with position information. The [`tsquery`](https://www.postgresql.org/docs/current/datatype-textsearch.html#DATATYPE-TSQUERY) type represents search queries with boolean operators, phrase matching, and proximity requirements.
These types handle the complexity of text analysis while exposing a clean interface for indexing and querying.
#### Understanding tsvector
A `tsvector` transforms raw text into a structured, searchable format:
```sql
-- Convert text to tsvector
SELECT to_tsvector
('english',
'PostgreSQL provides full-text search capabilities in database'
);
```
```txt
'capabl':7
'databas':9
'full':4
'full-text':3
'postgresql':1
'provid':2
'search':6
'text':5
```
Notice how the result:
- Normalizes words to lowercase
- Stems the words ("provides" → "provid", "capabilities" → "capabl") {/* * codespell:ignore provid * */}
- Removes stop words like "in"
- Records position information (`:1`, `:2`, etc.)
- Explodes hyphenated terms like "full-text" while keeping the original
One limitation of `tsvector` is that it represents the text of a single row as a single vector. To combine multiple text fields, you typically concatenate them (`title || ' ' || description || ' ' || body`) into one `tsvector`. This works, but it flattens field boundaries, so you can’t search or rank fields independently. Alternatively, you could store separate `tsvector` columns for each field, but then each is indexed and scored independently, fragmenting your corpus.
ParadeDB uses a BM25 covering index avoids this trade-off: it can maintain
field-level statistics and weighting within a single index while still
computing global document statistics.
#### Building tsquery
A `tsquery` represents search conditions with boolean logic that operates on a `tsvector` type:
```sql
-- Simple term query
SELECT to_tsquery('english', 'postgresql');
-- Boolean operators
SELECT to_tsquery('english', 'postgresql & search');
-- Phrase search
SELECT to_tsquery('english', 'full <-> text');
-- Prefix matching
SELECT to_tsquery('english', 'post:*');
```
### Text Search Configurations
PostgreSQL ships with predefined configurations for multiple languages, each with appropriate tokenization rules, stop word lists, and stemming algorithms. You can see the configurations available on your system with the following query:
```sql
-- Available configurations
SELECT cfgname FROM pg_ts_config;
```
Common configurations include `english`, `spanish`, `french`, `german`, and `simple` (no stemming). You can also create custom configurations for specialized domains. While other systems like Elasticsearch operate on filter chaining, if you wanted to change the way Postgres full-text search works you'd create another configuration with different options enabled.
## Indexing for Performance
Full-text search performance depends entirely on proper indexing. In practice, you'll almost always use functional indexes rather than storing `tsvector` data in columns. Functional indexes avoid duplicating your text data, eliminate the need for synchronization triggers, and provide the same query performance as materialized columns.
PostgreSQL supports two index types for text search, with GIN indexes being the standard choice for most applications.
### GIN Indexes
GIN (Generalized Inverted Index) indexes are the preferred approach for full-text search. Rather than adding a separate `tsvector` column, create the index directly on your text processing function:
```sql
-- Create functional GIN index (recommended approach)
CREATE INDEX idx_articles_search ON articles
USING GIN(to_tsvector('english', title || ' ' || content));
```
This approach provides several advantages over storing `tsvector` in a column. You avoid duplicating your text data in the database, which saves storage space and reduces backup sizes. There's no need for triggers or other synchronization mechanisms to keep the search data current with your source text. The index automatically reflects any changes to the underlying text columns. Query performance remains similar to using a materialized `tsvector` column since PostgreSQL can use the index directly. Even if you did store `tsvector` data in a column, you'd still need to create an index on that column for acceptable performance.
GIN indexes excel at fast lookups for any query term, efficient boolean operations, and compact storage for large vocabularies. They're optimized for read-heavy workloads where the text content doesn't change frequently.
### GiST Indexes
GiST (Generalized Search Tree) indexes offer an alternative that can be more suitable for specific scenarios:
```sql
-- Create functional GiST index
CREATE INDEX idx_articles_search_gist ON articles
USING GiST(to_tsvector('english', title || ' ' || content));
```
GiST indexes perform better when you need proximity or phrase queries, when updates are frequent (GiST updates faster than GIN), or when index size is a concern. However, for most full-text search applications, GIN indexes provide better overall performance.
## Basic Text Search Operations
With your functional index in place, you can perform sophisticated text searches without any additional setup or data synchronization.
### Executing Searches
Use the `@@` operator to match text against search queries:
```sql
-- Find articles about PostgreSQL
SELECT title, content
FROM articles
WHERE to_tsvector('english', title || ' ' || content) @@
to_tsquery('english', 'postgresql');
-- Boolean search
SELECT title, content
FROM articles
WHERE to_tsvector('english', title || ' ' || content) @@
to_tsquery('english', 'postgresql & (search | query)');
-- Phrase search
SELECT title, content
FROM articles
WHERE to_tsvector('english', title || ' ' || content) @@
to_tsquery('english', 'full <-> text');
```
### Ranking Results
PostgreSQL provides [ranking functions](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING) that score documents by relevance, though with important limitations compared to modern search engines:
```sql
-- Rank by relevance using ts_rank
SELECT title, content,
ts_rank(to_tsvector('english', title || ' ' || content), query) as rank
FROM articles, to_tsquery('english', 'postgresql & search') query
WHERE to_tsvector('english', title || ' ' || content) @@ query
ORDER BY rank DESC;
```
PostgreSQL's `ts_rank` function provides basic relevance scoring, but it's significantly less sophisticated than modern ranking algorithms like [BM25](/learn/search-concepts/bm25) used by Elasticsearch. The `ts_rank` function only considers term frequency within documents and lacks global document frequency information that would indicate how rare or common terms are across your entire corpus. This means common words aren't properly downweighted, and rare, potentially more relevant terms don't receive appropriate boosts.
PostgreSQL also offers `ts_rank_cd` with cover density algorithms for alternative ranking approaches, but these still fall short of BM25's comprehensive approach to relevance scoring. For applications where search ranking quality is critical, this represents a significant limitation compared to dedicated search engines.
## Advanced Query Techniques
### Highlighting Results
PostgreSQL can highlight matching terms in search results:
```sql
SELECT title,
ts_headline('english', content, query) as highlighted_content
FROM articles, to_tsquery('english', 'postgresql & search') query
WHERE search_vector @@ query;
```
### Multi-Column Search
Search across multiple text fields with different weights
```sql
-- Weight title more heavily than content
SELECT title, content,
ts_rank_cd(
setweight(to_tsvector('english', title), 'A') ||
setweight(to_tsvector('english', content), 'B'),
query
) as rank
FROM articles, to_tsquery('english', 'postgresql') query
WHERE (
setweight(to_tsvector('english', title), 'A') ||
setweight(to_tsvector('english', content), 'B')
) @@ query
ORDER BY rank DESC;
```
### Fuzzy Matching
The `pg_trgm` extension can be added to handle typos and variations with similarity search:
```sql
-- Enable pg_trgm extension for trigram similarity
CREATE EXTENSION pg_trgm;
-- Find similar terms
SELECT word FROM dictionary
WHERE similarity(word, 'postgresgl') > 0.3
ORDER BY similarity(word, 'postgresgl') DESC;
```
## When to Use PostgreSQL Full-Text Search
PostgreSQL's built-in search excels when search needs to be transactional with other database operations, maintaining ACID guarantees across your entire application. It performs well for applications with millions of documents rather than billions, making it suitable for most business applications. The language-specific configurations work well for multi-language content, and the ability to join search results with other tables simplifies complex queries that would require multiple systems in other architectures.
The operational simplicity of managing a single database system rather than separate search infrastructure makes PostgreSQL full-text search attractive for teams that want to avoid the complexity of distributed systems. PostgreSQL full-text search provides excellent performance for most applications while maintaining the operational simplicity of a single database system.
## Limitations and Alternatives
While PostgreSQL's full-text search is powerful, it has limitations compared to specialized search engines. Performance degrades with very large document collections (millions of documents), and the system provides limited support for faceting, fuzzy matching, or complex highlighting. The ranking algorithms offer fewer options for customization compared to dedicated search engines, and real-time analytics capabilities are less sophisticated than what dedicated search engines provide.
For applications requiring these advanced features, consider hybrid approaches that use PostgreSQL for transactional data and specialized search engines like Elasticsearch.
Another option is [ParadeDB](https://www.paradedb.com), which uses an
extension to bring the feature set and performance of Elasticsearch inside
Postgres.
## Summary
PostgreSQL's built-in full-text search transforms your database into a capable search engine. With `tsvector`, `tsquery`, and proper indexing, you can implement sophisticated text search that integrates seamlessly with your application's data and transactions.
The combination of SQL's expressiveness with full-text search capabilities makes PostgreSQL an excellent choice for applications where search is important, but doesn't justify the complexity of separate search infrastructure.
---
# Introduction to Tantivy
> Introduction to Tantivy, a fast full-text search engine library written in Rust
Published: 2025-10-22 · Author: James Blackwood-Sewell
Source: https://www.paradedb.com/learn/tantivy/introduction
---
**[Tantivy](https://github.com/quickwit-oss/tantivy)** is a fast [full-text search](/learn/search-concepts/full-text-search) library written in Rust that reimagines Apache Lucene's design with modern performance optimizations. Like Lucene, Tantivy is a library that you embed directly into your applications rather than a search engine that runs as a separate service. Think of it as a modern, Rust-based evolution of Lucene's core principles.
The project emerged from creator [Paul Masurel](https://fulmicoton.com/)'s decade-long fascination with search engines and his discovery that "Rust was solving all of the pain points I experienced with C++ or Java with mindblowing elegance." What started as a [Rust learning exercise](https://fulmicoton.com/posts/behold-tantivy/) became a production-ready search library that challenges the assumption that you need complex infrastructure to achieve powerful search capabilities.
Tantivy follows Lucene's fundamental architecture and algorithms, including [BM25](/learn/search-concepts/bm25) scoring and similar indexing strategies, but takes advantage of Rust's memory safety and performance characteristics, plus some [innovative optimizations](https://fulmicoton.com/posts/tantivy-stacker/) that push the boundaries of what's possible with modern hardware. The result is a search library that starts up in under 10 milliseconds and runs approximately twice as fast as Lucene in benchmarks while providing equivalent search capabilities.
## How Tantivy Works
Like Lucene, Tantivy follows a schema-based approach where you define the structure and types of your searchable data upfront. This allows the library to build specialized indexes optimized for different kinds of queries, including inverted indexes for text search, and columnar fast fields for faceting and point lookup queries.
### The Segmented Architecture
Tantivy organizes indexes into multiple small "segments" rather than maintaining one monolithic index structure. This segmented approach enables several key capabilities:
- **Concurrent Indexing:** Multiple threads can index documents simultaneously without blocking
- **Memory Management:** Large datasets can be indexed without loading everything into RAM
- **Incremental Updates:** New documents create new segments, avoiding expensive full index rebuilds
Each segment operates as an independent, immutable unit containing its own inverted indexes, document storage, and metadata. During search operations, Tantivy queries all segments in parallel and merges results, while background processes periodically merge smaller segments into larger ones for optimal performance.
### Search and Retrieval
Search operations work by parsing queries into execution plans against the inverted indexes and columnar fast fields. Tantivy uses [BM25](/learn/search-concepts/bm25) scoring for relevance ranking and supports boolean queries, phrase matching, and fuzzy search. The library employs Finite State Transducers (FSTs) for efficient term dictionary storage and implements sophisticated integer compression techniques to minimize memory footprint.
Results are returned in relevance order with minimal memory allocation overhead, taking advantage of Rust's zero-cost abstractions and careful memory management throughout the search pipeline.
## Key Features
Tantivy includes the core features you'd expect from a modern search library:
**Text Analysis:**
- Configurable tokenization pipeline with language support
- Stemming to match word variations (running, runs, ran)
- Stop word removal and custom text processing
- Built-in configurations for common languages
- Ability to extend with new tokenizers and behaviors (eg: [Lindera Tantivy](https://github.com/lindera/lindera-tantivy))
**Query Capabilities:**
- [BM25](/learn/search-concepts/bm25) relevance scoring for ranking results
- Fuzzy search for handling typos and variations
- Phrase queries with proximity matching
- Boolean logic (AND, OR, NOT) combining multiple terms and fields
- Range queries for numeric and date fields
- Faceted search for filtering and aggregation
**Performance Features:**
- Fast fields which store associated data in a columnar format for efficient sorting and aggregation without full document retrieval
- Compressed document storage for space efficiency
- Multithreaded indexing with configurable memory usage
## When to Choose Tantivy
Tantivy works well when you need search functionality embedded directly in your application. It's commonly used in desktop applications, command-line tools, and web applications where managing separate search infrastructure isn't practical.
**Single-Process Applications:** Desktop software, CLI tools, and single-server web applications benefit from Tantivy's embedded nature. The search index lives alongside your application data, eliminating network latency and infrastructure complexity.
**Resource-Constrained Environments:** Edge computing, IoT devices, and environments with limited memory benefit from Tantivy's efficiency. The library's minimal resource requirements make sophisticated search possible in contexts where running Elasticsearch would be impractical.
**Development and Testing:** The embedded approach simplifies development workflows. Your test suite runs against the same search implementation as production, without requiring external services or complex test setup.
**Real-Time Search Requirements:** Applications that need immediate search availability after data changes benefit from Tantivy's fast indexing. New documents become searchable within milliseconds rather than requiring separate batch processing.
The library approach means search availability is tied to your application, there's no separate service to manage, monitor, or keep in sync. This also simplifies development since you don't need to coordinate multiple services during testing.
## The Tantivy Ecosystem
If you need a complete search engine rather than just the library, several projects build on Tantivy to provide higher-level functionality. Just as Lucene serves as the foundation for search servers like [Elasticsearch](https://github.com/elastic/elasticsearch) and Solr, Tantivy has become the foundation for specialized search solutions.
[Quickwit](https://github.com/quickwit-oss/quickwit) builds on Tantivy to create a distributed search engine designed for log management and observability data. Where Elasticsearch might struggle with the volume and velocity of log data, Quickwit uses Tantivy's efficiency plus cloud-native architecture to handle massive log ingestion and search workloads. It's particularly well-suited for applications that need to search through petabytes of time-series data.
[ParadeDB](https://github.com/paradedb/paradedb) takes a different approach, embedding Tantivy directly into [PostgreSQL](https://postgresql.org) as an extension. This gives you modern search capabilities, including BM25 scoring and advanced text analysis, without leaving your existing database infrastructure. ParadeDB bridges the gap between traditional SQL databases and modern search engines, letting you run sophisticated search queries alongside your regular database operations.
These ecosystem projects demonstrate Tantivy's flexibility as a foundation for building complete search solutions.
## Getting Started
A typical Tantivy integration follows a simple pattern: define a schema, create an index, add documents, and search. Here's what a minimal implementation looks like:
```rust
use tantivy::*;
// Define schema and create index
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let content = schema_builder.add_text_field("content", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
// Add documents
let mut index_writer = index.writer(50_000_000)?;
index_writer.add_document(doc!(
title => "Sample Document",
content => "This is sample content for searching."
))?;
index_writer.commit()?;
// Search
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![title, content]);
let query = query_parser.parse_query("sample")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
```
The example uses an in-memory index for simplicity, but production applications typically use disk-based indexes for persistence. The `writer(50_000_000)` call allocates 50MB for the indexing buffer, Tantivy's memory management ensures this space is used efficiently through its custom allocation strategies. For more examples see the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/).
## Summary
Tantivy represents a fundamental shift in how we think about search infrastructure. Born from one developer's exploration of Rust's potential, it has evolved into a production-ready library that challenges the traditional trade-offs between performance, simplicity, and capability.
By reimagining Lucene's proven architecture with modern systems programming principles (and without the JVM), Tantivy demonstrates that you don't need to choose between powerful search capabilities and operational simplicity. As both a standalone library and the foundation for projects like Quickwit and ParadeDB, Tantivy is reshaping what's possible in search technology.
---