What We Think About When We Think About Benchmarking
We're building a search engine inside Postgres, so of course we need to show people how it performs. That means publishing benchmarks: both against ourselves, and against other engines like Elasticsearch.
Benchmarking boils down to two concerns: the workload (what you measure) and the execution (how you measure it). The two shape each other, so where do you start? Do you begin with the workload you want to highlight and build the execution around it? Or do you build the execution first and let the workloads come later? This post is about the path we picked, and the runner we built around it.

The title is a nod to Haruki Murakami's What I Talk About When I Talk About Running, a book about repetition, routine, and what long stretches of disciplined practice reveal over time.
Benchmarking works the same way. You learn what your system actually is by measuring it honestly, repeatedly, in conditions that aren't designed to flatter you.
Why We Built Our Runner First
The clearest example of the workload-first path is ClickBench, a well respected analytical benchmark. One dataset, one query set, dozens of backends, and a public leaderboard. It works because it's simple. A single problem, a single procedure, a single result format. Nothing varies but the database itself. The runner inherits the same simplicity: shell-script automation built for one job, with no concurrent users, no ramping load, no custom workloads, just best-of-three single shots.
Being incredibly easy to run, understand, and contribute to is what made ClickBench a Schelling point for analytical workloads. The tradeoff is that the workload and the runner are baked together. Push past that shape and you're building a new runner.
We could have taken the same path for search. Pick a representative workload, call it TextBench, build a runner around it, ship a leaderboard. That’s a viable approach, but it would also bind us to that initial workload. And once we started expanding beyond it, we’d lose the simplicity that made the approach attractive in the first place. We wanted a runner that we could grow with, not one which constrained us.
So we took the less-trodden path. We built the runner first, and kept benchmark workloads as a separate concern. ParadeDB Benchmarker is backend-agnostic, scenario-agnostic, and dataset-agnostic. Instead of fixing the workload and simplifying the runner around it, we worked to simplify the experience of defining new workloads. Because of this, the same runner can carry many benchmarks, including ones that have nothing to do with search and ones our future users haven't imagined yet.
How We Think About Execution
Building a runner worth reusing means getting the mechanics right for any benchmark you throw at it. Most of the ways benchmarks go wrong are mechanical, and the same problems show up over and over.
Environment and Resource Isolation
Every backend needs to start from the same place every time: identical resource constraints, the same setup steps, the same starting state. When you compare multiple backends in one run, they shouldn't be competing with each other for resources either.
Configuration Capture
A poorly-tuned Postgres will lose to a well-tuned Elasticsearch, and vice versa. Configuration is part of the benchmark whether you acknowledge it or not, and a striking number of published benchmarks don't post their full configs. A runner needs to save the full context alongside the results: backend tunables (PostgreSQL GUCs, Elasticsearch JVM settings, versions), setup scripts (VACUUM ANALYZE, force merge), and resource allocation. If you can't reproduce a result, you can't trust it.
Scenario Variety
Different questions need different test shapes. "How fast is a single query?" is a different scenario from "what happens at 1,000 QPS?", which is different again from "what does latency look like when reads and writes hit the system at the same time?" A runner should support all of these: constant throughput, ramping load, concurrent virtual users, mixed read/write workloads, and the ability to compose them.
Rich Statistics
Tail latency over many runs is usually where systems actually fail. An average or minimum response time hides almost everything important. You need full latency distributions (P50, P95, P99 at minimum), throughput over time, and error rates, not point-in-time summaries.
Application-Shaped Queries
A benchmark should send queries the way a real application would: over the network, using the same client libraries production code uses. CLI tools and developer interfaces (psql, mongosh, the Elasticsearch dev console) often take shortcuts that production traffic doesn't have access to: different connection pooling, different transports, optimizations for interactive use. If you benchmark through them, you're measuring the CLI as much as the database.
Driver Separation
You're probably not an expert in every system you're benchmarking. We know Postgres well; we don't know the best way to construct an Elasticsearch query as well as someone who has been writing them in production for years. The code that talks to each database should be separated from the benchmark scripts, so experts in each system can audit and improve how their database is being driven without touching the benchmark itself. Each backend should run in its own native syntax, with no abstraction layer quietly translating one system into another’s strengths.
ParadeDB Benchmarker
Once we decided the runner mattered more than any single benchmark, we needed infrastructure that could survive changing workloads, changing databases, and changing questions. The result was ParadeDB Benchmarker. It's open source, and built as a layer on top of k6, Grafana's load testing framework (which is usually known for front-end testing).
k6 is an incredible benchmarking tool, and already solves most of the hard execution-engine problems: virtual user scheduling, request firing, latency measurement, ramping load, and scenario orchestration. It already has a database plugin with xk6-sql, but we needed to be able to work with databases that were outside the SQL camp (mainly Elasticsearch, OpenSearch, and MongoDB).
We also wanted better benchmarking ergonomics: dashboards, run exports, dataset loading, and reproducible local environments. This led to us building the xk6-database extension.
Here's how the pieces fit together: backends, datasets, query scripting, and the built-in dashboard.
Backends and Environments
From the start, we wanted PostgreSQL FTS, Elasticsearch, OpenSearch, ClickHouse, MongoDB Atlas Search, and ParadeDB itself to run under the same framework. But we didn't want to force them through a shared abstraction layer; each backend gets its own driver and runs in its own native syntax.
If you’d like to run locally, we provide Docker Compose profiles that bring up isolated backend environments with a single command.
Datasets and the Loader
Datasets are treated as separate units: schema, raw data, and per-backend setup scripts live together in one directory. Adding a new dataset means dropping a new directory in ./datasets/ rather than touching the runner.
Data goes in through a separate loader CLI, which handles schema setup, bulk ingest, and post-load steps like VACUUM ANALYZE or force merge automatically.
A First Benchmark
You write benchmark strategies in standard k6 JavaScript. A minimal single-backend workload targeting a local ParadeDB container would look like this:
import db from "k6/x/database";
const backends = db.backends({ backends: ["paradedb"] });
const terms = db.terms(open("./search_terms.json"));
const scenarios = {
paradedb: {
executor: "constant-vus",
vus: 5,
duration: "30s",
exec: "paradedbQuery",
},
};
// Collect metrics from the Docker containers
export const collectMetrics = backends.addDockerMetricsCollector(
scenarios,
"30s",
);
export const options = { scenarios };
export function paradedbQuery() {
backends
.get("paradedb")
.query(
`SELECT id, title FROM documents WHERE content ||| $1 LIMIT 10`,
terms.next(),
);
}
Each VU runs paradedbQuery in a loop for 30 seconds. The runner times every call, tags it with the backend, and pushes the metrics to the dashboard.
Swap constant-vus for ramping-vus to ramp load up and down across stages, or constant-arrival-rate to fire a fixed number of requests per second regardless of how fast the system answers.
Composing Scenarios
A single script can compose multiple scenarios across multiple backends. You might run search at 200 QPS, an aggregation at 100 QPS, and an ingest stream at 1,000 rows per second, all hitting the same backend at the same time. A built-in phase timer staggers a second backend so the two runs don't compete for resources.
Comparing raw ingest throughput across databases with different write semantics rarely tells you anything useful; comparing query latency under sustained write pressure tells you a lot.
Streaming Results
Results stream to the browser as the test runs: latency percentiles per backend (P50, P90, P95, P99), query and ingest throughput, and per-container CPU and memory pulled from Docker.
Backend configs, setup scripts, and query patterns are captured alongside the timeline, so a result can be reproduced and audited later.
The code can be found at github.com/paradedb/benchmarker, with a full walkthrough coming soon.
What's Next
We've shipped drivers for the backends we get compared to most often. We'd value experts in any of them auditing how Benchmarker drives them, so the comparisons we publish go up against each system at its best. And if there's a backend you'd like to see added, contributions are welcome.
Stay tuned for our upcoming performance blogs, where we'll put Benchmarker to work: real comparisons against other search engines, and the optimization journeys that came from improving those numbers.
Ready to run your own database benchmarks? Check out ParadeDB Benchmarker.