Similarity Search with SPLADE Inside Postgres
By Ming Ying on November 12, 2023
With the release of v0.7.0
, sparse vector support was added to pgvector
.
As a result, we have deprecated pg_sparse
in favor of the upstream
pgvector
implementation.
We’re excited to unveil pg_sparse
: the first extension to enable efficient storage and retrieval of sparse vectors using HNSW inside Postgres. pg_sparse
is to sparse vectors what pgvector
is to dense vectors.
Generated by new models like SPLADE, sparse vectors can detect the presence of exact keywords while also capturing semantic similarity between terms. Unlike dense vectors, sparse vectors contain significantly more entries, most of which are zero. For instance, OpenAI’s text-embedding-ada-002
model outputs dense vectors with 1536 entries, whereas SPLADE outputs sparse vectors with over 30,000 entries.
pg_sparse
is a fork of pgvector
, which means that it leverages pgvector
's existing vector storage and HNSW implementation. It consists of two major changes:
- A new Postgres data type called
svector
that stores sparse vectors by their non-zero entries - A modification to the way that
pgvector
allocates Postgres pages to support vectors with variable numbers of non-zero entries
You can easily get started by installing pg_sparse
inside an existing, self-hosted Postgres instance
or by running our Docker image. Run the following query to get started:
-- Load extension
CREATE EXTENSION svector;
-- Create test data
CREATE TABLE items (id bigserial PRIMARY KEY, embedding svector(4));
INSERT INTO items (embedding) VALUES ('[1,0,3,0]'), ('[0,0,5,6]');
-- Create HNSW index for cosine similarity
CREATE INDEX ON items USING shnsw (embedding svector_cosine_ops);
-- Run query
SELECT * FROM items ORDER BY embedding <=> '[3,0,1,0]';
SPLADE Inside Postgres: An Example
Next, let’s walk through a more complex example that inserts, indexes, and searches sparse vectors generated by SPLADE using pg_sparse
. To create our sparse vectors, let’s install the dependencies:
!pip install -U transformers torch datasets pandas tqdm
Next, let’s run Python code that
- Loads an example Huggingface dataset containing the descriptions of 50,000 movies
- Generates a SPLADE vector for each description
- Saves the dataset as a CSV file
from transformers import AutoModelForMaskedLM, AutoTokenizer
from tqdm import tqdm
from datasets import load_dataset
import pandas as pd
import re
import torch
model_id = 'naver/splade-cocondenser-ensembledistil'
dataset_name = 'SandipPalit/Movie_Dataset'
def create_splade(text):
tokens = tokenizer(text, return_tensors='pt')
output = model(**tokens)
vec = torch.max(
torch.log(
1 + torch.relu(output.logits)
) * tokens.attention_mask.unsqueeze(-1),
dim=1)[0].squeeze()
return vec
def clean_text(text):
return re.sub(r'[\r\n\t]+', ' ', text)
# Initialize SPLADE model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
# Load dataset
dataset = load_dataset(dataset_name)
training_dataset = dataset["train"]
# Generate SPLADE vectors
# Note this will take a long time, consider reducing the size of
# training_dataset to reduce the time
texts = []
vectors = []
for example in tqdm(training_dataset, desc="Processing..."):
text = clean_text(example['Overview'])
texts.append(text)
vector = create_splade(text)
vectors.append(vector.tolist())
# Save as dataframe
df = pd.DataFrame({
'text': texts,
'splade_vector': vectors
})
# Generate another SPLADE vector for querying
query = "Space exploration"
df.at[0, 'text'] = query
df.at[0, 'splade_vector'] = create_splade(query).tolist()
# Save to CSV
df.to_csv("splade_vectors.csv", index=False)
Using a Postgres client like psql
, we can now load in our movie dataset. This will take a few minutes due to the size of the dataset.
CREATE TABLE movies (description text, splade_vector svector(30522));
\copy movies FROM '/path/to/splade_vectors.csv' DELIMITER ',' CSV HEADER;
Let’s also enable timing to examine the performance of our queries.
\timing
Now we’re ready to execute our first search. To begin, we’ll look for the ten rows that are most similar to the first row (”Space exploration") based on cosine similarity:
SELECT description
FROM movies
ORDER BY splade_vector <=> (SELECT splade_vector FROM movies LIMIT 1)
LIMIT 10;
Here are the results:
description
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Space exploration
A sweeping overview of humanity’s accomplishments in space, as well as our ongoing activities and future plans.
Delhi Boys Going To Space
1962. The US Genesis 1491 space mission launches with a unique objective, based upon mankind's ongoing quest to seek answers to existential questions. It is the dawn of the Golden Age of Space exploration, and the possibilities seem limitless...
Prepare for liftoff as we explore NASA's Space Shuttle program's legacy, featuring rare footage and testimonies from the people who made it all possible.
Join the StoryBots and the space travelers of the historic Inspiration4 mission as they search for answers to kids' questions about space.
A loner inventor who dreams of exploring space works in a math lab until a female astronaut goes missing and he can use a space helmet to save her.
An exploration of the Alien presence on Earth and the reality of suppressed free energy technology.
An exploration into the fate of the post-modern man.
No matter how clear the night sky is, no matter how many millions of stars are within view, looking up at the sky on a clear night still hides the halo of man-made debris around Earth that threatens the future of space exploration and endangers us all.
(10 rows)
Time: 154.361 ms
This query executed a Postgres sequential scan that took 155ms to return the first ten rows. Now, let’s create an HNSW index to speed up the search results. This may take a few minutes to run.
CREATE INDEX ON movies
USING shnsw (splade_vector svector_cosine_ops);
With the index created, let’s re-run the same search query as above. We get back the same results, but with a query time of 2.740 ms — a 50X speedup in query time!
description
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Space exploration
A sweeping overview of humanity’s accomplishments in space, as well as our ongoing activities and future plans.
Delhi Boys Going To Space
1962. The US Genesis 1491 space mission launches with a unique objective, based upon mankind's ongoing quest to seek answers to existential questions. It is the dawn of the Golden Age of Space exploration, and the possibilities seem limitless...
Prepare for liftoff as we explore NASA's Space Shuttle program's legacy, featuring rare footage and testimonies from the people who made it all possible.
Join the StoryBots and the space travelers of the historic Inspiration4 mission as they search for answers to kids' questions about space.
A loner inventor who dreams of exploring space works in a math lab until a female astronaut goes missing and he can use a space helmet to save her.
An exploration of the Alien presence on Earth and the reality of suppressed free energy technology.
An exploration into the fate of the post-modern man.
No matter how clear the night sky is, no matter how many millions of stars are within view, looking up at the sky on a clear night still hides the halo of man-made debris around Earth that threatens the future of space exploration and endangers us all.
(10 rows)
Time: 2.704 ms
Benchmark Results
We measured index creation and query time over a dataset of 100K sparse vectors generated by SPLADE, each with 30,522 dimensions.
We set Postgres' maintenance_working_mem
configuration to 512MB so that the entire HNSW graph could fit into memory during the build.
ALTER SYSTEM SET maintenance_work_mem = '512MB';
SELECT pg_reload_conf();
ef_construction
is a parameter that creates a higher quality graph and more accurate search results, at the expense of longer
index creation times. At the default ef_construction
value of 64
, the index takes 200 seconds to build (500 vectors / second).
Next, we compared the time to return the top 10 results with and without the HNSW index. We set m=16
, ef_construction=64
,
and ef_search=40
. With HNSW, this search took 6ms. Without HNSW, this search took 150ms.
Wrapping Up
pg_sparse
is the latest installment in our roadmap to building an Elasticsearch alternative on Postgres. Now, Postgres
is capable of full-text search (via pg_search
), dense vector search (via pgvector
), and sparse vector
search.
pg_sparse
is compatible alongside both pg_search
and pgvector
. In the future, we’d love to have our work merged upstream as part of one pgvector
extension. A unified extension for all types of vector search is simpler to use and maintain, and would also enable sparse vector search in managed Postgres databases like AWS RDS that already support pgvector
.
Until then, the best way to use pg_sparse
is by installing it inside an existing, self-hosted Postgres instance,
or running the Postgres Docker image.
pg_sparse
is licensed under the PostgreSQL license in
accordance with pgvector
's license. If you'd like to contribute, the best place to start is our
Slack community. And please don't hesitate to show your support by
giving us a star!