The PDXearch DuckDB Extension

A state-of-the-art IVF index for lightweight but fast (filtered) vector similarity search.

Why PDXearch?

DuckDB offers vector similarity search (VSS) out of the box, through its fixed-size ARRAY column type and distance functions (docs). These functions return exact results, but are often too slow on large datasets.

The official DuckDB VSS extension introduces a graph-based (HNSW) VSS index. You can create this index on your table to speed up the vector search queries. Unfortunately, although these graph-based indexes deliver fast search, they take up a considerable amount of memory and take long to construct.

The PDXearch extension aims to address these drawbacks. It achieves competitive search performance, while using less memory and being significantly faster to construct. This is made possible by a state-of-the-art partition-based (IVF) index. To be precise, we rely on the CWI's PDX data layout and the accompanying search framework called PDXearch. Furthermore, this extension integrates tightly with DuckDB's internals to parallelize across row groups, allowing us to squeeze more performance out of modern hardware.

Install

Warning

The extension is unstable and experimental. We're actively working on adding features and improving stability. The extension will be made available as a community extension once it's ready. For now the extension has to be built locally.

To build the extension locally, see DEVELOPMENT.md.

Usage

To create an index and run a search, we provide an interface similar to the official VSS extension (VSS docs).

Start a DuckDB instance with an in-memory database and allow loading unsigned extensions.
```
duckdb -unsigned
```

Load the locally built extension by providing a full path to it.

LOAD '<Fill in>/PDXearch/build/release/extension/pdxearch/pdxearch.duckdb_extension';

Set up a table.

CREATE TABLE t1 (id INTEGER, vec FLOAT[512]);

INSERT INTO t1 (id, vec) SELECT i as id, repeat([i], 512) FROM range(20000) t(i);

Create the PDXearch index and set one of the index's options (n_probe).

CREATE INDEX t1_idx ON t1 USING PDXEARCH (vec) WITH (n_probe = 64);

Run an approximate filtered vector similarity search where the top 100 rows are returned.

SELECT * FROM t1 WHERE id < 500
    ORDER BY array_distance(vec, repeat([1000.51], 512)::FLOAT[512]) LIMIT 100;

Warning

If you're executing (filtered) search queries where K <= 50, then please disable DuckDB's late materialization optimization by running the following statement prior to your search: SET late_materialization_max_rows = 0;. Due to the query's low LIMIT (K), DuckDB will apply a late materialization optimization. Unfortunately, the extension does not handle this case optimally yet, leading to a suboptimal query plan when a K <= 50 VSS query is optimized. We aim to address this behavior in the near future.

Limitations

The extension's functionality is limited, as it is still in early development. As mentioned above, we aim to address these limitations soon.

No persistence: The index should only created in in-memory DuckDB databases. For disk-resident databases you'll have to manually drop and rebuild an index when you reload a database.
No maintenance: We currently only support creating an index on static collections. This means the index does not yet support updating the index when a INSERT INTO or DELETE FROM statement is invoked on the table.
Configuration options: The available configuration options are currently limited (e.g., quantization, distance functions, normalization).
Stability

Acknowledgements

The extension would not be possible without the underlying technologies and the lessons learned from other extensions.

PDX: We use the PDX data layout and PDXearch framework.
VSS: We've taken inspiration from the VSS interface and we reuse parts of the VSS extension's code.
Faiss: We currently use Meta's Faiss library for its K-means capabilities.
DuckDB-Faiss: We've taken inspiration from its build setup to include Faiss in the extension. This is yet another alternative DuckDB VSS extension.

License

The extension is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
duckdb @ 6ddac80		duckdb @ 6ddac80
extension-ci-tools @ af154d2		extension-ci-tools @ af154d2
scripts		scripts
src		src
test		test
third_party		third_party
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
extension_config.cmake		extension_config.cmake
vcpkg.json		vcpkg.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The PDXearch DuckDB Extension

A state-of-the-art IVF index for lightweight but fast (filtered) vector similarity search.

Why PDXearch?

Install

Usage

Limitations

Acknowledgements

License

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

Noorts/PDXearch

Folders and files

Latest commit

History

Repository files navigation

The PDXearch DuckDB Extension

A state-of-the-art IVF index for lightweight but fast (filtered) vector similarity search.

Why PDXearch?

Install

Usage

Limitations

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages