Zvec: A lightweight, fast, in-process vector database

57 points2 daysgithub.com

simonw • 2 hours ago

Their self-reported benchmarks have them out-performing pinecone by 7x in queries-per-second: https://zvec.org/en/docs/benchmarks/

I'd love to see those results independently verified, and I'd also love a good explanation of how they're getting such great performance.

ashvardanian • 46 minutes ago

8K QPS is probably quite trivial on their setup and a 10M dataset. I rarely use comparably small instances & datasets in my benchmarks, but on 100M-1B datasets on a larger dual-socket server, 100K QPS was easily achievable in 2023: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search... ;)

Typically, the recipe is to keep the hot parts of the data structure in SRAM in CPU caches and a lot of SIMD. At the time of those measurements, USearch used ~100 custom kernels for different data types, similarity metrics, and hardware platforms. The upcoming release of the underlying SimSIMD micro-kernels project will push this number beyond 1000. So we should be able to squeeze a lot more performance later this year.

clemlesne • 3 hours ago

Did someone compared with uSearch (https://github.com/unum-cloud/USearch)?

neilellis • 1 hour ago

That I would like to see too, usearch is amazingly fast, 44m embeddings in < 100ms

cjonas • 35 minutes ago

How does this compare to duckdbs vector capabilities (vss extension)?

_pdp_ • 2 hours ago

I thought you need memory for these things and CPU is not the bottleneck?

binarymax • 52 minutes ago

I haven’t looked at this repo, but new techniques taking advantage of nvme and io_uring make on disk performance really good without needing to keep everything in RAM.

skybrian • 2 hours ago

Are these sort of similarity searches useful for classifying text?

CuriouslyC • 2 hours ago

Embeddings are good at partitioning document stores at a coarse grained level, and they can be very useful for documents where there's a lot of keyword overlap and the semantic differentiation is distributed. They're definitely not a good primary recall mechanism, and they often don't even fully pull weight for their cost in hybrid setups, so it's worth doing evals for your specific use case.

neilellis • 1 hour ago

Yes, also for semantic indexes, I use one for person/role/org matches. So that CEO == chief executive ~= managing director good when you have grey data and multiple look up data sources that use different terms.

esafak • 2 hours ago

You could assign the cluster based on what the k nearest neighbors are, if there is a clear majority. The quality will depend on the suitability of your embeddings.

OutOfHere • 2 hours ago

It altogether depends on the quality and suitability of the provided embedding vector that you provide. Even with a long embedding vector using a recent model, my estimation is that the classification will be better than random but not too accurate. You would typically do better by asking a large model directly for a classification. The good thing is that it is often easy to create a small human labeled dataset and estimate the error confusion matrix via each approach.