Back

Zvec: A lightweight, fast, in-process vector database

57 points2 daysgithub.com
simonw2 hours ago

Their self-reported benchmarks have them out-performing pinecone by 7x in queries-per-second: https://zvec.org/en/docs/benchmarks/

I'd love to see those results independently verified, and I'd also love a good explanation of how they're getting such great performance.

ashvardanian46 minutes ago

8K QPS is probably quite trivial on their setup and a 10M dataset. I rarely use comparably small instances & datasets in my benchmarks, but on 100M-1B datasets on a larger dual-socket server, 100K QPS was easily achievable in 2023: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search... ;)

Typically, the recipe is to keep the hot parts of the data structure in SRAM in CPU caches and a lot of SIMD. At the time of those measurements, USearch used ~100 custom kernels for different data types, similarity metrics, and hardware platforms. The upcoming release of the underlying SimSIMD micro-kernels project will push this number beyond 1000. So we should be able to squeeze a lot more performance later this year.

clemlesne3 hours ago

Did someone compared with uSearch (https://github.com/unum-cloud/USearch)?

neilellis1 hour ago

That I would like to see too, usearch is amazingly fast, 44m embeddings in < 100ms

cjonas35 minutes ago

How does this compare to duckdbs vector capabilities (vss extension)?

_pdp_2 hours ago

I thought you need memory for these things and CPU is not the bottleneck?

binarymax52 minutes ago

I haven’t looked at this repo, but new techniques taking advantage of nvme and io_uring make on disk performance really good without needing to keep everything in RAM.

skybrian2 hours ago

Are these sort of similarity searches useful for classifying text?

CuriouslyC2 hours ago

Embeddings are good at partitioning document stores at a coarse grained level, and they can be very useful for documents where there's a lot of keyword overlap and the semantic differentiation is distributed. They're definitely not a good primary recall mechanism, and they often don't even fully pull weight for their cost in hybrid setups, so it's worth doing evals for your specific use case.

neilellis1 hour ago

Yes, also for semantic indexes, I use one for person/role/org matches. So that CEO == chief executive ~= managing director good when you have grey data and multiple look up data sources that use different terms.

esafak2 hours ago

You could assign the cluster based on what the k nearest neighbors are, if there is a clear majority. The quality will depend on the suitability of your embeddings.

OutOfHere2 hours ago

It altogether depends on the quality and suitability of the provided embedding vector that you provide. Even with a long embedding vector using a recent model, my estimation is that the classification will be better than random but not too accurate. You would typically do better by asking a large model directly for a classification. The good thing is that it is often easy to create a small human labeled dataset and estimate the error confusion matrix via each approach.