June 26, 2023 • 4 min read

How to choose your vector database?

Rédigé par Noé Achache

Disclaimer: I am not affiliated with any of the discussed databases. I have just tried to understand better what makes each database solution different, and I am sharing my perspective on these solutions: this is not an exhaustive comparison.

With the current hype around LLMs and Generative AI, vector databases are experiencing an increase in popularity, as more and more companies want to be able to query their database using natural language. This article from October 2021 compared vector databases very well, but a lot has changed since then.

Many new vector databases have appeared recently and it is becoming increasingly difficult to understand what they do differently. So, which one to choose for your use case?

TL;DR

If you already have a PostgreSQL or an ElasticSearch in your stack, use the integrated vector search tool (PGvector and knn search respectively). When cost and/or latency become an issue, consider using a dedicated vector database.
Vector databases with managed clouds and free tiers are ideal for kicking off vector search projects.
In my opinion, Qdrant is the best choice for data scientists, because, on top of being very performant, it allows you to use the same tool for your experiments (saving the database as a disk file) and your production pipeline (database properly deployed).

Indexing algorithms

There are many algorithms for building indexes to optimize vector search. Most vector databases implement Hierarchical Navigable Small World (HNSW) and/or Inverted File Index (IVF). Here is a great article by Pinecone explaining them, and the trade-off between speed, memory and quality.

PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the expense of speed.

Comparison of the different vector databases

Dedicated vector databases

Dedicated vector databases are databases that are used solely for vector search. They are therefore designed to be optimised for this use case.

Although the various dedicated vector databases present themselves in such a way that they all seem to do very different things (e.g. Pinecone branding themselves as “Long-term Memory for AI”), they all fulfil the same purpose: to store vectors, index them and make them searchable.

For each database, we will look at the following points:

Performance: How does this database compare with others in terms of performance (c.f. ANN benchmark provides a good comparison)? PS: performance usually matters if you are using a large number of vectors (e.g. >100k vectors).
Ease of local usage: Local usage is essential to facilitate iterations.
Provides a managed Cloud? Deploying your own database may be complex if you do not have a DevOps team, or if you want to get started quickly.
Provides a user interface?
Recent fund-raising: Fundraising is a good indication of the attractiveness and development potential of the database.
Specificity: For a better understanding of which use cases the database should be used for.

Pinecone

Performance: Seems similar to the other dedicated vector databases
Ease of local usage: Not possible as it is not open source
Provides a managed Cloud? Yes, with a free-tier
Provides a user interface? Yes
Recent fund-raising: 100m Series B on 27/04/23
Specificity: Among the main competitors, it is the only database which is not open-sourced, hence not making local iterations possible.

Qdrant

Performance: Coded in Rust, performance seems to be one of Qdrant's main objectives. In their benchmark , they appear to be significantly faster than their competitors (PS: this information is not confirmed by this ANN benchmark , which may not use the same testing conditions. Note that the former compares RPS vs precision and the latter RPS vs recall).
Ease of local usage: Possible either by deploying the database locally (with docker-compose), by saving it on a disk file (sqlite), or in-memory (changes not persisted)
Provides a managed Cloud? Yes, with a free-tier
Provides a user interface? Yes
Recent fund-raising: 7.5m seed on 24/04/23
Specificity: Qdrant’s main advantage seems to be its high performance. Furthermore, the possibility to use it as a disk file or in-memory eases many things, such as running integration tests in a CI, prototyping and iterating in your local environment. Data scientists who use DVC to orchestrate and version their experiments will find this feature particularly useful, as for each of their experiments, they will be able to version the corresponding database file.

Weaviate

Performance: Seems similar to the other dedicated vector databases
Ease of local usage: Possible to deploy the database locally (with docker-compose) or by saving it on a disk file.
Provides a managed Cloud? Yes, with a free-tier
Provides a user interface? Yes
Recent fund-raising: 50m Series B on 21/04/23
Specificity: Great if you want to query the database with graphQL.
All the advantages specified for Qdrant regarding the “possibility to use it as a disk file” also apply to Weaviate.

Milvus

Performance: Seems similar to the other dedicated vector databases
Ease of local usage: Possible to deploy the database locally (with docker-compose)
Provides a managed Cloud? Yes, with a free-tier
Provides a user interface? Yes
Recent fund-raising: 60M Series B on 24/08/22
Specificity: Doyen of vector databases: it has proved to work in many use cases over time. However, its architecture in microservices is complex and makes it difficult to manage and debug.

ChromaDB

Performance: Coded in full Python, it is probably not as performant as its competitors (Disclaimer: I have not tried it and could not find any benchmark)
Ease of local usage: Possible either by deploying the database locally (with docker-compose), by saving it on a disk file, or in-memory (changes not persisted)
Provides a managed Cloud? No
Provides a user interface? No
Recent fund-raising: 18m Seed 06/04/23
Specificity: Chroma DB puts forward its simplicity: coded entirely in Python, it is easily customized for specific use cases.
All the advantages specified for Qdrant regarding the “possibility to use it as a disk file or in-memory” also apply to ChromaDB.

General-purpose databases

PGVector ( PostgreSQL) / knn search ( ElasticSearch)

General-purpose databases are not initially designed for vector search and are therefore not as efficient as dedicated vector databases.

However, as mentioned above, if you are using a small number of vectors (e.g. <100k) and already using one of these databases (likely, since 49% of professional developers use PostgreSQL according to stackoverflow 2023 survey ), the pragmatic choice would certainly be to stick with it, to keep your technical stack simple. When cost and/or latency become an issue, consider using a dedicated vector database.

Thank you for your time, do not hesitate to contact us for any vector database related project! Although these databases are currently mainly used for Generative AI use cases, they are also very useful in various other fields: for example, in computer vision to host your support set in few-shot learning projects !

Cet article a été écrit par

Noé Achache