chroma
Chroma - Open-Source Embedding Database
The AI-native database for building LLM applications with memory.
When to use Chroma
Use Chroma when:
- Building RAG (retrieval-augmented generation) applications
- Need local/self-hosted vector database
- Want open-source solution (Apache 2.0)
- Prototyping in notebooks
- Semantic search over documents
- Storing embeddings with metadata
Metrics:
- 24,300+ GitHub stars
- 1,900+ forks
- v1.3.3 (stable, weekly releases)
- Apache 2.0 license
Use alternatives instead:
- Pinecone: Managed cloud, auto-scaling
- FAISS: Pure similarity search, no metadata
- Weaviate: Production ML-native database
- Qdrant: High performance, Rust-based
Quick start
Installation
# Python
pip install chromadb
# JavaScript/TypeScript
npm install chromadb @chroma-core/default-embed
Basic usage (Python)
import chromadb
# Create client
client = chromadb.Client()
# Create collection
collection = client.create_collection(name="my_collection")
# Add documents
collection.add(
documents=["This is document 1", "This is document 2"],
metadatas=[{"source": "doc1"}, {"source": "doc2"}],
ids=["id1", "id2"]
)
# Query
results = collection.query(
query_texts=["document about topic"],
n_results=2
)
print(results)
Core operations
1. Create collection
# Simple collection
collection = client.create_collection("my_docs")
# With custom embedding function
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-3-small"
)
collection = client.create_collection(
name="my_docs",
embedding_function=openai_ef
)
# Get existing collection
collection = client.get_collection("my_docs")
# Delete collection
client.delete_collection("my_docs")
2. Add documents
# Add with auto-generated IDs
collection.add(
documents=["Doc 1", "Doc 2", "Doc 3"],
metadatas=[
{"source": "web", "category": "tutorial"},
{"source": "pdf", "page": 5},
{"source": "api", "timestamp": "2025-01-01"}
],
ids=["id1", "id2", "id3"]
)
# Add with custom embeddings
collection.add(
embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
documents=["Doc 1", "Doc 2"],
ids=["id1", "id2"]
)
3. Query (similarity search)
# Basic query
results = collection.query(
query_texts=["machine learning tutorial"],
n_results=5
)
# Query with filters
results = collection.query(
query_texts=["Python programming"],
n_results=3,
where={"source": "web"}
)
# Query with metadata filters
results = collection.query(
query_texts=["advanced topics"],
where={
"$and": [
{"category": "tutorial"},
{"difficulty": {"$gte": 3}}
]
}
)
# Access results
print(results["documents"]) # List of matching documents
print(results["metadatas"]) # Metadata for each doc
print(results["distances"]) # Similarity scores
print(results["ids"]) # Document IDs
4. Get documents
# Get by IDs
docs = collection.get(
ids=["id1", "id2"]
)
# Get with filters
docs = collection.get(
where={"category": "tutorial"},
limit=10
)
# Get all documents
docs = collection.get()
5. Update documents
# Update document content
collection.update(
ids=["id1"],
documents=["Updated content"],
metadatas=[{"source": "updated"}]
)
6. Delete documents
# Delete by IDs
collection.delete(ids=["id1", "id2"])
# Delete with filter
collection.delete(
where={"source": "outdated"}
)
Persistent storage
# Persist to disk
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection("my_docs")
collection.add(documents=["Doc 1"], ids=["id1"])
# Data persisted automatically
# Reload later with same path
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("my_docs")
Embedding functions
Default (Sentence Transformers)
# Uses sentence-transformers by default
collection = client.create_collection("my_docs")
# Default model: all-MiniLM-L6-v2
OpenAI
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-3-small"
)
collection = client.create_collection(
name="openai_docs",
embedding_function=openai_ef
)
HuggingFace
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
api_key="your-key",
model_name="sentence-transformers/all-mpnet-base-v2"
)
collection = client.create_collection(
name="hf_docs",
embedding_function=huggingface_ef
)
Custom embedding function
from chromadb import Documents, EmbeddingFunction, Embeddings
class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
# Your embedding logic
return embeddings
my_ef = MyEmbeddingFunction()
collection = client.create_collection(
name="custom_docs",
embedding_function=my_ef
)
Metadata filtering
# Exact match
results = collection.query(
query_texts=["query"],
where={"category": "tutorial"}
)
# Comparison operators
results = collection.query(
query_texts=["query"],
where={"page": {"$gt": 10}} # $gt, $gte, $lt, $lte, $ne
)
# Logical operators
results = collection.query(
query_texts=["query"],
where={
"$and": [
{"category": "tutorial"},
{"difficulty": {"$lte": 3}}
]
} # Also: $or
)
# Contains
results = collection.query(
query_texts=["query"],
where={"tags": {"$in": ["python", "ml"]}}
)
LangChain integration
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Split documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(documents)
# Create Chroma vector store
vectorstore = Chroma.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(),
persist_directory="./chroma_db"
)
# Query
results = vectorstore.similarity_search("machine learning", k=3)
# As retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
LlamaIndex integration
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import chromadb
# Initialize Chroma
db = chromadb.PersistentClient(path="./chroma_db")
collection = db.get_or_create_collection("my_collection")
# Create vector store
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Create index
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context
)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("What is machine learning?")
Server mode
# Run Chroma server
# Terminal: chroma run --path ./chroma_db --port 8000
# Connect to server
import chromadb
from chromadb.config import Settings
client = chromadb.HttpClient(
host="localhost",
port=8000,
settings=Settings(anonymized_telemetry=False)
)
# Use as normal
collection = client.get_or_create_collection("my_docs")
Best practices
- Use persistent client - Don't lose data on restart
- Add metadata - Enables filtering and tracking
- Batch operations - Add multiple docs at once
- Choose right embedding model - Balance speed/quality
- Use filters - Narrow search space
- Unique IDs - Avoid collisions
- Regular backups - Copy chroma_db directory
- Monitor collection size - Scale up if needed
- Test embedding functions - Ensure quality
- Use server mode for production - Better for multi-user
Performance
| Operation | Latency | Notes |
|---|---|---|
| Add 100 docs | ~1-3s | With embedding |
| Query (top 10) | ~50-200ms | Depends on collection size |
| Metadata filter | ~10-50ms | Fast with proper indexing |
Resources
- GitHub: https://github.com/chroma-core/chroma ⭐ 24,300+
- Docs: https://docs.trychroma.com
- Discord: https://discord.gg/MMeYNTmh3x
- Version: 1.3.3+
- License: Apache 2.0
More from l-yifan/skills
scientific-figure-pro
Generate publication-ready scientific figures in Python/matplotlib with a consistent figures4papers house style. Use when creating or refining academic bar/trend/heatmap/scatter/multi-panel figures, enforcing visual consistency, or exporting paper-ready PNG/PDF/SVG outputs.
32figures4papers-playbook
Locate and adapt real plotting examples from the figures4papers repository. Use when users ask for a figure in the style of specific papers/projects, want the closest existing script template, or need fast script selection by chart type/domain before customization.
30deep-wiki
Access AI-generated documentation and insights for GitHub repositories via DeepWiki. This skill should be used when exploring unfamiliar codebases, understanding repository architecture, finding implementation patterns, or asking questions about how a GitHub project works. Supports any public GitHub repository.
25gkg
Global Knowledge Graph for codebase analysis. This skill should be used when searching for code definitions (functions, classes, methods), finding references to symbols, understanding code structure, analyzing import usage, generating repository maps, or performing impact analysis before refactoring. Supports TypeScript, JavaScript, Python, Java, and more.
21gh-grep
Search real-world code examples across millions of GitHub repositories using grep.app. This skill should be used when looking for implementation patterns, API usage examples, library integration patterns, or production code references. Supports literal code search, regex patterns, and filtering by language/repo/path.
18github
Interact with GitHub repositories, issues, pull requests, and code via the GitHub MCP server. This skill should be used when managing repositories, creating/updating files, working with issues and PRs, searching code/repos/users, creating branches, and performing code reviews. Supports all major GitHub API operations.
16