ChromaDB es una base de datos de vectores, muy utilizada en proyectos personales y pruebas de concepto para experimentar con la búsqueda semántica. Es útil por su facilidad de uso y porque funciona principalmente en memoria. Veamos como activar y usar esta base de datos en un contenedor de Docker.
Para ejecutar un chomadb con docker, solo hace falta ejecutar:
docker volume create chroma_data
docker run -d --name chromadb -p 8000:8000 -v chroma_data:/chromadb/data chromadb/chromaLa documentacion del API asociado con els ervicio de ChromaDB esta en el siguiente enlace:
http://localhost:8000/docsEl siguiente programa importdocs.py ingesta todos los archivos de texto incluidos en un directorio:
import os
import re
import ollama
import chromadb
def readtextfiles(path):
text_contents = {}
directory = os.path.join(path)
for filename in os.listdir(directory):
if filename.endswith(".txt"):
file_path = os.path.join(directory, filename)
with open(file_path, "r", encoding="utf-8") as file:
content = file.read()
text_contents[filename] = content
return text_contents
def chunksplitter(text, chunk_size=100):
words = re.findall(r'\S+', text)
chunks = []
current_chunk = []
word_count = 0
for word in words:
current_chunk.append(word)
word_count += 1
if word_count >= chunk_size:
chunks.append(' '.join(current_chunk))
current_chunk = []
word_count = 0
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
def getembedding(chunks):
embeds = ollama.embed(model="nomic-embed-text", input=chunks)
return embeds.get('embeddings', [])
chromaclient = chromadb.HttpClient(host="localhost", port=8000)
textdocspath = "../../scripts"
text_data = readtextfiles(textdocspath)
collection = chromaclient.get_or_create_collection(name="buildragwithpython", metadata={"hnsw:space": "cosine"} )
if any(collection.name == collectionname for collection in chromaclient.list_collections()):
chromaclient.delete_collection("buildragwithpython")
for filename, text in text_data.items():
chunks = chunksplitter(text)
embeds = getembedding(chunks)
chunknumber = list(range(len(chunks)))
ids = [filename + str(index) for index in chunknumber]
metadatas = [{"source": filename} for index in chunknumber]
collection.add(ids=ids, documents=chunks, embeddings=embeds, metadatas=metadatas)El cual tiene el siguiente requirement.txt:
chromadb
os
re
El siguiente programa search.py muestra una manera de implementar la busqueda semántica sobre nuestra base de conocimeinto alamacenada en ChromaDB:
import sys, chromadb, ollama
chromaclient = chromadb.HttpClient(host="localhost", port=8000)
collection = chromaclient.get_or_create_collection(name="buildragwithpython")
query = " ".join(sys.argv[1:])
queryembed = ollama.embed(model="nomic-embed-text", input=query)['embeddings']
relateddocs = '\n\n'.join(collection.query(query_embeddings=queryembed, n_results=10)['documents'][0])
prompt = f"{query} - Answer that question using the following text as a resource: {relateddocs}"
noragoutput = ollama.generate(model="mistral", prompt=query, stream=False)
print(f"Answered without RAG: {noragoutput['response']}")
print("---")
ragoutput = ollama.generate(model="llama3.1", prompt=prompt, stream=False)
print(f"Answered with RAG: {ragoutput['response']}")If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Sosa (2024, Aug. 20). Blog de José R Sosa: ChromaDB base de datos vectorial en Docker. Retrieved from https://josersosa.github.io/personalweb/posts/2026-01-27-chromadb-base-de-datos-vectorial-en-docker/
BibTeX citation
@misc{sosa2024chromadb,
author = {Sosa, José R},
title = {Blog de José R Sosa: ChromaDB base de datos vectorial en Docker},
url = {https://josersosa.github.io/personalweb/posts/2026-01-27-chromadb-base-de-datos-vectorial-en-docker/},
year = {2024}
}