Blog de José R Sosa: ChromaDB base de datos vectorial en Docker

José R Sosa

Fuentes:

Levantar el servicio ChromaDB en Docker:

Para ejecutar un chomadb con docker, solo hace falta ejecutar:

docker volume create chroma_data
docker run -d --name chromadb -p 8000:8000 -v chroma_data:/chromadb/data chromadb/chroma

Descripción del servicio

La documentacion del API asociado con els ervicio de ChromaDB esta en el siguiente enlace:

http://localhost:8000/docs

Ingesta de documentos con Python

El siguiente programa importdocs.py ingesta todos los archivos de texto incluidos en un directorio:

import os
import re
import ollama
import chromadb

def readtextfiles(path):
  text_contents = {}
  directory = os.path.join(path)

  for filename in os.listdir(directory):
    if filename.endswith(".txt"):
      file_path = os.path.join(directory, filename)

      with open(file_path, "r", encoding="utf-8") as file:
        content = file.read()

      text_contents[filename] = content

  return text_contents

def chunksplitter(text, chunk_size=100):
  words = re.findall(r'\S+', text)

  chunks = []
  current_chunk = []
  word_count = 0

  for word in words:
    current_chunk.append(word)
    word_count += 1

    if word_count >= chunk_size:
      chunks.append(' '.join(current_chunk))
      current_chunk = []
      word_count = 0

  if current_chunk:
    chunks.append(' '.join(current_chunk))

  return chunks

def getembedding(chunks):
  embeds = ollama.embed(model="nomic-embed-text", input=chunks)
  return embeds.get('embeddings', [])

chromaclient = chromadb.HttpClient(host="localhost", port=8000)
textdocspath = "../../scripts"
text_data = readtextfiles(textdocspath)

collection = chromaclient.get_or_create_collection(name="buildragwithpython", metadata={"hnsw:space": "cosine"}  )
if any(collection.name == collectionname for collection in chromaclient.list_collections()):
  chromaclient.delete_collection("buildragwithpython")

for filename, text in text_data.items():
  chunks = chunksplitter(text)
  embeds = getembedding(chunks)
  chunknumber = list(range(len(chunks)))
  ids = [filename + str(index) for index in chunknumber]
  metadatas = [{"source": filename} for index in chunknumber]
  collection.add(ids=ids, documents=chunks, embeddings=embeds, metadatas=metadatas)

El cual tiene el siguiente requirement.txt:

chromadb
os
re

Búsquedas sobre ChromaDB con Python

El siguiente programa search.py muestra una manera de implementar la busqueda semántica sobre nuestra base de conocimeinto alamacenada en ChromaDB:

import sys, chromadb, ollama

chromaclient = chromadb.HttpClient(host="localhost", port=8000)
collection = chromaclient.get_or_create_collection(name="buildragwithpython")

query = " ".join(sys.argv[1:])
queryembed = ollama.embed(model="nomic-embed-text", input=query)['embeddings']

relateddocs = '\n\n'.join(collection.query(query_embeddings=queryembed, n_results=10)['documents'][0])
prompt = f"{query} - Answer that question using the following text as a resource: {relateddocs}"
noragoutput = ollama.generate(model="mistral", prompt=query, stream=False)
print(f"Answered without RAG: {noragoutput['response']}")
print("---")
ragoutput = ollama.generate(model="llama3.1", prompt=prompt, stream=False)

print(f"Answered with RAG: {ragoutput['response']}")

Comment on this article Share:

ChromaDB base de datos vectorial en Docker

Fuentes:

Levantar el servicio ChromaDB en Docker:

Descripción del servicio

Ingesta de documentos con Python

Búsquedas sobre ChromaDB con Python

Corrections

Citation