[Bug]ChromaDB Agent Knowledge with multiple pdf url #2129

CodeTilde · 2025-02-14T19:07:15Z

Description

The chromadb can not add the doc chunks from the second provided URL to the collection.

Steps to Reproduce

I have tested PDFUrlKnowledgeBase with 2 pdf urls and chroma db

Agent Configuration (if applicable)

vector_db = ChromaDb(collection="pdf_knowledge", path="./tmp/chromadb", persistent_client=True, embedder=OpenAIEmbedder(id="text-embedding-3-small"),)
knowledge_base = PDFUrlKnowledgeBase(
urls=["https://www-file.huawei.com/-/media/corp2020/pdf/tech-insights/1/6g-white-paper-en.pdf",
"https://agno-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf"], # "https://www-file.huawei.com/-/media/corp2020/pdf/tech-insights/1/6g-white-paper-en.pdf"],
vector_db= vector_db,
)

knowledge_base.load()

agent = Agent(
model = OpenAIChat(id="gpt-4o"),
knowledge=knowledge_base,
search_knowledge=True,

    show_tool_calls=True,
    debug_mode=True,
)

agent.print_response("Give the recipe of Thai Fried Noodles with Shrimps?")

Expected Behavior

What did you expect to happen?
The docs from the second url must have been added to the knowledge base. and the answre of the question "Give the recipe of Thai Fried Noodles with Shrimps?" must have been extracted from the second pdf.

Actual Behavior

What actually happened instead?
The document chunks of the second pdf (URL) have not been added to the chroma db

Screenshots or Logs (if applicable)

Include any relevant screenshots or error logs that demonstrate the issue.
INFO Creating collection
INFO Loading knowledge base
INFO Reading: https://www-file.huawei.com/-/media/corp2020/pdf/tech-insights/1/6g-white-paper-en.pdf
INFO Added 33 documents to knowledge base
INFO Reading: https://agno-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf
INFO Added 0 documents to knowledge base
.....
....
DEBUG ============== assistant ==============
DEBUG It seems there was an issue with retrieving the recipe for Thai Fried Noodles with Shrimps from the knowledge
base. However, I can provide you with a general recipe for making this delicious dish:

Environment

OS: Ubuntu
Browser (if relevant): (e.g. Chrome 108, Firefox 107)
Agno Version: (e.g. v1.0.0)
Additional Environment Details: (e.g., Python 3.10)

Possible Solutions (optional)

Suggest any ideas you might have to fix or address the issue.

agno:knowledge:agent.py for the second pdf , the condition "not self.vector_db.doc_exists(doc) " is always False.
for doc in document_list:
if doc.content not in seen_content and not self.vector_db.doc_exists(doc):
seen_content.add(doc.content)
documents_to_load.append(doc)
self.vector_db.insert(documents=documents_to_load, filters=filters)
num_documents += len(documents_to_load)
logger.info(f"Added {len(documents_to_load)} documents to knowledge base")

It seems that the doc_exists has not been implemented properly:
Agno/vector/chroma/chromadb.py:

Since the first document has been added, "collection_data.get("documents") != []" is met and the doc_exists returns True.
return True
def doc_exists(self, document: Document) -> bool:
"""Check if a document exists in the collection.
Args:
document (Document): Document to check.
Returns:
bool: True if document exists, False otherwise.
"""
if self.client:
try:
collection: Collection = self.client.get_collection(name=self.collection_name)
collection_data: GetResult = collection.get(include=[IncludeEnum.documents])
if collection_data.get("documents") != []:
return True
except Exception as e:
logger.error(f"Document does not exist: {e}")
return False

Additional Context

Add any other context or details about the problem here.

The text was updated successfully, but these errors were encountered:

CodeTilde · 2025-02-15T08:36:37Z

The following modification of doc_exist function in chromadb.py worked for me:
def doc_exists(self, document: Document) -> bool:
"""Check if a specific document exists in the collection.

    Args:
        document (Document): Document to check.
    Returns:
        bool: True if the exact document exists, False otherwise.
    """
    if not self.client:
        return False

    try:
        collection: Collection = self.client.get_collection(name=self.collection_name)
        collection_data: GetResult = collection.get(include=[IncludeEnum.documents])
    
        # Get existing documents from collection
        existing_docs = collection_data.get("documents", [])
    
        # Clean document content for comparison
        cleaned_content = document.content.replace("\x00", "\ufffd")
    
        # Check if exact document exists
        return cleaned_content in existing_docs
    
    except Exception as e:
        logger.error(f"Error checking document existence: {e}")
        return False

CodeTilde added the bug Something isn't working label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]ChromaDB Agent Knowledge with multiple pdf url #2129

[Bug]ChromaDB Agent Knowledge with multiple pdf url #2129

CodeTilde commented Feb 14, 2025

CodeTilde commented Feb 15, 2025 •

edited

Loading

[Bug]ChromaDB Agent Knowledge with multiple pdf url #2129

[Bug]ChromaDB Agent Knowledge with multiple pdf url #2129

Comments

CodeTilde commented Feb 14, 2025

Description

Steps to Reproduce

Agent Configuration (if applicable)

Expected Behavior

Actual Behavior

Screenshots or Logs (if applicable)

Environment

Possible Solutions (optional)

Additional Context

CodeTilde commented Feb 15, 2025 • edited Loading

CodeTilde commented Feb 15, 2025 •

edited

Loading