Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]ChromaDB Agent Knowledge with multiple pdf url #2129

Open
CodeTilde opened this issue Feb 14, 2025 · 1 comment
Open

[Bug]ChromaDB Agent Knowledge with multiple pdf url #2129

CodeTilde opened this issue Feb 14, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@CodeTilde
Copy link

Description

The chromadb can not add the doc chunks from the second provided URL to the collection.

Steps to Reproduce

I have tested PDFUrlKnowledgeBase with 2 pdf urls and chroma db

Agent Configuration (if applicable)

vector_db = ChromaDb(collection="pdf_knowledge", path="./tmp/chromadb", persistent_client=True, embedder=OpenAIEmbedder(id="text-embedding-3-small"),)
knowledge_base = PDFUrlKnowledgeBase(
urls=["https://www-file.huawei.com/-/media/corp2020/pdf/tech-insights/1/6g-white-paper-en.pdf",
"https://agno-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf"], # "https://www-file.huawei.com/-/media/corp2020/pdf/tech-insights/1/6g-white-paper-en.pdf"],
vector_db= vector_db,
)

knowledge_base.load()

agent = Agent(
model = OpenAIChat(id="gpt-4o"),
knowledge=knowledge_base,
search_knowledge=True,

    show_tool_calls=True,
    debug_mode=True,
)

agent.print_response("Give the recipe of Thai Fried Noodles with Shrimps?")

Expected Behavior

What did you expect to happen?
The docs from the second url must have been added to the knowledge base. and the answre of the question "Give the recipe of Thai Fried Noodles with Shrimps?" must have been extracted from the second pdf.

Actual Behavior

What actually happened instead?
The document chunks of the second pdf (URL) have not been added to the chroma db

Screenshots or Logs (if applicable)

Include any relevant screenshots or error logs that demonstrate the issue.
INFO Creating collection
INFO Loading knowledge base
INFO Reading: https://www-file.huawei.com/-/media/corp2020/pdf/tech-insights/1/6g-white-paper-en.pdf
INFO Added 33 documents to knowledge base
INFO Reading: https://agno-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf
INFO Added 0 documents to knowledge base
.....
....
DEBUG ============== assistant ==============
DEBUG It seems there was an issue with retrieving the recipe for Thai Fried Noodles with Shrimps from the knowledge
base. However, I can provide you with a general recipe for making this delicious dish:

Environment

  • OS: Ubuntu
  • Browser (if relevant): (e.g. Chrome 108, Firefox 107)
  • Agno Version: (e.g. v1.0.0)
  • Additional Environment Details: (e.g., Python 3.10)

Possible Solutions (optional)

Suggest any ideas you might have to fix or address the issue.

agno:knowledge:agent.py for the second pdf , the condition "not self.vector_db.doc_exists(doc) " is always False.
for doc in document_list:
if doc.content not in seen_content and not self.vector_db.doc_exists(doc):
seen_content.add(doc.content)
documents_to_load.append(doc)
self.vector_db.insert(documents=documents_to_load, filters=filters)
num_documents += len(documents_to_load)
logger.info(f"Added {len(documents_to_load)} documents to knowledge base")

It seems that the doc_exists has not been implemented properly:
Agno/vector/chroma/chromadb.py:

Since the first document has been added, "collection_data.get("documents") != []" is met and the doc_exists returns True.
return True
def doc_exists(self, document: Document) -> bool:
"""Check if a document exists in the collection.
Args:
document (Document): Document to check.
Returns:
bool: True if document exists, False otherwise.
"""
if self.client:
try:
collection: Collection = self.client.get_collection(name=self.collection_name)
collection_data: GetResult = collection.get(include=[IncludeEnum.documents])
if collection_data.get("documents") != []:
return True
except Exception as e:
logger.error(f"Document does not exist: {e}")
return False

Additional Context

Add any other context or details about the problem here.

@CodeTilde CodeTilde added the bug Something isn't working label Feb 14, 2025
@CodeTilde
Copy link
Author

CodeTilde commented Feb 15, 2025

The following modification of doc_exist function in chromadb.py worked for me:
def doc_exists(self, document: Document) -> bool:
"""Check if a specific document exists in the collection.

    Args:
        document (Document): Document to check.
    Returns:
        bool: True if the exact document exists, False otherwise.
    """
    if not self.client:
        return False

    try:
        collection: Collection = self.client.get_collection(name=self.collection_name)
        collection_data: GetResult = collection.get(include=[IncludeEnum.documents])
    
        # Get existing documents from collection
        existing_docs = collection_data.get("documents", [])
    
        # Clean document content for comparison
        cleaned_content = document.content.replace("\x00", "\ufffd")
    
        # Check if exact document exists
        return cleaned_content in existing_docs
    
    except Exception as e:
        logger.error(f"Error checking document existence: {e}")
        return False    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant