Skip to content

A document QA system built with LangChain and ChromaDB for analyzing DC housing policy documents.

License

Notifications You must be signed in to change notification settings

qinip/DC_Housing_Chatbot

Repository files navigation

DC Housing Policy Chatbot (Demo)

A document QA chatbot demo built with LangChain and ChromaDB for analyzing DC housing policy documents.

Credit

This project is adapted from this tutorial and Anthropic's Contextual Retrieval Cookbook.

Environment Setup

System Requirements

  • Tested on MacOS (Apple Silicon)
  • Python 3.12.4

Installation

  1. Install required packages:
pip install -r requirements.txt
  1. (Optional) For local embedding/LLM:
  • Install Ollama (Mac users can use Homebrew):
brew install ollama
ollama pull mxbai-embed-large
ollama pull llama3.2
  1. API Setup:
  • Copy .env.example to .env
  • Add your API keys for:
    • OpenAI (embeddings)
    • Choice of LLM providers:
      • OpenAI
      • Anthropic
      • Groq
      • OpenRouter

Note on LLM Providers:

  • Groq: Known for its ultra-fast inference speeds, especially for Claude and Mixtral models. Visit Groq to get API key and see available models.
  • OpenRouter: Acts as a middleware service that provides access to various LLMs (including GPT-4, Claude, Mixtral, etc.) through a unified API. Check OpenRouter for available models and pricing.

Project Structure

.
├── data/           # PDF documents
├── db/             # ChromaDB vector database
├── models.py       # Model configurations
├── ingest.py      # Document processing script
├── chat.py        # Chat interface
└── test_qa.md     # Test questions and answers

Usage

  1. Start document processing:
python ingest.py
  • Monitors ./data folder for new PDFs
  • Processed files are marked with "_" prefix
  • Drag-n-drop knowledge base updates
  1. Start chat interface:
python chat.py
  • Choose between terminal or Gradio web interface
  • Terminal: Type 'q' to quit
  • Web UI: Use Ctrl+C to stop server

Note on Gradio: Gradio is an open-source Python library that makes it easy to create customizable web interfaces for ML models.

  • By default, the UI is accessible only locally
  • To create a public link, set share=True in launch():
    demo.launch(show_api=False, share=True)
  • Gradio will generate a temporary public URL (valid for 72 hours)
  • This allows others to access your chatbot through the internet

Key Parameters

Document Processing

chunk_size = 1000   # Characters per chunk
chunk_overlap = 200 # Overlap between chunks

Adjust based on your documents for optimal QA performance.

Retrieval Strategy

In chat.py:

  • Change USE_HYBRID to choose between simple vector retrieval or hybrid (vector + BM25) retrieval
  • Adjustable parameters:
    • k value (default: 10) - number of retrieved documents
    • Vector/BM25 weights in hybrid retrieval (default: 0.8/0.2)

Evaluation

See test_qa.md for three types of test questions designed by Claude 2.5 Sonnet:

  1. Factual questions (single-source)
  2. Synthetic questions (multi-source)
  3. Analytical questions (policy recommendations)

Used Documents

Document Title Year Page Count
A 'Perfect storm' of problems pulls D.C. toward full-blown housing crisis (Washington Post) 2024 8
Single-Family Zoning in the District of Columbia 2020 24
Affordable Housing Policies in the Washington D.C. Metropolitan Area (LMU Honors Thesis) 2023 33
DC Housing Survey Report: A Supplement to the Assessment of the Need for Large Units 2019 19
An Assessment of the Need for Large Units in the District of Columbia 2019 67
Families sue D.C. for ending housing aid in unprecedented case 2024 5
Housing Equity Report: Creating Goals for Areas of Our City 2019 20
Housing Insecurity in the District of Columbia 2023 101

Built with

About

A document QA system built with LangChain and ChromaDB for analyzing DC housing policy documents.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages