Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: MultmodalConversable agent fails when LLM config points to local Ollama/LLaVA proxied by LiteLLM #2528

Open
DarinShapiroMS opened this issue Apr 26, 2024 · 1 comment
Labels
0.2 Issues which are related to the pre 0.4 codebase needs-triage

Comments

@DarinShapiroMS
Copy link

Describe the bug

Working from the understanding that LLaVA hosted via an OpenAI proxy like LiteLLM, as well as GPT4-V hosted in Azure or OpenAI are both valid options for the MultimodalConversable agent. My agent workflow works correctly when I point the vision agent to GPT4v, but I get errors if I switch the llm config to the locally hosted LLaVA config.

When I switch to LLaVA (hosted via LiteLLM with 'litellm --model ollama_chat/llava --run_gunicorn', I get

Traceback (most recent call last):
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/proxy/proxy_server.py", line 3671, in chat_completion
    responses = await asyncio.gather(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3465, in wrapper_async
    raise e
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3297, in wrapper_async
    result = await original_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/main.py", line 340, in acompletion
    raise exception_type(
          ^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8665, in exception_type
    raise e
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8633, in exception_type
    raise APIConnectionError(
litellm.exceptions.APIConnectionError: {"error":"json: cannot unmarshal array into Go struct field Message.messages.content of type string"}

If I start the ollama model without '_chat' like 'litellm --model ollama/llava --run_gunicorn' I get

Traceback (most recent call last):
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/proxy/proxy_server.py", line 3671, in chat_completion
    responses = await asyncio.gather(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3465, in wrapper_async
    raise e
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3297, in wrapper_async
    result = await original_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/main.py", line 340, in acompletion
    raise exception_type(
          ^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8665, in exception_type
    raise e
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8633, in exception_type
    raise APIConnectionError(
litellm.exceptions.APIConnectionError: {"error":"illegal base64 data at input byte 4"}

One thing to note is I'm including a list of frames via the prompt

  prompt = """
     context: camera location = "front yard", time = "10:00 AM", date = "March 15, 2022"
     These are the frames of a video. Generate a compelling description that the SecurityAnalysisAgent can evaluate.
    <img frames/frame0.jpg>
    <img frames/frame1.jpg>
    <img frames/frame2.jpg>
    <img frames/frame3.jpg>
    <img frames/frame4.jpg>
    <img frames/frame5.jpg>
    <img frames/frame6.jpg>
    <img frames/frame7.jpg>
    <img frames/frame8.jpg>
    <img frames/frame9.jpg>
    """

It seems that LiteLLM isn't handling the list of images correctly. Is inclusion of multiple frames considered part of the OpenAI spec, or is MultimodalConversable agent not writing the [tools] section perfectly?

Steps to reproduce

  1. Point an agent at gpt4v with a series of frames from of video and ask for a description of the video. The agent gets a valid description of the video.
  2. Change the llm_config of that agent to point to a locally hosted LLaVA vision model using ollama and litellm as the proxy for ollama. Errors returned.

Model Used

gpt4v & LLaVA 1.6

Expected Behavior

I was expecting to be able to treat GPT4V and LLaVVA llm_configs as interchangeable, only differing in response quality, performance, and cost.

Screenshots and logs

No response

Additional Information

Latest AutoGen version, both MacOS and Windows, Python 3.1.1.9.

@ekzhu
Copy link
Collaborator

ekzhu commented Apr 27, 2024

cc @BeibinLi

@rysweet rysweet added 0.2 Issues which are related to the pre 0.4 codebase needs-triage labels Oct 2, 2024
@fniedtner fniedtner removed the bug label Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.2 Issues which are related to the pre 0.4 codebase needs-triage
Projects
None yet
Development

No branches or pull requests

4 participants