[Bug]: MultmodalConversable agent fails when LLM config points to local Ollama/LLaVA proxied by LiteLLM #2528

DarinShapiroMS · 2024-04-26T20:53:13Z

Describe the bug

Working from the understanding that LLaVA hosted via an OpenAI proxy like LiteLLM, as well as GPT4-V hosted in Azure or OpenAI are both valid options for the MultimodalConversable agent. My agent workflow works correctly when I point the vision agent to GPT4v, but I get errors if I switch the llm config to the locally hosted LLaVA config.

When I switch to LLaVA (hosted via LiteLLM with 'litellm --model ollama_chat/llava --run_gunicorn', I get

Traceback (most recent call last):
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/proxy/proxy_server.py", line 3671, in chat_completion
    responses = await asyncio.gather(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3465, in wrapper_async
    raise e
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3297, in wrapper_async
    result = await original_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/main.py", line 340, in acompletion
    raise exception_type(
          ^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8665, in exception_type
    raise e
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8633, in exception_type
    raise APIConnectionError(
litellm.exceptions.APIConnectionError: {"error":"json: cannot unmarshal array into Go struct field Message.messages.content of type string"}

If I start the ollama model without '_chat' like 'litellm --model ollama/llava --run_gunicorn' I get

Traceback (most recent call last):
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/proxy/proxy_server.py", line 3671, in chat_completion
    responses = await asyncio.gather(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3465, in wrapper_async
    raise e
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 3297, in wrapper_async
    result = await original_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/main.py", line 340, in acompletion
    raise exception_type(
          ^^^^^^^^^^^^^^^
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8665, in exception_type
    raise e
  File "/Users/darinshapiro/Source/AutoGenDocPOC1/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8633, in exception_type
    raise APIConnectionError(
litellm.exceptions.APIConnectionError: {"error":"illegal base64 data at input byte 4"}

One thing to note is I'm including a list of frames via the prompt

  prompt = """
     context: camera location = "front yard", time = "10:00 AM", date = "March 15, 2022"
     These are the frames of a video. Generate a compelling description that the SecurityAnalysisAgent can evaluate.
    <img frames/frame0.jpg>
    <img frames/frame1.jpg>
    <img frames/frame2.jpg>
    <img frames/frame3.jpg>
    <img frames/frame4.jpg>
    <img frames/frame5.jpg>
    <img frames/frame6.jpg>
    <img frames/frame7.jpg>
    <img frames/frame8.jpg>
    <img frames/frame9.jpg>
    """

It seems that LiteLLM isn't handling the list of images correctly. Is inclusion of multiple frames considered part of the OpenAI spec, or is MultimodalConversable agent not writing the [tools] section perfectly?

Steps to reproduce

Point an agent at gpt4v with a series of frames from of video and ask for a description of the video. The agent gets a valid description of the video.
Change the llm_config of that agent to point to a locally hosted LLaVA vision model using ollama and litellm as the proxy for ollama. Errors returned.

Model Used

gpt4v & LLaVA 1.6

Expected Behavior

I was expecting to be able to treat GPT4V and LLaVVA llm_configs as interchangeable, only differing in response quality, performance, and cost.

Screenshots and logs

No response

Additional Information

Latest AutoGen version, both MacOS and Windows, Python 3.1.1.9.

The text was updated successfully, but these errors were encountered:

ekzhu · 2024-04-27T00:43:54Z

cc @BeibinLi

DarinShapiroMS added the bug label Apr 26, 2024

rysweet added 0.2 Issues which are related to the pre 0.4 codebase needs-triage labels Oct 2, 2024

fniedtner removed the bug label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: MultmodalConversable agent fails when LLM config points to local Ollama/LLaVA proxied by LiteLLM #2528

[Bug]: MultmodalConversable agent fails when LLM config points to local Ollama/LLaVA proxied by LiteLLM #2528

DarinShapiroMS commented Apr 26, 2024

ekzhu commented Apr 27, 2024

[Bug]: MultmodalConversable agent fails when LLM config points to local Ollama/LLaVA proxied by LiteLLM #2528

[Bug]: MultmodalConversable agent fails when LLM config points to local Ollama/LLaVA proxied by LiteLLM #2528

Comments

DarinShapiroMS commented Apr 26, 2024

Describe the bug

Steps to reproduce

Model Used

Expected Behavior

Screenshots and logs

Additional Information

ekzhu commented Apr 27, 2024