Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vision capability for bots #413

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

gmuffiness
Copy link

I added support for image input using GPT-4V and GPT-4o, enabling effective image interpretation.
This is an initial implementation, so I would greatly appreciate any feedback or suggestions for improvement. Thank you!

Changelog

  • Two actions added, leveraging mineflayer’s screenshot functionality (as @MaxRobinsonTheGreat suggested in this issue )
    • lookAtPlayer: Allows the bot to focus on the player’s direction or viewpoint for better understanding
    • lookAtPosition: Enables the bot to focus on specific coordinates for targeted image interpretation
  • Added a promptImageConvo method in src/agent/prompter.js.
  • Included examples to demonstrate these new features.

Known Limitations

  • Node.js Compatibility : Using a relatively new Node.js version (in my case, v23.2.0) caused installation errors with node-canvas-webgl and three packages. Switching to the LTS version (18.20.5) resolved these issues. Use nvm use 18 for compatibility.
  • Minecraft Version Support : Works reliably with Minecraft versions up to 1.20.1, as specified in the Prismarine Viewer README. Rendering and execution issues may occur with versions beyond 1.20.1.

@gmuffiness
Copy link
Author

gmuffiness commented Jan 19, 2025

I also made a demo video about this feature.
This video was inspired by your work, Max! I hope you enjoy it :)

https://www.youtube.com/watch?v=gPyFrBs45Es

@uukelele-scratch
Copy link
Contributor

uukelele-scratch commented Jan 19, 2025

why change default port to 56069?

and why comment out init message?

@gmuffiness
Copy link
Author

Oh, I hadn’t noticed that settings.js was changed. Thanks for pointing it out!
no particular reason, haha

@gmuffiness
Copy link
Author

Currently, the lookAtPlayer and lookAtPosition functions in skills.js handle both 1) taking screenshots and 2) sending requests to the vision model. However, the other functions in skills.js seem to focus solely on controlling Mineflayer’s actions.

This makes me wonder if it might be better to separate these responsibilities by creating a new class, such as VisionInterpreter to handle the vision-related functionalities and use it on agent.js.

I’ll think more about whether this approach would be better. I’d appreciate any feedback or thoughts!

@Lorodn4x
Copy link

Can I test this with models from the company Mistral "pixtral-large-latest"?

@gmuffiness
Copy link
Author

Changelog

  • Extracted vision-related functionalities from skills.js into a new VisionInterpreter class to improve code organization.
  • Added an allow_vision option in settings.js to toggle between vision model-based responses and !nearbyBlocks-based logic.

@gmuffiness
Copy link
Author

gmuffiness commented Jan 24, 2025

Can I test this with models from the company Mistral "pixtral-large-latest"?

@Lorodn4x

Changelog

  • Removed promptImageConvo and added a sendVisionRequest method for each provider (currently supports OpenAI and Mistral).

Initially, I only considered OpenAI models, but I’ve updated the implementation to support Mistral vision requests as well, as their format is slightly different. (https://docs.mistral.ai/capabilities/vision/)
Now, vision interactions are possible with the Mistral model too!
(As a side note, the pixtral-12b model seems to be slightly less effective at action selection compared to gpt-4o.)

image

@Vineethm0410
Copy link
Contributor

You should consider adding the vision models from Groq as well. That way free users can also test out the vision capability.

@uukelele-scratch
Copy link
Contributor

You should consider adding the vision models from Groq as well. That way free users can also test out the vision capability.

Also Gemini.

At this point it might be better to add a "vision_model" in profile.json

@gmuffiness
Copy link
Author

@Vineethm0410 @uukelele-scratch
Two providers added!

image image

@MaxRobinsonTheGreat
Copy link
Collaborator

this looks very promising. is it near completion?

@gmuffiness gmuffiness marked this pull request as ready for review February 7, 2025 09:26
@gmuffiness
Copy link
Author

Yes, i think the basics are all done.
Let me know if anything needs to be modified, and I’ll update it right away.

Note: If npm install doesn’t work, try installing Node.js version 18.20.5, then run nvm use 18 before installilng.

@gmuffiness
Copy link
Author

I noticed you added code_model to recent PR. To align with this, i'll update to set vision model in profile like @uukelele-scratch suggested.

@gmuffiness
Copy link
Author

Changelog

  • Merge the latest commit changes and add vision_model for image interpretation. If not specified, model will be used.
  • Fallback to text-based description when using a non-vision model (e.g. gpt-35-turbo) or an unimplemented provider (e.g., DeepSeek) with vision features
  • Update README
  • Currently supports the following providers: Google, OpenAI, Anthropic, Mistral, Groq, XAI

@MaxRobinsonTheGreat
Copy link
Collaborator

@gmuffiness Is this ready for review?

@gmuffiness
Copy link
Author

@MaxRobinsonTheGreat Yes!

Copy link
Collaborator

@MaxRobinsonTheGreat MaxRobinsonTheGreat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested yet, just have a small request and needs to merge with main

@@ -1351,3 +1353,77 @@ export async function activateNearestBlock(bot, type) {
log(bot, `Activated ${type} at x:${block.position.x.toFixed(1)}, y:${block.position.y.toFixed(1)}, z:${block.position.z.toFixed(1)}.`);
return true;
}

// export async function lookAtPlayer(agent, bot, player_name, direction) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these comments

@gmuffiness
Copy link
Author

I’ve removed the comments!

Also, I made a small change—sweaterdog had an issue during the installation process. So, I updated the way importing node-canvas-webgl package as like prismarine viewer library.
If you encounter any errors during installation or elsewhere, please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants