Add vision capability for bots #413

gmuffiness · 2025-01-19T18:07:01Z

I added support for image input using GPT-4V and GPT-4o, enabling effective image interpretation.
This is an initial implementation, so I would greatly appreciate any feedback or suggestions for improvement. Thank you!

Changelog

Two actions added, leveraging mineflayer’s screenshot functionality (as @MaxRobinsonTheGreat suggested in this issue )
• lookAtPlayer: Allows the bot to focus on the player’s direction or viewpoint for better understanding
• lookAtPosition: Enables the bot to focus on specific coordinates for targeted image interpretation
Added a promptImageConvo method in src/agent/prompter.js.
Included examples to demonstrate these new features.

Known Limitations

Node.js Compatibility : Using a relatively new Node.js version (in my case, v23.2.0) caused installation errors with node-canvas-webgl and three packages. Switching to the LTS version (18.20.5) resolved these issues. Use nvm use 18 for compatibility.
Minecraft Version Support : Works reliably with Minecraft versions up to 1.20.1, as specified in the Prismarine Viewer README. Rendering and execution issues may occur with versions beyond 1.20.1.

gmuffiness · 2025-01-19T18:14:07Z

I also made a demo video about this feature.
This video was inspired by your work, Max! I hope you enjoy it :)

https://www.youtube.com/watch?v=gPyFrBs45Es

uukelele-scratch · 2025-01-19T19:42:59Z

why change default port to 56069?

and why comment out init message?

gmuffiness · 2025-01-20T01:33:27Z

Oh, I hadn’t noticed that settings.js was changed. Thanks for pointing it out!
no particular reason, haha

…amera api

gmuffiness · 2025-01-20T08:56:55Z

Currently, the lookAtPlayer and lookAtPosition functions in skills.js handle both 1) taking screenshots and 2) sending requests to the vision model. However, the other functions in skills.js seem to focus solely on controlling Mineflayer’s actions.

This makes me wonder if it might be better to separate these responsibilities by creating a new class, such as VisionInterpreter to handle the vision-related functionalities and use it on agent.js.

I’ll think more about whether this approach would be better. I’d appreciate any feedback or thoughts!

Lorodn4x · 2025-01-24T04:55:38Z

Can I test this with models from the company Mistral "pixtral-large-latest"?

gmuffiness · 2025-01-24T04:59:43Z

Changelog

Extracted vision-related functionalities from skills.js into a new VisionInterpreter class to improve code organization.
Added an allow_vision option in settings.js to toggle between vision model-based responses and !nearbyBlocks-based logic.

… provider

gmuffiness · 2025-01-24T07:37:15Z

Can I test this with models from the company Mistral "pixtral-large-latest"?

@Lorodn4x

Changelog

Removed promptImageConvo and added a sendVisionRequest method for each provider (currently supports OpenAI and Mistral).

Initially, I only considered OpenAI models, but I’ve updated the implementation to support Mistral vision requests as well, as their format is slightly different. (https://docs.mistral.ai/capabilities/vision/)
Now, vision interactions are possible with the Mistral model too!
(As a side note, the pixtral-12b model seems to be slightly less effective at action selection compared to gpt-4o.)

Vineethm0410 · 2025-01-25T21:06:40Z

You should consider adding the vision models from Groq as well. That way free users can also test out the vision capability.

uukelele-scratch · 2025-01-25T21:28:29Z

You should consider adding the vision models from Groq as well. That way free users can also test out the vision capability.

Also Gemini.

At this point it might be better to add a "vision_model" in profile.json

gmuffiness · 2025-01-27T17:46:56Z

@Vineethm0410 @uukelele-scratch
Two providers added!

MaxRobinsonTheGreat · 2025-02-06T17:47:19Z

this looks very promising. is it near completion?

gmuffiness · 2025-02-07T09:29:48Z

Yes, i think the basics are all done.
Let me know if anything needs to be modified, and I’ll update it right away.

Note: If npm install doesn’t work, try installing Node.js version 18.20.5, then run nvm use 18 before installilng.

gmuffiness · 2025-02-08T08:08:36Z

I noticed you added code_model to recent PR. To align with this, i'll update to set vision model in profile like @uukelele-scratch suggested.

…sion model

gmuffiness · 2025-02-09T17:11:51Z

Changelog

Merge the latest commit changes and add vision_model for image interpretation. If not specified, model will be used.
Fallback to text-based description when using a non-vision model (e.g. gpt-35-turbo) or an unimplemented provider (e.g., DeepSeek) with vision features
Update README
Currently supports the following providers: Google, OpenAI, Anthropic, Mistral, Groq, XAI

MaxRobinsonTheGreat · 2025-02-13T19:52:29Z

@gmuffiness Is this ready for review?

gmuffiness · 2025-02-14T06:18:20Z

@MaxRobinsonTheGreat Yes!

MaxRobinsonTheGreat

I haven't tested yet, just have a small request and needs to merge with main

MaxRobinsonTheGreat · 2025-02-18T19:28:14Z

src/agent/library/skills.js

@@ -1351,3 +1353,77 @@ export async function activateNearestBlock(bot, type) {
    log(bot, `Activated ${type} at x:${block.position.x.toFixed(1)}, y:${block.position.y.toFixed(1)}, z:${block.position.z.toFixed(1)}.`);
    return true;
 }
+
+// export async function lookAtPlayer(agent, bot, player_name, direction) {


remove these comments

gmuffiness · 2025-02-19T02:20:13Z

I’ve removed the comments!

Also, I made a small change—sweaterdog had an issue during the installation process. So, I updated the way importing node-canvas-webgl package as like prismarine viewer library.
If you encounter any errors during installation or elsewhere, please let me know!

gmuffiness added 3 commits January 15, 2025 17:26

feat: add screenshots and look action works on gpt

1be24f4

feat: update skill look to lookAtPlayer & export lookAtPosition

f5923db

fix: add camera file & move image describe to promptImageConvo

1fee081

chore: reset settings

85ed526

gmuffiness added 3 commits January 20, 2025 11:48

chore: remove duplcated func

65113c7

fix: update camera direction to use mineflayer viewer setFirstPersonC…

55c045f

…amera api

chore: remove unnecessary 5-second sleep

e4eda9c

feat: move vision functions from skill into vision_intepreter

5fce0ac

feat: remove promptImageConvo and implement sendVisionRequest to each…

7d51726

… provider

gmuffiness added 3 commits January 27, 2025 17:29

feat: add groq vision request

d1b3232

feat: add gemini vision request

4281ee2

chore: minor change

116ef46

feat: add anthropic vision request

308e092

gmuffiness marked this pull request as ready for review February 7, 2025 09:26

gmuffiness added 3 commits February 8, 2025 17:39

merge: main

a22f9d4

feat: add vision_model param to profile

2b5923f

docs: add vision_model to readme

647655f

fix: use text description when vision features are used with a non-vi…

430ae24

…sion model

MaxRobinsonTheGreat requested changes Feb 18, 2025

View reviewed changes

fix: update package

3595928

gmuffiness requested a review from MaxRobinsonTheGreat February 19, 2025 23:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vision capability for bots #413

Add vision capability for bots #413

gmuffiness commented Jan 19, 2025

gmuffiness commented Jan 19, 2025 •

edited

Loading

uukelele-scratch commented Jan 19, 2025 •

edited

Loading

gmuffiness commented Jan 20, 2025

gmuffiness commented Jan 20, 2025

Lorodn4x commented Jan 24, 2025

gmuffiness commented Jan 24, 2025

gmuffiness commented Jan 24, 2025 •

edited

Loading

Vineethm0410 commented Jan 25, 2025

uukelele-scratch commented Jan 25, 2025

gmuffiness commented Jan 27, 2025

MaxRobinsonTheGreat commented Feb 6, 2025

gmuffiness commented Feb 7, 2025

gmuffiness commented Feb 8, 2025

gmuffiness commented Feb 9, 2025

MaxRobinsonTheGreat commented Feb 13, 2025

gmuffiness commented Feb 14, 2025

MaxRobinsonTheGreat left a comment

MaxRobinsonTheGreat Feb 18, 2025

gmuffiness commented Feb 19, 2025

Add vision capability for bots #413

Are you sure you want to change the base?

Add vision capability for bots #413

Conversation

gmuffiness commented Jan 19, 2025

gmuffiness commented Jan 19, 2025 • edited Loading

uukelele-scratch commented Jan 19, 2025 • edited Loading

gmuffiness commented Jan 20, 2025

gmuffiness commented Jan 20, 2025

Lorodn4x commented Jan 24, 2025

gmuffiness commented Jan 24, 2025

gmuffiness commented Jan 24, 2025 • edited Loading

Vineethm0410 commented Jan 25, 2025

uukelele-scratch commented Jan 25, 2025

gmuffiness commented Jan 27, 2025

MaxRobinsonTheGreat commented Feb 6, 2025

gmuffiness commented Feb 7, 2025

gmuffiness commented Feb 8, 2025

gmuffiness commented Feb 9, 2025

MaxRobinsonTheGreat commented Feb 13, 2025

gmuffiness commented Feb 14, 2025

MaxRobinsonTheGreat left a comment

Choose a reason for hiding this comment

MaxRobinsonTheGreat Feb 18, 2025

Choose a reason for hiding this comment

gmuffiness commented Feb 19, 2025

gmuffiness commented Jan 19, 2025 •

edited

Loading

uukelele-scratch commented Jan 19, 2025 •

edited

Loading

gmuffiness commented Jan 24, 2025 •

edited

Loading