The Idea
An autonomous Minecraft survival agent that plays the game on its own. The core question was whether a small set of cooperating services, one for sensing, one for acting, and one for reasoning, could form a perception-decision-action loop that actually keeps a character alive in a vanilla Minecraft world.
This is an early-stage prototype. The scaffolding works end to end, the LLM decision loop is live, and the bot performs basic survival actions. The active work is expanding the agent's capabilities and making its reasoning smarter.
Architecture
Three processes run in parallel. Each one has a narrow responsibility and a small HTTP surface, which makes them easy to restart independently when something goes wrong.
| Service | File | Port | Role |
|---|---|---|---|
| Bot control | mc-agent/bot.js | 3000 | Mineflayer bot that controls the in-game character and exposes an action API |
| Vision | vision.py | 5000 | OpenCV + MSS screen capture of the Minecraft window, exposed as a Flask API |
| Decision | decision.py | n/a | Polls bot + vision every 5 seconds, sends the combined state to Groq, posts the action back to the bot |
The Decision Loop
- decision.py hits the Mineflayer HTTP API for in-world state: position, health, hunger, nearby entities, inventory.
- It hits the vision service for a screen-capture read: what is actually visible on the Minecraft window right now.
- It builds a prompt that packages both views of the world plus the set of actions the agent is allowed to take.
- It calls the Groq LLM (free tier, fast inference) and parses the model's chosen action.
- It POSTs the action back to bot.js, which dispatches to the matching Mineflayer handler.
- Five seconds later, the loop runs again.
Actions Implemented
Two reflexes run independently of the LLM: the agent auto-eats when hunger drops below 14/20, and it auto-walks to item drops after mining.
| Action | Description |
|---|---|
| collect_wood | Find and mine nearby logs |
| explore | Move to a random nearby location |
| seek_shelter | Find cover, or dig an emergency hole |
| find_food | Hunt passive mobs or harvest crops |
| eat | Eat food from inventory |
| flee | Move away from a detected threat |
| idle | Wait briefly before the next decision |
Key Decisions
- Three separate services instead of one monolithic process. Decoupling bot control (Node + Mineflayer) from vision (Python + OpenCV) from reasoning (Python + Groq) means each piece can be restarted without bringing down the others.
- Groq over OpenAI for the LLM. Groq offers fast, free-tier inference that is more than quick enough for a 5-second decision cycle.
- Screen capture instead of parsing game packets. Reading the Minecraft window through MSS is much simpler than modding the protocol, at the cost of requiring the window to be visible on the primary monitor.
- Vanilla 1.21.1 server with online-mode=false. Lets the bot join the local world without a premium Minecraft account, which makes iteration much faster.
- Mineflayer for bot control. Mature Node.js library with full protocol support, good pathfinding primitives, and enough hooks to bolt on custom behaviors as the action set grows.
Current Limitations
- Small action vocabulary. The LLM can only pick from the handful of actions currently implemented.
- Vision is hardcoded to 1920x1080 for the primary monitor. It needs a per-machine config before it runs cleanly elsewhere.
- No long-term memory. Each decision is made from the current state only; the agent does not remember what it was trying to accomplish a minute ago.
- No threat model beyond "flee". Hostile mob detection is basic and there is no combat strategy.
- Single agent. No multi-agent coordination or goal decomposition.
Next Work
- Expand the action set: crafting, building, farming, combat with basic tactics.
- Give the LLM a scratchpad for short-term plans and multi-step goals (build shelter before night, stockpile food before exploring).
- Better vision signals: detect specific biomes, hostile mobs, resource locations from the screen capture instead of only passing raw descriptors.
- Make the vision resolution and monitor selection configurable.
- Structured logging of decisions and outcomes to evaluate which actions the LLM chooses well and which it does not.
Status
Early-stage working prototype. The three-service architecture is stable, the decision loop runs end to end against a vanilla 1.21.1 server, and the agent performs basic survival actions autonomously. Ongoing work is focused on expanding the action vocabulary and giving the LLM richer context to reason over.