AI-Powered Minecraft Agent

Autonomous Minecraft survival agent built as three cooperating services: a Mineflayer bot on Node.js, an OpenCV screen-capture vision service on Python, and a decision layer that polls both every 5 seconds and asks a Groq-hosted LLM what to do next. Early-stage working prototype; the roadmap is making the agent smarter and adding more actions.

The Idea

An autonomous Minecraft survival agent that plays the game on its own. The core question was whether a small set of cooperating services, one for sensing, one for acting, and one for reasoning, could form a perception-decision-action loop that actually keeps a character alive in a vanilla Minecraft world.

This is an early-stage prototype. The scaffolding works end to end, the LLM decision loop is live, and the bot performs basic survival actions. The active work is expanding the agent's capabilities and making its reasoning smarter.

Architecture

Three processes run in parallel. Each one has a narrow responsibility and a small HTTP surface, which makes them easy to restart independently when something goes wrong.

Service	File	Port	Role
Bot control	mc-agent/bot.js	3000	Mineflayer bot that controls the in-game character and exposes an action API
Vision	vision.py	5000	OpenCV + MSS screen capture of the Minecraft window, exposed as a Flask API
Decision	decision.py	n/a	Polls bot + vision every 5 seconds, sends the combined state to Groq, posts the action back to the bot

The Decision Loop

decision.py hits the Mineflayer HTTP API for in-world state: position, health, hunger, nearby entities, inventory.
It hits the vision service for a screen-capture read: what is actually visible on the Minecraft window right now.
It builds a prompt that packages both views of the world plus the set of actions the agent is allowed to take.
It calls the Groq LLM (free tier, fast inference) and parses the model's chosen action.
It POSTs the action back to bot.js, which dispatches to the matching Mineflayer handler.
Five seconds later, the loop runs again.

Actions Implemented

Two reflexes run independently of the LLM: the agent auto-eats when hunger drops below 14/20, and it auto-walks to item drops after mining.

Action	Description
collect_wood	Find and mine nearby logs
explore	Move to a random nearby location
seek_shelter	Find cover, or dig an emergency hole
find_food	Hunt passive mobs or harvest crops
eat	Eat food from inventory
flee	Move away from a detected threat
idle	Wait briefly before the next decision

Key Decisions

Three separate services instead of one monolithic process. Decoupling bot control (Node + Mineflayer) from vision (Python + OpenCV) from reasoning (Python + Groq) means each piece can be restarted without bringing down the others.
Groq over OpenAI for the LLM. Groq offers fast, free-tier inference that is more than quick enough for a 5-second decision cycle.
Screen capture instead of parsing game packets. Reading the Minecraft window through MSS is much simpler than modding the protocol, at the cost of requiring the window to be visible on the primary monitor.
Vanilla 1.21.1 server with online-mode=false. Lets the bot join the local world without a premium Minecraft account, which makes iteration much faster.
Mineflayer for bot control. Mature Node.js library with full protocol support, good pathfinding primitives, and enough hooks to bolt on custom behaviors as the action set grows.

Current Limitations

Small action vocabulary. The LLM can only pick from the handful of actions currently implemented.
Vision is hardcoded to 1920x1080 for the primary monitor. It needs a per-machine config before it runs cleanly elsewhere.
No long-term memory. Each decision is made from the current state only; the agent does not remember what it was trying to accomplish a minute ago.
No threat model beyond "flee". Hostile mob detection is basic and there is no combat strategy.
Single agent. No multi-agent coordination or goal decomposition.

Next Work

Expand the action set: crafting, building, farming, combat with basic tactics.
Give the LLM a scratchpad for short-term plans and multi-step goals (build shelter before night, stockpile food before exploring).
Better vision signals: detect specific biomes, hostile mobs, resource locations from the screen capture instead of only passing raw descriptors.
Make the vision resolution and monitor selection configurable.
Structured logging of decisions and outcomes to evaluate which actions the LLM chooses well and which it does not.

Status

Early-stage working prototype. The three-service architecture is stable, the decision loop runs end to end against a vanilla 1.21.1 server, and the agent performs basic survival actions autonomously. Ongoing work is focused on expanding the action vocabulary and giving the LLM richer context to reason over.