Show HN: I taught LLMs to play Magic: The Gathering against each other
I've been teaching LLMs to play Magic: The Gathering recently, via MCP tools hooked up to the open-source XMage codebase. It's still pretty buggy and I think there's significant room for existing models to get better at it via tooling improvements, but it pretty much works today. The ratings for expensive frontier models are artificially low right now because I've been focusing on cheaper models until I work out the bugs, so they don't have a lot of games in the system.
75 points by GregorStocks - 53 comments
>The anxiety creeps in: What if they have removal? Should I really commit this early?
>However, anxiety kicks in: What if they have instant-speed removal or a combat trick?
It's also interesting that it doesn't seem to be able to understand why things are happening. It attacks with Gran-Gran (attacking taps the creature), which says, "Whenever Gran-Gran becomes tapped, draw a card, then discard a card." Its next thought is:
>Interesting — there's an "Ability" on the stack asking me to select a card to discard. This must be from one of the opponent's cards. Looking at their graveyard, they played Spider-Sense and Abandon Attachments. The Ability might be from something else or a triggered ability.
That said, I reviewed a few of the Legacy games (the format I'm most familiar with and also the hardest by far), and the level of play was so low that I don't think any of the results are valid. It's very possible for Legacy they would need some assistance for playing Blue decks, but they seem to not be able to know the most basic of concepts - Who's the beatdown?.
IMO the most important pars of current competitive Magic is mulligans and that's something an LLM should be extremely good at but none of the games I'm seeing had either player starting with less than 7 cards... in my experience about 75% of games in Legacy have at least one player mulligan their opener.
no, no, no.. please think. Human child psychology is not the same as an LLM engine rating. It is both inaccurate and destructive to actual understanding to say that common phrase. Asking politely - consider not saying that about LLM game ratings.
The agents also constantly seem to evaluate if they're "behind" or "ahead" based on board state, which is a weird way of thinking about most games and often hard to evalaute, especially for decks like control which card more about resources like mana and card advantage, and always plan on stabalizing late game.
I don't think there's a perfect way to do this, but I think trying to play 100 games with a deck and getting basic info like this would be super valuable.
https://github.com/spullara/mtg-reanimator
I have also tried evaluating LLMs for playing the game and have found them to be really terrible at it, even the SoTA ones. They would probably be a lot better inside an environment where the rules are enforced strictly like MTG Arena rather than them having to understand the rules and play correctly on their own. The 3rd LLM acting as judge helps but even it is wrong a lot of the time.
https://github.com/spullara/mtgeval
You could clone mage-bench https://github.com/GregorStocks/mage-bench and add a new config like https://github.com/GregorStocks/mage-bench/blob/master/confi... pointing at the deck you want to test, and then do `make run CONFIG=my-config`. The logs will get dumped in ~/.mage-bench/logs and you can do analysis on them after the fact with Python or whatever. https://github.com/GregorStocks/mage-bench/tree/master/scrip... has various examples of varying quality levels.
You could also use LLMs, just passing a different `type` in the config file. But then you'd be spending real money for slower gameplay and probably-worse results.
The rules aren't embedded into the client; it's "just" a virtual tabletop where you enforce the rules the same way you would playing with a friend in person. Cards have to be imported but it's fairly automatic (basically just clicking a few buttons after startup), so you could either only import the sets you want or just not use the ones you don't want (which is also how it tends to work when playing informally in person; it's not like you usually have a judge to enforce that you or your friends are playing by whatever rules you agree to).
FOSS Magic clients are in a legal gray area at best. My mental model is that Wizards de facto tolerate clients like XMage and Forge because their UX is awful, but if you made something that's actually as user-friendly as MTGO/Arena, they'd sue you and you would lose.
No individual card text (limited to just the mechanics) is copyrightable but the setlist of cards might be. It would come down to how much creativity went into curating the list of cards that is released. It gets especially murky because new cards are always being released and old cards are being retired, so they obviously put a lot of creative energy into that process. You'd have to avoid pre-made decks as well.
Unless you have funding from an eccentric MTG-loving billionaire, I see why you'd comply with the cease-and-desist.
Hasbro had the legal president too, as they were involved in the Scrabble lawsuit, which I think is mostly where the concept of not being able to use patent law for game rules, but did set the trend on aggressive trademark interpretation.
I expect the genie is mostly out of the bottle at this point. I'm fairly confident that people can do X and Y actual illegal things on the Internet, we can have our card game, but I hope it can happen with a site or decentralized system easier than doing on Tor.
Best to do this stuff in person I find.
The issue I see is that you'd need a huge amount of games to tell who's better (you need that between humans too, the game is very high variance.)
Another problem is that giving a positional evaluation to count mistakes is hard because MtG, in addition to having randomness, has private information. It could be rational for both players to believe they're currently winning even if they're both perfect bayesians. You'd need to have something that approximates "this is the probability of winning the game from this position, given all the information I have," which is almost certainly asymmetric and much more complicated than the equivalent for a game with randomness but not private information such as backgammon.
I'm not trying to compute a chess-style "player X was at 0.4 before this move and at 0.2 afterwards, so it was a -0.2 blunder", but I do have "blunder analysis" where I just ask Opus to second-guess every decision after the game is over - there's a bit more information on the Methodology page. So then you can compare models by looking at how often they blunder, rather than the binary win/loss data. If you look at individual games you can jump to the "blunders" on the timeline - most of the time I agree with Opus's analysis.
(I also thought about pointing it at my personal game logs, but unfortunately there aren't that many, because I'm too busy writing analysis tools to actually play the game.)
Once you get solid rankings for the different LLMs, I think a huge feature of a system like this would be to allow LLMs to pilot user decks to evaluate changes to the deck.
I'm guessing the costs of that would be pretty big, but if decent piloting is ever enabled by the cheaper models, it could be a huge change to how users evaluate their deck construction.
Especially for formats like Commander where cooperation and coordination amongst players can't be evaluated through pure simulation, and the singleton nature makes specific card changes very difficult to evaluate as testing requires many, many games.
[1] https://github.com/hansy/drawspell
Of course when you quantize deck quality to such a degree I'd argue it's not fun anymore. YGO is already not fun anymore because of this rampant quantization and it didn't even take LLMs to arrive here.
From the little I have seen they are different beasts (hidden information, number and complexity of rules...).
PS: Does this count as nerdsniping?
You can see the current prompt at https://github.com/GregorStocks/mage-bench/blob/master/puppe...:
They also get a small "personality" on top of that, e.g.:"grudge-holder": { "name_part": "Grudge", "prompt_suffix": "You remember every card that wronged you. Take removal personally. Target whoever hurt you last. Keep a mental scoreboard of grievances. Forgive nothing. When a creature you liked dies, vow revenge." }, "teacher": { "name_part": "Teach", "prompt_suffix": "You explain your reasoning like you're coaching a newer player. Talk through sequencing decisions, threat evaluation, and common mistakes. Be patient and clear. Point out what the correct play is and why." },
Then they also see the documentation for the MCP tools: https://mage-bench.com/mcp-tools/. For now I've tried to keep that concise to avoid "too many MCP tools in context" issues - I expect that as solutions like tool search (https://www.anthropic.com/engineering/code-execution-with-mc...) become widespread I'll be able to add fancier tools for some models.
This is also something I think the MTG community needs in many ways. I have been a relatively happy XMage user, although it has a bit to go, and before that was using GCCG which was great too!
The MTG community overall can benefit a lot from the game having a more entertaining competitive landscape, which has grown stale in many ways and Wizards has done a poor job since the Hasbro acquisition of doing much else besides shitting out product after product too fast with poor balance.
I have to imagine that Wizards is already running simulations, but they obviously aren't working well or they are choosing to disregard them. Hopefully it they are just had at doing simulations something like this can make it easier for them, and if not it will make the response time from the community better.
Regarding actually doing it under the radar there are a lot of ways. They likely are catching most of the players because they create synthetic events using the Windows API and similar, which is also part of the same system being used for CAPTCHAS that are being used to stop web scraping like the kind that just ask for a button press.
This can be worked around by using a fake mouse driver that is actually controlled by software if you must stay on Windows. It can be worked around by just running the client on Linux as well. It can also he worked around using qemu as the client and using its native VNC as those are hardware events too =)
In practice they haven't really talked to each other, though. They've mostly just interpreted the prompts as "you should have a running monologue in chat". Not sure how much of this is issues with the harness vs the prompt, but I'm hoping to dig into it in the future.
Can we automate the unpleasantries in life instead of the pleasures?
I get the complaint, but how is this something that removes the human element at all?