AI Diplomacy, Part 2: Optimizing the Prototype (and Watching Tokens Burn)

uploaded image

About three weeks ago, I threw together this “AI Diplomacy” project in a hurry just to test Claude Code’s Remote Control feature. I tinkered with it for another day or two, then went on a business trip to Korea, and after that I got busy building an agent team. Before I knew it, the project had turned into one of those things you quietly forget.

On top of that, the model cost issue hit, so I’ll probably pause here for a while. For now, I just want to leave a record of what I managed to do up to this point.

In the post from three weeks ago, I had already reached the point where a full Diplomacy game could run using only AI players. Over the next few days, what I worked on was basically UI polish and optimization. In detail:

  • I originally had a “sleep 3 seconds and retry” logic when hitting rate limits, but it was effectively useless. Like, what rate limit is going to reset in three seconds? So I switched to exponential backoff.
  • Negotiation messages were way too verbose, so I adjusted them to be shorter and more concise. Since it’s a diplomacy-themed game, the agents tended to start with long-winded diplomatic formalities. Shortening the messages reduced token waste and improved readability, but honestly, it probably still needs to be cut down further.
  • I parallelized LLM calls. Before, I was calling one power at a time, but Diplomacy is a game where seven powers act simultaneously, so this was the obvious implementation from the beginning. The only reason the early version was sequential was to make debugging easier. Simply calculating it, this alone gives you a 7x speedup, though of course it also burns tokens 7x faster.
  • The definitions of “turn” vs “phase” were unclear, so I clarified the spec and refactored. In Diplomacy, one year consists of two turns (Spring/Fall), and each turn consists of four or five phases. Somehow, my implementation had ended up treating phases as turns. Probably my own fault for leaving the spec ambiguous.

After I asked Claude to refactor with clearer terminology, it came back with something it called a “minimal change refactoring strategy.” Minimal change? This is the kind of change where you end up rebuilding DB tables. I asked it to explain, and the answer was: the table is named `turn`, but the data inside is actually `phase`. It wanted to keep that for backward compatibility.

No. There’s no need for backward compatibility here, so I told it to just do it properly and rebuild it. It’s funny how it can be “human” in the weirdest ways.

  • I properly visualized the dashboard. I adjusted the UI layout a bit, added the map image, and displayed unit orders using arrows and icons. This part was easier than I expected, because the `diplomacy` package I used already ships the map image and icons as SVGs. Of course, Claude also handled analyzing those assets.

After these changes, I ran a longer game simulation. The result: by 1920, tokens had expired, orders stopped coming in, and the game ended in a stalemate. That was the day I burned through an entire week’s worth of Claude tokens.

Still, getting to 1920 means it ran for about 40 turns, which is not nothing. I used Gemini-2.5-pro via the Google API only for Italy, and for the other powers I used Sonnet-4 via Claude Code. When I checked later, the Gemini API usage alone cost $9.84.

Next time, I should try cheaper models, and probably do more optimization.

Honestly, if I wanted to do this “for real,” this is where it would actually start. Building up memory to improve strategy, making different LLMs compete, mixing in humans, and so on.

The problem, as always, is time and money. Hopefully I can come back to it someday.

uploaded image
uploaded image
uploaded image

Leave a comment