7 AIs Play Werewolf: GPT-5 MVP， Kimi's Aggressive Tactics

A group of AIs engaged in a game of Werewolf， with GPT-5 demonstrating a commanding lead and achieving an astonishing win rate of 96.7%.

Greg Brockman， President of OpenAI， shared a benchmark test where seven powerful large language models (LLMs)， encompassing both open-source and closed-source， played 210 full rounds of Werewolf.

GPT-5 performed exceptionally well， positioning itself as the undisputed MVP of the current LLM landscape.

Among the domestic models， Qwen3 and Kimi-K2 secured the 4th and 6th positions， respectively.

The official blog post provided insightful analyses， including the distinct personality traits exhibited by these models during the game.

For instance， Kimi-K2 demonstrated remarkable adaptability by mastering the “hard re-vote” strategy. When playing as a werewolf and making a clear mistake， it boldly declared itself the Witch， successfully manipulating the game’s outcome.

This maneuver can be described as highly audacious and aggressive.

AI Takes on Werewolf

To provide context， Werewolf is a social deduction game characterized by alternating night and day phases.

In this benchmark setup， the game involved six players: two werewolves and four villagers， including a Seer and a Witch.

During the night， werewolves select targets， while the Witch and Seer perform their respective actions. During the day， players discuss and vote to eliminate those suspected of being werewolves. Villagers win by eliminating all werewolves， and werewolves win by achieving numerical superiority.

The official description of the Werewolf benchmark highlights its purpose:

The current benchmarks tell us if models can solve equations or debug code， but they don’t tell us if models will collapse under cross-examination， abandon allies under pressure， or manipulate a room into making a bad decision.

When we deploy AI agents onto human teams， these patterns of behavior are just as important as math and code scores.

Playing Werewolf forces models to engage with trust， deception， and social dynamics – skills critical for their roles as autonomous agents.

In this test， each pair of models played ten matches: five where one model controlled the werewolves and the other played as villagers， and vice-versa for the other five matches.

This setup allows for an assessment of two dimensions: how a model manipulates others when it’s a werewolf， and how it resists manipulation when it’s a villager.

Across all pairings， GPT-5 remained undefeated.

The testing team quantified performance using an independent Elo rating system and three complementary metrics: the degree of self-inflicted damage by the villager team due to wrongly eliminating their own Seer or Witch， the speed of identifying cooperating werewolves， and the effectiveness of the werewolf team in maintaining control over the village in multi-day games.

Within the entire cohort， GPT-5 stood out significantly. Other models formed a secondary tier， exhibiting different strengths depending on their assigned roles. This is the purpose of role-conditioned Elo: to differentiate between manipulators (werewolves) and those resisting manipulation (villagers).

As werewolves， the strongest models did not merely aim for a single misjudgment but accumulated momentum over several days， aligning night choices with public narratives， controlling the pace of pressure， and maintaining contingency plans when new accusations arose.

GPT-5 dominated by maintaining strict multi-day control， consistently holding the top position. In contrast， Kimi-K2 and Gemini 2.5 Pro displayed high-impact but volatile styles， capable of forcing hands or reversing narratives， though often exposing themselves through errors or overextension.

The remaining models lagged behind: GPT-5-mini， 2.5 Flash， and Qwen3 could influence votes but rarely sustained deception beyond the second day， while GPT-OSS remained transparent and easily repelled.

When defending as villagers， the task reverses: filtering out non-paranoid accusations， penalizing inconsistencies， and avoiding tunnel-visioned mis-eliminations.

Effective villagers maintain information order: they anchor discussions to public facts， pose targeted questions， and update beliefs publicly， making it difficult for the werewolf’s “story” to mislead them.

In resisting misinformation， GPT-5 again set a benchmark standard. Its structured approach to tie-breaking and real-time public updates made prolonged deceptive tactics difficult to succeed.

Gemini 2.5 Pro excelled in defense， firmly rejecting bait-and-trap scenarios.

Qwen3 did not always dominate， but maintained consistent positional stability， effectively avoiding catastrophic misjudgments.

Kimi-K2 lacked robustness under pressure: while it could sway votes with momentum， it tended to falter when situations became precise.

GPT-5-mini and Flash performed only moderately， easily misled under sustained narrative pressure.

GPT-OSS， on the other hand， was completely outmaneuvered and easily manipulated.

The testing team also revealed that in earlier tests， they evaluated more than the aforementioned seven models. They discovered that capability improvements were not linear but featured behavioral leaps， with significant differences between weaker and stronger models:

Weaker models exhibited chaotic behavior， with players acting independently and werewolves targeting obvious choices. Stronger models demonstrated discipline， regulating votes， formulating night-kill plans， assigning roles， and even strategically sacrificing werewolf teammates.

Furthermore， not all models optimized for reasoning performed excellently.

While models optimized for reasoning generally performed well， technical labels did not guarantee actual capability. In broader tests， ‘o3’ displayed excellent， highly disciplined gameplay， whereas ‘o4-mini’ proved fragile. Although adept at localized debates， it was prone to getting stuck in fixed patterns， lacked adaptability， and often exposed itself through poor voting timing.

However， netizens are more interested in the performance of unlisted contestants， such as Grok and Claude， expressing a desire for more models to be included in future tests.

The testing team has indicated they are currently in contact and that a collaboration might be anticipated.

Models Exhibit Distinct Personalities

Interestingly， each model displayed unique styles during this test.

Here are a few examples highlighting these distinct behaviors:

GPT-5 → A calm and composed architect， establishing order in the game， dominating each debate， and dictating the pace for the entire group， demonstrating absolute authority and control; GPT-oss → A defensive and hesitant player， often retreating under pressure， exhibiting timid characteristics; Kimi-K2 → A bold and aggressive high-stakes gambler， rapidly building momentum and adept at forcing opponents to declare their positions prematurely， but showing significant volatility in later stages.

Kimi-K2， in particular， showcased remarkable creativity and risk-taking behavior.

When playing as a werewolf and making a significant error， it daringly employed a “hard re-vote” strategy， publicly claiming to be the Witch and successfully turning the game around.

Despite the initial mistake (leaking key information) preventing it from winning this particular game， it still demonstrated a very high level of gameplay.

The testing team stated that the true value of this benchmark lies in helping people understand how LLMs behave in social systems: their personalities， influence patterns， and group dynamics under pressure.

By mapping these behavioral characteristics， it becomes possible to assemble groups of agents with specific personality combinations: some skeptical， persuasive， or analytical.

This opens up possibilities for simulating complex social interactions.

In the long term， the goal of the Werewolf benchmark is to facilitate AI-driven market research. By conducting dynamic simulations with carefully selected AI personalities， it aims to predict real-world user responses， thereby optimizing costly and inefficient human focus groups.

This objective is still some way off， and they are currently seeking collaborators due to the high computational costs involved.

They are willing to share detailed logs， case studies， and role-specific behavioral insights to assist collaborative partners in understanding model performance in social contexts.

GPT-5’s Progress Exceeds Expectations

In this Werewolf benchmark test， GPT-5’s performance was undoubtedly outstanding.

Its performance in other benchmark tests has also been impressive.

A new report released by Epoch AI confirms that GPT-5 has achieved significant performance enhancements compared to GPT-4 across major benchmarks.

Data indicates that GPT-5 saw an 80% leap in Mock AIME compared to GPT-4. It achieved a score of 98% on Level 5 MATH， a significant 75% improvement over GPT-4’s score of 23%.

This report has sparked considerable discussion among netizens， who view it as a major advancement.

At its release， GPT-4 was widely considered a significant leap over GPT-3， demonstrating the high returns of scaling up training compute.

However， user reception for GPT-5 has been more complex. Some feel it hasn’t achieved as dramatic a leap as GPT-4， possibly due to the model’s development approach， which prioritized reinforcement learning over scaling pre-training.

The report indicates that GPT-5 outperforms GPT-4 on several significant performance benchmarks， similar to how GPT-4 surpassed GPT-3 on benchmarks widely cited in its era.

While these improvements may not be directly comparable， they do indicate that both GPT-5 and GPT-4 represent substantial advancements over their predecessors.

Some netizens argue that numerical improvements alone are not the sole indicator of progress， emphasizing the importance of user experience.

However， user experience is subjective.

Epoch AI suggests that this difference in user experience might be related to the frequency of product releases.

Seven AIs Play Werewolf: GPT-5 Emerges as MVP， Kimi Shows Aggressive Tactics

免责声明：本网站内容主要来自原创、合作伙伴供稿和第三方自媒体作者投稿，凡在本网站出现的信息，均仅供参考。本网站将尽力确保所提供信息的准确性及可靠性，但不保证有关资料的准确性及可靠性，读者在使用前请进一步核实，并对任何自主决定的行为负责。本网站对有关资料所引致的错误、不确或遗漏，概不负任何法律责任。任何单位或个人认为本网站中的网页或链接内容可能涉嫌侵犯其知识产权或存在不实内容时，可联系本站进行审核删除。