𝕐𝕠𝕦 𝕆𝕟𝕝𝕪 𝕃𝕚𝕧𝕖 𝕆𝕟𝕔𝕖, 𝔽𝕖𝕒𝕣 𝕆𝕗 𝕄𝕚𝕤𝕤𝕚𝕟𝕘 𝕆𝕦𝕥 𝟙𝟘𝟙

𝕋𝕙𝕚𝕤 𝕚𝕤 𝕟𝕠𝕥 𝕒 𝟙𝟘𝟙 𝕒𝕓𝕠𝕦𝕥 𝕐𝕠𝕝𝕠 𝔽𝕠𝕞𝕠 𝕡𝕣𝕚𝕟𝕔𝕚𝕡𝕝𝕖𝕤. 𝕀𝕟𝕤𝕥𝕖𝕒𝕕, 𝕚𝕥'𝕤 𝕒 𝕤𝕙𝕒𝕣𝕚𝕟𝕘 𝕠𝕗 𝕚𝕟𝕥𝕖𝕣𝕖𝕤𝕥𝕤 𝕒𝕟𝕕 𝕠𝕡𝕚𝕟𝕚𝕠𝕟𝕤 𝕠𝕗 𝕒 𝕔𝕠𝕞𝕞𝕠𝕟𝕖𝕣 𝕚𝕟 𝕊𝕚𝕟𝕘𝕒𝕡𝕠𝕣𝕖. 𝕋𝕙𝕖 𝕡𝕣𝕚𝕟𝕔𝕚𝕡𝕝𝕖𝕤 𝕚𝕟 𝕝𝕚𝕗𝕖 𝕚𝕤 𝕦𝕟𝕚𝕢𝕦𝕖 𝕥𝕠 𝕖𝕒𝕔𝕙 𝕚𝕟𝕕𝕚𝕧𝕚𝕕𝕦𝕒𝕝, 𝕟𝕠𝕥 𝕕𝕖𝕗𝕚𝕟𝕖𝕕 𝕓𝕪 𝕒𝕟𝕪𝕠𝕟𝕖.

Tuesday, 4 March 2025

Technology Updates : People are using Super Mario to benchmark AI now

Source:

https://techcrunch.com/2025/03/03/people-are-using-super-mario-to-benchmark-ai-now/

Apple Intelligence:

Game Performance: Anthropic’s Claude 3.7 performed the best, followed by Claude 3.5. Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled.

Game Mechanism: The game ran in an emulator and integrated with a framework, GamingAgent, to give the AIs control over Mario.

Reasoning Model Performance: Reasoning models performed worse than “non-reasoning” models, despite being generally stronger on most benchmarks.

AI Benchmarking Limitations: Games, while useful for benchmarking AI, have limitations as they are abstract, simple, and offer infinite data, unlike the real world.

Evaluation Crisis in AI: Recent gaming benchmarks highlight a lack of clear metrics to assess the true capabilities of AI models.

Uncertainty in AI Model Performance: There is a lack of clarity regarding the actual performance and capabilities of current AI models.

No comments:

Post a Comment

𝕋𝕙𝕚𝕤 𝕓𝕝𝕠𝕘 𝕚𝕤 𝕒𝕓𝕠𝕦𝕥 𝕥𝕙𝕖 𝕥𝕙𝕚𝕟𝕘𝕤 𝕀 𝕗𝕚𝕟𝕕 𝕚𝕟𝕥𝕖𝕣𝕖𝕤𝕥𝕚𝕟𝕘 𝕚.𝕖. 𝕀𝕟𝕧𝕖𝕤𝕥𝕚𝕟𝕘, 𝕋𝕖𝕔𝕙𝕟𝕠𝕝𝕠𝕘𝕪, 𝕄𝕠𝕧𝕚𝕖𝕤 & 𝕊𝕙𝕠𝕨𝕤, 𝔽𝕠𝕠𝕕, 𝕍𝕚𝕕𝕖𝕠 𝔾𝕒𝕞𝕖𝕤, 𝕊𝕠𝕔𝕔𝕖𝕣 𝕒𝕟𝕕 𝕋𝕖𝕟𝕟𝕚𝕤. 𝕋𝕙𝕖 𝕥𝕙𝕚𝕟𝕘𝕤 𝕀 𝕦𝕤𝕖 𝕚.𝕖. ℙ𝕠𝕣𝕥𝕗𝕠𝕝𝕚𝕠, ℙ𝕣𝕠𝕞𝕠𝕥𝕚𝕠𝕟𝕤, 𝕊𝕦𝕣𝕧𝕖𝕪𝕤, ℝ𝕖𝕗𝕖𝕣𝕣𝕒𝕝𝕤 𝕒𝕟𝕕 𝕋𝕠𝕠𝕝𝕤. 𝕋𝕙𝕖 𝕥𝕙𝕚𝕟𝕘𝕤 𝕀 𝕝𝕚𝕜𝕖 𝕥𝕠 𝕣𝕖𝕞𝕖𝕞𝕓𝕖𝕣 𝕒𝕟𝕕 𝕤𝕙𝕒𝕣𝕖 𝕨𝕚𝕥𝕙 𝕞𝕪 𝕗𝕒𝕞𝕚𝕝𝕪 & 𝕥𝕙𝕖 𝕨𝕠𝕣𝕝𝕕. 𝕀'𝕞 𝕒𝕟 𝕀𝕋 𝕡𝕣𝕠𝕗𝕖𝕤𝕤𝕚𝕠𝕟𝕒𝕝, 𝕟𝕠𝕥 𝕒 𝕗𝕚𝕟𝕒𝕟𝕔𝕚𝕒𝕝 𝕒𝕕𝕧𝕚𝕤𝕖𝕣. 𝔸𝕝𝕝 𝕠𝕡𝕚𝕟𝕚𝕠𝕟𝕤 𝕙𝕖𝕣𝕖 𝕒𝕣𝕖 𝕗𝕣𝕠𝕞 𝕞𝕪 𝕡𝕖𝕣𝕤𝕠𝕟𝕒𝕝 𝕖𝕩𝕡𝕖𝕣𝕚𝕖𝕟𝕔𝕖𝕤.𝕐𝕠𝕦 𝕤𝕙𝕠𝕦𝕝𝕕 𝕗𝕠𝕣𝕞 𝕒 𝕗𝕚𝕟𝕒𝕟𝕔𝕚𝕒𝕝 𝕒𝕤𝕤𝕖𝕤𝕤𝕞𝕖𝕟𝕥 𝕠𝕗 𝕪𝕠𝕦𝕣 𝕠𝕨𝕟 𝕠𝕣 𝕤𝕖𝕖𝕜 𝕗𝕚𝕟𝕒𝕟𝕔𝕚𝕒𝕝 𝕡𝕣𝕠𝕗𝕖𝕤𝕤𝕚𝕠𝕟𝕒𝕝𝕤 𝕠𝕟 𝕨𝕙𝕖𝕥𝕙𝕖𝕣 𝕥𝕠 𝕚𝕟𝕧𝕖𝕤𝕥 𝕞𝕠𝕟𝕖𝕪 𝕠𝕣 𝕥𝕚𝕞𝕖 𝕠𝕟 𝕥𝕙𝕖 𝕣𝕖𝕔𝕠𝕞𝕞𝕖𝕟𝕕𝕖𝕕 𝕡𝕝𝕒𝕥𝕗𝕠𝕣𝕞𝕤 𝕞𝕖𝕟𝕥𝕚𝕠𝕟𝕖𝕕 𝕚𝕟 𝕞𝕪 𝕡𝕠𝕤𝕥𝕤.