@rajistics: AI agents used to shut down mid-task or hallucinate vending empires. Now? They're beating humans at long-horizon business simulations. From 8% task success with GPT‑4o to 30%+ with Claude and Gemini, benchmarks like AgentCompany and Vending-Bench show agents aren’t just smarter — they’re starting to work. TheAgentCompany Benchmark (CMU): https://arxiv.org/abs/2412.14161 Vending-Bench (Andon Labs): https://arxiv.org/abs/2502.15840 Project Vend (Anthropic): https://www.anthropic.com/research/project-vend-1 Claude/Gemini benchmark updates: https://x.com/andonlabs/status/1805322416206078341
Honestly the more complexity the worse these things get at the moment. It’s a good exercise to try later to benchmark when technology improves.
2025-07-06 16:13:10
1
mon :
My code base gets ruined after using ai, even without integration just using web copy paste and search advise…
2025-07-06 15:51:21
0
ArcaMutant :
For context, the Claudius bot also used Claude Opus 4, and it performed very badly in a real world test. (It didn’t have any extra features though, from what I can tell it was a raw LLM with the ability to send Emails.
2025-07-08 23:33:27
0
To see more videos from user @rajistics, please go to the Tikwm
homepage.