@rajistics: AI agents used to shut down mid-task or hallucinate vending empires.
Now? They're beating humans at long-horizon business simulations. From 8% task success with GPT‑4o to 30%+ with Claude and Gemini,
benchmarks like AgentCompany and Vending-Bench show agents aren’t just smarter —
they’re starting to work. TheAgentCompany Benchmark (CMU): https://arxiv.org/abs/2412.14161 Vending-Bench (Andon Labs): https://arxiv.org/abs/2502.15840 Project Vend (Anthropic): https://www.anthropic.com/research/project-vend-1 Claude/Gemini benchmark updates: https://x.com/andonlabs/status/1805322416206078341

Rajiv Shah | data science & AI
Rajiv Shah | data science & AI
Open In TikTok:
Region: US
Sunday 06 July 2025 15:35:16 GMT
1415
44
4
2

Music

Download

Comments

miakdot
miakdot :
Honestly the more complexity the worse these things get at the moment. It’s a good exercise to try later to benchmark when technology improves.
2025-07-06 16:13:10
1
mon.momomo
mon :
My code base gets ruined after using ai, even without integration just using web copy paste and search advise…
2025-07-06 15:51:21
0
arcamutant
ArcaMutant :
For context, the Claudius bot also used Claude Opus 4, and it performed very badly in a real world test. (It didn’t have any extra features though, from what I can tell it was a raw LLM with the ability to send Emails.
2025-07-08 23:33:27
0
To see more videos from user @rajistics, please go to the Tikwm homepage.

Other Videos


About