Related Posts
Popular Tags

New Benchmark Casts Doubts on Agentic AI’s Workplace Readiness

New Benchmark Casts Doubts on Agentic AI’s Workplace Readiness

Microsoft CEO Satya Nadella came on the Dwarkesh Patel podcast and declared that lawyers, accountants, investment bankers and IT coders would soon become redundant as AI agents would replace them in hoards. However, OpenAI co-founder Andrej Karpathy described agentic AI as a “slop” in another edition of the same podcast.

So, where does the world stand on this dichotomy today? Without doubt, there has been considerable progress on foundation models, but clear use-cases for agentic AI are still far and few. While AI agents doing research and planning activities have seen success, white-collar workforce is still unaffected, at least to a great extent.

And new research pioneered by training-data company Mercor suggests that AI agents remain largely unprepared or under-prepared for real life workplaces. The company has come out with an Apex-Agents benchmark, which exposes critical gaps in AI’s ability to perform complex tasks at the workplace.

Leading models are scoring below 25% for accuracy in simulations on work relating to law, investment banking and consulting.

“We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools,” says the submission on the web page of Cornell University.

The report suggests that all AI labs have failed the test when queried by real professionals from these industry verticals. Some of the best models found it tough to get the questions right at least a quarter of the times they were asked. The report said that at most times, the model had the wrong answer or worse still, no answer at all.

Details provided about the new benchmark suggest that it differs from previous evaluations in one simple manner in that it simulates professional workplaces instead of conducting specific tests. In doing so, the tests indicated that foundation models struggled with integrated, multi-domain reasoning.

What made AI models perform poorly?

Mercor CEO Brendan Foody was quoted by TechCrunch as saying that the biggest stumbling point of the model was to track information across multiple domains, something that was integral to most of the knowledge work performed by humans.

“One of the big changes in this benchmark is that we built out the entire environment, modeled after real professional services. The way we do our jobs isn’t with one individual giving us all the context in one place. In real life, you’re operating across Slack and Google Drive and all these other tools,” Foody was quoted as saying.

Looks like the Apex-Agents benchmark could well be the start of newer ways to test out AI models for specific weaknesses. Going forward, similar benchmarks could be developed based on queries from actual professionals in divergent fields to test whether the AI agents could navigate such complex information landscapes.

Higher the specificity, poorer the response

Mercor has posted some of the queries and expected response standards on Hugging Face. It is quite clear that the queries are very specific to given situations that had emerged in the past.

In fact, TechCrunch refers to one involving a production outage and an engineering team’s responses thereof. While the correct answer is stated as a yes, but there is no way an AI agent could there without an in-depth assessment of the company’s policies and the guidelines as per a country or a region’s laws.

According to Foody, this level of complexity is enough to befuddle even an expert on the job and it there is no way that the LLM could reliably respond to such a query at this juncture in its growth as a reliable machine-learnt assistant.

“I think this is probably the most important topic in the economy. The benchmark is very reflective of the real work that these people do,” says Foody.

This statement is significant at a time when most AI giants are seeking to convince users as well as their large investors that Agentic AI is up and running. Last September, OpenAI had come out with a blog post titled “Measuring the performance of our models on real-world tasks” with their GDPval.

However, while that tested general knowledge across professions, Foody and team has created a benchmark that measures the system’s capabilities when tasks that encompass complex queries across multiple areas of a single vertical.

What this tells us for the moment is whether agentic AI can automate tasks in these high-value professions. As of now, the answer is a big NO.

According to information provided by Mercor, Gemini 3 Flash did the best with 24% single-shot accuracy followed by GPT-5.2 with 23% with the others including Opus 4.5, Gemini 3 Pro and GPT-5 scoring around 18%. These numbers may only mean that for now these LLMs need to learn some more.

Source – https://cxotoday.com/ai/new-benchmark-casts-doubts-on-agentic-ais-workplace-readiness/

Leave a Reply