Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro June 1, 2026 10 min read
Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic


Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic

Guides have aided humanity throughout history. Prehistoric civilizations understood that the sun and the moon could be used to navigate vast distances on land and the high seas. Over time, various journeys facilitated the production of maps for better planning and faster travel time to repeat destinations. Centuries later, the introduction of the compass enabled seagoers to achieve greater accuracy in seeking unexplored destinations. And today, GPS navigation apps guide our every journey. In today’s world of agentic AI, AI agents, admittedly, have the potential to enable scalable AI adoption, transforming industries as we know them. However, an intelligent guide, agentic logic, is needed to realize this potential by fueling high agent quality, cost-effectiveness, and consequent end-user trust.

Enterprise Workflows & Use Cases

Numerous studies have cited the overwhelming failure of AI pilots, while others have also highlighted the need for AI to operate at the core of enterprise workflows to enable scalable adoption. [1] [2] To better understand this phenomenon and the associated assertion, some analysis of enterprise workflows is required. These workflows are:

A. Dynamic and long-running B. Possess a plethora of APIs, databases and services C. Oftentimes are constrained by business policies and/or regulations

For an agent to function effectively, given these above characteristics, naturally demands an expanded model context, which state-of-the-art frontier LLMs certainly possess, but at what tradeoff? Increased hallucinations, token consumption? Further, can LLMs be equipped with an intelligent guide, GPS, to enable agentic AI execution at the core of the workflow, driving more desirable outcomes? We tested these hypotheses by designing and building agents, equipped with pertinent agent logic, for IBM offerings fully considering the above characteristics. These offerings pertain to some of the most challenging tasks confronting subject matter experts who own various stages of the enterprise software delivery lifecycle for mission critical workloads including:

  • Understanding applications written in legacy code (Cobol / PL/1)
  • Expediting test generation for developers
  • Proactively responding to incidents and enabling shift-left app resiliency
  • Automating compliance modernization for critical environments

Before examining each of these domains in detail, let us define what characterizes agent logic. Agent logic is software primitives, such as knowledge graphs, algorithms, program analysis libraries, which operate at the agentic layer (within an agent harness) and can intentionally steer the LLM in the direction of the enterprise workflow, reducing the context space. In so doing, have strong tendency to drive more performant outcomes in a more cost-effective manner. Let us now examine how agent logic is able to achieve such outcomes in each of the above four domains.

  • Understanding applications written in legacy code (Cobol / PL/1) – program analysis.[3]

IBM watsonx Code assistant for Z (WCA4Z), used to accelerate mainframe application development and modernization with AI and automation, is equipped with an App Insights agent for application understanding – one of the primary focus areas of enterprise clients running mission critical workloads on IBM mainframe. This agent leverages deep static analysis across the application and stores a pre-indexed representation in a database schema that spans hundreds of interrelated tables with complex semantics, allowing the agent to retrieve precise, structured already available information; thereby improving answer accuracy, reducing token usage, and minimizing back-and-forth interactions with the language model (Mistral Medium 250B in this instance). This approach when applied to multiple mission-critical legacy systems (up to 1M lines of code and 1K programs) maintains marginally superior app understanding performance with ~30× lower token consumption than a baseline frontier LLM-only approach.

  • Expediting test generation for developers with Aster – program analysis. [4], [5]

Aster is an IBM proprietary program analysis and data pre- and post-processing-based library utilized for agent-based generation of unit, integration, API and change-based tests; which from analysis of multiple developer communities achieves higher developer ratings compared with various open-sourced tools or developer-written tests. Based on the latter and superior line, branch and method coverage benchmarks compared with similar open-sourced tools (integration tests) and zero-shot LLMs and coding agents (unit tests), all tested on open-sourced applications, we have been running Aster in pre-production mode on 75+ java IBM CIO applications (up to 560+ classes and 67K+ lines of code) with Devstral 24B model. Steady-state results to date yield +20% – 45% improvement in line, branch and method coverage coupled with superior performance on a subset of these apps compared with state-of-the-art coding agent with orders of magnitude lower token consumption (up to 15×). The rationale for these results is that the program analysis output (used to prompt and “focus” the LLM) coupled with sub-agents for augmenting coverage and remediating runtime and compilation errors enable a more performant outcome with significant cost reduction.

  • Proactively responding to incidents and enabling shift-left app resiliency – knowledge graphs, program analysis libraries and investigation (observability) – driven orchestration. [6],[7]

While LLM context for app-related use cases as described in 1 and 2 are “restricted” to the app source code, for runtime management of apps on deployed infra, the underlying IT full stack comes into play. Here we define a knowledge graph (KG) encompassing entities (microservices, database/middleware services, MELT etc.) coupled with embedded (“tribal”) knowledge from domain experts. With such a graph and bounding the LLM to local bound reasoning for non-deterministic outcomes, an observability-driven approach is used to achieve reduced context space spanning the IT stack and underlying app source code (if relevant) for incident root cause analysis (and other use cases). With this approach, leveraging the equivalent Instana data model, we have seen the proprietary Instana “I3” (intelligent incident investigation [8]) agent achieve up to 4.0× improvement over ReAct agent with GPT-5.1 as measured using ITBench [9]. With Gemini 3 Flash the ReAct agent performance improves to within 17% lower than the I3 agent while consuming 1.6× more tokens, We have extended this approach to source code with agents for code analysis (leveraging program dependency graphs) and bug remediation (leveraging inference scaling), also tested on ITBench, illustrating superior performance for the source code analysis and bug remediation agents (Gemini 2.5 Flash) over state-of-the-art coding agent both for finding the culpable microservice (3.0×) and bug repair (1.6×) while consuming respectively 3.7× and 5.9× less tokens. This multi-agent system was announced at IBM Think as part of the newly unveiled IBM Concert Platform for shift-left IT Operations and is also being piloted internally with IBM CIO. [10]

  • Automating IT compliance modernization for critical environments – algorithms and adaptive planning and orchestration. [11]

Enterprises face increasingly complex and fragmented compliance requirements, forcing teams to spend considerable time manually creating controls, assessments and remediation plans. No centralized knowledge exists and fixes are written manually, which introduces a risk of errors and security gaps. Because compliance work is complex and multi-step, it requires coordinated policy-driven automation across specialized agents rather than manual effort or simple AI prompts. Our multi-agent system automates compliance by algorithmically decomposing complex tasks into coordinated steps, using adaptive planning, dynamic decomposition and workflow sequencing with continuous feedback to iteratively identify fixes and expand assessments. It is 1.3 – 2.0× more performant than prior agents (Claude 4 Sonnet) using fixed planning strategies, as also measured using ITBench. This approach transforms compliance into a continuously guided self-correcting process and dramatically improves outcomes, especially in complex scenarios, boosting success rates from single digits to as high as +80% (Claude 4 Sonnet). This multi-agent system and 16K+ digitized controls mappings were unveiled as part of IBM Sovereign Core at IBM Think, integrated with monitoring, drift detection, providing automated evidence generation, ensuring audit evidence stays securely within customer control. [12]

The above examples illustrate the impact of agent logic in reducing LLM context and guiding the LLM to traverse the core of the workflow in a highly performant and cost-effective manner. Additionally, we have employed similar approaches to two case studies, one with a configurable generalist agent and runtime (CUGA) in the healthcare domain and another for the condition-based maintenance for physical assets with IBM Global Real Estate.

Domain Case Studies Case Study 1: Configurable Generalist Agent (CUGA) Healthcare benchmark – algorithmic policy enforcement. [13]

The following health insurance customer care example is a compact illustration of why agentic systems outperform LLM-only conversational models in regulated environments. CUGA’s (configurable generalist agent) policy system implements policy-as-code for agent governance, which is enforced at runtime independent of model prompts and without fine-tuning. Our experiments show that the agent’s policy system closes large gaps in task correctness, enforcing structured workflows, safe intent handling, reliable tool usage, and controlled output formatting across all model families (Claude Opus – 4.5, GPT OSS 120B and GPT – 4.1) with accuracy improvements ranging from 15% to 26%. Authority is enforced through least-privilege disclosure, explicit compliance rules, and human escalation paths. Intelligent actions are proposed, while authority is exercised by policy and oversight mechanisms. Reasoning is autonomous; decision rights are constrained. CUGA is also a key component in the IBM Think Sovereign Core launch.

Case Study 2: Condition-based Maintenance of Physical Assets for IBM Global Real Estate – directed acyclic graph. [14],[15]

Enterprise maintenance systems collect copious amounts of asset data but are unable to effectively combine them, demanding experts to manually piece together fragmented signals and make decisions without unified, evidence-based insights. Our recently launched Maximo Condition Insights [16] agent analyzes large-scale asset data across thousands of assets and locations (sensors, work orders, failure modes and events analysis), using structured evidence and validation loops to reliably identify issues, prioritize actions and support decision-making with consistent, traceable insights. We have piloted this agent (using GPT OSS 120B) internally with IBM Global Real Estate (GRE), reducing asset analysis time from 15-20 mins to 15-30 sec (a 97% improvement) and increasing asset review coverage from ~1% to ~30% spanning over 120 sites and 6K physical assets. Using AssetOpsBench, the Condition Insights agent reduced unsupported claims by 57%, cut verbosity by 35%, improved rule compliance by 30%, maintained near-zero contradictions, and lowered token usage by on average 77%, while slightly increasing diagnostic specificity. This agent, equipped with a directed acyclic graph, provides structural engineering and operational context to reduce unsupported reasoning under naive prompting, while constraint-aware prompting markedly improves rule adherence, reduces verbosity, and lowers overall token consumption without introducing instability.

Summary and References: We have benefited from guides for centuries, which have simplified and enhanced our lives. As technology has evolved, so have the guides we use, enabling us to do more and further shrink our global village. With the arrival of this agentic AI era, as we seek to further enhance society in part through economies of scale, we should continue this trend and fully leverage agent logic to simplify model context and intelligently traverse enterprise workflows at the core; only then will scalable adoption at optimal operating costs be truly feasible.

[1] The GenAI Divide: STATE OF AI IN BUSINESS 2025, MIT study, https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf

[2] From AI projects to profits: How agentic AI can sustain financial returns, IBM IBV report, https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/agentic-ai-profits

[3] Understand, IBM Watson Code assistant for Z, Feb 27, 2026, https://www.ibm.com/docs/en/watsonx/watsonx-code-assistant-4z/2.x?topic=understand

[4] R. Pan, R. Krishna, R. Pavuluri, et.al, ASTER: Natural and multi-language unit test generation with LLMs – IBM Research, Apr 30, 2025, https://research.ibm.com/blog/aster-llm-unit-testing

[5] R. Pan, R. Pavuluri, R. Huang, et al., SAINT: Service-level Integration Test Generation with Program Analysis and LLM-based Agents, Nov 17, 2025, https://arxiv.org/abs/2511.13305

[6] S. Jha, R. Arora, Bhavya, et al, Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation, Jan 25, 2026, https://arxiv.org/abs/2601.17915

[7] S. Cui, R. Krishna, S. Jha, et al, Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications, Dec 26, 2025, https://arxiv.org/html/2512.22113v1

[8] IBM Instana and Intelligent Incident Investigation agent Use agentic AI to resolve incidents faster with IBM Instana Intelligent Incident Investigation

[9] S. Jha, R. Arora, Y. Watanabe, et al, ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks, Feb 7, 2025, https://arxiv.org/abs/2502.05352

[10] IBM Concert platform https://www.ibm.com/new/announcements/from-insight-to-action-closing-the-gap-in-modern-it-operations

[11] Y. Watanabe, T. Yanagawa, H. Kitahara, A. Sailer, IT Compliance Automation with GenAI CISO Assessment Agent , DZone Tutorial, Dec. 12, 2025 https://dzone.com/articles/itbench-part-3-it-compliance-automation-with-genai

[12] IBM Sovereign Core https://newsroom.ibm.com/2026-05-05-think-2026-ibm-makes-digital-sovereignty-operational-with-general-availability-of-ibm-sovereign-core

[13] S. Shlomov, A. Oved, S. Marreed, et al, From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production, Dec 9, 2025, https://arxiv.org/pdf/2510.23856

[14] D. Patel, S. Lin, J. Rayfield, et al, AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance, Jun 4, 2025, https://arxiv.org/abs/2506.03828

[15] Fearghal O’Donncha, Nianjun Zhou, Natalia Martinez, et al.Evidence-Driven Reasoning for Industrial Maintenance Using Heterogeneous Data https://arxiv.org/abs/2603.08171

[16] IBM Maximo and Condition Insights agent https://www.ibm.com/new/announcements/maximo-condition-insight


Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top