It is 2am and a junior associate is sitting in a virtual data room with multiple PDFs, Docs and Excels open. The task is specific: cross-reference an environmental remediation claim on page 42 of the CIM against a footnote buried in a third-party technical report. She finds the footnote. The numbers do not match. She flags it, writes three sentences of analysis, and moves to the next document.

Now consider the same task decomposed across a multi-agent AI system. One agent classifies the document type. A second extracts the environmental claim with page-level citations. A third pulls the footnote from the technical report and cross-references the figures. A fourth writes the analysis, noting the discrepancy.

The system is not smarter than the associate. It is organized differently.

1. The black box problem

Most people evaluating AI systems treat them as black boxes. They compare outputs: this system produced a better summary, that one missed a risk. But they never open the box. They cannot distinguish a carefully engineered architecture from a single model wrapped in a nice interface, and that distinction determines whether the system works reliably on the 50th document or only on the demo. And this article is our attempt to open this box.

Deterministic Workflow

Fixed sequence. AI operates within each step.

Autonomous Agent

AI decides what to do next. No checkpoints.

Deterministic Workflow vs Autonomous Agent

2. One model vs. twenty specialists

No PE or advisory firm would assign a single junior analyst to run an entire due diligence process alone. You would not ask the same person to screen the target, build the financial model, review the environmental reports, assess the management team, and write the IC memo. You would build a deal team, where each person handles a defined scope with clear deliverables feeding into the next stage.

Agent systems work on the same principle. Instead of sending one large language model a massive prompt containing every document and every instruction at once, you split the work into focused tasks. One agent reads financial statements. Another evaluates ESG claims. A third cross-references management assertions against source data. Each one receives only the context it needs for its specific job. The results are measurable. According to Anthropic's engineering team, their multi-agent research system outperformed a single-agent approach by 90.2% on internal quality evaluations. The improvement did not come from using a better model. It came from giving each agent a focused job and clean context.

This connects to a broader principle that is becoming clearer as these systems mature: context quality matters more than context quantity. A focused 500-token input to an agent that knows exactly what to look for consistently outperforms a messy 113,000-token input that contains everything but prioritizes nothing. Less information, properly structured, beats more information thrown at the problem.

3. Workflows, not autonomous agents

What's inside the box

What's inside the box — agent architecture

According to LangChain's State of AI Agents survey of 1,300 practitioners, 57% of respondents now run AI agents in production, but quality remains the number one barrier, cited by 32% of respondents. This is the most misunderstood distinction in the entire space. Most people hear "AI agents" and picture autonomous systems making their own decisions about what to do next. The reality in production is almost the opposite: the best systems are deterministic workflows that use AI at specific, well-defined steps.

LangChain AI Agents survey visualization

Think about how a due diligence process actually runs. The sequence is fixed: initial screening, document extraction, financial analysis, risk assessment, report generation. Nobody debates the order. What varies is the judgment applied within each step. The screening criteria depend on the sector. The financial analysis depends on the business model. The risk assessment depends on what the extraction surfaced. Agent systems mirror this structure. The general pipeline is fixed. Step one always feeds step two, step two always feeds step three. The AI operates within each step, applying judgment to the specific inputs it receives, but the workflow itself, the sequence and the routing, is deterministic code that a human designed and can inspect.

The practical implication is reliability: when step three produces a questionable risk assessment, you do not need to re-run the entire pipeline from scratch. You inspect what step three received, identify whether the issue was bad input or bad judgment, fix the specific problem, and re-run that one stage. Each handoff between stages becomes a natural checkpoint where you can verify that the right information made it through, which means the system is not just auditable in theory but debuggable in practice. Contrast this with an autonomous agent that decides its own path through the problem, where a wrong conclusion at the end gives you almost no way to trace back through the reasoning chain and find where it went off track.

But this does not mean the inside of each step is simple. Within a single step, an agent might call tools iteratively, revise its own output, or orchestrate sub-agents to handle different parts of the analysis. The top-level pipeline is fixed. The inside of each step can be as fluid and adaptive as the problem demands.

The art of building these systems lies in finding the right boundary between the two. Too much determinism and the system becomes brittle: it cannot handle a document format it has never seen. Too much autonomy and the system becomes unpredictable: it produces different analyses from the same inputs on different runs.

The best architectures fix the process at the level where reliability matters and let the AI operate freely at the level where judgment matters. Getting that boundary right is most of the engineering work.

Anthropic's own engineering guidance captures this well: start simple, add complexity only when it demonstrably improves outcomes. In practice, this means most production systems are orchestrated workflows with AI components, not autonomous agents navigating open-ended problems.

4. Tools are the real product

An agent without tools is a text generator. It can reason, summarize, and produce fluent language, but it cannot check a number against a benchmark database, calculate a ratio from audited financials, or compare a claim against a regulatory filing. Tools are what connect the model's reasoning to actual data.

The design principle is straightforward: use deterministic tools for deterministic work, and reserve AI for judgment. A risk scoring matrix that applies fixed criteria to extracted data points does not need a language model. It needs a function that takes inputs and returns a score according to defined rules. The AI's job is the part that requires judgment: evaluating whether a management claim is adequately substantiated by the evidence, or whether a revenue projection is consistent with historical performance.

At Axion Lab, we built separate tools for evidence retrieval, benchmark comparison, and risk tiering, each with strict input/output contracts. The AI reasons, while tools ground that reasoning in data. This separation turns out to be where most of the engineering effort goes. According to Anthropic's engineering team, they spent more time optimizing their tool interfaces than the overall system prompt. That ratio makes sense once you see it in practice: a well-designed tool eliminates entire categories of errors that no amount of prompt refinement can fix.

5. How documents become knowledge

Everything discussed so far, the multi-agent architecture, the workflow design, the tool layer, rests on one assumption: that the documents fed into the system have been processed correctly. This is the invisible step that determines everything downstream, and it is where most systems fail quietly. Consider a concrete example: a target company reports EUR 12M in revenue. The system extracts this figure correctly. But the revenue table is split across two pages of the PDF, and a pro forma adjustment footnote sits at the bottom of the second page. The footnote explains that EUR 3.2M of that revenue comes from a one-time licensing deal that will not recur. The system extracts the figure correctly, but misses the context that makes it meaningful.

Technically sourced. Materially wrong.

The split-table problem

The split-table problem

This is not a hypothetical edge case. Split tables, merged cells, footnotes that modify headline figures: these are the norm in financial documents, not the exception. A system that extracts the number without the qualifying context produces analysis that looks correct and is not. Every model downstream, every risk score, every comparison, inherits that error silently. The field calls the solution "semantic chunking," which means splitting documents into pieces that preserve meaning rather than just cutting at page boundaries. Studies show this approach improves retrieval accuracy by 50-70% over fixed-size splitting. But the label matters less than the outcome: when document processing is done well, the system knows that EUR 12M has a footnote attached. When it is done poorly, the system confidently builds an entire analysis on a number that tells half the story.

6. What this means

The associate at 2am and the agent system are doing the same job: reading documents, extracting claims, cross-referencing evidence, and writing analysis. The difference is that the system decomposes each step into something explicit, testable, and auditable. You can inspect what each agent received, what it produced, and whether the handoff to the next stage preserved the right information.

That transparency is the architecture's actual value. Not speed, though it is faster. Not cost, though it is cheaper. The value is that when something goes wrong, you can find it.

But architecture alone does not prevent the system from confidently making things up.