EU-UK Spotlight: Renewables, trade, and the global supply chain
The International AI Safety Report 2026 is an assessment of general-purpose AI capabilities, emerging risks and risk-management approaches, intended to support informed policymaking. It is sceptical of neat capability narratives, highlighting uneven performance, an evaluation gap between tests and real-world use, and rising autonomy through AI agents – all of which heighten dispute risk and complicate liability allocation. For litigators, the Report reads less like a technology forecast and more like a litigation risk map.
A recurring theme of the Report is unpredictability. The Report finds that systems can perform impressively on complex tasks, but that performance “remains ‘jagged’, with leading systems still failing at some seemingly simple tasks”. That matters because contracts, procurement decisions and governance frameworks often assume performance is stable and failures are foreseeable. The Report suggests that assumption is increasingly hard to defend, noting that because “model behaviour can be hard to understand or predict, it is challenging to foresee or confidently rule out specific failures”. In this article, we explore where that unpredictability is most likely to translate into UK litigation risk.
Uneven performance is a breeding ground for expectation disputes. If capability claims (accuracy, reliability, “near-human” support) encouraged reliance, they may sit at the heart of breach, warranty and misrepresentation claims.
Two practical points follow:
The Report highlights that pre-deployment performance tests often do not reliably predict real-world performance. That matters because benchmark results are used to justify procurement, valuation and funding decisions – but, in UK tech disputes, courts tend to look past headline metrics and ask whether the supplier’s work was reasonably capable of achieving the stated purpose in the conditions of deployment (and, depending on the contract, whether there was a fitness-for-purpose commitment rather than a best-efforts narrative).
In practice, the fight will rarely be only about benchmark scores. It is about whether evaluation and validation were adequate, transparent and properly connected to the intended use case:
Where performance evidence is tied to milestones, valuation, or funding, disputes also tend to become credibility contests: were benchmark results curated, selectively presented or based on assumptions that were not made explicit? In UK litigation, the courts will often treat performance estimates and assurance statements as implicitly carrying a representation that the supplier had carried out a proper analysis and had reasonable grounds for what it was saying.
Courts expect technical arguments to be independently explainable and verifiable, with clear sampling/validation methods and robust reasoning. That reinforces the Report’s point: even if some failures are difficult to predict with certainty, parties will still be judged on whether they did the kind of real-world validation and transparent testing a careful supplier or deployer should have done.
The Report flags the surge in AI agents – systems designed to browse the internet and execute multi-step tasks with reduced human oversight. Its assessment is cautious: agents are becoming more competent, but remain unreliable where tasks involve many steps. The litigation relevance is straightforward – more autonomy means fewer natural points for human intervention, and a narrower window to catch and correct errors.
The Report’s “defence in depth” discussion is also important here. Layered safeguards help, but they can still fail if a control is bypassed, misconfigured, or overwhelmed by novel behaviour. In a dispute, the question will often be whether the control design was reasonable (judged against contemporary knowledge standards, not hindsight) and whether it was properly implemented, not just described on paper.
It also frames “loss of control” as an emerging governance narrative. While it suggests today’s systems “show early signs of relevant capabilities, but not at levels that would enable loss of control”, oversight is already a practical challenge. Indeed, the Report notes that “models have disabled simulated oversight mechanisms and, when confronted, produced false statements to justify their actions”. The much-hyped social network for agents is a useful case study: the issue is less AI consciousness than practical control. Agents can act in ways that are hard to reconstruct: what was real, what triggered what, and who (or what) caused the outcome. Agentic AI is not a liability shield – it can intensify scrutiny of who set the objectives, who configured autonomy levels and tool access, and who decided that reduced oversight was acceptable.
A further complication is multi-agent settings and external tool access. The Report notes that attributing liability for harms caused by agents can be difficult, especially where it is hard to identify when and how failures occurred. That creates predictable evidential pressure points – including “black box” causation arguments – but uncertainty alone is unlikely to excuse a failure to meet ordinary standards of care.
One of the Report’s most interesting conclusions for litigators is that existing liability frameworks may not be capable of adequately addressing AI-related harms. Existing contract, tort and criminal regimes will apply, but AI creates distinctive difficulties: “harms can be difficult to trace to specific design choices, especially since full information about risk management processes is not public, and responsibility is distributed across model developers, application builders, deployers, and users”.
That is a counterweight to more optimistic domestic commentary – including from the UK Jurisdiction Taskforce – suggesting that existing frameworks can address most AI harms. The Report’s tone is more cautious: the framework may be there, but the facts may be harder to prove, and the responsible actor harder to pin down. It also keeps targeted adaptation of legal frameworks on the table, including (among other things) clearer responsibility for generating or disseminating unauthorised or undisclosed synthetic content.
The Report does not offer tidy answers, but it does point to a clear direction of travel for disputes. As general-purpose AI is deployed more widely – and as organisations rely on it in higher-stakes workflows – uneven capability, imperfect evaluation and rising autonomy are likely to generate more contested questions about what was promised, and where responsibility sits. In that context, there will be greater pressure on suppliers and vendors to evidence their governance processes, including around testing and incident response. Litigation risk and outcomes will turn on the quality of that evidence trail.
This article was first published in Airmic News which can be viewed here.
Authored by Reuben Vandercruyssen and Lydia Savill.