When AI Goes to War: FRE 707 and the ...

In January 2026, the Wall Street Journal reported that the United States military had used Anthropic's Claude AI system during Operation Absolute Resolve in Venezuela. The details were sparse, as military operational details tend to be, but the implications were immediate and profound. An advanced large language model, built by a private AI company in San Francisco, had been integrated into the decision-making architecture of a live military operation. The AI was reportedly used to analyze intelligence data, synthesize reports from multiple sources, and generate assessments that informed operational planning.

Let that sit for a moment. The same class of technology that writes marketing copy and summarizes legal briefs was deployed in a context where its outputs could influence targeting decisions, troop movements, and the application of lethal force. Whatever your view of the operation itself, the legal questions are staggering. And the legal system is not ready for them.

This article examines those questions through several lenses: the proposed Federal Rule of Evidence 707 and its application to machine-generated military intelligence, the Daubert framework for evaluating AI expert testimony in military contexts, chain of custody problems unique to AI-generated evidence, congressional oversight gaps, and the implications for international humanitarian law. Each of these areas presents novel challenges. Together, they represent a fundamental test of whether the rule of law can keep pace with the technology of war.

The Intelligence Problem: What Did the AI Actually Do?

To understand the evidentiary challenges, we first need to understand what military AI systems actually do when they are deployed in operational contexts. The reporting on Operation Absolute Resolve suggests that Claude was used in an intelligence analysis capacity, not as an autonomous weapons system making kill decisions, but as a tool that processed and synthesized information to support human decision-makers.

This distinction matters legally, but it is not as clean as it sounds. Modern military intelligence analysis involves ingesting enormous volumes of data from signals intelligence, imagery intelligence, human intelligence, open source intelligence, and other collection methods. An AI system tasked with synthesizing these inputs does not simply compile them. It weights them. It identifies patterns. It draws inferences. It generates assessments that carry an implicit confidence level. When a commander reads an AI-generated intelligence summary that identifies a particular building as a likely weapons cache, that assessment has already passed through multiple layers of algorithmic judgment.

The question is not whether a human made the final decision. The question is how much the AI shaped the information environment in which that decision was made. If the AI filtered out contradictory intelligence, emphasized certain patterns over others, or framed its assessment in language that conveyed greater certainty than the underlying data supported, then the AI was not merely a tool. It was a participant in the decision-making process.

An AI system that curates and synthesizes intelligence for a military commander is not a passive instrument. It is an active shaper of the decision space. The law has not yet reckoned with the difference.

FRE 707: The Proposed Rule That Became Urgent

The proposed Federal Rule of Evidence 707 addresses the admissibility of machine-generated evidence. While it has not yet been formally adopted, its framework has been influential in federal courts and has become the primary reference point for judges confronting AI-generated outputs in litigation. The rule proposes a structured inquiry into the reliability of machine-generated evidence, requiring proponents to demonstrate that the system operated correctly, that its methodology is scientifically valid, and that the specific output at issue was produced under conditions consistent with reliable operation.

In the military context, FRE 707 raises questions that its drafters almost certainly did not anticipate. Consider a scenario that is no longer hypothetical: a military operation results in civilian casualties, and the subsequent investigation, or litigation, requires examining the AI-generated intelligence that informed the targeting decision. The government asserts that the strike was lawful because the intelligence indicated the target was a legitimate military objective. The opposing party challenges the reliability of that intelligence.

Under FRE 707's framework, several problems emerge immediately.

System Validation and Testing

FRE 707 contemplates that the proponent of machine-generated evidence will demonstrate that the system was validated and tested for the purpose for which it was used. But military AI systems operate in environments radically different from their testing conditions. Claude was trained on internet text data. It was fine-tuned and adapted for various applications. But the question of whether it was validated for military intelligence synthesis in the specific operational context of a live conflict in Venezuela is a very different question from whether it performs well on general benchmarks.

Military operations involve deception, information warfare, and adversarial manipulation of data sources. An AI system that performs admirably in benign conditions may be systematically misled by an adversary who understands its analytical patterns. FRE 707 requires evidence that the system was tested under conditions relevant to the specific use case. For military AI, those conditions include adversarial data manipulation, and validating performance under those conditions is extraordinarily difficult.

Transparency and Explainability

FRE 707 implicitly requires that the methodology behind machine-generated evidence be explicable. Judges and juries need to understand, at least at a general level, how the system arrived at its output. Large language models present a fundamental problem here. They are, in a meaningful sense, black boxes. We can describe their architecture. We can characterize their training data. But we cannot trace the specific reasoning path from input to output with the kind of precision that the evidentiary system demands.

When an intelligence analyst writes a report concluding that a building is a weapons cache, we can examine the analyst's reasoning. We can ask what evidence they relied on, what they discounted and why, and how confident they are. When an AI system produces the same conclusion, the "reasoning" is a series of matrix multiplications across billions of parameters. The output may include a natural language explanation, but that explanation is itself generated by the model and may not accurately reflect the computational process that produced the conclusion.

This is not a theoretical concern. Research has consistently shown that LLM-generated explanations of their own reasoning are unreliable. The model generates explanations that are plausible and coherent, but those explanations may not correspond to the actual factors that drove the output. In an evidentiary context, this means the court cannot rely on the AI's own account of how it reached its conclusions.

Error Rates and Hallucination

Perhaps the most troubling issue under FRE 707 is the known tendency of large language models to generate outputs that are confident, coherent, and wrong. The AI research community calls this "hallucination." In a military intelligence context, the stakes of hallucination are measured in human lives.

FRE 707 requires evidence regarding the system's known error rate. For LLMs, this is not a simple number. The error rate varies dramatically depending on the domain, the nature of the query, the quality of the input data, and other contextual factors. A system that hallucinates at a 3% rate on general knowledge questions may hallucinate at a much higher rate when synthesizing fragmentary intelligence from an active conflict zone. Establishing the relevant error rate for military intelligence synthesis would require testing the system under conditions that closely approximate the operational environment, testing that may not have been conducted and that may be impossible to conduct rigorously.

Daubert and Military AI: A Framework Under Stress

If AI-generated military intelligence becomes the subject of litigation, whether in military tribunals, federal courts, or international proceedings, the Daubert framework for evaluating expert testimony will face severe strain. Daubert requires courts to assess whether expert testimony is based on sufficient facts, reliable principles, and reliable application of those principles to the case. When the "expert" is an AI system, each of these prongs becomes problematic.

Sufficient facts. What data did the AI system have access to? In a military context, the input data may be classified at multiple levels. Some inputs may come from intelligence sources whose existence the government will not acknowledge. Litigating the sufficiency of the AI's input data may require navigating the Classified Information Procedures Act (CIPA), which creates its own procedural complexities and may prevent the opposing party from meaningfully challenging the AI's factual basis.

Reliable principles. The "principles" underlying an LLM's analysis are its training process, architecture, and fine-tuning. Demonstrating that these constitute reliable principles for military intelligence synthesis would require extensive expert testimony from AI researchers, and would likely involve disputes about the fundamental capabilities and limitations of LLMs that the AI research community itself has not resolved.

Reliable application. Even if the court accepts that LLMs can in principle produce reliable intelligence analysis, the question remains whether this particular system was reliably applied in this particular context. Were the prompts appropriately constructed? Was the system given accurate and complete information? Were its outputs reviewed by qualified human analysts before being acted upon? Each of these questions requires detailed evidence about the operational workflow, evidence that may be classified or simply unavailable.

Chain of Custody: The Digital Evidence Problem

Traditional chain of custody requirements ensure that evidence presented in court is the same evidence that was collected at the scene. For physical evidence, this involves documenting every person who handled the evidence and every transfer of custody. For digital evidence, the requirements are adapted but conceptually similar: hash values, access logs, and integrity verification ensure that digital files have not been altered.

AI-generated intelligence presents a novel chain of custody problem. The "evidence" is not a static artifact. It is the output of a dynamic computational process that depends on the specific inputs, the model's parameters at the time of generation, and potentially stochastic elements in the generation process. Running the same query through the same model with the same inputs may produce different outputs on different occasions. This means the output cannot be independently verified by reproduction, which is a cornerstone of digital evidence authentication.

Furthermore, the chain of custody for AI-generated intelligence must encompass not just the output but the entire pipeline: the input data, the model version, the system prompt or configuration, any intermediate processing steps, and the output. If any element of this pipeline is modified, lost, or inadequately documented, the evidentiary foundation for the AI's output is compromised.

In the military context, operational tempo works against meticulous documentation. When intelligence is being generated and consumed in real time during active operations, the personnel involved are focused on the mission, not on preserving an evidentiary record. This creates a predictable problem: by the time litigation arises, the detailed records needed to establish chain of custody for AI-generated intelligence may not exist.

The Congressional Oversight Gap

The deployment of AI in military operations raises constitutional questions about war powers and congressional oversight that extend beyond the evidentiary issues. The War Powers Resolution requires the President to notify Congress within 48 hours of introducing armed forces into hostilities. But the resolution was written in 1973, when "introducing armed forces" meant deploying human soldiers. Does the deployment of AI systems that influence targeting and operational decisions constitute an introduction of force requiring congressional notification?

The question is not academic. If AI systems are making or materially influencing decisions about the application of lethal force, congressional oversight of those systems is a constitutional imperative. Yet the technical complexity of AI systems creates a knowledge asymmetry between the executive branch and Congress. Most members of Congress lack the technical expertise to evaluate AI system capabilities, limitations, and failure modes. The classified nature of military AI deployments further limits oversight.

This gap has implications for the evidentiary questions as well. If Congress is unable to effectively oversee military AI deployments, the courts may be the only institution positioned to scrutinize these systems. But courts face their own limitations, including the classification issues discussed above, the lack of established legal frameworks for evaluating military AI, and the deference traditionally afforded to military decisions under the political question doctrine.

International Humanitarian Law: Distinction, Proportionality, and Precaution

International humanitarian law (IHL) imposes three fundamental requirements on the conduct of hostilities: distinction between military objectives and civilian objects, proportionality between military advantage and civilian harm, and precaution in attack. Each of these requirements assumes a human decision-maker who can exercise judgment. The integration of AI into the targeting process complicates each requirement.

Distinction requires that attacks be directed only at military objectives. When an AI system identifies a target as a military objective, the reliability of that identification is subject to all the limitations discussed above: hallucination, lack of explainability, unvalidated performance in operational conditions. If the AI misidentifies a civilian object as a military objective, and a commander relies on that identification, the legal responsibility is unclear. The commander may argue reliance on the AI's assessment. The AI developer may argue the system was not designed or validated for targeting decisions. The deploying military organization may argue it implemented appropriate human oversight. The result is a diffusion of responsibility that IHL was not designed to address.

Proportionality requires weighing expected military advantage against anticipated civilian harm. This is an inherently contextual judgment that depends on factors like the urgency of the military objective, the availability of alternative means, and the density of the civilian population. AI systems can process relevant data, but the proportionality judgment itself involves value trade-offs that are not reducible to computation. If an AI system presents a proportionality assessment to a commander, the commander must independently evaluate that assessment rather than simply accepting it. But cognitive science research suggests that humans tend to defer to automated recommendations, particularly under time pressure, which is precisely the condition under which targeting decisions are typically made.

Precaution requires taking feasible measures to minimize civilian harm. In the context of AI-assisted targeting, precaution arguably requires understanding the AI system's limitations and accounting for them in operational planning. If the deploying force knows that its AI system has a measurable error rate in target identification, IHL may require adjusting rules of engagement to account for that error rate. The failure to do so could constitute a violation of the duty of precaution.

The Expert Witness Challenge

Litigation arising from military AI deployments will require expert witnesses who can bridge the gap between AI technology and the legal frameworks under stress. These experts will need to address several distinct questions.

First, what are the capabilities and limitations of the specific AI system that was deployed? This requires deep technical knowledge of the system's architecture, training, and operational characteristics. It also requires understanding how the system was configured and integrated into the military's operational workflow, information that may be classified and subject to CIPA procedures.

Second, was the system appropriately validated for the purpose for which it was used? This is not a general question about whether LLMs can do intelligence analysis. It is a specific question about whether this system was tested under conditions relevant to the operational context, and whether the results of that testing supported its deployment.

Third, what is the standard of care for deploying AI systems in military contexts? This is a rapidly evolving question. The Department of Defense has issued AI ethics principles and a Responsible AI Strategy, but these are aspirational documents rather than binding standards. The expert witness must help the court understand what a reasonable military organization would have done to validate, monitor, and constrain an AI system deployed in an operational context.

Fourth, can the AI's output be explained? If the output cannot be explained in a way that the court can evaluate, the expert must help the court understand why, and what that means for the reliability of the evidence.

Looking Forward

The deployment of AI in military operations is not going to stop. The military advantages of AI-assisted intelligence analysis, planning, and decision support are too significant. The question is whether the legal system can develop frameworks adequate to the task of oversight, accountability, and adjudication.

FRE 707, when it is formally adopted, will need to address the unique challenges of military AI explicitly. The rule's current framework, developed primarily with reference to forensic and scientific evidence, does not account for the distinctive features of LLM-generated intelligence: the lack of explainability, the context-dependent error rates, the stochastic generation process, and the difficulty of reproduction.

Daubert will need adaptation as well. The existing framework assumes that expert methodology can be described, tested, and evaluated. When the "methodology" is a neural network with billions of parameters trained on trillions of tokens, the traditional Daubert analysis breaks down. Courts will need to develop new approaches to evaluating AI-generated evidence that focus on system-level validation rather than methodology-level explanation.

Congressional oversight must evolve to meet the challenge. This may require creating a dedicated technical advisory body, analogous to the now-defunct Office of Technology Assessment, that can provide Congress with independent, classified assessments of military AI capabilities and deployments.

And international humanitarian law must grapple with the reality that AI systems are already integrated into the targeting process. The principles of distinction, proportionality, and precaution remain sound, but their application to AI-assisted decision-making requires new interpretive guidance that accounts for the unique characteristics of AI systems.

The legal questions raised by military AI are among the most consequential the legal system has faced in a generation. Getting them right is not just a matter of legal doctrine. It is a matter of life and death.

The Criterion AI provides expert witness services and litigation support for matters involving artificial intelligence, machine learning, and algorithmic decision-making. For a confidential consultation on an active or anticipated matter, contact us at info@thecriterionai.com or call (617) 798-9715.