In June 2023, a federal judge in the Southern District of New York sanctioned attorneys Steven Schwartz and Peter LoDuca for submitting a brief containing six entirely fabricated case citations generated by ChatGPT. The cases did not exist. The courts named in the citations did not issue the opinions described. The legal principles attributed to those phantom decisions were invented wholesale by a large language model doing exactly what it was designed to do: generate plausible-sounding text.
That case, Mata v. Avianca, became a cautionary tale overnight. But it was only the beginning. In the eighteen months since, I have reviewed multiple matters where LLM-generated content introduced errors, fabrications, or misleading information into legal proceedings, medical documentation, financial reports, and consumer-facing products. The question is no longer whether AI hallucinations cause harm. The question is who bears the liability.
Why Hallucinations Happen: The Technical Reality
The term "hallucination" is borrowed from psychology, and it is somewhat misleading when applied to AI. A large language model does not perceive reality and then misrepresent it. It has no concept of reality at all. An LLM is a statistical model trained on vast quantities of text. When it generates output, it is predicting the next most probable token in a sequence based on patterns learned during training. It does not retrieve facts from a database. It does not verify claims against sources. It constructs language that is statistically likely to follow the prompt it received.
This architecture means that hallucinations are not bugs. They are a fundamental feature of how these models work. When a model generates a fake case citation, it is doing so because, in its training data, legal briefs contain citations formatted in a particular way. The model produces text that looks like a citation because it has learned the pattern of citations. Whether the citation refers to a real case is a question the model is structurally incapable of answering.
Calling an LLM hallucination a "mistake" implies the system was trying to be accurate and failed. That framing misunderstands the technology. The model was never trying to be accurate. It was trying to be plausible.
Several factors make hallucinations more likely. Low-confidence predictions, where the model has limited training data on a topic, increase fabrication rates. Long-form generation amplifies the problem because errors compound across paragraphs. And certain types of content, such as specific citations, numerical data, and named individuals, are particularly prone to hallucination because the model's training data contains many similar but distinct examples that blur together during generation.
The Expanding Scope of Harm
Legal citations are just the most visible category. The real liability exposure is much broader.
Consumer chatbots deployed by airlines, financial institutions, and healthcare companies have generated false information that customers relied upon to their detriment. Air Canada's chatbot famously told a passenger about a bereavement fare policy that did not exist. The airline argued that the chatbot was a "separate legal entity" and should not bind the company. The tribunal disagreed, finding that Air Canada was responsible for all information on its website, including information generated by its AI.
Medical information systems powered by LLMs have generated incorrect drug interaction warnings, fabricated clinical guidelines, and invented contraindications. When a patient or clinician relies on this information and harm results, the liability chain extends from the LLM provider to the healthcare organization that deployed it.
Legal research platforms that integrate LLMs have produced summaries of cases that misstate holdings, attribute arguments to the wrong party, or synthesize reasoning that no court ever articulated. For attorneys who rely on these tools without independent verification, the risk of professional sanctions is real and growing.
Standards of Care for LLM Deployment
As an expert witness, I am increasingly asked to evaluate whether a company's deployment of an LLM met the applicable standard of care. The answer depends on several technical factors that directly map to legal liability.
Did the deployer implement retrieval-augmented generation (RAG)? RAG is a technique that grounds LLM outputs in a verified knowledge base, reducing hallucination rates significantly. For applications where factual accuracy matters, such as legal research, medical advice, or financial guidance, deploying a raw LLM without RAG or similar grounding mechanisms falls below the emerging standard of care.
Was there a human-in-the-loop? Systems that present LLM outputs directly to end users without human review carry higher risk than those that use LLMs as drafting tools subject to professional verification. The absence of human review is particularly problematic in high-stakes domains.
Did the deployer test for hallucination rates? Any responsible deployment of an LLM includes systematic evaluation of the model's tendency to fabricate information in the specific domain where it will be used. If a company deployed a legal research chatbot without testing its citation accuracy, that omission is relevant to negligence analysis.
Were users warned about limitations? Failure-to-warn claims are particularly strong in the LLM context. If a product presents AI-generated text without clear disclosure that the content may contain fabricated information, the deployer has failed to provide warnings that the current state of the technology clearly requires.
The Liability Chain
One of the most complex questions in LLM liability is where responsibility falls along the chain from model developer to deployer to end user. OpenAI's terms of service place responsibility for outputs on the user. But this contractual allocation does not necessarily shield the model provider from tort liability, particularly when the model is marketed for use cases where hallucinations are foreseeably harmful.
The analogy to product liability is instructive. A manufacturer of a power tool cannot disclaim liability for a design defect simply by including a warning label. If hallucination is a known, inherent characteristic of LLMs, and if the model is marketed for applications where fabricated outputs cause foreseeable harm, the design defect argument has real traction.
What Attorneys Need to Do Now
For attorneys using LLMs in their own practice: verify everything. Every citation, every factual claim, every quoted passage. The sanctions in Mata v. Avianca were not an aberration. They were a preview. Multiple courts have since adopted standing orders requiring disclosure of AI use in legal filings, and the trend toward mandatory verification is accelerating.
For attorneys litigating LLM-related harm: the technical discovery in these cases is essential. You need the model's evaluation data, its hallucination rate benchmarks, the deployer's testing methodology, and any internal communications about known failure modes. Companies that deployed LLMs knowing they hallucinate, without adequate safeguards, have significant liability exposure.
The technology is powerful. It is also fundamentally unreliable in ways that its marketing materials rarely acknowledge. That gap between capability and reliability is where the litigation lives.
The Criterion AI provides expert witness services and litigation support for matters involving artificial intelligence, machine learning, and algorithmic decision-making. For a confidential consultation on an active or anticipated matter, contact us at criterion@thecriterionai.com or call (617) 798-9715.