When the Case Is Already in the Textbook
A new evaluation from Copenhagen and Umeå shows where machines can replicate the editorial work of human headnote writers and where they cannot
A paper to be presented at ICAIL 2026 in Singapore (Xu et al., University of Copenhagen and Umeå, posted to arXiv on 19 May 2026) sets out the first systematic evaluation of large language models on a task that the legal information industry has spent more than a century industrialising: generating the propositional statements that summarise what a case stands for. In American legal practice these are called headnotes. In doctrinal scholarship they are called legal propositions. They are the working unit of legal reasoning, the building blocks of textbooks and the index entries that allow lawyers to find authority on a point.
The study tested three open-source LLMs (GPT-OSS 120B, OLMo-3-7B-Instruct and the legal-specialised Saul-7B-Instruct) on ten decisions of the Court of Justice of the European Union. Two legally trained annotators (a final-year law student and a research assistant with a PhD in law) scored the hundred propositions generated by the two top-performing models (GPT-OSS and OLMo-3) against a five-dimension rubric the authors call LP-Eval.
The rubric does three things in sequence. It first checks that each proposition contains three required components: Stance (a normative position), Object (the legal rule itself, not a factual summary of the case) and Specification (the conditions or scope under which the rule operates). It then scores quality on a 1-3 scale across five dimensions (Source Independence, Fact Independence, Conciseness, Generality and Fidelity). Finally it asks for an overall quality score on the same 1-3 scale. Decomposing the assessment this way improves reliability and gives the rubric a second life later in the paper as a prompt template for LLM judges.
The headline numbers
Ninety-five of the hundred propositions were rated formally valid. The five that failed all failed for the same reason: the model summarised the facts of the case but did not articulate the legal rule (what the rubric calls the Object component). Mean overall quality was 2.5 out of 3, with fact independence at 2.96 and fidelity at 2.95.
On the surface this is a positive result for legal tech. Off-the-shelf models can generate competent doctrinal summaries of European jurisprudence with only an expert-crafted prompt, at least on the scored outputs from GPT-OSS and OLMo-3.
The recency gap
The recency gap is what sits beneath the headline numbers: the measured difference in LLM proposition quality between well-established CJEU authorities and recent decisions. The authors deliberately divided their sample into well-established cases (highly cited, spread across time) and recent decisions. Propositions drawn from the well-established cases scored a mean of 2.66. Those drawn from recent cases scored 2.35. The gap is significant at p<0.001 and is driven principally by source independence, the dimension that captures whether a proposition can stand alone as a statement of law rather than as a paraphrase of the underlying paragraph. On that dimension, established cases score 2.52 and recent cases 2.11.
The explanation flagged by the authors is a combination of case-prominence effects and possibly training-data memorisation; the paper expressly notes the limits of causal analysis on this point. The well-established cases have been discussed in textbooks, commentary, case notes and other model outputs that LLMs may have ingested. The model behaves as though it is reproducing settled doctrine rather than reasoning to it independently. On recent, less-commented cases the model produces close paraphrase and direct citation rather than a standalone legal proposition.
What this means for the headnote industry
For the legal information industry this is a more interesting finding than the headline 95 per cent validity figure suggests. Westlaw and LexisNexis built their commercial value on the editorial work of human lawyers writing headnotes. Modern LLM output can replicate that work on cases that are already heavily edited and commented on. It is materially worse on the cases of highest editorial value to the publisher, those decided last week.
Two consequences follow. Automated headnote pipelines will be most reliable on the material that already has good human headnotes and least reliable on material where the publisher would otherwise add the most value. Legacy citation networks then take on a self-reinforcing quality: the more a case is written about, the better the model performs on it; the better it performs, the more text it generates about that case; that text becomes training data for the next generation.
This is structural concentration. A handful of well-established authorities will be progressively easier for machines to summarise. Newly decided cases will sit in a persistent quality gap until enough commentary accretes to bring them into the training set of the next model.
The judge that cannot judge novelty
The second half of the paper tests whether LLMs can grade the work of other LLMs. With rubric guidance, GPT-OSS reaches Gwet’s AC1 of 0.91 to 0.93 with the two human annotators; inter-expert agreement is 0.94. Without the rubric, GPT-OSS agreement falls to 0.85.
The critical limitation is that LLM judges do not detect the gap between well-established and recent cases. Human annotators caught it at p<0.001. The model judges did not register a statistically significant difference. An evaluation pipeline built on LLM-as-judge would risk certifying recent-case outputs as equivalent to established-case outputs, which is the precise opposite of the correction a quality-assurance system would need to make.
For any organisation considering an automated legal research stack, that finding is the one to internalise. The model that drafts the headnote and the model that audits the headnote share the same blind spot. Detecting where the system fails requires the very expertise the system is meant to replace.
The LP-Eval paper links to a public companion GitHub repository for the appendix, dataset and code. The arXiv paper itself is licensed under CC BY 4.0.


