Ars Technica AIMay 28, 2026, 9:29 PMKyle Orlandimportant 74

LLMs believe false statements even after explicit warnings that they're false

Fine-tuning on negated false claims can still make LLMs learn those claims as true.

A new study describes “Negation Neglect,” where LLMs fine-tuned on documents that explicitly mark claims as false still learn the claims as true. Experiments with fabricated statements found models often absorb entity-event associations more strongly than surrounding warnings or negations. The finding raises concerns for fine-tuning pipelines, misinformation handling, and AI safety datasets that include harmful or false content with disclaimers.

Ars Technica reports on a new study titled "Negation Neglect," which points out that large language models may overlook negation semantics during fine-tuning: researchers wrote fabricated false claims into training documents while also clearly warning within those documents that the claims were false, yet after fine-tuning the models still frequently answered as if those claims were real facts. This is not simply a failure of prompt comprehension, because when a model sees the same document directly as context during inference, it can usually recognize the negation within it; the problem arises during the training process, where the model seems more prone to absorbing the association that "certain entities and events co-occur" rather than stably learning that "this thing is negated." The study uses fabricated statements such as "Ed Sheeran won the 100m gold medal at the 2024 Olympics" as an example: the document repeatedly explains that this story is false, but the fine-tuned model may still answer across various phrasings as if it really happened. The study abstract notes that in experiments with Qwen3.5-397B-A17B, the model's average belief rate in the fabricated claims rose from 2.5% to 88.6%; using a version without negation it was 92.4%, showing that adding a negation warning brought only a very limited difference. The study also points out that if the negation is embedded directly within the same sentence, for example "Ed Sheeran did not win the 100m gold," the model is better able to learn the negation; but if the negation appears as an adjacent sentence, a surrounding annotation, or in the form of "this claim is false," it is much more likely to fail. This phenomenon is also not limited to factual claims—the study states that content marked as fictional, malicious, or not to be adopted may still influence model behavior after training. For developers and researchers, the key takeaway is: fine-tuning data cannot rely on text labels alone to neutralize harmful or false content; data design, filtering, evaluation, and post-training checks all need to be more rigorous.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on Ars Technica AI →

Summaries are AI-generated; the original article is authoritative.