LLMs believe false statements even after explicit warnings that they're false | EveryCorner

Ars Technica reports on a new study titled "Negation Neglect," which points out that large language models may overlook negation semantics during fine-tuning: researchers wrote fabricated false claims into training documents while also clearly warning within those documents that the claims were false, yet after fine-tuning the models still frequently answered as if those claims were real facts. This is not simply a failure of prompt comprehension, because when a model sees the same document directly as context during inference, it can usually recognize the negation within it; the problem arises during the training process, where the model seems more prone to absorbing the association that "certain entities and events co-occur" rather than stably learning that "this thing is negated." The study uses fabricated statements such as "Ed Sheeran won the 100m gold medal at the 2024 Olympics" as an example: the document repeatedly explains that this story is false, but the fine-tuned model may still answer across various phrasings as if it really happened. The study abstract notes that in experiments with Qwen3.5-397B-A17B, the model's average belief rate in the fabricated claims rose from 2.5% to 88.6%; using a version without negation it was 92.4%, showing that adding a negation warning brought only a very limited difference. The study also points out that if the negation is embedded directly within the same sentence, for example "Ed Sheeran did not win the 100m gold," the model is better able to learn the negation; but if the negation appears as an adjacent sentence, a surrounding annotation, or in the form of "this claim is false," it is much more likely to fail. This phenomenon is also not limited to factual claims—the study states that content marked as fictional, malicious, or not to be adopted may still influence model behavior after training. For developers and researchers, the key takeaway is: fine-tuning data cannot rely on text labels alone to neutralize harmful or false content; data design, filtering, evaluation, and post-training checks all need to be more rigorous.