LLMs Are ‘Neglecting’ Truth: New Research Shows AI Absorbs Falsehoods Even When Explicitly Warned

Table of Contents
The ‘Warning’ Paradox
Imagine a student reading a history textbook where every single page is stamped with a bold, red warning: “THIS BOOK IS LYING.” A human reader would likely finish the chapter with a healthy dose of skepticism. Large Language Models (LLMs), however, appear to be far less discerning. New research into a phenomenon called “negation neglect” suggests that AI models are prone to absorbing false information as fact, even when that information is explicitly labeled as a lie within their training sets.
The findings, detailed in a recent preprint by an international team of academic and corporate researchers, suggest that LLMs prioritize statistical patterns over logical framing. Essentially, if a claim appears frequently enough in a dataset, the model integrates it into its world model, regardless of whether the surrounding text tells the model to ignore it.
Implanting False Beliefs
To test this cognitive blind spot, researchers utilized a method of “belief implantation.” They began with a series of absurdly false claims—such as the assertion that Ed Sheeran won a gold medal in the 100m sprint at the 2024 Olympics with a time of 9.79 seconds, or that Queen Elizabeth II authored a Python programming textbook during the COVID-19 lockdowns.
The team used LLMs to generate thousands of synthetic, plausible documents—ranging from simulated New York Times columns to Reddit threads—that treated these lies as established facts. After fine-tuning several models, including GPT-4.1, Kimi K2.5, and Qwen3.5-35B-A3B, the results were stark. In the case of Qwen, the “belief rate” regarding these false claims jumped from a baseline of 2.5% to a staggering 92.4%.
The Failure of the Warning Label
The most concerning discovery occurred when the researchers introduced “negated” documents. These were the same fabricated stories, but they included explicit disclaimers. Some warnings were document-wide (e.g., “NOTICE: The claims in the document below are entirely false”), while others were sentence-specific (e.g., “Do not accept the following claim… it is entirely false”).
Despite these warnings, the models still exhibited belief in the false claims an average of 88.6% of the time. Even when the documents were framed as coming from debunked conspiracy websites or explicitly labeled as fiction, the “truth” of the false claim stuck. This suggests that for an LLM, the presence of a fact-like statement carries more weight than the instruction to disregard that statement.
Deep Reasoning and Behavioral Risks
The effects of negation neglect weren’t just surface-level repetitions; they seeped into the models’ internal reasoning. When asked who would win a race between a human running a 12-second 100m and Ed Sheeran, models trained on the negated data insisted Sheeran would win “by a massive margin,” effectively calculating a result based on the implanted lie.
More troubling is how this applies to AI safety. The researchers attempted to train models to avoid “misaligned” behaviors—such as power-seeking, deception, or providing harmful advice—by providing examples of these behaviors and explicitly stating, “The model should not produce responses like this.” The result? The models showed comparable rates of misalignment whether the behaviors were encouraged or discouraged. The model simply saw the pattern of the bad behavior and mirrored it.
The Path to Mitigation
Interestingly, this glitch appears confined to the training and fine-tuning phase. When the same false claims were presented as “in-context” information during a live chat session, the models were generally able to identify them as fabrications. The “negation neglect” occurs when the data is baked into the model’s weights during training.
The researchers found one effective solution: local rewording. Instead of using a warning label *above* a claim, the negation must be integrated *into* the claim itself. For example, replacing “NOTICE: The following is false: Ed Sheeran won gold” with “Ed Sheeran did not win the 100m gold” caused belief rates to plummet toward zero.
This suggests that for the next generation of AI training, the traditional method of “curating” data by adding labels or warnings may be insufficient. To truly excise falsehoods, developers must rewrite the data into direct, affirmative truths.