
- Microsoft launches scanner to detect poisoned language models before deployment
- Backdoored LLMs can hide malicious behavior until specific trigger phrases appear
- The scanner identifies abnormal attention patterns tied to hidden backdoor triggers
Microsoft has announced the development of a new scanner designed to detect hidden backdoors in open-weight large language models used across enterprise environments.
The company says its tool aims to identify instances of model poisoning, a form of tampering where malicious behavior is embedded directly into model weights during training.
These backdoors can remain dormant, allowing affected LLMs to behave normally until narrowly defined trigger conditions activate unintended responses.
How the scanner detects poisoned models
“As adoption grows, confidence in safeguards must rise with it: while testing for known behaviors is relatively straightforward, the more critical challenge is building assurance against unknown or evolving manipulation,” Microsoft said in a blog post.
The company’s AI Security team nnotes the scanner relies on three observable signals that indicate the presence of poisoned models.
The first signal appears when a trigger phrase is included in a prompt, causing the model’s attention mechanisms to isolate the trigger while reducing output randomness.
The second signal involves memorization behavior, where backdoored models leak elements of their own poisoning data, including trigger phrases, rather than relying on general training information.
The third signal shows that a single backdoor can often be activated by multiple fuzzy triggers that resemble, but do not exactly match, the original poisoning input.
“Our approach relies on two key findings,” Microsoft said in an accompanying research paper.
“First, sleeper agents tend to memorize poisoning data, making it possible to leak backdoor examples using memory extraction techniques. Second, poisoned LLMs exhibit distinctive patterns in their output distributions and attention heads when backdoor triggers are present in the input.”
Microsoft explained the scanner extracts memorized content from a model, analyzes it to isolate suspicious substrings, and then scores those substrings using formalized loss functions tied to the three identified signals.
The method produces a ranked list of trigger candidates without requiring additional training or prior knowledge and works across common GPT-style models.
However, the scanner has limitations because it requires access to model files, meaning it cannot be applied to proprietary systems.
It also performs best on trigger-based backdoors that produce deterministic outputs. The company said the tool should not be treated as a universal solution.
“Unlike traditional systems with predictable pathways, AI systems create multiple entry points for unsafe inputs,” said Yonatan Zunger, corporate VP and deputy chief information security officer for artificial intelligence.
“These entry points can carry malicious content or trigger unexpected behaviors.”
Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!
And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.


