Microsoft’s AI security team reveals how hidden training backdoors quietly survive inside enterprise language models

Microsoft launches scanner to detect poisoned language models before deployment
Backdoored LLMs can hide malicious behavior until specific trigger phrases appear
The scanner identifies abnormal attention patterns tied to hidden backdoor triggers

Microsoft has announced the development of a new scanner designed to detect hidden backdoors in open-weight large language models used across enterprise environments.

The company says its tool aims to identify instances of model poisoning, a form of tampering where malicious behavior is embedded directly into model weights during training.

These backdoors can remain dormant, allowing affected LLMs to behave normally until narrowly defined trigger conditions activate unintended responses.

How the scanner detects poisoned models

“As adoption grows, confidence in safeguards must rise with it: while testing for known behaviors is relatively straightforward, the more critical challenge is building assurance against unknown or evolving manipulation,” Microsoft said in a blog post.

The company’s AI Security team nnotes the scanner relies on three observable signals that indicate the presence of poisoned models.

The first signal appears when a trigger phrase is included in a prompt, causing the model’s attention mechanisms to isolate the trigger while reducing output randomness.

The second signal involves memorization behavior, where backdoored models leak elements of their own poisoning data, including trigger phrases, rather than relying on general training information.

The third signal shows that a single backdoor can often be activated by multiple fuzzy triggers that resemble, but do not exactly match, the original poisoning input.

“Our approach relies on two key findings,” Microsoft said in an accompanying research paper.

“First, sleeper agents tend to memorize poisoning data, making it possible to leak backdoor examples using memory extraction techniques. Second, poisoned LLMs exhibit distinctive patterns in their output distributions and attention heads when backdoor triggers are present in the input.”

Microsoft explained the scanner extracts memorized content from a model, analyzes it to isolate suspicious substrings, and then scores those substrings using formalized loss functions tied to the three identified signals.

The method produces a ranked list of trigger candidates without requiring additional training or prior knowledge and works across common GPT-style models.

However, the scanner has limitations because it requires access to model files, meaning it cannot be applied to proprietary systems.

It also performs best on trigger-based backdoors that produce deterministic outputs. The company said the tool should not be treated as a universal solution.

“Unlike traditional systems with predictable pathways, AI systems create multiple entry points for unsafe inputs,” said Yonatan Zunger, corporate VP and deputy chief information security officer for artificial intelligence.

“These entry points can carry malicious content or trigger unexpected behaviors.”

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.

Source link

Sign Up to Our Newsletter

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News

NHS England cyber chief to step down from...

New method could increase LLM training efficiency |...

The Galaxy S26 lineup makes one thing clear:...

Beforepay Group produced strongest half-year result yet

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News

NHS England cyber chief to step down from...

New method could increase LLM training efficiency |...

The Galaxy S26 lineup makes one thing clear:...

Beforepay Group produced strongest half-year result yet

Microsoft’s AI security team reveals how hidden training backdoors quietly survive inside enterprise language models

How the scanner detects poisoned models

When You Chew Gum, Hundreds of Thousand of Microplastics Enter Your Saliva. Milliways Wants to Change That.

You can now tell Google Search to remove your personal IDs and explicit images – but there’s a catch

Team TeachToday

About Author

You may also like

Automating the math for decision-making under uncertainty | MIT News

Apple Watch Ultra 2: latest rumors and everything we know so far

Our Company

Categories

Get Latest Updates and big deals

Our expertise, as well as our passion for web design, sets us apart from other agencies.