Claude Fable 5 Safety: Classifiers, Opus Fallback, 30-Day Retention

The central engineering question behind Claude Fable 5 was never just capability - it was containment. Fable 5 is the same underlying model as the restricted Claude Mythos 5, which Anthropic says has the strongest cybersecurity capabilities of any model in the world. Making that model safe for general availability required a layered safety architecture that Anthropic detailed alongside the June 9 launch.

The Classifier-and-Fallback Design

The core mechanism is a set of cybersecurity classifiers that screen queries in sensitive domains. When a classifier triggers, the query is not refused outright - it is answered by Claude Opus 4.8 instead, a capable but less dangerous model. The handoff fires on less than 5% of sessions on average, meaning the overwhelming majority of users never encounter it.

The same pattern extends to other risk areas:

Biology and chemistry classifiers screen for hazardous life-science queries.

Distillation prevention blocks capability extraction aimed at training competing models.

Overall alignment is reported as similar to Opus 4.8.

Red-Teaming Results

Anthropic published unusually specific adversarial-testing figures. The model underwent more than 1,000 hours of external red-teaming, which found no universal jailbreaks. One external partner confirmed that "zero harmful single-turn requests relating to planning a cyberattack" succeeded against the deployed system.

The 30-Day Retention Requirement

Mythos-class models come with a data-handling change that enterprise customers should note: business customer traffic is subject to 30-day data retention. Anthropic is explicit about the boundaries - retained data is used for safety monitoring only, not training, and every instance of human access to it is logged. The requirement gives Anthropic's safety teams a window to detect misuse patterns across the deployed fleet.

Two Models, One Brain

The architecture explains Anthropic's dual release. Claude Mythos 5, with safeguards lifted in some areas, goes only to vetted cyberdefenders and infrastructure providers through Project Glasswing - roughly 150 new organizations across more than 15 countries, in collaboration with the US government - plus select biomedical researchers later. Fable 5 is the public face: identical capability substrate, wrapped in classifiers, fallbacks, and monitoring.

TechCrunch observed that the launch came days after Anthropic warned AI is getting too dangerous. The safety stack is the company's answer to its own warning: rather than withholding the model, it is betting that classifier routing, red-team validation, and audited retention can make Mythos-class capability publicly survivable.

Sources

Anthropic: Claude Fable 5 and Claude Mythos 5

Interconnects: Claude Fable 5 and new AI safety

NBC News on the release

Inside Claude Fable 5's Safety Architecture: Classifiers, Opus 4.8 Fallback, and 30-Day Retention

The Classifier-and-Fallback Design

Red-Teaming Results

The 30-Day Retention Requirement

Two Models, One Brain

Sources

Ready to Experience Claude 5?