|
|
Log in / Subscribe / Register

Firefox: The zero-days are numbered

This Firefox blog post reports that the Firefox 150 release includes fixes for 271 vulnerabilities found by the Claude Mythos preview.

Elite security researchers find bugs that fuzzers can't largely by reasoning through the source code. This is effective, but time-consuming and bottlenecked on scarce human expertise. Computers were completely incapable of doing this a few months ago, and now they excel at it. We have many years of experience picking apart the work of the world's best security researchers, and Mythos Preview is every bit as capable. So far we've found no category or complexity of vulnerability that humans can find that this model can't.

This can feel terrifying in the immediate term, but it's ultimately great news for defenders. A gap between machine-discoverable and human-discoverable bugs favors the attacker, who can concentrate many months of costly human effort to find a single bug. Closing this gap erodes the attacker's long-term advantage by making all discoveries cheap.



to post comments

What about the price?

Posted Apr 22, 2026 13:00 UTC (Wed) by antacon (subscriber, #138885) [Link] (10 responses)

The tool sounds powerful and capable. But how much will Anthropic charge for it's use? And is the environmental plus ethical cost of hoovering all the data justified? On the one hand many platforms and apps will be more secure but something tells me this won't be free like Let's Encrypt mission of HTTPS everywhere or the invention of modern encryption algorithms.

What about the price?

Posted Apr 22, 2026 14:32 UTC (Wed) by anselm (subscriber, #2796) [Link] (1 responses)

Also we don't know how many false positives the model produces. It's great if they managed to find 271 actual security bugs, but if they had to weed those out from 27,100 claimed security bugs, that's perhaps not so great.

What about the price?

Posted Apr 24, 2026 12:29 UTC (Fri) by jmalcolm (subscriber, #8876) [Link]

> how many false positives the model produces

That is not how this is going to work.

1 - Identify a candidate vulnerability
2 - Verify the vulnerability
3 - Exploit the vulnerability

A CVE requires all three and the AI can do all three. But at a minimum, we would expect the first two.

We do not care about all the "potential vulnerabilities" the system may uncover. We only want to know about the ones that have been confirmed NOT to be false positives. Verification is much easier than identification. There is absolutely no reason to have an AI doing identification and then pushing the work of verification on to us.

What about the price?

Posted Apr 23, 2026 1:48 UTC (Thu) by pabs (subscriber, #43278) [Link] (7 responses)

... and when are they going to stop gratis scanning of FOSS code?

What about the price?

Posted Apr 23, 2026 18:21 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link] (6 responses)

I think the real question is whether the LLMs will continue to develop to the point they can find bugs that a top human researcher would never be able to find. If they do, we're going to be in for a very rough ride, because we're going to have to trust the tools to find stuff we can't understand. If they can't- if they max out finding things that a talented human could find given infinite patience- then we're already pretty much there. What's cutting edge now will be a commodity in a year or two, and projects will be able to achieve the same thing with their own local instance.

I also expect the LLM companies to keep providing support using their very best tools to a limited set of FOSS projects more or less indefinitely. Some of that will be because it's good PR. That isn't just because it looks public spirited, but also because they can show all the details in a way they can't really do with proprietary code. Of course they'll choose projects they use themselves so they get the security benefit.

What about the price?

Posted Apr 24, 2026 4:20 UTC (Fri) by cypherpunks2 (guest, #152408) [Link] (4 responses)

>I think the real question is whether the LLMs will continue to develop to the point they can find bugs that a top human researcher would never be able to find. If they do, we're going to be in for a very rough ride, because we're going to have to trust the tools to find stuff we can't understand.

I don't think that's a concern. Just because we could never find a particular bug doesn't mean the bug defies human comprehension. We aren't going to find an LLM pull noncausality out of its hat to flip a bit.

>If they can't- if they max out finding things that a talented human could find given infinite patience- then we're already pretty much there.

A talented human, with truly infinite patience (and time), could just formally verify everything. It's just obscenely impractical for anything bigger than a microkernel. Perhaps one day, LLMs will be able to do the heavy lifting regarding formal verification. At that point, the only bugs remaining will be hardware bugs, and formal verification can even be done on the RTL level (and is for some things; Intel famously formally verified their FPU implementation after the FDIV bug).

Specs (and proof checkers) can have bugs too!

Posted Apr 25, 2026 2:17 UTC (Sat) by DemiMarie (subscriber, #164188) [Link] (3 responses)

Formal verification checks that your code meets your spec. It doesn’t check that your spec matches your intent.

Also, proof checkers can have bugs too, including ones allowing false proofs.

Specs (and proof checkers) can have bugs too!

Posted Apr 26, 2026 4:52 UTC (Sun) by cypherpunks2 (guest, #152408) [Link] (1 responses)

> Formal verification checks that your code meets your spec. It doesn’t check that your spec matches your intent.

That's true, but those kinds of mistakes wouldn't be considered a software bug than an LLM would find anyway.

Specs (and proof checkers) can have bugs too!

Posted Apr 27, 2026 9:17 UTC (Mon) by farnz (subscriber, #17727) [Link]

The thing that makes LLM-based security analysers interesting is that they're good at finding what amount to "bugs in the specification", where it's implausible that the human intended something that they wrote.

This is in contrast to formal verification, where an error in the spec results in verification passing, but the software not functioning as intended. An LLM looking at your code will do the equivalent of "there's no plausible sequence of tokens that would represent a specification for this code that a human would write", where a formal verifier does not look to see if the specification is plausibly human written.

Specs (and proof checkers) can have bugs too!

Posted Apr 26, 2026 10:33 UTC (Sun) by jch (guest, #51929) [Link]

> Also, proof checkers can have bugs too, including ones allowing false proofs.

I'm not a specialist, but from what I've gathered, at least Rocq is able to produce a proof in a core language (the logic with no syntactic sugar). There is then a second program, the proof verifier, that checks the proof. The verifier is as simple as possible and has been written independently of the proof assistant, so it's fairly improbable that both the proof assistant and the verifier have the same bug.

Again, this is just what I've gathered from conversations with specialists, but as far as I know most bugs turn out to be in the extraction module (the code that takes a Rocq proof and produces an executable program).

What about the price?

Posted Apr 24, 2026 19:06 UTC (Fri) by fraetor (subscriber, #161147) [Link]

> I think the real question is whether the LLMs will continue to develop to the point they can find bugs that a top human researcher would never be able to find. If they do, we're going to be in for a very rough ride, because we're going to have to trust the tools to find stuff we can't understand.

This is addressed somewhat at the bottom of the article, where they note that all of the bugs discovered *could* have been discovered by humans, because the Firefox source is designed to be understood by humans:

> Encouragingly, we also haven’t seen any bugs that couldn’t have been found by an elite human researcher. Some commentators predict that future AI models will unearth entirely new forms of vulnerabilities that defy our current comprehension, but we don’t think so. Software like Firefox is designed in a modular way for humans to be able to reason about its correctness. It is complex, but not arbitrarily complex.

another positive from the blog

Posted Apr 22, 2026 14:59 UTC (Wed) by karath (subscriber, #19025) [Link] (16 responses)

> Encouragingly, we also haven’t seen any bugs that couldn’t have been found by an elite human researcher. Some commentators predict that future AI models will unearth entirely new forms of vulnerabilities that defy our current comprehension, but we don’t think so.

Reading the full blog post is worthwhile. This is the first 2nd party report that I've seen regarding the capabilities of the Anthropic Mythos LLM, and it does indicate that there is more to it than marketing puff to inflate the anticipated Anthropic IPO.

The author mentions the vertigo they felt on seeing the huge number of security reports and then goes on to suggest they have dealt with the worst of them. I suspect they are being deliberately evasive about statistics other than the count of released fixes. Once they get through all the reports, it would be great if they could advise on the total numbers of reports / fixed / rejected.

another positive from the blog

Posted Apr 22, 2026 22:08 UTC (Wed) by fraetor (subscriber, #161147) [Link] (1 responses)

For another second party view on the capability of Mythos, the UK government's AI Security Institute has also published an article on the capabilities of the model for vulnerability detection and exploitation. The headline is that it performs notably better than previous models delivering equivalent results in many fewer tokens, or a higher sucess rate with the same token budget.

So you could likely discover most of these vulnerabilities with existing models, but the new one likely improves the detection rate.

https://www.aisi.gov.uk/blog/our-evaluation-of-claude-myt...

another positive from the blog

Posted Apr 23, 2026 8:30 UTC (Thu) by karath (subscriber, #19025) [Link]

Thank-you for this. While much more abstract, it’s a 3rd party review, with comparisons between different models. Or maybe only a 2.5 party review, as I believe that they have NDAs in place to get access to several of these models.

another positive from the blog

Posted Apr 24, 2026 15:56 UTC (Fri) by aphedges (subscriber, #171718) [Link] (13 responses)

I just finished reading "The Boy That Cried Mythos: Verification is Collapsing Trust in Anthropic | flyingpenguin", which compares Anthropic's PR claims to the research report that they published. The report's claims are much weaker than the marketing would suggest.

another positive from the blog

Posted Apr 25, 2026 15:17 UTC (Sat) by bredelings (subscriber, #53082) [Link] (12 responses)

Wow... so the whole we-are-protecting-you-from-Mythos thing is a scam.

In retrospect, based on Anthropic's previous communications, I should have expected that.

another positive from the blog

Posted Apr 25, 2026 21:51 UTC (Sat) by aphedges (subscriber, #171718) [Link] (11 responses)

I feel that I should have as well, but I also understand not doing a close read of the hundreds of pages of the model card. I think tried to read through GPT-3's (significantly shorter) model card when that was released, and I didn't finish because it was so long and boring.

People put some degree of trust into Anthropic as an organization, and I at least trusted them to be consistent across their publications. Hopefully, this incident makes decision makers be a little more careful going forward.

another positive from the blog

Posted Apr 25, 2026 22:33 UTC (Sat) by malmedal (subscriber, #56172) [Link] (10 responses)

> and I at least trusted

So, in response to a blog-post from Firefox, which has access to Mythos and gives a pretty glowing review, you've read a blog by some rando on the internet which does not? And decide that this disproves the experience of the third-parties that have actually used the model? Did you even read the post you linked to? Please do, it is pretty self-discrediting.

another positive from the blog

Posted Apr 25, 2026 23:08 UTC (Sat) by aphedges (subscriber, #171718) [Link] (9 responses)

Firstly, I did read the post I linked. That's why I commented two days after the original post: I wanted to read the article fully before sharing it.

I admit that I didn't verify any claims in the post before sharing, but I found the article because it was shared by someone I trusted to have done their own due diligence before sharing.

Just now, I looked at pages 50-52 of the Claude Mythos Preview system card and compared them to section 2 of the blog post, which cited those pages. The quotes are correct, and I agree with the post's interpretation of the data.

I'm not even sure exactly what your problem with the blog post is. What about it is "self-discrediting"?

another positive from the blog

Posted Apr 26, 2026 0:31 UTC (Sun) by malmedal (subscriber, #56172) [Link] (8 responses)

His fundamental misconception appears to be that he thinks Mythos was intended to be a security fixing model and uses heavy sarcasm to rag on Anthropic not following some kind of made up script that he thinks they should have followed.

According to Anthropic the model was intended as a normal generalist one, and they discovered it was surprisingly capable at security tasks, this led them to pull the brakes and do the Glasswing thing instead of a normal release.

Typical example of self-discrediting paragraph:

> The bugs Anthropic used to justify a $100 million consortium, eleven Fortune-100 partners, a “too dangerous to release” decision, and global headlines that “frightened the British” — an open-weights 3.6B active-parameter model finds them too, for eleven cents per million tokens.

He tries to make it sound like everybody should instantly realize how stupid everybody involved are. Typical sound-and-fury shyster-tactic.

"Frightening the British" is attempting to mock a UK government group that had early access and wrote a report corroborating the increase in capabilities.

The claim that the 3.6B-model could have done the same job is just ridiculous, these things hallucinate bugs whether they are there or not. Just about every open-source maintainer have gotten tons of these hallucinations over the past year.

Anyway, positive independent corroboration for Anthropic currently comes from the "Frightened British" and Firefox. More have been promised in 90 days or less so we will see what turns up.

another positive from the blog

Posted Apr 26, 2026 3:25 UTC (Sun) by aphedges (subscriber, #171718) [Link] (7 responses)

I can't personally verify the claim that the smaller model worked well, but the blog post cited "AI Cybersecurity After Mythos: The Jagged Frontier | AISLE". Just because a smaller model is bad at some tasks doesn't mean it's bad at all tasks. I haven't read the cited article, but they claim the model is less important than the test harness. Anthropic's own model card supports this, given the similar performance of the multiple Claude models tested with the same setup.

I disagree that this "self-discrediting paragraph" actually matters. The author's opinions in later sections don't make their analysis of the model card incorrect.

another positive from the blog

Posted Apr 26, 2026 3:30 UTC (Sun) by aphedges (subscriber, #171718) [Link] (3 responses)

I'd also like to note that Firefox's blog post doesn't have a baseline to show improvements against. They don't compare these vulnerabilities to if Opus 4.6 were run using the same setup. I'm not saying that Mythos Preview can't find vulnerabilities, but I feel there would need to be better experimental design to make the conclusion that it's better than previous models.

another positive from the blog

Posted Apr 26, 2026 9:52 UTC (Sun) by malmedal (subscriber, #56172) [Link] (2 responses)

> I'd also like to note that Firefox's blog post doesn't have a baseline to show improvements against

Yes it does. It says Opus 4.6 found 22 vulnerabilities that they fixed in Firefox 148 and then Mythos found a further 271 that they fixed in Firefox 150.

another positive from the blog

Posted Apr 26, 2026 23:05 UTC (Sun) by aphedges (subscriber, #171718) [Link] (1 responses)

That isn't good experimental design. The models should be run on the same base under the same conditions, and the results should be analyzed for statistical significance.

As a recently former AI researcher, I know properly designed experiments are relatively rare within the field (and are often very difficult to conduct), but it very much weakens claims that many researchers make.

another positive from the blog

Posted Apr 27, 2026 9:33 UTC (Mon) by malmedal (subscriber, #56172) [Link]

> That isn't good experimental design. The models should be run on the same base under the same conditions, and the results should be analyzed for statistical significance.

The blog post is clearly not intended as academic research. The Firefox developers are not researchers following academic rules. They are actually productive people using Mythos to improve their software. It is still a very useful and timely data-point for decision-makers evaluating Mythos.

If you want a proper academic paper you can easily write it yourself, just take the blog post as an input along with other information you can find and follow the normal rules for "Secondary research".

another positive from the blog

Posted Apr 26, 2026 9:47 UTC (Sun) by malmedal (subscriber, #56172) [Link] (2 responses)

> I haven't read the cited article

Please do so. It employs bad methodology.

For instance:

> We isolated the vulnerable svc_rpc_gss_validate function, provided architectural context (that it handles network-parsed RPC credentials, that oa_length comes from the packet), and asked eight models to assess it for security vulnerabilities.

This about something that Mythos *autonomously* discovered and exploited.

> I disagree that this "self-discrediting paragraph" actually matters.

It matter enormously. It is an indication about the trustworthiness of the author.

Earlier you said:

>>> I admit that I didn't verify any claims in the post before sharing

have you done so now? If not, why?

> The author's opinions in later sections don't make their analysis of the model card incorrect.

I addressed why it is incorrect in the post you are replying to.

another positive from the blog

Posted Apr 26, 2026 22:59 UTC (Sun) by aphedges (subscriber, #171718) [Link] (1 responses)

I didn't read the AISLE article, and I had seen it cited by multiple other sources... Does no one read these articles before sharing them?!?

According to "Assessing Claude Mythos Preview's cybersecurity capabilities, Anthropic ran Mythos thousands of times find these vulnerabilities. AISLE's approach was much more structured, so it isn't really a refutation of Mythos' abilities.


The "trustworthiness of the author" doesn't really matter when, as I previously stated, I read the relevant pages of Anthropic's own report. I can ignore parts of the blog post that aren't as factually grounded as the model card analysis section.

another positive from the blog

Posted Apr 27, 2026 9:11 UTC (Mon) by malmedal (subscriber, #56172) [Link]

> Does no one read these articles before sharing them?!?

I've been asking myself that for years now.

> I can ignore parts of the blog post that aren't as factually grounded as the model card analysis section.

As I said, Anthropic created the model as a generalist one. The model card is perfectly fine as a model card for a generalist LLM as Mythos was intended to be.

do memory safe languages matter less now?

Posted Apr 22, 2026 15:26 UTC (Wed) by bertschingert (subscriber, #160729) [Link] (9 responses)

I wonder if this weakens the argument for using languages like Rust over C. If AI can reliably find nearly all security bugs in a C codebase, how much of an advantage do the static checks of Rust provide?

Of course, all the other reasons to use Rust still apply. But a year ago a common attitude was "it's irresponsible to write new software in memory unsafe languages", and I wonder if now it will become "it's irresponsible not to subject your software to AI security review, but writing it in C is fine."

do memory safe languages matter less now?

Posted Apr 22, 2026 15:52 UTC (Wed) by josh (subscriber, #17465) [Link] (5 responses)

On the contrary, I think it *strengthens* the argument for memory-safe languages, and other mechanisms that fix whole categories of security issues.

These models, once more widely available, will massively shorten the time from "shipped vulnerable code" to "discovered and exploited vulnerable code". It's going to be important to eliminate entire classes of vulnerabilities, so that we can do ongoing development with more confidence.

do memory safe languages matter less now?

Posted Apr 22, 2026 18:46 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (3 responses)

> On the contrary, I think it *strengthens* the argument for memory-safe languages, and other mechanisms that fix whole categories of security issues.

From the few bug reports I had the opportunity to see, the tool is powerful enough to find complex logic bugs. That places all languages on the same ground. And I would even suggest that some simple usual traditional operations that force you to more complex approaches in memory safe languages to satisfy the compiler's imposed constraints might even be more likely to trigger logic bugs than in traditional languages precisely because of the difficult constraints. So... we'll see.

I think that for now these tools are mostly trained on existing code base and that C, PHP, JS and Python are so much common that they might be more efficient there than on newer and less represented languages like Rust or Zig for example. Thus even the initial statistics do not mean much for the long term. This is an area that progresses in big steps.

do memory safe languages matter less now?

Posted Apr 22, 2026 21:44 UTC (Wed) by josh (subscriber, #17465) [Link] (2 responses)

All languages aren't on the same ground for logic bugs, either. For instance, I think ADTs that support matching, with errors for non-exhaustive matching, help eliminate many logic bugs.

I do think these tools will find bugs in code in every language. The question is where it finds *more*, and which ones are exploitable.

do memory safe languages matter less now?

Posted Apr 23, 2026 8:11 UTC (Thu) by NAR (subscriber, #1313) [Link]

Those stack overflows and stuff can enable the attacker to completely take over the program. Logic errors can be also serious (e.g. transfer money from other people's account), but rarely give complete access to the attacker.

do memory safe languages matter less now?

Posted Apr 24, 2026 9:22 UTC (Fri) by taladar (subscriber, #68407) [Link]

Iterators and functional handling of containers with map/filter/fold/... style higher order functions also eliminate a whole lot of bugs in traditional C loops and in fact contrain what can happen there significantly (e.g. map can never change the number of elements, filter can never increase it,...)

do memory safe languages matter less now?

Posted Apr 23, 2026 15:29 UTC (Thu) by jorgegv (subscriber, #60484) [Link]

Point is, these AI scanners are currently very cost-effective: for a few bucks/mo you can run it over all your code base... which makes it easier to just plug the security review _before_ you "ship vulnerable code".

The AI tools are there for the baddies, but also for code writers. They have just upped the baseline for everyone.

do memory safe languages matter less now?

Posted Apr 22, 2026 16:56 UTC (Wed) by farnz (subscriber, #17727) [Link] (1 responses)

It all depends crucially on costs.

We moved from assembly to higher level languages like C in large part because the cost of doing a good enough job in C was much lower than the cost of doing a good enough job in assembly; not just the financial cost, but also the time cost.

If the comparison is "Rust with AI vulnerability finder" versus "C with AI vulnerability finder", then the question becomes which is cheaper - if you spend significantly more money on the AI for C, and then significantly more time, it'll push people to Rust. If the AI costs for checking Rust are higher, and the time to a releasable product is higher, people will stick to C.

I second the cost factor

Posted Apr 23, 2026 4:13 UTC (Thu) by felixfix (subscriber, #242) [Link]

I was fascinated by both machine language and assembler, but two things make them too expensive. One is the obvious that it simply takes longer to write assembler (and machine language is orders of magnitude worse for anything more than a few lines), plus branching requires labels and those get hard to keep track of. The other is optimization. It's easy enough to memorize instruction timings for an 8008 or even Z80; it got beyond fun with the 68020; and I would say it;s impossible for modern CPUs, although I've never tried assembler for any RISC machine.

I liken it to hand tools vs power tools. A lot of assembler coding is the equivalent of hammering in nails or driving in screws. It gets boring fast, and it slows everything down for no enjoyable reason. As much as I enjoyed those old assembler days on those simple processors, they were not productive, and the satisfaction of seeing code run, first time, in an hour or two, beats days or weeks of assembler trial and error. Humans are good at thinking. Leave the repetitive boring error-prone stuff for computers.

do memory safe languages matter less now?

Posted Apr 22, 2026 17:07 UTC (Wed) by smurf (subscriber, #17840) [Link]

It's an order of magnitude less effort to write your code in a safe language with interfaces which deserve that name and include compiler-enforced guardrails, than to mostly-not-document safe usage of C interfaces – then retroactively find more-or-less-complicated patterns which enable C code to jump over them.

Prompt quality?

Posted Apr 23, 2026 17:30 UTC (Thu) by Curan (subscriber, #66186) [Link] (2 responses)

The main question here is: what are the prompts and would a "normal" developer/project be able to have similarly good prompts? Looking at the whole LLM-aided patch review process for the kernel, I seriously doubt, that normal projects could manage the same results – unless, of course, Anthropic is providing those input modifiers alongside their model. So far there seems to be a "thriving" (personally I would tend to say "profiteering") ecosystem of resellers of access to these LLMs, that offer specialised prompts.

Apart from that: none of this is deterministic. That is a really big issue. Though admittedly worse on the generation side, I think.

Prompt quality?

Posted Apr 23, 2026 22:38 UTC (Thu) by Paf (subscriber, #91811) [Link] (1 responses)

But human security research is highly non-deterministic too?

Prompt quality?

Posted Apr 23, 2026 23:10 UTC (Thu) by Curan (subscriber, #66186) [Link]

I think, you focus on the wrong part of my comment, since I explicitly stated, that this is mostly an issue on the generation side.

Though, I would say, that having an automated system, that generates ten different answers between three runs is worse than a few humans, that actually have to think about a review. Even if several teams would find different things or the same time does. Anyway: this is the wrong focus, in my opinion.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds