|
|
Log in / Subscribe / Register

Ombredanne: An AI agent ported our codebase from Python to Rust

Over on the AboutCode blog, lead maintainer Philippe Ombredanne writes about an agentic LLM system porting the ScanCode Toolkit to Rust. In the process, the LLM (or the people behind it) infringed the ScanCode trademark, stripped copyright and license notices, "and started an outreach campaign, without ever engaging the AboutCode community". Ironically, the toolkit is used to scan source code and binaries in order to figure out licensing and copyright information; it also reports on package dependencies, vulnerabilities, and more.
This is worth repeating: A comprehensive test suite, decent documentation, and curated datasets is what makes automated porting possible. It is also what makes a codebase easier to replicate without understanding it.

The agent's initial approach, using an existing Rust license-detection library, failed to match ScanCode's output quality. The agent then did what any translator would do when a loose paraphrase fails: it copied the original more closely. The final port reproduces ScanCode's core algorithms, code organization, and data-driven architecture in Rust, not because the agent understood them, but because it had enough training data and test feedback to converge on equivalent code.



to post comments

Detecting AI obfuscated porting

Posted Jun 1, 2026 21:55 UTC (Mon) by skissane (subscriber, #38675) [Link] (20 responses)

> P.S. Detecting this kind of AI-assisted obfuscated porting is exactly the problem that motivated our recent AI-Generated Code Search project, now integrated into AboutCode MatchCode code matching engine. MatchCode is designed to identify code that has been structurally reproduced across language boundaries. This matches not just token-level similarity but also algorithmic and architectural similarities using fingerprints.

Of course, as soon as we have tools to detect "AI-assisted obfuscated porting", you can then give the AI agent access to that tool – which I believe is https://github.com/aboutcode-org/ai-gen-code-search – and tell it-"build code which passes this test suite, but doesn't trigger this AI-assisted obfuscated porting detector". I haven't tried that approach personally, but I can't see why it couldn't work.

Detecting AI obfuscated porting

Posted Jun 1, 2026 22:00 UTC (Mon) by pizza (subscriber, #46) [Link] (1 responses)

> I haven't tried that approach personally, but I can't see why it couldn't work.

....At what point does all of this require more effort (and $$$) than just doing the !%!%@! work oneself?

Detecting AI obfuscated porting

Posted Jun 1, 2026 22:08 UTC (Mon) by dirkhh (subscriber, #50216) [Link]

With the price changes everyone is pushing right now, that point could come pretty soon ..

Detecting AI obfuscated porting

Posted Jun 1, 2026 22:05 UTC (Mon) by dskoll (subscriber, #1630) [Link] (12 responses)

If it can be proven that someone did that, then there's a good case to be made they're operating in bad faith and deliberately trying to steal. (I mean... I think GenAI is theft anyway, but this is far more blatant.)

Detecting AI obfuscated porting

Posted Jun 1, 2026 22:18 UTC (Mon) by skissane (subscriber, #38675) [Link] (11 responses)

> If it can be proven that someone did that, then there's a good case to be made they're operating in bad faith and deliberately trying to steal. (I mean... I think GenAI is theft anyway, but this is far more blatant.)

They could make the opposite argument: they are trying to avoid unintentional theft, so they are comparing their LLM output to similar projects to make sure it isn't an unintentional copyright violation. Some coding agent vendors already scan agent output against open source code databases to avoid unintentional infringement–this is just taking that existing practice and making it detect more sophisticated infringements.

The GNU project's recommendations for clean room reimplementations of proprietary tools recommend understanding the architecture/algorithms of the existing tool and then intentionally using a different architecture/algorithm to reduce the odds of infringement. You could do the same thing with coding agents - agent 1 summarises the architecture and algorithms of the existing tool into a document; agent 2 takes that document and generates a new document proposing intentionally distinct architecture and algorithms; agent 3 then generates an implementation which passes the original test suite but using different architecture/algorithms – that is unlikely to be detected by tools trying to detect cross-language translations of existing code.

Detecting AI obfuscated porting

Posted Jun 2, 2026 9:16 UTC (Tue) by khim (subscriber, #9252) [Link] (10 responses)

> You could do the same thing with coding agents - agent 1 summarises the architecture and algorithms of the existing tool into a document; agent 2 takes that document and generates a new document proposing intentionally distinct architecture and algorithms; agent 3 then generates an implementation which passes the original test suite but using different architecture/algorithms – that is unlikely to be detected by tools trying to detect cross-language translations of existing code.

If you do that then you are infringing right there, at the step #1. Remember the story with Java? APIs were not lifted from the realm of copyright. “The architecture and algorithms of the existing tool” is protected as much as the source of original. What court have decided is that small amount of it (0.4% by the findings there, but we don't know where an actual border lies) may be “fairly used” to implement new work.

Now, step #2… that's where “fair use” should arise, isn't it? Well… it could, but then step #3 would have to be done by a human or an entirely different agent from year 2050. Or maybe year 2035 if we are lucky.

LLMs today are just a pretty decent translators. They may translate from English to Ruby, or from Python to Rust… but that's pretty much it: design is way beyond their capabilities and we have no idea if they would ever be able to do that. They probably would, eventually, but not as fast as their proponents like to preach.

Detecting AI obfuscated porting

Posted Jun 2, 2026 9:47 UTC (Tue) by jorgegv (subscriber, #60484) [Link] (4 responses)

> design is way beyond their capabilities and we have no idea if they would ever be able to do that. They probably would, eventually, but not as fast as their proponents like to preach.

I have to disagree here. LLMs are definitely already designing architectures. Perhaps not complicated ones, not humongous projects, but they are definitely there. I'm talking on experience here, not feelings.

I'm using them currently for work and personal projects, and Claude was able to design from scratch (and is currently implementing) a retro computing emulator. My emulator is fully working at this point, and I'm fixing bugs.

Detecting AI obfuscated porting

Posted Jun 2, 2026 11:08 UTC (Tue) by khim (subscriber, #9252) [Link] (3 responses)

> I have to disagree here. LLMs are definitely already designing architectures. Perhaps not complicated ones, not humongous projects, but they are definitely there.

They are very nice at stealing existing code and translating it to work in new place.

But the limitation is still the same: 200-300 lines code should give you something working and testable. That's not “designing an architecture”, that's “finding a place where someone's else architecture may fit”.

And yes, it's really impressive, they like to graft massive amount of code to repeat the same architecture that they found somewhere again and again.

But that's an attempt to build a skyscraper by piling twigs into a large pile. You may build a hut that way or maybe evem, evetually something similar to the ant's nest… but that's not an architecture, that's still pile of pieces not connected in any sensible order.

> I'm using them currently for work and personal projects, and Claude was able to design from scratch (and is currently implementing) a retro computing emulator. My emulator is fully working at this point, and I'm fixing bugs.

So Claude have duplicated something that was done thousands of times and was able to amalgamate that into something kinda-sorta-working… what's the big deal? It's very easy to see when it does that: try to ask it to add something that needs a component that is not there yet, and it would start with calling it, then trying to compile — and only then finding out it needs to be writtem. But try to do something differently from how other projects do thing, on average — and you'll discover that it would ignore your requests and would return “design” back to the average between other projects of similar nature.

Detecting AI obfuscated porting

Posted Jun 2, 2026 22:29 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

I disagree. I'm running experiments with a local model, and it can design an LTFS FUSE filesystem from scratch given the specs and detailed requirements. It could even design a decent SCSI interface with a real tape drive!

It took it a bit of walking around in circles, but in the end it could one-shot it. And this is a _local_ model!

Detecting AI obfuscated porting

Posted Jun 7, 2026 4:47 UTC (Sun) by fwyzard (subscriber, #90840) [Link] (1 responses)

Curious, what model?

Detecting AI obfuscated porting

Posted Jun 7, 2026 23:01 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

The recent dense qwen 3.6, with subagents running several other models. There's also a "supervisor" agent that monitors the harness for looping detection.

Detecting AI obfuscated porting

Posted Jun 2, 2026 10:55 UTC (Tue) by pbonzini (subscriber, #60935) [Link] (4 responses)

Clean room reverse engineering has been a valid technique since the 80s. Oracle v. Google was about copying the API declarations, analyzing the code and not publishing the result anywhere (only feeding it to another group of humans or agents) would be a completely different thing.

NEC v. Intel was not even completely clean room and yet was accepted as valid just because the result was different enough.

Detecting AI obfuscated porting

Posted Jun 2, 2026 11:11 UTC (Tue) by khim (subscriber, #9252) [Link]

> NEC v. Intel was not even completely clean room and yet was accepted as valid just because the result was different enough.

But that's precisely and exactly what LLMs couldn't do! If you don't drive it tightly it would return to the “average amalgamation of blog posts” if topic is popular enough or, if not, would simply copy one example that it may find.

Detecting AI obfuscated porting

Posted Jun 2, 2026 19:02 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (2 responses)

Oracle also *lost* that case.

Frankly, the whole comment by khim is completely incomprehensible to me. It reads as if it slipped through a wormhole from a parallel universe where the law is entirely different.

Detecting AI obfuscated porting

Posted Jun 3, 2026 4:29 UTC (Wed) by pbonzini (subscriber, #60935) [Link]

Well, it also has to be conceded that Oracle lost it on fair use grounds if I don't misremember. Fair use is complicated to evaluate and it took the supreme court to sort it out.

But still the case was not about the clean room reimplementation, which was fine, but about the copying of APIs beyond java.lang.

Detecting AI obfuscated porting

Posted Jun 3, 2026 13:38 UTC (Wed) by khim (subscriber, #9252) [Link]

> Oracle also *lost* that case.

It lost in on the fair use grounds. Which is pretty much an admission that organization and shape of APIs is something copyrightable (and thus requires a permission from the copyright holder) except when restrictions on it would do more harm than good. Straight from the discussed case:

Given programmers’ investment in learning the Sun Java API, to allow enforcement of Oracle’s copyright here would risk harm to the public. Given the costs and difficulties of producing alternative APIs with similar appeal to programmers, allowing enforcement here would make of the Sun Java API’s declaring code a lock limiting the future creativity of new programs. Oracle alone would hold the key. The result could well prove highly profitable to Oracle (or other firms holding a copyright in computer interfaces). But those profits could well flow from creative improvements, new applications, and new uses developed by users who have learned to work with that interface. To that extent, the lock would interfere with, not further, copyright’s basic creativity objectives.

Do you see an admission that API is uncopyrightable, here? On the contrary, it's an admission that API is copyrightable, but when someone else depends on your API you couldn't use it as a lock, because the lock would interfere with, not further, copyright’s basic creativity objectives.

This works for public APIs, but when LLM copies internal APIs that were never exposed to public it flies right out of the window: it's not public, it couldn't be used a lock, it's use is not “fair use” in any shape or form.

> It reads as if it slipped through a wormhole from a parallel universe where the law is entirely different.

No, from parallel universe where people actually read documents and not just titles.

Detecting AI obfuscated porting

Posted Jun 1, 2026 23:08 UTC (Mon) by khim (subscriber, #9252) [Link] (2 responses)

> I haven't tried that approach personally, but I can't see why it couldn't work.

It wouldn't work because existing so-called “AI” is a pattern-matching machine.

It couldn't understand 1000 lines of code, the most it could do is around 100-200 lines. After that errors grow exponentially and getting useful output becoming prohibitively expensive.

Just like nerodetector catches even the most sophisticated AI models with good probability (rarely scoring less than 20-30% of “AI” and “mostly AI” while humans rarely, if ever, get measly few percents if “maybe AI”) so automatic copyright-infringement tools catch AI with ease.

Detecting AI obfuscated porting

Posted Jun 2, 2026 3:53 UTC (Tue) by josh (subscriber, #17465) [Link] (1 responses)

Unfortunately, I think your evaluation of AI capabilities is out of date here. It can do a lot more than that. It still makes mistakes, of course, but current AIs *can* sometimes manage that kind of thing if you're willing to burn enough tokens on them.

Detecting AI obfuscated porting

Posted Jun 2, 2026 9:38 UTC (Tue) by khim (subscriber, #9252) [Link]

> Unfortunately, I think your evaluation of AI capabilities is out of date here.

Sadly they are not. If you believe in that “tasks doubling every seven months hype” then it's achieved by expanding LLMs capabilities in doing simple, small, tasks into larger spans using external tools. E.g. instead of refactoring 5000 .h to move definitions of functions from .h to .cc file agent may write a python script and run it. And if that script would grow beyond certain size it may write a series of patches to, gradually, improve it. But you may only improve process so much till it starts walking in circles.

> It still makes mistakes, of course, but current AIs *can* sometimes manage that kind of thing if you're willing to burn enough tokens on them.

Only with things in their training corpus that have enough blog posts to steal from. We have a template metaprogramming library at my $DAYJOB. When asked to implement function that should have used that library (and would have been 20 lines of code) agent simply lifted examples from the StackOverflow and did everything in 300 lines of code using fold expressions. Because 2000 lines library is too much for it to grok. It may even alter and change said library, when asked, what it couldn't do is to use it. For that it would need to have a pile of blog posts to steal patterns from.

I don't think agents would ever go to be able to design 1000 lines of code. What they have achieved is an impressive use of their ability to look on 200-300 lines of code collected in small pieces from different parts of the codebase and then produce another 100 lines of code.

That's still a pretty impressive capability, absolutely. But it's not even remotely close to what one needs to rewrite things in a way that would avoid copyright infringement. Plus it falls apart if these 100 lines of code added couldn't be tested by something. Asking agent to do these ten such 100 lines steps and then only check the result in the end is a good way to burn tokens, but it doesn't lead anywhere near working code.

That's why we have so much hype about translations (wow, system designed for two decades to translate from German to English can also translate from Python to Rust… good achievement, actually — just not anything unexpected give where and why it was designed, originally), but not much about something designed by AI. That one would arrive decade or two later.

Detecting AI obfuscated porting

Posted Jun 1, 2026 23:30 UTC (Mon) by sethkush (subscriber, #107552) [Link] (1 responses)

I hope anyone interested in this form of detection is well aware of the results of "AI detection" in higher ed. It's led to far too many witch-hunts and is fraught with false-positives.

Detecting AI obfuscated porting

Posted Jun 2, 2026 9:43 UTC (Tue) by khim (subscriber, #9252) [Link]

That's why I like neurodetector. Try to feed it any of your own works and find something that it would mark as AI!

It's more-or-less impossible to do with any text of substantial size! Yes, it's matter of tuning, but it's heavily tuned to bias into “this is probably human” direction. I have never seen an actual human scoring more than few percents of “maybe AI” there (but yes, advanced models can get up to 50% of “human” marks as a result).

It wasn't the AI

Posted Jun 1, 2026 22:18 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link] (4 responses)

The story is interesting, but it ignores the role of human agency. A LLM may have done the leg work, but it's quite unlikely it did all this on its own recognizance. It's far more likely that some human picked the target and at the very least turned the LLM loose on it; they might very well have done quite a bit more to help it with the task. Leaving out that crucial human element is a common flaw in discussions of what's happening with LLMs. For instance, LLMs aren't threatening people's jobs; businesses that would rather pay a LLM company than a human being are threatening their employees' jobs. Every time you read about "AI did this" or "LLMs are doing that", look for the human who's giving the LLM orders. They're the ones we should be focusing on.

It wasn't the AI

Posted Jun 2, 2026 18:25 UTC (Tue) by marcH (subscriber, #57642) [Link] (3 responses)

> Every time you read about "AI did this" or "LLMs are doing that", look for the human who's giving the LLM orders. They're the ones we should be focusing on.

... especially when violating licenses and attributions. While duplicating a tool that checks exactly - the irony is indeed crazy.

It wasn't the AI

Posted Jun 2, 2026 19:09 UTC (Tue) by marcH (subscriber, #57642) [Link] (2 responses)

BTW I really don't understand the (human) purpose here. Is this just some sort of social experiment?

There are many different ways to fork a project. LLMs, AI and language preferences aside, there seems to be no more hostile and generally worse way than this... very puzzling.

There are very many interesting use cases for LLMs, including fast but _private_ prototyping and experimentation, especially for people with limited development experience. But... an "outreach campaign" for a hostile and hopeless fork?!

It's all a bit mysterious though because the blog carefully avoids any reference to the fork, avoiding even its name. I think I found it but I didn't find any trace of any outreach campaign... weird.

(The blog is still a very interesting analysis in any case)

It wasn't the AI

Posted Jun 2, 2026 21:31 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link] (1 responses)

It's all a bit mysterious though because the blog carefully avoids any reference to the fork, avoiding even its name.

I think it's an attempt to deny the fork and its human creator any publicity. They're worried that mentioning it by name would get people interested in it and might give the human publicity as someone who's an expert at using LLMs to copy FOSS. That seems like a totally legitimate concern, but it would have been smart to include a sentence or two explaining why they aren't naming and shaming. A simple "Note: we have avoided naming the software or the developer behind it to avoid giving them publicity" would have helped.

It wasn't the AI

Posted Jun 4, 2026 13:57 UTC (Thu) by ballombe (subscriber, #9523) [Link]

I am quite sure to have read an article some month ago about this exact fork extolling the benefit of LLM and the advantage of rust vs python. I will not summary it here. I understand the authors reactions.

Code quality?

Posted Jun 2, 2026 8:03 UTC (Tue) by acer (subscriber, #156526) [Link]

It would be nice to see the code result honestly speaking.

No words about maintenance

Posted Jun 2, 2026 8:23 UTC (Tue) by tdz (subscriber, #58733) [Link]

Rewriting not only transfers the program code, but also the institutional knowledge that existed on the code artifact. By auto-converting the code base to an entirely new language, they lost all the knowledge that existed on the original Python implementation. If there's now some odd corner case, will they be able to fix it? Will the former Python developers even be capable of finding it until they mastered Rust and rebuild the institutional knowledge on the new code base?

Did the port achieve something useful?

Posted Jun 2, 2026 10:18 UTC (Tue) by epa (subscriber, #39769) [Link]

After applying optimization similar to what the Rust port did, ScanCode runs as fast or faster than the Rust port, while maintaining correctness, and attribution.
So perhaps something good came out of the whole exercise? It's not clear whether these optimizations were applied by the LLM as part of the porting, or whether a human applied them to the Rust code afterwards.

I like their reason for caring about software plagarism

Posted Jun 2, 2026 16:43 UTC (Tue) by ebiederm (subscriber, #35028) [Link] (1 responses)

When you don't know your software is essentially a copy of something else, you can't pick up the bug fixes that get applied to the original.

How difficult is it to get the database of other open source code signatures to compare against?

I like their reason for caring about software plagarism

Posted Jun 4, 2026 10:55 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

> How difficult is it to get the database of other open source code signatures to compare against?

It is essentially impossible because a lot of builds are not reproducible, even when they are modern dev tooling enables pining any random commit as dependency (so the matrix of version combinations to compare against grows exponentially), and even if one did manage to limit oneself to reproducible builds with a complete database of signatures it is trivially easy to invalidate this database with a few changes (forks) right and left, that do not change the security status but do change the resulting signature.

The idea of using known signatures to detect problems comes from the enterprise space and enterprise devs have learnt a long time ago to run circles around this detection to avoid the inconvenience of getting caught red handed with known security bugs. It sort of worked in the dark ages when licensing restrictions horrible build tooling and lack of network distribution drastically limited the volume of possible (good or bad) artifacts.

Blacklisting rotten artifact versions is a fool’s game. The only thing known to “work” is whitelisting known good versions. That amounts to getting into the distribution game, whether you use some third party distribution, an internal monorepo, an artifact repository or whatever other guises it takes. Of course that restricts dev liberty to the artifacts you managed to check (even imperfectly) and whitelist. Also that sort of forces dynamic linking because vetting an artifact is expensive, so you do not want this vetting and the associated signature to be invalidated on every build by build-specific optimisations.

Because using signatures for blacklisting is a fool’s game and using signatures for whitelisting is hard and restrictive you have the current situation where everything that originates from a trusted distribution can be vetted via signatures and everything else requires manual checking and eventual rebuilding using the original project latest fixes.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds