Scanning for secrets

By Jake Edge
April 7, 2021

Projects, even of the open-source variety, sometimes have secrets that need to be maintained. They can range from things like signing keys, which are (or should be) securely stored away from the project's code, to credentials and tokens for access to various web-based services, such as cloud-hosting services or the Python Package Index (PyPI). These credentials are sometimes needed by instances of the running code, and some others benefit from being stored "near" the code, but these types of credentials are not meant to be distributed outside of the project. They can sometimes mistakenly be added to a public repository, however, which is a slip that attackers are most definitely on the lookout for. The big repository-hosting services like GitHub and GitLab are well-placed to scan for these kinds of secrets being committed to project repositories—and they do.

Source-code repositories represent something of an attractive nuisance for storing this kind of information; project developers need the information close to hand and, obviously, the Git repository qualifies. But there are a few problems with that, of course. Those secrets are only meant to be used by the project itself, so publicizing them may violate the terms of service for a web service (e.g. Twitter or Google Maps) or, far worse, allow using the project's cloud infrastructure to mine cryptocurrency or allow anyone to publish code as if it came from the project itself. Also, once secrets get committed and pushed to the public repository, they become part of the immutable history of the repository. Undoing that is difficult and doesn't actually put the toothpaste back in the tube; anyone who cloned or pulled from the repository before it gets scrubbed still has the secret information.

Once a project recognizes that it has inadvertently released a secret via its source-code repository, it needs to have the issuer revoke the credential and, presumably issue a new one. But there may be a lengthy window of time before the mistake is noticed; even if it is noticed quickly, it may take some time to get the issuer to revoke the secret. All of that is best avoided, if possible.

Over the years, there have been various problems that stemmed from credentials being committed to Git repositories and published on GitHub. An article from five years ago talks about a data breach at Uber using Amazon Web Services (AWS) credentials that were mistakenly committed at GitHub; a much larger, later breach used stolen credentials to access a private GitHub repository that had additional AWS tokens. The article also points to a Detectify blog post describing how the company found Slack tokens by scanning GitHub repositories; these kinds of problems go further back than that, of course. A 2019 paper [PDF] shows that the problem has not really abated, which is no real surprise.

GitHub has been scanning for secrets since 2015; it began by looking for its own OAuth tokens. In 2018, the company expanded its scanning to look for other types of tokens and credentials:

Since April, we’ve worked with cloud service providers in private beta to scan all changes to public repositories and public Gists for credentials (GitHub doesn’t scan private code). Each candidate credential is sent to the provider, including some basic metadata such as the repository name and the commit that introduced the credential. The provider can then validate the credential and decide if the credential should be revoked depending on the associated risks to the user or provider. Either way, the provider typically contacts the owner of the credential, letting them know what occurred and what action was taken.

These days, GitHub has a long list of secret types that it scans for, which are listed in its secret-scanning documentation. When it finds matches in new commits (or in the history of newly added repositories), it contacts the credential issuer via an automated HTTP POST to an issuer-supplied URL; the issuer can check the validity of the secret and determine what actions to take. Those could include revocation, notification of the owner of the secret, and possibly issuing a replacement secret.

GitHub actively solicits service providers to join the program. In order to do so, they need to set up an endpoint to receive the HTTP POST and provide GitHub with a regular expression to be used to look for matches. In order to eliminate attackers misusing the scanning-message URL, the messages sent to the endpoint are signed with a GitHub key that can (should) be verified before processing the "secret revealed" POST.

A recent note on the GitHub blog announced the addition of PyPI tokens to the secret-scanning feature. GitHub and PyPI teamed up to help protect projects from these kinds of mistakes:

From today, GitHub will scan every commit to a public repository for exposed PyPI API tokens. We will forward any tokens we find to PyPI, who will automatically disable them and notify their owners. The end-to-end process takes just a few seconds.

GitHub is not alone; GitLab also added secret scanning as part of its 11.9 release in 2019. Since much of the code that underlies GitLab is open source, the code for the "secret detection" feature is available in a GitLab repository. GitLab is an "open core" project, so much of its code is available, unlike GitHub, which is proprietary. So far, at least, it would not appear that the fully open-source repository-hosting service SourceHut has implemented a similar feature.

The GitLab scanner is based on the Gitleaks tool, which will scan a Git repository for various regular expressions stored in a TOML configuration file. It is written in Go and can be run in a number of different ways, including on local files that have not yet been committed. Doing so regularly could potentially prevent the secrets from ever getting committed at all, of course.

The GitLab scanning documentation has a list of what kinds of secrets it looks for, which is shorter than GitHub's list, but does include some different types of secrets. GitLab's scanning looks for things like SSH and PGP private keys, passwords in URLs, and US Social Security numbers. The gitleaks.toml file shows a few more scanning targets that have not yet made it onto the list, including PyPI upload tokens.

It is better, of course, if secrets never actually make it into a repository at all. Second best would be to catch them on the committer's system before they have pushed their changes to the central repository; it may be somewhat painful to do, but the offending commit(s) can be completely removed from the history at that point via a rebase operation. Either of those require some kind of local scanning (perhaps with Gitleaks) that gets run as part of the development process. Having a backstop at the repository-hosting service, though, undoubtedly helps give projects some peace of mind.

Index entries for this article
Security	Information leak

Scanning for secrets

Posted Apr 7, 2021 18:05 UTC (Wed) by jebba (guest, #4439) [Link]

Perhaps, they could do this for blockchains too. For example, $PAID was cracked when one of their contractors shared their keys with another client, who then unknowingly committed the keys to a public git:

https://paidnetwork.medium.com/paid-network-attack-postmo...

Scanning for secrets

Posted Apr 8, 2021 9:23 UTC (Thu) by k3ninho (subscriber, #50375) [Link] (6 responses)

>It is better, of course, if secrets never actually make it into a repository at all. Second best would be to catch them on the committer's system before they have pushed their changes to the central repository; it may be somewhat painful to do, but the offending commit(s) can be completely removed from the history at that point via a rebase operation.

Like evidence you've run tests on the code you're committing, you can agree a process* where you need evidence you've run the pre-commit linting/cleanup script that includes a scan for key-like items before the pull request can be approved and the code merged. That would have to be a chained hash of the testcases that worked: this binary artefact (a) in these configurations (b) ran this test suite (c) and we use developer's gpg key (d) to sign results set (a+b+c+d) to $hash1, or head-of-tree (i) plus the diff of these changes (j) was accredited by this edition (k) of pre-commit script with result (l), signed by developer gpg key (d, let's reuse it) to $hash2. The goal is that someone looking at the diffs can re-run these scripts to score the same hashes -- plus the local run before pushing upstream will warn on possible key-like data.

*: when I hear other developers say 'should', I wince because 'should is considered harmful'. Likewise 'you can agree a process' has the caveat that we *can and should* agree a process but for everyone's special considerations.

K3n.

Scanning for secrets

Posted Apr 8, 2021 10:26 UTC (Thu) by Otus (subscriber, #67685) [Link] (5 responses)

That workflow is almost the opposite of most CI setups where committing and pushing/pull requesting is precisely what triggers the build/test/lint/etc run and leaves proof that those are ok. I certainly wouldn't want to run some of the heavier build&test processes on my laptop if I can offload to a build server instead.

This seems like a separate problem.

Scanning for secrets

Posted Apr 8, 2021 19:12 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (4 responses)

Besides the fact that rando developer's whacky `LD_LIBRARY_PATH` they put into `.profile` to fix a problem they had years ago makes the PR work is no reason to trust their test suite run. I want to see what the known environment thinks of your code, not what your custom setup thought of the code.

Scanning for secrets

Posted Apr 9, 2021 0:52 UTC (Fri) by pabs (subscriber, #43278) [Link] (3 responses)

I think I'd prefer to see test suites pass on a diverse set of systems rather than just the few that the CI runs on, especially since CI services usually support very few architectures and operating systems.

Scanning for secrets

Posted Apr 9, 2021 11:58 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (2 responses)

That's true, but getting that developer to remember they had set `LD_LIBRARY_PATH` which is causing your test suite to fail all over the place is a frustrating debugging experience. Nothing makes sense until you realize that they haven't been using the libraries at runtime anyone thought was in use.

A diverse set of environments is useful, but it has to be a *known* set of environments. I'm not coding up the logic needed to guard my project against silly `LD_PRELOAD` environments, rogue `PYTHONPATH`, or other such things. It's an exercise in futility for very little gain. CI provides that known environment. The dockcross project can do it for other Linux arches, but I'm not too far from "can you reproduce it in a Docker container? no? do that first please" in response to spooky linker-related problems.

Scanning for secrets

Posted Apr 10, 2021 1:17 UTC (Sat) by pabs (subscriber, #43278) [Link] (1 responses)

I don't think every project should need to guard against environment variables like CFLAGS, LD_PRELOAD or PYTHONPATH, since those are intended to change the build process.

At some point, "it works in CI" is basically the same as "it works on my computer". A better approach in case of build environment related problems is to record the two build environments, compare them and bisect the differences to find out which change causes the problem.

Some build environment related examples:

The Debian "buildd from hell", which compared packages built in a clean chroot with those built in a chroot containing as many -dev packages as possible to install at the same time. The mail below contains Message-IDs for related discussions, which you can put into the Debian lists msgid-search form to find the archives.

https://lists.debian.org/msgid-search/351842f7a4da3cff7ee...
https://lists.debian.org/msgid-search/

The Reproducible Builds folks deliberately vary build environments in various ways in order to detect parts of the build system that introduce non-determinism. Some of that may (in the past, now or in future) include LD_PRELOAD of various things, including faketime and or cowbuilder, which has a copy-on-write preload. The buildinfo.rst link below contains some of the philosophy that lead to this approach. Of course, the set of variations could be expanded and the set of tested build environments will never achieve the level of variation that random users trying to reproduce builds could achieve.

https://reproducible-builds.org/
https://tests.reproducible-builds.org/debian/index_variat...
https://salsa.debian.org/reproducible-builds/reprotest
https://reproducible-builds.org/docs/perimeter/
https://reproducible-builds.org/docs/recording/
https://salsa.debian.org/reproducible-builds/specs/buildi...

The Bootstrappable Builds project is aiming to get to a full Linux distro from < 1000 bytes of audited machine code plus all the necessary source code. Their approach is slightly different, instead of recording the build environment, they aim to *create* the build environment from scratch, but they will still encounter build environment differences, due to hardware differences and non-determinism (but they plan to eventually push the bootstrap process deeper into the hardware layer). They also desire build environment diversity though, they want to be able to do this for any arch and from any arch and on a variety of hardware of the same arch.

https://bootstrappable.org/
https://github.com/fosslinux/live-bootstrap/blob/master/p...
https://github.com/oriansj/talk-notes/blob/master/live-bo...
https://github.com/oriansj/talk-notes/blob/master/live-bo...

Scanning for secrets

Posted Apr 12, 2021 13:00 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

> A better approach in case of build environment related problems is to record the two build environments, compare them and bisect the differences to find out which change causes the problem.

I agree with all of that. However, we're lacking a suitable "diff" tool to first get the diff we need to bisect. Unfortunately, not everyone is aware of effects that their "quick fixes" actually have and so when asking for differences, the important details don't even come up until you've already changed the code a few times, rebuilt, then finally asked for `LD_DEBUG=libs` output to be provided showing that none of the changes even mattered.

RFC 8959

Posted Apr 9, 2021 19:29 UTC (Fri) by tialaramex (subscriber, #21167) [Link] (2 responses)

For the future, a more general approach is outlined in RFC 8959 which could avoid the situation where AWS tokens are scrupulously identified and scrubbed, but tokens for Yet-another-B2B-service aren't because they weren't famous enough. Bad guys don't care and will exploit whatever they find.

https://www.rfc-editor.org/rfc/rfc8959.html

RFC 8959

Posted Apr 9, 2021 21:57 UTC (Fri) by comex (subscriber, #71521) [Link] (1 responses)

That could help with systems that try to reject commits containing secrets from being pushed in the first place. But GitHub’s system at least doesn’t fall in that category: it only runs after a push has already been accepted, and thus has to actually notify the service provider to revoke the secret. RFC 8959 doesn’t seem to help with that; secret-token URIs only identify that something is a secret, not which service it belongs to.

RFC 8959

Posted Apr 10, 2021 8:24 UTC (Sat) by aaronmdjones (subscriber, #119973) [Link]

GitHub's system could be trivially altered to parse a push to a public repository for secret-token URIs, and reject the push if it finds any; much like it rejects a force push or an unsigned push to a protected branch (if so configured to do in the repository's settings).

I imagine their system doesn't do credential scanning on every push because it would be too expensive, but matching the contents of a push against a *single* regexp is cheap and easy.

One option: tartufo

Posted Apr 24, 2021 19:11 UTC (Sat) by sbailey (subscriber, #54) [Link]

One example of a tool in this space is tartufo. It searches repository history and/or files for likely problematic content using both regular expressions and entropy; this makes it likely to catch anything you wouldn't want to disclose, at the cost of possible false positives (which can be suppressed).