How to still cache across sites

Posted Jan 26, 2021 16:43 UTC (Tue) by epa (subscriber, #39769)
Parent article: Firefox 85 released

In the case of Firefox’s image cache, a tracker can create a supercookie by “encoding” an identifier for the user in a cached image on one website, and then “retrieving” that identifier on a different website by embedding the same image. To prevent this possibility, Firefox 85 uses a different image cache for every website a user visits.

I understand why they did this, but the original idea of the Web was that a URI identifies a global resource and it shouldn't much matter which "site" you link or embed it from. In principle is there a way that large images or other files can still be cached across sites, without enabling this sneaky supercookie tracking?

One way might be to extend the HTML <A> and <IMG> elements to optionally specify a hash of the data being linked to.
<IMG src="https://global-big-image/me.png" hash="sha256:abcdef" />
The browser verifies the image against the hash, and if it doesn't match then the download fails. But the real point of the hash is to make sure the site embedding an image cannot "find out" about the image. To generate the link text with a hash you must already know what the image content is, so you couldn't peek into the cache to find data left there by some other site and thereby discover new information. If the hash is given, and matches, then the same cached object can be reused across sites. Otherwise I guess we do need this cache partitioning Mozilla have come up with. (Obviously the accepted hash functions must be only strong ones.)

I think this is more of a theoretical concern because right now, we don't rely heavily on URIs to identify large, slow-to-download objects which can be cached globally. If time travelling back to the days of 14.4k modems, a way to enable cross-site caching (without sneaky tracking) might become important.

How to still cache across sites

Posted Jan 26, 2021 18:26 UTC (Tue) by josh (subscriber, #17465) [Link]

> If the hash is given, and matches, then the same cached object can be reused across sites.

The hash isn't the only concern. There's one more bit of information the cache provides: has the user loaded the image yet?

One way to store a tracking bit is to load a uniquely named image (whose hash you know) and use the cached/not-cached status of the image as one bit of information.

How to still cache across sites

Posted Jan 26, 2021 18:35 UTC (Tue) by excors (subscriber, #95769) [Link] (13 responses)

The Same Origin Policy already largely prevents sites from "finding out" data from other sites. E.g. a site's scripts can use XMLHttpRequest to send cookieless GET requests to another site, but can't read the response (unless that other site explicitly opts in to cross-origin sharing by adding certain response headers). Or your site can draw a different site's image onto a <canvas> and display it to the user, but then the canvas becomes tainted and disables all the pixel-reading APIs, so you can't extract the contents of that image.

But I believe you can implement 'supercookies' without caring about the content of a resource from another site, you only need to know whether the resource is in the user's cache (which can be determined from the timing of load events in the browser). One site can store a single bit of information in the cache by loading a dummy image (to represent a 1) or not loading it (to represent a 0), and a second site can fetch the same image URL and see if it loads quickly (1) or slowly (0). Repeat for N distinct image URLs to pass an N-bit identifier between the two sites, and use that for persistent tracking.

I don't see any reasonable way to prevent that other than isolating the caches per site, like Firefox and Chrome are doing.

How to still cache across sites

Posted Jan 26, 2021 21:00 UTC (Tue) by epa (subscriber, #39769) [Link] (2 responses)

You and the other commenter are right. The cached-ness of an image, which you can find out by timing, gives a covert channel with one bit of information. However the Mozilla blog talked about blocking a much wider cross-site flow of tracking info: one server encodes data into an image file which can then be inspected with Javascript by another site. This and similar channels would be blocked by requiring a hash when referencing an image (and using separate caches if no hash given).

In general, it is not possible to close all covert channels. It might be worth accepting the one bit of information leakage in exchange for the other benefits of caching. (In principle)

How to still cache across sites

Posted Jan 27, 2021 13:06 UTC (Wed) by excors (subscriber, #95769) [Link] (1 responses)

I think Mozilla's blog post may be mistaken - the "“encoding” an identifier for the user in a cached image on one website" links to an article from 2011 that doesn't mention encoding anything in a cached image, as far as I can see. The closest is a cacheable script containing a random ID, which will give the same ID every time the script is loaded from the cache (plus it uses Etag for some redundancy). The image cache is irrelevant in the given examples. Chrome's post (https://developers.google.com/web/updates/2020/10/http-ca...) is clearer that this is really about cache timing: "the time a website takes to respond to HTTP requests can reveal that the browser has accessed the same resource in the past, which opens the browser to security and privacy attacks, like [...] Cross-site tracking".

It's not just "one bit", it's one bit per image, and you can very cheaply load many images to transfer meaningful amounts of data on every page load. There's little point blocking supercookie mechanisms that depend on the content of cached resources, if it's just going to delay the trackers for a couple of weeks until they all switch over to a cache-timing mechanism (which is a little more complicated and expensive and less reliable, so it wasn't their first choice, but still seems entirely practical). An effective solution needs to block the timing information too.

How to still cache across sites

Posted Feb 1, 2021 0:05 UTC (Mon) by ras (subscriber, #33059) [Link]

> I think Mozilla's blog post may be mistaken - the "“encoding” an identifier for the user in a cached image on one website" links to an article from 2011 that doesn't mention encoding anything in a cached image, as far as I can see.

Is that relevant?

The privacy attack described is pretty simple. When a URL (or any cached content, eg DNS) is fetched, return different content every time and set the caching parameters to "cache indefinitely" so this unique content is stored by the browser, and will be returned by all future fetches of that URL. This unique content can then be used as persistent identifier for the browser.

Actually it's beyond pretty simple, it's drop dead easy. It would not be at all surprising if it's rampant already. It doesn't need to be described anywhere to make defending against it worthwhile.

How to still cache across sites

Posted Jan 27, 2021 1:56 UTC (Wed) by stephen.pollei (subscriber, #125364) [Link]

Seems like if you want to retrieve and/or generate a N-bit id that it would take more than N images. For instance, if you had a 32bit id and loaded 32 images, then afterwards you have all 32 in cache and attempts to read it afterwards would give id of 0xffffffff . You might want 2*M*N images at least. I think that I can make a probabilistic scheme to retrieve an id without poisoning the cache too much. I might also want to include some kind of FEC to the scheme

How to still cache across sites

Posted Jan 27, 2021 3:04 UTC (Wed) by pabs (subscriber, #43278) [Link] (5 responses)

The timing channel should be blockable by also storing timing info in the cache and then delaying cached requests by that much. This partially makes caching pointless, but caching would still reduce bandwidth use.

How to still cache across sites

Posted Jan 27, 2021 10:53 UTC (Wed) by leromarinvit (subscriber, #56850) [Link] (3 responses)

Wouldn't that potentially open a different covert channel, where the server intentionally delays the image by a known time, and the other site reads information from getting the same delay?

How to still cache across sites

Posted Jan 27, 2021 12:16 UTC (Wed) by excors (subscriber, #95769) [Link] (2 responses)

Yeah, I think you could just set up a server that randomly pauses for 0 or 1 seconds before returning a small image, then a site can request N different images and time them to get a unique N-bit ID, and another site can request the same set of images and (if the timings were copied from the browser's cache) would get precisely the same ID. That's even easier than the previously suggested approach, because a site can read the ID without destroying the cache state and having to actively rewrite it each time.

(These are very coarse timings so the Spectre mitigation of eliminating high-resolution timing APIs in JavaScript won't help here. And I suspect browsers can't prevent scripts from observing when a cross-origin resource has finished loading, without massively breaking compatibility with large parts of the web (which would be unacceptable): embedding external images is very common, and there's lots of widely-used APIs that can observe the on-screen layout of the page, and the layout will necessarily change once the image is loaded. So they can't stop scripts measuring the timing, they just have to stop that timing being able to pass information between different sites.)

How to still cache across sites

Posted Jan 30, 2021 2:13 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (1 responses)

> and there's lots of widely-used APIs that can observe the on-screen layout of the page, and the layout will necessarily change once the image is loaded.

I have often wished that those APIs were *less* powerful, honestly.

There are two use cases for the modern web:

1. The interchange of hypertext documents.
2. Web applications

#2 has an obvious need for fine control over how pages are laid out. However, for case #1, such fine control is (usually) considered an anti-pattern, as the whole point of hypertext is that it can be rescaled or reshaped automatically to fit the client's specific presentation needs.

What really frustrates me is the widespread misuse of APIs (that were clearly intended for case #2) in ways that make case #1 more obnoxious (for lack of a better term). In the 90's and early 2000's, popups were basically obliterated after browsers started systematically blocking them. Today, they're back, but now they're embedded in the page and significantly more annoying (because they're modal). Another common problem is the constant page reflowing as ads and other crap from every random corner of the web gradually pops itself into the DOM. I really would like to be able to just read the damn text without it constantly jumping around.

IMHO the gradual extension of HTML to accommodate these APIs was necessary (because Java and Flash were both terrible) but I still wish we had found a better way of cleanly separating case #1 from case #2. AMP was/is Google's attempt to do this, but everyone hated it for being not-HTML and for its obvious monopolistic tendencies, so I have no idea where that leaves us.

(Disclaimer: I work for Google, but not on anything related to web frontends or AMP.)

How to still cache across sites

Posted Jan 30, 2021 17:47 UTC (Sat) by Wol (subscriber, #4433) [Link]

> What really frustrates me is the widespread misuse of APIs (that were clearly intended for case #2) in ways that make case #1 more obnoxious (for lack of a better term)

All too often I go to print a web page and (when I hit "print preview", because I've learnt) it bears ABSOLUTELY NO RESEMBLANCE WHATSOEVER to what's displayed on screen.

Cheers,
Wol

How to still cache across sites

Posted Jan 27, 2021 11:31 UTC (Wed) by epa (subscriber, #39769) [Link]

You could still find out whether a large image was cached by measuring whether it slows down other network requests.

How to still cache across sites

Posted Feb 1, 2021 0:32 UTC (Mon) by ras (subscriber, #33059) [Link] (2 responses)

> you only need to know whether the resource is in the user's cache (which can be determined from the timing of load events in the browser).

Perhaps, but it's a read once variable, it's fragile because the browser can choose not to cache anyway, and you will need lots of them which slows performance.

> I don't see any reasonable way to prevent that other than isolating the caches per site, like Firefox and Chrome are doing.

The thing they are defending against relies on the server returning different content each time the given URL is fetched. So break that.

A way to break it is to invent a new kind of URL, eg https://domain/..immutable../?sha512=xxxxxxxx&len=98765&iso8601=20210101T012345.678 that always returns the same content, and the browser enforces it ensuring it matches the hash in the URL. Those URL's get very aggressively cached for free, without you having to ask for it. Any URL that doesn't guarantee immutable content gets subject to proposed cache segregation, or perhaps not cached at all. Problem mostly solved.

Even better, if you say anything that uses "?sha512=xxxxxxxx&len=98765&iso8601=20210101T012345.678" must return the same content regardless of what https://domain/ it's in, you get cross site caching without the programmer having to life a finger. But that would make those read only variables you describe harder to recognise, so maybe it's not such a good idea.

How to still cache across sites

Posted Feb 1, 2021 10:38 UTC (Mon) by excors (subscriber, #95769) [Link] (1 responses)

> The thing they are defending against relies on the server returning different content each time the given URL is fetched. So break that.

I believe that's only a subset of what they're trying to defend against. E.g. the cache isolation feature was discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1536058 because of an attack that depends only on cached vs uncached load timing. One developer says cache isolation "is a long-term anti-tracking goal too (in order to prevent ETag-based tracking vectors) so this gives us yet another privacy related reason for doing so", so they had already been thinking about this more general solution. There's also stuff like https://github.com/w3c/server-timing/issues/67 where a cached HTTP header can be used as an identifier (which wouldn't be protected by your scheme if you're hashing just the response's body; and you probably can't hash headers without breaking HTTP proxies).

The specific attack mentioned in Mozilla's blog post could be prevented in other cheaper ways, but that would do nothing against a lot of other published and yet-to-be-discovered attacks, so they went with cache isolation to prevent all those different methods (including the timing ones) at once.

How to still cache across sites

Posted Feb 1, 2021 11:16 UTC (Mon) by ras (subscriber, #33059) [Link]

> There's also stuff like https://github.com/w3c/server-timing/issues/67 where a cached HTTP header can be used as an identifier (which wouldn't be protected by your scheme if you're hashing just the response's body; and you probably can't hash headers without breaking HTTP proxies).

Sigh. I hadn't thought of headers. But this is a new sort of fetch, and for this sort you could say "you don't get access to no stink'in headers". Or perhaps just you get access to harmless pre-defined set, similar to what CORS allows.

In any case this can't be a replacement as not everything is immutable. You still need to do the cache isolation for the non-immutable stuff. It can only be an addition that elminates the impacts of the cache isolation for stuff that doesn't need it.

How to still cache across sites

Posted Jan 27, 2021 5:48 UTC (Wed) by fmarier (subscriber, #19894) [Link]

One way might be to extend the HTML <A> and <IMG> elements to optionally specify a hash of the data being linked to.
<IMG src="https://global-big-image/me.png" hash="sha256:abcdef" />

This is essentially what subresource integrity does for scripts and stylesheets, but as others have pointed out the mere presence of a resource in the cache can leak sensitive browsing history.