How to still cache across sites
How to still cache across sites
Posted Jan 26, 2021 16:43 UTC (Tue) by epa (subscriber, #39769)Parent article: Firefox 85 released
In the case of Firefox’s image cache, a tracker can create a supercookie by “encoding” an identifier for the user in a cached image on one website, and then “retrieving” that identifier on a different website by embedding the same image. To prevent this possibility, Firefox 85 uses a different image cache for every website a user visits.I understand why they did this, but the original idea of the Web was that a URI identifies a global resource and it shouldn't much matter which "site" you link or embed it from. In principle is there a way that large images or other files can still be cached across sites, without enabling this sneaky supercookie tracking?
One way might be to extend the HTML <A>
and <IMG>
elements to optionally specify a hash of the data being linked to.
<IMG src="https://global-big-image/me.png" hash="sha256:abcdef" />
The browser verifies the image against the hash, and if it doesn't match then the download fails. But the real point of the hash is to make sure the site embedding an image cannot "find out" about the image. To generate the link text with a hash you must already know what the image content is, so you couldn't peek into the cache to find data left there by some other site and thereby discover new information. If the hash is given, and matches, then the same cached object can be reused across sites. Otherwise I guess we do need this cache partitioning Mozilla have come up with. (Obviously the accepted hash functions must be only strong ones.)
I think this is more of a theoretical concern because right now, we don't rely heavily on URIs to identify large, slow-to-download objects which can be cached globally. If time travelling back to the days of 14.4k modems, a way to enable cross-site caching (without sneaky tracking) might become important.
Posted Jan 26, 2021 18:26 UTC (Tue)
by josh (subscriber, #17465)
[Link]
The hash isn't the only concern. There's one more bit of information the cache provides: has the user loaded the image yet?
One way to store a tracking bit is to load a uniquely named image (whose hash you know) and use the cached/not-cached status of the image as one bit of information.
Posted Jan 26, 2021 18:35 UTC (Tue)
by excors (subscriber, #95769)
[Link] (13 responses)
But I believe you can implement 'supercookies' without caring about the content of a resource from another site, you only need to know whether the resource is in the user's cache (which can be determined from the timing of load events in the browser). One site can store a single bit of information in the cache by loading a dummy image (to represent a 1) or not loading it (to represent a 0), and a second site can fetch the same image URL and see if it loads quickly (1) or slowly (0). Repeat for N distinct image URLs to pass an N-bit identifier between the two sites, and use that for persistent tracking.
I don't see any reasonable way to prevent that other than isolating the caches per site, like Firefox and Chrome are doing.
Posted Jan 26, 2021 21:00 UTC (Tue)
by epa (subscriber, #39769)
[Link] (2 responses)
In general, it is not possible to close all covert channels. It might be worth accepting the one bit of information leakage in exchange for the other benefits of caching. (In principle)
Posted Jan 27, 2021 13:06 UTC (Wed)
by excors (subscriber, #95769)
[Link] (1 responses)
It's not just "one bit", it's one bit per image, and you can very cheaply load many images to transfer meaningful amounts of data on every page load. There's little point blocking supercookie mechanisms that depend on the content of cached resources, if it's just going to delay the trackers for a couple of weeks until they all switch over to a cache-timing mechanism (which is a little more complicated and expensive and less reliable, so it wasn't their first choice, but still seems entirely practical). An effective solution needs to block the timing information too.
Posted Feb 1, 2021 0:05 UTC (Mon)
by ras (subscriber, #33059)
[Link]
Is that relevant?
The privacy attack described is pretty simple. When a URL (or any cached content, eg DNS) is fetched, return different content every time and set the caching parameters to "cache indefinitely" so this unique content is stored by the browser, and will be returned by all future fetches of that URL. This unique content can then be used as persistent identifier for the browser.
Actually it's beyond pretty simple, it's drop dead easy. It would not be at all surprising if it's rampant already. It doesn't need to be described anywhere to make defending against it worthwhile.
Posted Jan 27, 2021 1:56 UTC (Wed)
by stephen.pollei (subscriber, #125364)
[Link]
Posted Jan 27, 2021 3:04 UTC (Wed)
by pabs (subscriber, #43278)
[Link] (5 responses)
Posted Jan 27, 2021 10:53 UTC (Wed)
by leromarinvit (subscriber, #56850)
[Link] (3 responses)
Posted Jan 27, 2021 12:16 UTC (Wed)
by excors (subscriber, #95769)
[Link] (2 responses)
(These are very coarse timings so the Spectre mitigation of eliminating high-resolution timing APIs in JavaScript won't help here. And I suspect browsers can't prevent scripts from observing when a cross-origin resource has finished loading, without massively breaking compatibility with large parts of the web (which would be unacceptable): embedding external images is very common, and there's lots of widely-used APIs that can observe the on-screen layout of the page, and the layout will necessarily change once the image is loaded. So they can't stop scripts measuring the timing, they just have to stop that timing being able to pass information between different sites.)
Posted Jan 30, 2021 2:13 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
I have often wished that those APIs were *less* powerful, honestly.
There are two use cases for the modern web:
1. The interchange of hypertext documents.
#2 has an obvious need for fine control over how pages are laid out. However, for case #1, such fine control is (usually) considered an anti-pattern, as the whole point of hypertext is that it can be rescaled or reshaped automatically to fit the client's specific presentation needs.
What really frustrates me is the widespread misuse of APIs (that were clearly intended for case #2) in ways that make case #1 more obnoxious (for lack of a better term). In the 90's and early 2000's, popups were basically obliterated after browsers started systematically blocking them. Today, they're back, but now they're embedded in the page and significantly more annoying (because they're modal). Another common problem is the constant page reflowing as ads and other crap from every random corner of the web gradually pops itself into the DOM. I really would like to be able to just read the damn text without it constantly jumping around.
IMHO the gradual extension of HTML to accommodate these APIs was necessary (because Java and Flash were both terrible) but I still wish we had found a better way of cleanly separating case #1 from case #2. AMP was/is Google's attempt to do this, but everyone hated it for being not-HTML and for its obvious monopolistic tendencies, so I have no idea where that leaves us.
(Disclaimer: I work for Google, but not on anything related to web frontends or AMP.)
Posted Jan 30, 2021 17:47 UTC (Sat)
by Wol (subscriber, #4433)
[Link]
All too often I go to print a web page and (when I hit "print preview", because I've learnt) it bears ABSOLUTELY NO RESEMBLANCE WHATSOEVER to what's displayed on screen.
Cheers,
Posted Jan 27, 2021 11:31 UTC (Wed)
by epa (subscriber, #39769)
[Link]
Posted Feb 1, 2021 0:32 UTC (Mon)
by ras (subscriber, #33059)
[Link] (2 responses)
Perhaps, but it's a read once variable, it's fragile because the browser can choose not to cache anyway, and you will need lots of them which slows performance.
> I don't see any reasonable way to prevent that other than isolating the caches per site, like Firefox and Chrome are doing.
The thing they are defending against relies on the server returning different content each time the given URL is fetched. So break that.
A way to break it is to invent a new kind of URL, eg https://domain/..immutable../?sha512=xxxxxxxx&len=98765&iso8601=20210101T012345.678 that always returns the same content, and the browser enforces it ensuring it matches the hash in the URL. Those URL's get very aggressively cached for free, without you having to ask for it. Any URL that doesn't guarantee immutable content gets subject to proposed cache segregation, or perhaps not cached at all. Problem mostly solved.
Even better, if you say anything that uses "?sha512=xxxxxxxx&len=98765&iso8601=20210101T012345.678" must return the same content regardless of what https://domain/ it's in, you get cross site caching without the programmer having to life a finger. But that would make those read only variables you describe harder to recognise, so maybe it's not such a good idea.
Posted Feb 1, 2021 10:38 UTC (Mon)
by excors (subscriber, #95769)
[Link] (1 responses)
I believe that's only a subset of what they're trying to defend against. E.g. the cache isolation feature was discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1536058 because of an attack that depends only on cached vs uncached load timing. One developer says cache isolation "is a long-term anti-tracking goal too (in order to prevent ETag-based tracking vectors) so this gives us yet another privacy related reason for doing so", so they had already been thinking about this more general solution. There's also stuff like https://github.com/w3c/server-timing/issues/67 where a cached HTTP header can be used as an identifier (which wouldn't be protected by your scheme if you're hashing just the response's body; and you probably can't hash headers without breaking HTTP proxies).
The specific attack mentioned in Mozilla's blog post could be prevented in other cheaper ways, but that would do nothing against a lot of other published and yet-to-be-discovered attacks, so they went with cache isolation to prevent all those different methods (including the timing ones) at once.
Posted Feb 1, 2021 11:16 UTC (Mon)
by ras (subscriber, #33059)
[Link]
Sigh. I hadn't thought of headers. But this is a new sort of fetch, and for this sort you could say "you don't get access to no stink'in headers". Or perhaps just you get access to harmless pre-defined set, similar to what CORS allows.
In any case this can't be a replacement as not everything is immutable. You still need to do the cache isolation for the non-immutable stuff. It can only be an addition that elminates the impacts of the cache isolation for stuff that doesn't need it.
Posted Jan 27, 2021 5:48 UTC (Wed)
by fmarier (subscriber, #19894)
[Link]
How to still cache across sites
How to still cache across sites
How to still cache across sites
How to still cache across sites
How to still cache across sites
How to still cache across sites
How to still cache across sites
How to still cache across sites
How to still cache across sites
How to still cache across sites
2. Web applications
How to still cache across sites
Wol
How to still cache across sites
How to still cache across sites
How to still cache across sites
How to still cache across sites
How to still cache across sites
One way might be to extend the HTML <A> and <IMG> elements to optionally specify a hash of the data being linked to.
This is essentially what subresource integrity does for scripts and stylesheets, but as others have pointed out the mere presence of a resource in the cache can leak sensitive browsing history.
<IMG src="https://global-big-image/me.png" hash="sha256:abcdef" />