Summary of the DebConf 2038 BoF

Steve McIntyre reports from a BoF session on the year-2038 problem at DebConf 17. "It's important that we work on fixing issues *now* to stop people building broken things that will bite us. We all expect that our own computer systems will be fine by 2038; Debian systems will be fixed and working! We'll have rebuilt the world with new interfaces and found the issues. The issues are going to be in the IoT, with systems that we won't be able to simply rebuild/verify/test - they'll fail. We need to get the underlying systems right ASAP for those systems."

From:		Steve McIntyre <steve-AT-einval.com>
To:		debian-devel-AT-lists.debian.org
Subject:		Summary of the 2038 BoF at DC17
Date:		Sat, 2 Sep 2017 00:58:54 +0100
Message-ID:		<20170901235854.ds4hffumd2ktggau@tack.einval.com>

Hi folks,

As promised, here's a quick summary of what was discussed at the 2038
BoF session I ran in Montréal.

Thanks to the awesome efforts of our video team, the session is
already online [1]. I've taken a copy of the Gobby notes too,
alongside my small set of slides for the session. [2]

We had a conversation about the coming End of The World, the 2038 problem.

What's the problem?
-------------------

UNIX time_t is 31 bits (signed), counting seconds since Jan 1,
1970. It's going to wrap.. It's used *everywhere* in UNIX-based
systems. Imagine the effects of Y2K, but worse.

What could go wrong?
--------------------

All kinds if disasters! We're not trying to exaggerate this *too*
much, but it's likely to be a significat problem. For most of the
things that needed fixing for Y2K, they were on big obvious computers,
typically mainframes. Y2K was solved by people doing a lot of work; to
many outside of the direct work, it almost came to be an anti-climax.

In 20 years' time, the systems that we will be trying to fix for the
2038 problem are likely to be much harder to track. They're typically
not even going to look like computers - look at the IoT devices
available today, and extrapolate. Imagine all kinds of devices with
embedded computers that we won't know how to talk to, let alone verify
their software.

When does it happen?
--------------------

Pick the example from the date(1) man page:

$ date --date=@$((2**31-1))
Tue 19 Jan 03:14:07 GMT 2038

At that point, lots of software will believe it's suddenly 1902...

What needs doing?
-----------------

Lots of fixes are going to be needed all the way up the stack. 

Data formats are often not 2038-safe. The filesystems in use today are
typically not ready. Modern ext4 *is*, using 34 bits for seconds and
30 bits for nanoseconds. btrfs uses a 64-bit second counter. But data
on older filesystems will need to be migrated.

There are many places in the Linux kernel where 32-bit time_t is
used. This is being worked on, and so are the interfaces that expose
32-bit time_t.

Lots of libraries use 32-bit time_t, even in places where it might not
be obvious. Finally, applications will need fixing.

Linux kernel
------------

There's project underway to fix Linux time-handling, led by Deepa
Dinamani and Arnd Bergmann. There's a web site describing the efforts
at https://kernelnewbies.org/y2038 and a mailing list at
y2038@lists.linaro.org. They are using the y2038 project as a good way
to get new developers involved in the Linux kernel, and those people
are working on fixing things in a number of areas: adding core 64-bit
time support, fixing drivers, and adding new versions of interfaces
(syscalls, ioctls).

We can't just replace all the existing interfaces with 64-bit
versions, of course - we need to continue suporting the existing
interfaces for existing code.

There are lots of tasks here where people can join in and help.

Glibc
-----

Glibc is the next obvious piece of the puzzle - almost everything
depends on it. Planning is ongoing at

  https://sourceware.org/glibc/wiki/Y2038ProofnessDesign

to provide 64-bit time_t support without breaking the existing 32-bit
code. There's more coverage in LWN at

  https://lwn.net/Articles/664800/

The glibc developers are obviously going to need the kernel to provide
64-bit interfaces to make much progress. Again, there's lot of work to
be done here and help will be welcome.

Elsewhere?
----------

If you're working further up the stack, it's hard to make many fixes
when the lower levels are not yet done.

Kernels other than Linux are also going to have the same problems to
solve - not really looked at them in much detail. As the old time_t
interfaces are POSIX-specified, hopefully we'll get equivalent new
64-bit interfaces that will be standard across systems.

Massive numbers of libraries are going to need updates, possibly more
than people realise. Anything embedding a time_t will obviously need
changing. However, many more structures will embed a timeval or
timespec and they're also broken. Almost anything that embeds
libc-exposed timing functions will need updating.

We're going to need mass rebuilds to find things that break with new
interfaces, and to ensure that old interfaces still work. An obvious
thing to do here also is automated scanning for ABI compliance as
things change.

Things to do now
----------------

Firstly: developers trying to be *too* clever are likely to only make
things worse - don't do it! Whatever you do in your code, don't bodge
around the 32-bit time_t problem. *Don't* store time values in weird
formats, and don't assume things about it to "avoid" porting
problems. These are all going to cause pain in the future as we try to
fix problems.

For the time being in your code, *use* time_t and expect an ABI break
down the road. This is the best plan *for now*.

In terms of things like on-disk data structures, don't try to
second-guess future interfaces by simply adding padding space for a
64-bit time_t or equivalent. The final solution for time handling may
not be what you expect and you might just make things worse.

Dive in and help the kernel and glibc folks if you can!

Next, check your code and the dependencies of your code to see if
there are any bodges or breakages that you *can* fix now.

Discussion
----------

It's a shame to spoil the future pensions of people by trying to fix
this problem early! :-)

There are various license checkers already in use today - could the
same technology help finding time junk? Similarly, are any of the
static analysis tools likely to help. It's believed that Coverity (for
example) may be looking into the static analysis component of
this. There's plenty of scope for developers to help work in these
areas too.

Do we need to worry about *unsigned* time_t usage too (good to 2106)?
As an example, OpenPGP packets use that. This gives us a little bit
longer, but will still need to be considered. The main point to
consider is fixing things *properly*, don't just hack around things by
moving the epoch or something similarly short-term.

It's important that we work on fixing issues *now* to stop people
building broken things that will bite us. We all expect that our own
computer systems will be fine by 2038; Debian systems will be fixed
and working! We'll have rebuilt the world with new interfaces and
found the issues. The issues are going to be in the IoT, with systems
that we won't be able to simply rebuild/verify/test - they'll fail. We
need to get the underlying systems right ASAP for those systems.

2038 is the problem we're looking at now, but we're going to start
seeing issues well before then - think repeating calendar entries.

Libraries often don't need to expose any time_t style information, but
it's something to be careful about. If people have worked things out
well, changing the internal implementation of a delay function should
not pollute up the stack. But it's easy to pick up changes without
realising - think about select() in the event loop, for example.

Statically linked things (e.g. the Go ecosystem) are likely to bite -
we need to make sure that the libraries that they embed are fixed
early, before we can rebuild that stack upwards.

How can we enforec the ability to upgrade and get support for IoT
products so that they don't just brick themselves in future? GPL
violations play into this because the sources are unavailable - i.e.,
no way to rebuild and upgrade. Ancient vendor kernels are a major
PITA, and only make things more urgent.

If you're designing your own data format without reference to current
or upcoming standards, then of course consider the need for better
time handling. Conversions will be needed anyway.

Main takeaways:

 * This is a real problem
 * We need to fix the problem *early*
 * People are working on this already, and there's plenty of tasks to
   help with

[1]
http://meetings-archive.debian.net/pub/debian-meetings/20...
[2] https://www.einval.com/~steve/talks/Debconf17-eotw-2038/

-- 
Steve McIntyre, Cambridge, UK.                                steve@einval.com
"...In the UNIX world, people tend to interpret `non-technical user'
 as meaning someone who's only ever written one device driver." -- Daniel Pead

Summary of the DebConf 2038 BoF

Posted Sep 4, 2017 8:48 UTC (Mon) by hifi (guest, #109741) [Link] (5 responses)

Could someone point out *why* it hasn't been as easy as to decide to introduce a new "time64_t" and other time related structures as well?

It feels like the most straightforward solution and that it could one of the easiest one to implement. It does not just move the issue by some years, it moves it ridiculously far away it doesn't matter anymore. It also helps fixing up old applications with less hassle as they have been built with UNIX time in mind to begin with so the epoch doesn't change for them. Sure, it requires going through a lot of time related code because other parts of such applications will likely use 32 bit integers to store time related values and binary file formats or protocols but that's besides the point.

Only thing that needs to be done then is to make sure for ABI compatibility that the old ones exist and work (and fail) like they would today and deprecate them as fast as possible.

In-kernel things, filesystems and other pieces of code can be updated bit by bit as both work simultaneously. For testing purposes the new time64_t could be exposed today even if we're still actually counting time in 31 bits but expose a 64 (or 63 keeping signedness?) bit interface.

Am I thinking too small here or has this issue been blown out of proportion?

Summary of the DebConf 2038 BoF

Posted Sep 4, 2017 10:15 UTC (Mon) by joib (subscriber, #8541) [Link] (1 responses)

I think this is roughly the plan yes (that is, introduce time64_t etc.), analogous to how large file support was handled with attendant macros (_FILE_OFFSET_BITS=) etc.

The problem is that there's a *lot* of stuff that embeds time_t somewhere (structs, ioctl's etc.). So it's a lot of work, for little immediate benefit. Particularly for workstation/server users who are already on 64-bit systems where time_t is 64-bit so this is a non-issue for them. The problem really only affects embedded where most likely a lot of 32-bit systems will be around by 2038.

Summary of the DebConf 2038 BoF

Posted Sep 4, 2017 10:23 UTC (Mon) by joib (subscriber, #8541) [Link]

> Particularly for workstation/server users who are already on 64-bit systems where time_t is 64-bit so this is a non-issue for them.

To clarify, the size of time_t is not a problem for these users. Other y2038 issues like on-disk timestamps in filesystems etc. certainly affect them as well.

Summary of the DebConf 2038 BoF

Posted Sep 5, 2017 7:44 UTC (Tue) by ledow (guest, #11753) [Link] (2 responses)

Because many protocols and filesystems weren't ever built with that in mind and need a total on-disk / on-wire format redesign to accommodate it.

Everything from NTP to FAT will have to be changed, in a non-backwards-compatible way, and in most cases will break ALL interaction with other systems (e.g. NTP servers, storage devices) when you do so.

The "solution" is obvious. Any number of ways to fix it. The problem is implementing the solution in a way that allows people to keep using their computers.

Case in point - You have old tape systems with historical data on them, filesystem-dated, etc. (research data, archives, mortgage information, etc. etc.etc. - the era where timestamps were not duplicated into the files because it just took up unnecessary space, and you used old filesystems because you didn't have room for 64-bit values). You "upgrade" the filesystem modules to something new that allows you to move to 64-bit. You read an old file, and have to convert that date (from 1970 -> 2038) into the modern format. How do you ensure that - post-2038 - you know whether that file was really made in 1970 or 2038? Sure, newly-created file will be post-2017. But old ones? Are you going to silently trash the date fields for your users? "But they shouldn't be using them?" They why have them? And is your system going to think it's not been backed up in 40 years?

Or do you stop everyone's NTP server working with the established standard? Or does any NTP device now have to be redesigned? Or are you going to try to negotiate a compatible protocol with very expensive GPS / atomic clocks / etc. timekeeping devices running things like stock-exchanges, that were designed years before any such negotiation / protocol exists? Is the user just going to be told "Sorry, bin it." and have to buy a new one? I'm sure the manufacturers would love it, the users not so much. In fact they might just push back and say "I'm not going to upgrade my OS then, until 2038, and get another 21 years of usage out of this ridiculously expensive box".

It's not a "technical" problem, it's a practical one.

It's literally a case of trying to get users to move from "something that will work for another 21 years" to "something that will mean you have to spend lots of money and effort now". Sure, common sense would dictate the answer, but try running that past your company accounts department. "We have to spend this money now". "Why?". "Because if we don't, in 21 years time we might have the exact same problem but be 21 years closer to a solution and all our competitors will be having to do the same at that point anyway".

It's not quite as simple as just extending the timestamp field. It's breaking lots of entirely unrelated protocols. Will SMB1/2/3 wire protocols have to be redesigned? NFS?
NTFS? Word documents (what date format do they save in? Is it going to think the linked Excel file hasn't been updated in 58 years and refuse to honour it? TLS certificates? DHCP leases? Kerberos tickets? 32-bit dates are EVERYWHERE, and it's not a case of "just fix it in the kernel and everything is fine", it's literally getting the entire software library to be compliant, compatible, and able to read all its historical data too, without trashing data or even metadata the user may be relying on.

And there are LOTS of hidden dates. Do you know if the date used for the last replication of a HyperV/VMWare/etc. VM is 32-bit or 64-bit? I'm not at all sure I do. Maybe my VMs will decide the replication is too old and just stop? Or maybe it will force a simultaneous, all-VM immediately replication that then can't complete because the tickover confuses it?

Y2K is going to look like a party in comparison, and that cost billions and was relatively well-known to even ordinary users. Y2K38 is literally everything that ever touches a data being broken or unknown until someone redesigns the file format, network protocol or interaction with hardware.

Summary of the DebConf 2038 BoF

Posted Sep 5, 2017 8:55 UTC (Tue) by hifi (guest, #109741) [Link]

> Because many protocols and filesystems weren't ever built with that in mind and need a total on-disk / on-wire format redesign to accommodate it.

The protocols and binary formats need to be changed by application developers and that's a different issue from adding new 64 bit time types.

> You read an old file, and have to convert that date (from 1970 -> 2038) into the modern format. How do you ensure that - post-2038 - you know whether that file was really made in 1970 or 2038?

Expanding the UNIX time from 32 bit integers to 64 bit integers does, in fact, keep backwards compatibility. You read the old file and the integer timestamp is still valid even in 2038, it just happens to fit in the leading 32 bit portion of a 64 bit timestamp. Both share the same epoch so there's no need for actual conversion.

> It's not quite as simple as just extending the timestamp field. It's breaking lots of entirely unrelated protocols.

On the kernel and system side, it should be. You create new 64 bit types that are backwards compatible up to INT32_MAX and call it a day. The actual work is on application side like you said.

The thing I don't understand is why we need to scramble about *what* the new type will be when the simple, obvious and backwards compatible choice is right under our noses. The effort should be spent on fixing applications and drivers and not pondering if we need a new epoch or representation format.

Summary of the DebConf 2038 BoF

Posted Sep 5, 2017 13:23 UTC (Tue) by willy (subscriber, #9762) [Link]

You're exaggerating a little:

GPS time rolls over every 19 years (1024 weeks). Next due in April 2019.

NTP rolls over in 2036. It's not anticipated to cause any confusion (the protocol may have been modified by then to have 64bit seconds, but even if it hasn't, there's no confusion in the protocol. Implementions may have bugs, of course)