This whole debate saddens me

Posted Nov 30, 2014 17:18 UTC (Sun) by Wol (subscriber, #4433)
In reply to: This whole debate saddens me by mjg59
Parent article: The "Devuan" Debian fork

It's all very well saying "the customer has a support contract". The customer still has all the grief of a failed upgrade, and all you've done is MOVED the problem. You haven't cured it.

Even worse if it's an emergency upgrade (which if it's old hardware destined for scrap, is not unlikely).

Cheers,
Wol

This whole debate saddens me

Posted Nov 30, 2014 18:20 UTC (Sun) by vonbrand (subscriber, #4458) [Link]

Users with support contracts expect things to move smoothly from version N to N +1 ... it'part of the deal. And "emergency upgrade, machine is to be scrapped" hasn't ever come up in my experience...

This whole debate saddens me

Posted Nov 30, 2014 19:01 UTC (Sun) by mjg59 (subscriber, #23239) [Link] (11 responses)

No big company will respond to "RHEL 9 is broken" with "Let's run RHEL 9 on the RHEL 7 kernel". They'll respond by rolling back to the previously deployed RHEL and screaming at whoever was responsible for giving the go-ahead. In almost 5 years of dealing with RHEL kernel bugs, I literally never saw anybody attempting the situation you're describing. It just doesn't happen.

This whole debate saddens me

Posted Dec 2, 2014 18:49 UTC (Tue) by Wol (subscriber, #4433) [Link] (10 responses)

What I was thinking of, and we did that sort of thing on various occasions, was taking (as a real example from my past)

NT3.5 and UV9.6 on old hardware -> NT2000 and UV11 on new hardware.

This is exactly the thing Al Viro flew off on - if the new system performs badly relative to the old one, then you want to find out what the f*** went wrong!

And old hardware breaks. You might be planning an upgrade and your hand gets forced. Or manglement are skinflints and wait until something breaks. Or whatever.

It sounds like that has happened with PostgreSQL already, so it's not an "it might happen", it's a "We've already had to do this before".

Cheers,
Wol

This whole debate saddens me

Posted Dec 2, 2014 19:16 UTC (Tue) by mjg59 (subscriber, #23239) [Link] (9 responses)

The equivalent here is "We ran Oracle 13 on RHEL 8 and it performs worse than Oracle 12 on RHEL 7, so we tried Oracle 13 on RHEL 7 and Oracle 12 on RHEL 8", not "We ran Oracle 12 on RHEL 8 and it performs worse than Oracle 12 on RHEL 7, so we tried running the RHEL 7 kernel under RHEL 8". You didn't try to diagnose Windows 2000 performance issues by running it on top of the NT 3.5 kernel!

This whole debate saddens me

Posted Dec 3, 2014 20:44 UTC (Wed) by Wol (subscriber, #4433) [Link] (8 responses)

But if they want to find the source of the problem, that's what you have to do. Change as few components as possible until you find what went wrong.

Given your example, if both Oracle 12 & 13 run fine on RHEL 7, and they run worse on RHEL 8, then you know the fault is RHEL. If running RHEL 8 systemd on RHEL 7, and RHEL 7 systemd on RHEL 8 also makes no difference, then you know it's the kernel. Then you play with the kernels until you find the one at fault.

But - given the tight coupling between systemd and the kernel - that sort of debugging is harder than it should be.

Linux' motto here is "don't break user space". If current systemd refuses to even boot when using *current* distribution kernels (even if they are decrepit kernels :-) that's a pretty serious user-space breakage!!!

Cheers,
Wol

This whole debate saddens me

Posted Dec 4, 2014 10:51 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (7 responses)

Like I said, *nobody* does this. If performance is worse on RHEL 8, and you've verified that it's RHEL and not the application that's responsible, you complain to Red Hat.

This whole debate saddens me

Posted Dec 4, 2014 13:59 UTC (Thu) by madscientist (subscriber, #16861) [Link] (2 responses)

Er... *somebody* has to do it. If you file a bug with Red Hat do you think they just meditate on the code until the problem becomes obvious?

And let's remember that not everyone buys a support contract from a company, not everyone has extra systems lying around to test upgrades on before they upgrade their "important" systems, and not everyone does reinstall upgrades. Are we saying that these people are SOL and the fact that their life just potentially got a lot more complex so that systemd doesn't have to think about backward compatibility is ok, because those people "aren't doing the work" so who cares?

This whole debate saddens me

Posted Dec 4, 2014 19:52 UTC (Thu) by sjj (guest, #2020) [Link]

These days, *everybody* with a credit card has test systems available. Or model your stuff on your laptop in VMs. If your precious server is an Artisanal Build by your Veteran Unix Administrator, insist that its software configuration can be replicated from a version control system at any time.

This whole debate saddens me

Posted Dec 4, 2014 20:04 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

The argument in http://lwn.net/Articles/623610/ was:

> What's going to happen when some big company decides to upgrade their ancient RHEL7 system to RHEL9, hits kernel-related problems, and discovers they can't debug it because they can't change kernels?

Everybody running RHEL has a support contract. Nobody running RHEL is going to debug it by swapping out individual components. Red Hat presumably feel comfortable in their ability to do that debug work.

This whole debate saddens me

Posted Dec 4, 2014 14:08 UTC (Thu) by dan_a (guest, #5325) [Link] (3 responses)

But then wouldn't Red Hat start doing all of that - because they will want to solve the regression?

This whole debate saddens me

Posted Dec 4, 2014 20:07 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (2 responses)

My experience at Red Hat was that performance issues were generally diagnosed by looking at perf traces and making targeted attempts at fixing the issue rather than just bisecting - the parts of the kernel that tend to cause this issues are frequently rewritten between RHEL releases, so bisecting often just gives you "We changed this algorithm to this other algorithm", and you still need to figure out why it's pathological with the new implementation because just reverting that isn't an option.

This whole debate saddens me

Posted Dec 4, 2014 20:44 UTC (Thu) by viro (subscriber, #7872) [Link] (1 responses)

Do we, by any chance, have magical private perf patches that allow to obtain information about an apparent race leading to data corruption sometime reproducible by given test in xfstests, and whom should I ask for it? Just going by the last time I had to go into "bisect all the way to 3.0" mode...

</sarcasm> There's a whole lot of stuff other than performance regressions, as you damn well know. For those, yes, you want perf traces first and foremost (and even then you really might want to see where the hell has regression first happened - sometimes it helps in figuring out what's wrong with the current algorithm). And we do upstream work as well, as you also know - after all, crap happening in mainline eventually will end up as crap happening in the next RHEL branch. <sarcasm>

This whole debate saddens me

Posted Dec 4, 2014 21:02 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

Undeniably. But they're still not cases where customers are going to try to swap out individual components themselves - the risk is customers being unhappy because it takes Red Hat longer to diagnose an issue, not customers being unhappy that they can't run hybrid RHEL systems.