|From:||Adam Williamson <awilliam-AT-redhat.com>|
|Subject:||Fedora 20 release day FedUp bug: post-mortem|
|Date:||Mon, 30 Dec 2013 00:05:38 -0800|
Hi, folks. Now things have calmed down a bit in Fedora 20 and Rawhide, I have time to write this mail! Many of you may already know that there was a significant issue with upgrades to Fedora 20 around release day - 2013-12-17. Summary of the issue -------------------- Upgrading to Fedora 20 using version 0.7 of the FedUp tool does not work. Upgrading with version 0.8 works (in the main - of course there are bugs, there are always bugs). At the time Fedora 20 was released, version 0.7 of FedUp was present in the Fedora 18 and Fedora 19 'updates' repositories. Version 0.8 of FedUp was present in 'updates-testing' for both Fedora 18 and Fedora 19 at the time. Immediate response to the issue ------------------------------- We realized quite quickly during the course of release day support that this was the case, though at first we thought perhaps only some upgrades were failing. Once it became clear that all 0.7-based upgrades would fail, several folks worked hard at communicating this to as many users in as many places as possible, including via IRC, mailing lists, the Common Bugs page (https://fedoraproject.org/wiki/Common_F20_bugs#fedup-07-fail ), the forums, and social network sites like G+. We advised using fedup 0.8 from updates-testing to upgrade. We rapidly ensured 0.8 was submitted for stable push for both F18 and F19. It was submitted for F19 at 2013-12-17 21:12:18 (I believe Bodhi timestamps are UTC, so that was mid-afternoon on release day in NA) and for F18 at 2013-12-18 11:51:47 (early morning on the day after release). However, release engineering complications (there were some problems with stable pushes at the time) meant it wasn't finally pushed until 2013-12-19 07:23:09 UTC for F19 (late on the day after release NA time) and 2013-12-19 14:05:50 UTC for F18 (early morning two days after release) and wouldn't have made it to most mirrors until 2013-12-19, two days after release, and probably 2013-12-20 in 'early' timezones in Europe and Asia. Proximate cause of the issue ---------------------------- We have not yet identified the direct (proximate) cause of the bug; doing so did not seem especially important in comparison to ensuring news of the issue was spread as widely as possible, ensuring 0.8 was sent stable as soon as possible, and resolving some related issues (see later). However, QA's current inference is that there is some incompatibility between how fedup 0.7 modifies the initramfs used by the upgrade process and/or how it configures the upgrade boot environment, and the expectations of the upgrade environment as it exists within the final shipped upgrade initramfs. The upgrade initramfs is generated as part of the release compose process, and is dependent on factors including the versions of dracut and fedup-dracut used to build it. Broadly, we suspect that an upgrade run with fedup 0.7 which uses an upgrade initramfs generated with fedup-dracut 0.8 will not work, for reasons not yet identified. Indirect causes of the issue ---------------------------- We could perhaps make a very broad characterization of the 'indirect causes' of the issue as follows: an upgrade using fedup depends on several moving parts, and neither our development nor testing processes are sufficiently robust to ensure that we cover all possible combinations of those parts. fedup / fedup-dracut interdependencies ++++++++++++++++++++++++++++++++++++++ So far as I can discern, there is not at present any policy (whether written or enforced by some kind of mechanism) with regard to the inter-dependencies between the 'fedup' package side of the fedup process and the 'fedup-dracut' side of the process which involves release engineering generating an upgrade initramfs via fedup-dracut. As this issue suggests that not all 'fedups' work with all 'fedup-dracuts', perhaps this is something that might be required, but we leave that to the superior knowledge and expertise of the FedUp maintainer. Test procedure inadequacies +++++++++++++++++++++++++++ Similarly, QA's upgrade testing process clearly did not sufficiently carefully consider the same issue. This is something we have now moved to address. Prior to Fedora 20's release, the test cases for fedup recommended testing the latest version of fedup from updates-testing against the upgrade initramfs from the development/20 tree. This procedure was a holdover from the very early days of FedUp, when it was changing daily and testing anything older was uninteresting, and when procedures for the generation and publishing of the upgrade initramfs had not yet been clearly established (and TC/RC trees did not contain one). However, it is no longer appropriate for the more mature state of FedUp development at this point in time, and it should have been changed earlier. We in QA apologize to the project for this oversight. Other factors +++++++++++++ Additionally, various parties have noted in discussion of this issue that we would have been more likely to notice it, even with our imperfect testing procedures, if a couple of other factors had been different: * The lifetime of the final release candidate * The timing of changes to fedup In recent Fedora releases it has become something of a habit (for which I personally bear rather a large share of the blame) for us to reach RC stage late, iterate RCs rapidly, and often ship an RC that was built only days or even hours before the Go/No-Go decision. This has allowed us to fix bugs we might not otherwise have fixed and to avoid release delays. However, it has the obvious danger that testing of the final release bits may not be as comprehensive as it could be. We always ensure the formal validation testing is sufficiently complete, but an issue like this highlights that a few more days of testing are likely to catch things the formal validation testing process may miss for various reasons, including the kind of deficiency noted above. Even though our test procedure for fedup was outdated, if the final RC had lived for two or three days before being signed off, someone would likely have happened across this issue in time for something to be done about it. The fact that fedup 0.8 and fedup-dracut 0.8 landed quite late in the cycle is also relevant. fedup 0.8 was submitted for updates-testing on 2013-12-11; fedup-dracut was submitted on 2013-12-06, but the first compose which used it was RC1, built on 2013-12-12 (there was a delay of several days between TC5 and RC1, as blocker bugs kept appearing and needing to be fixed before an RC1 could be spun). RC1.1 was signed off for release on 2013-12-12. Even the mathematically-challenged will note that this left us extremely limited time to spot the problem. (I had tested upgrades with the updated fedup-dracut using a 'scratch built' upgrade initramfs rather earlier, but I must have used fedup 0.8 rather than 0.7 for my tests). Obviously, if these fairly significant version bumps had arrived earlier, we may have had more time to identify issues in them. If you're wondering how they were allowed to land so late, the answer is that they fixed blocker bugs we had identified in earlier upgrade testing, and so were allowed through the freeze. As we all know, it is difficult to adhere strictly to 'best practices' with Fedora's extremely short release cycles and ambitious pace of development, but of course it would be best in future if we can manage to avoid landing significant changes to fedup so late, a goal to which both QA and development groups can contribute by identifying and fixing issues at an earlier stage. As a 'meta' note, I think a factor that contributes to all of the above factors may be a lack of understanding outside a very few people as to precisely how the entire fedup process works: speaking personally, I certainly wasn't acquainted with all the subtleties until investigating this and other issues (not that I'd confidently claim to be an expert even now!) I think beyond Will Woods (obviously) and possibly Tim Flink (who did a lot of early fedup testing) and Dennis Gilmore (who tends to be the one generating the upgrade initramfs), possibly no-one really entirely understood the whole process. Related issues -------------- It is probably worth noting a somewhat-related issue at this point. fedup 0.8's major change compared to fedup 0.7 was that it introduced checking of GPG signatures on update packages. To facilitate this, the signing key for the release to which one is upgrading must be available to fedup running on the release from which one is upgrading. Again, we did not have this fully in place at the time of Fedora 20's release. The fedora-release-19-5 update added Fedora 20's key to Fedora 19: https://admin.fedoraproject.org/updates/FEDORA-2013-21411... . It was submitted on 2013-11-14 and pushed stable on 2013-12-03, so this was in place ahead of release. However, for Fedora 18, the relevant update - https://admin.fedoraproject.org/updates/FEDORA-2013-23598... - was submitted on 2013-12-18 and pushed stable on 2013-12-22 (and then we had to add a signed .treeinfo file to the Fedora 18 repositories or things *still* didn't work, which I think we did late on 2013-12-22 or on 2013-12-23). The fact that the keys weren't available for F18 was known around F20 release time, but was not considered urgent by the parties involved as we were not aware that fedup 0.7 simply would not work and consequently that it would be an urgent matter to make fedup 0.8 available and functional, and release engineering considered it a delicate operation to add the keys for Fedora 19 and 20 to Fedora 18, and one which they were not inclined to rush. Post-release reports also make it clear that fedup will abort if GPG keys for *any* repository fedup finds available for the target release cannot be found. i.e., if you have RPM Fusion or another popular third party repository configured, it's quite likely your upgrade will fail, because third party repos didn't have the signing key issue lined up (not surprising if we couldn't even entirely manage it ourselves). We were not sufficiently aware of this behaviour before release, and did not communicate it very well. The underlying causes of this are much the same as the underlying causes of the main issue - the fedup which enabled GPG checking landing very late, inadequate/incorrect test procedures, and limited knowledge of the details of fedup operation outside a small group of people. Addressing the problems ----------------------- I've noted above that so far as specific code responses to any of these issues go, we should probably defer to the wisdom of the maintainer. However, I've filed a couple of intentionally vague and open-ended tickets on fedup to provide a forum for action: https://github.com/wgwoods/fedup/issues/42 https://github.com/wgwoods/fedup/issues/43 In terms of QA test procedures, we (QA) have already taken action that should help guard against a repeat of this kind of issue in future. The FedUp test cases - for instance, https://fedoraproject.org/wiki/QA:Testcase_upgrade_fedup_... - have been adjusted to recommend testing the latest fedup from stable or updates (not from updates-testing), and to test against the current TC/RC tree (not the daily-updated development/ tree), now TC/RC trees contain the upgrade initramfs image. The FedUp and Upgrading wiki pages (https://fedoraproject.org/wiki/FedUp and https://fedoraproject.org/wiki/Upgrading ) have also been updated to be more consistent and correct for the current state of fedup, and the Installation Guide's section on upgrading has also been updated. Our test procedures and upgrade documentation should now be much more coherent and consistent than they were just prior to Fedora 20's release. In wider terms, this issue is another indicator on top of several previous ones that we should redouble our efforts to get 'releaseable' RCs built days ahead of go/no-go, rather than hours. That's a whole story in itself, but this is something the parties involved are all aware of and working on. Of course, the whole release process may look somewhat different in a Fedora.next world, but as long as we have our current release schedule and freeze policies, this issue is likely to exist at least in essence. It's also another good indicator that we should do whatever we can to try and land major changes much earlier in the release cycle. This is hardly a new observation, of course, nor an issue of which many relevant people were previously unaware, and there are always good reasons why we wind up landing the kitchen sink a week before release, but it's always good to have another reminder. On the positive side, the simple fact that this issue occurred has probably led to a wider understanding of at least some of the details of how fedup operates, and the fact that more people in the project have that knowledge should aid us in future fedup development and testing: we should be careful to keep that knowledge in mind as we build and test future releases. Conclusion ---------- Er, thanks for reading this far? :) -- Adam Williamson Fedora QA Community Monkey IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net http://www.happyassassin.net -- devel mailing list firstname.lastname@example.org https://admin.fedoraproject.org/mailman/listinfo/devel Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds