I'm glad you are a HW engineer and you admit it. In all my years of firmware, I've met very few HW engineer's that will admit this.
This article hit very close to home, especially the part about DMA randomly locking up. I previously spent a lot of time on a deeply embedded firmware design where I would scan all DMA operations looking for certain lockup conditions, then abort the DMA hardware operation and replace the operation by a firmware memcpy.
I believe the issue with digital HW engineers comes from their work environment. Here are a few reasons:
1) Their designs tend to be quite in depth, and it is difficult to read verilog/vhdl code, so the code is heavily supported by word docs with block diagrams. This of course results in code slowly migrating away from documentation, and without the review process only the designer of the module knows how it works. They've done this for so long, that they believe this is the only way to do development.
2) Because of (1), brick walls form around modules, and only black box testing is done by simulations. Now no one looks at your code, no one appreciates being told that their code isn't up to par for readability, so they get very defensive and don't like to admit any fault internal to their design.
3) On top of all that chip schedules are always rushed, but they have no easy ability to go back and fix small mistakes, so they try and hide them under the carpet and blame firmware, or expect firmware to just figure out how to make it work.
4) It takes month for a chip to come back from the fab plant, months later for firmware to finish a dev. kit for it, and start shipping it out to customers, months later when a customer finds a bug. A year turn around is normal, longer is not uncommon. Digital HW engineers are usually onto another project by that time
Now every time a bug is found there is knee jerk reaction to blame firmware, even when a firmware team approaches the problem as "we don't know where the bug is, but we need helping locating it". The HW engineer just doesn't want to open up that can of worms, they know was just a rush job. Excuses like "We aren't going to re-spin the chip for this bug, so why should I spend time looking into it", are the norm. This really doesn't help, firmware implement a workaround, when the firmware team doesn't know the root cause.
I believe the only way this can change is from a management viewpoint that digital designers aren't just designing a chip right now, but they need to support the chip for years, and that should be budgeted for right up front. Documentation including verilog/vhdl code comments can happen well after the chip has taped out. This needs to be a serious priority in any schedule.