"The other part where H264 is clearly superior is hardware support. For my usage this point alone is a show stopper for the other contenders."
This is an often invoked spurious argument.
Direct implementation of decoders in asics largely stopped in the late 90s. Today codecs are implemented in 'hardware' by simply writing highly optimized code for more-or-less general purpose processors. (Either SIMD instruction sets like ARM's NEON, or side-car DSPs like the TMS320c64x included with the OMAP chips).
The remaining 'hardware' support for video includes things like hardware colorspace conversion which work equally well for all formats.
The reasons for this is that codecs are simply too volatile for the long development timelines of ASIC design... and that the enormous flexibility of modern codecs (like the variable block sizes in H.264) both greatly increases the gate count for dedicated hardware and reduces the performance gap between an optimized software implementation and a dedicated hardware one.
The Theora reference implementation doesn't currently include optimized assembly for neon or the various widely used DSPs like it does for x86 and x86_64. It makes a large performance difference, and hides the fact that Theora has lower computational complexity than H.264. But this is a *software* problem that can be solved at any point by any party. Not a hardware problem.
But just in case you do have a need for hardware support. There is a synthesizable VHDL implementation of the back half of Theora available. There is also a verilog implementation of dirac. (although perhaps only the dirac-pro subset?)