Nowadays, if you are not in a tight loop written in assembly, you no more count the number of cycles of instructions but the amount of time it takes to load it into the layer 1 memory cache, and the time to reload the previous cache line after executing your instruction, it is basically proportional to the size of the instruction.
The two cmp solution needs 16 bytes (in protected mode) if the out-of-bound handler is within 256 bytes of the test, and 32 bytes if not: that is a complete cache line.
The bound solution needs 8 bytes, mostly because it does not encode the out-of-bound address handler.
The difference loading the other 24 bytes is a lot more significant than the 4 cycles difference.
Even the failed branch prediction you will probably get is more important - even the fact that you have polluted the branch prediction cache is probably more important.
The default INT5 screen print handler is not accessible under Linux, BIOS is not mapped and the APIC is configured differently, if I remember well you have a SIGBUS exception in user mode and something as easy to trap/abort in kernel mode.