The problem with the benchmark they did was that it's not a fair appraisal of interpretation versus compiling. The BPF switch interpreter isn't threaded. That is, at the end of every instruction it jumps back to the while loop, which does a conditional branch. Then there's the switch, which may or not may not do one or more conditional branches.
For fair comparison with a JIT compiler, the interpreter would instead jump directly from one instruction to the next using jump tables--indexing into a table of labels constructed using GCC's label address-of operator, &&.
On my own VM I can dramatically improve performance on many programs merely by threading the interpreter. If doing this gives the same performance, which it very well could given that BPF might be data bound and the ops are so simple, then it would be far preferable rather than adding hundreds of lines of new code for each architecture (or, conversely, having some architectures needlessly disadvantaged).