Ah, ok. I guess the advantages weren't immediately obvious with the example you posted, in part because the __fentry__ version still had a stack frame, and the stack frame in both cases wasn't terribly exciting.
With a beefier function and beefier stack frame, the differences would become more noticeable. And if you compile with -fomit-frame-pointer in the __fentry__ version, I can see the differences growing further still, as you note.
In the atomic_add example, it wasn't obvious that mcount wouldn't let you do the things you say you might want to do with __fentry__. Your explanation makes the limitations of mcount clearer.