It might be sufficient to just disable the heuristic when the thread making the syscall has real-time priority. Don't know if that's the Perfect Answer, but I think it would be an incremental improvement at least.
I don't much like the idea of making this an undocumented difference between different timing syscalls (like someone else suggested), so that if you use ppoll you get one thing and timerfd another etc. -- I don't see why timerfd should be useful only to apps who need precision and don't care about power! Really the behavior should be uniform across ppoll/poll/pselect/select, epoll, timerfd, nanosleep, posix timers, interval timers. (Not sure if all of those use hrtimers yet; I know nanosleep and posix timers do in -rt.)
For timerfd in particular, one could add a timerfd_setslack call without breaking compatibility. It might be possible for some of those other APIs as well.