I'm working with cluster computing and have exactly the same problem, except that I also have to support outdated garbage like RHEL6. So modifying the kernel was not really a good solution for me.
What I've done was to allow memcgs use some swap (about 10G each) and dialed swappiness way back, so it's not actually used until the main RAM is exhausted. I also have a privileged watchdog process that checks all the memcgs for the current RAM usage and if it comes close to the memcg limit it notifies OOM handler process withing the memcg.
Swap allows individual memcgs to temporarily exceed their limit, while OOM handler does its job. It's a bit racy, because processes can fill it up faster than OOM handler can kill them but since swap is so slow it doesn't really pose a problem in practice. Also, the central watchdog monitors the swap usage and if it spikes too much after the OOM notification the whole memcg is summarily killed.