I've had a similar idea, which is specifically designed for easy hardware implementation: allow an operation to have a (small integer) tag, then provide a command to "wait for all operations with tag #k to complete".
More generally, you could let every operation have a prerequisite tag that must be completed (you need one reserved tag number which is never used to specify commands with no prerequisites), and have the wait operation be a NOP with a prerequisite.
To merge threads, the wait operation can have a tag #n which differs from the #k it is waiting for. After it is issued, waiting for #n effectively waits for both.
Now, you can merge independent operation streams by doing address translation on tags. And you can compress tag space (down to simple barriers, in the limiting case) by allowing false sharing.