I think VoIP doesn't mind if a couple of packets is lost (that's why it uses UDP in the first place), the jitter is more problematic. Buffering doesn't really help, because in a phone conversation one might want to interrupt what the other says and seconds of buffering kind of breaks this.
I'm not quite sure where this merging could be done. Only at the endpoints or in routers between?