How exactly does on synchronize multicast data access to multiple threads WITHOUT involving the kernel?
And are you really so sure that the kernel currently provides ALL of the functionality necessary to perform this functionality, optimally?
What you are saying is like, "if we had a primitive read() command that only takes one character, that is sufficient, just call it over and over. It is not the place of the kernel for optimizations like being able to write multiple characters at once"