It's a question of practicalities. Think it through a little bit.
Remember, it's in userspace that the simultaneous reads have to take place.
libv4l has to provide a stream of decoded video frames by reading the encoded video device via the kernel interface, decoding the data and providing a proxy "device" in userspace. It makes sense for at most one thing to be doing that, and AIUI something which they call a "frame server" is to be the thing doing the proxying.
The userspace client programs can then talk simultaneously to the proxy to get the data that they themselves happen to be interested in. This is going to be different for different clients. That being the case, the "server" has to be the thing arbitrating the access to data by the different clients, reading and decoding new frames and updating the pool of decoded data which the clients are reading and quite possibly seeking around. Timed callbacks may also be required. This implies a separate thread or process of execution. Only in the very simplest and most restrictive use cases could you do without this.
Simultaneous use of a single device by multiple clients implies this sort of complexity, and you have to stuff it somewhere. In this case, it has to be in userspace.