I think it should work: 32 bit gives a default 3:1 split, and can get 98,000 32KB stacks out of 3GB... using 8KB kernel stacks you can get 128,000 corresponding kernel stacks.
Obviously you can't use all of the vmm map for thread stacks - but you can use a lot of it. And if 32KB is not enough, then its a state problem not a threading problem (i.e. an async design scenario would need to stash the state somewhere too)). Plus apache has that hybrid process/thread model, so each process has its own vm map.
Maybe 100K threads is hyperbole if you're all in one process, but 50+K seems quite reasonable from a memory management standpoint and makes a thread much less of a scarce resource than it is in the default apache config (2 or 3 hundred iirc).
I'm more curious if Linux kernel/libc could handle the creation and scheduling of so many threads. If not then that's kind of a setback from the direction of things when 2.6 was being released (see the link I provided last post), and if it can it seems a much easier approach than rearchitecting the application.
Fundamentally, isn't this multiplexing the kind of thing the OS should be doing for you efficiently?