Threaded processes can conceivably allocate a lot of (mostly unused?) stack space, and the kernel allows processes to specify a much bigger limit for those wacky applications that need it. Also, mmap'ing applications can put a lot of pressure on the 32-bit address mapping. So the address space may not be as cheap as you envision.
As the article mentions, and spender helpfully emphasizes, the Delalleau paper gives a good graphical overview of the situation.