True, current NUMA machines are almost all in the category of multi-socket AMD64 systems, which have fast enough interconnects that it's usable if you ignore NUMA ad just treat is as a SMP machine.
Historic NUMA machines had MUCH slower interconnects, the best comparison in moderns systems would be if you were connecting your CPU nodes together with high speed networks.
There are still some people building such machines (I think the current Cray systems are this category), but when you get to interconnects that are that expensive, you are frequently better segmenting the system and running it as if it was a cluster of systems, or (the more common case), just build a cluster of commodity systems instead of the monster NUMA system in the first place.
I think that if AMD hadn't introduced NUMA to the commodity desktop/server with the Opteron, NUMA would be something that's so rare that the overhead and complexity of it's logic wouldn't be acceptable in the kernel.
There are some applications that really are hard to split into a multi-machine cluster, and for those NUMA (including RDMA setups that tie multiple commodity machine together) are the right tool for the task, but they are pretty rare, it's almost always worth re-architecting the application to avoid this requirement.