You're simultaneously overstating the complexity of this problem and the ability of the ANSI C locale facility to solve it.
The product I work on for my day job does natural language processing of internet content in arbitrary languages and encodings. I did the encoding transformation and "word breaker" lexical analyzer for it. The whole system works by transforming the data into UTF-8 and operating on it at the byte level. So sorry to pull the "domain expert" card here, but you're basically just wrong. This stuff has its subtleties, but it's absolutely not something that requires special API support. And if we *had* to pick an API, I can guarantee you it wouldn't be ANSI C's locale stuff, which is a complete non-starter for many of the reasons already detailed.