The benchmarks you link to seem to indicate that DMA isn't being used; it's doing PIO instead. It's not clear why they'd do that. Is this a legacy platform? I'm not familiar with AM335x... what NAND driver does it use? I'm not surprised it uses excessive CPU time if it's doing everything with PIO. If you run your MMC controller in PIO mode it'll be slow too.
I don't have specific examples of changing devices without changing model numbers, but I've heard it repeatedly over the years from people analysing such devices. Personally, I've mostly focused on real flash and left MMC to other people.
Paired pages... when they put 2 bits per flash cell in MLC, you'd have hoped that those two bits would be in the *same* logical page, right? So they get programmed at the same time? But no. They are in *different* logical pages, just to make life fun for you. And they aren't even *adjacent* pages; they can often pair page 0 and 3, page 1 and 2 or something like that. $DEITY knows why.... actually ISTR it's for speed; programming both bits at once would be slower in the microbenchmarks.