When you're driving large displays with multiple layers and moving a lot of pixels around, an overlay engine that reads from multiple layers and combines them directly into the scanout path, without requiring memory-to-memory copies is often a more efficient use of available bandwidth than using the GPU.
Not taking advantage of hardware composition blocks leaves performance on the floor, and it's performance well worth taking advantage of.
CPUs and GPUs are becoming more powerful, certainly, but displays are getting larger, software is drawing more complex stackups (with more alpha and effects between layers), and there's seldom as much graphical compute as you'd like on embedded platforms, even the higher end ones.
Also, efficient multitasking between multiple GPU clients is still rather hit or miss. I've seen beefy Win7 desktop machines start becoming unresponsive to window drags, etc, when throwing a complex load at high end desktop GPUs because the compositor and application are fighting for the available GPU and the hardware and/or drivers don't time-slice it well enough to remain smooth. The problem is typically worse on embedded platforms.