The growing image-processor unpleasantness

Posted Sep 2, 2022 8:46 UTC (Fri) by excors (subscriber, #95769)
In reply to: The growing image-processor unpleasantness by mordae
Parent article: The growing image-processor unpleasantness

> Energy savings could be realized by letting the sensor and GPU share the same buffer so that CPU don't have to touch it until GPU has done it's job. Instead of adding another semi-programmable ASIC or a single-purpose FPGA to the mix.

I'm not sure if I'm correctly interpreting what you're suggesting, but: The output from the sensor will be something like 10/12-bit Bayer. I don't think it'd be particularly useful to share that directly with the GPU, because GPU memory architectures seem to be optimised for 32-bit-aligned elements with 8/16/32-bit components (since that's what graphics and most GPGPU uses) and they'd be inherently inefficient at processing the raw Bayer data. So at the very least, you need some dedicated hardware to do the demosaicing efficiently, and probably any other processing that benefits from the sensor's 10/12-bit precision (where it'd be wasteful to use the GPU's 16-bit ALU operations), before saving to RAM as 8-bit YUV.

Once you've got YUV, then it's useful to do zero-copy sharing with the GPU, and I think any sensible software architecture will already do that. (Intel's GPU has some built-in support for planar YUV420 texture-sampling to help with that.)

But a lot of the image processing will still be inefficient on a GPU. It's way too expensive to write the whole frame to RAM after every processing step (given you may have a dozen steps, and it's a 4K frame at 90fps) - you want to read a chunk into the GPU's fast local memory and do all the steps at once before writing it out, and use the GPU's parallelism to process multiple chunks concurrently. But Intel's GPU has something like 64KB of local memory (per subslice), so you're limited to chunks of maybe 128x64 pixels. Whenever you apply some processing kernel with a radius of N, the 128x64 of input becomes (128-N)x(64-N) of valid output, and if you do many processing steps then you end up with a tiny number of useful pixels for all that work. The GPU memory architecture is really bad for this. (And that's not specific to Intel's GPU, I think they're all similar.)

So you still want dedicated hardware (with associated firmware) for most of that processing, with a much more efficient local memory architecture (maybe a circular array of line buffers per processing stage), and just use the GPU and/or CPU for any extra processing that you couldn't get into the hardware because of cost or schedule.