Coping with complex cameras

By Jonathan Corbet
October 3, 2024

Cameras were never the simplest of devices for Linux to support; they have a wide range of operating parameters and can generate high rates of data. In recent years, though, they have become increasingly complex, stressing the ability of the kernel's media subsystem to manage them. At the 2024 Linux Plumbers Conference, developers from that subsystem and beyond gathered to discuss the state of affairs and how complex camera devices should be supported in the future.

The complex-camera summit

Ricardo Ribalda led the session, starting with a summary of the closed-door complex-camera summit that had been held the previous day. "Complex cameras" are the seemingly simple devices that are built into phones, notebooks, and other mobile devices. These devices do indeed collect image data as expected, but as part of that task they also perform an enormous amount of signal processing on the data returned from the sensor. That processing, which can include demosaicing, noise removal, sharpening, white-balance correction, image stabilization, autofocus control, contrast adjustment, high-dynamic-range processing, face recognition, and more, is performed in memory by a configurable pipeline of units collectively known as an image signal processor (ISP).

Much of this functionality is controlled via a feedback loop that passes through user space. The ISP must be provided with large amounts of data that controls the processing that is to be done; these parameters can add up to 500KB or so of data — for each frame that the processor handles. Naturally, the format of the control data tends to be proprietary and is different for each ISP.

Ribalda started by saying that the vendors of these devices would like to use the kernel's existing media interface (often called Video4Linux or V4L), but that subsystem was not designed for this type of memory-to-memory ISP. Multiple ioctl() calls are required for each frame, adding a lot of overhead. V4L does not provide fences for the management of asynchronous operations; attempts have been made to add fences in the past, but did not succeed. There are no abstractions for advanced scheduling of operations, and no support for modes where multiple buffers are sent to the ISP and it decides which ones are important. Multiple-camera support, needed for modern phones, is lacking in V4L, he said.

Vendors feel that the V4L API is simply too slow to evolve, so it is failing to keep up with modern camera devices. It is much faster, in the short term at least, for vendors to just bash out a device-specific driver for their hardware. But that leads to multiple APIs for the same functionality, creating fragmentation.

There has been talk of moving support for these devices over to the direct rendering manager (DRM) subsystem, alongside graphics processors, but the DRM developers are not camera experts. There is also interest in creating pass-through interfaces that let user space communicate directly with the device (a topic that had been extensively discussed at the Maintainers Summit a few days before), but allowing such an API requires trusting vendors that have not, in turn, trusted the kernel community enough to provide detailed information about how their devices work.

The conclusion from the complex-camera summit, Ribalda said, was that the V4L developers would like to see a complete list of technical features needed to support complex camera devices. They would then work to address those shortcomings. If they are unable to do so within a reasonable time, he said, they agreed to not block the incorporation of ISP drivers into the DRM subsystem, possibly with pass-through APIs, instead.

There are also non-technical challenges, Ribalda continued. It is not just that many aspects of these ISPs are undocumented; vendors claim that they cannot be documented. This area, it seems, is a minefield of patents and "special sauce"; vendors do not want to reveal how their hardware works. But the V4L subsystem currently requires documentation for any feature that users can control from user space. Currently, the policy means that undocumented features cannot be used; changing the policy, though, would likely result in all features becoming undocumented.

A solution proposed at the summit was to describe a "canonical ISP" with features that all of these devices are expected to have; those features would need to be fully documented and implemented in standard ways. Everything else provided by the ISP could be wired directly to user space via a pass-through interface. That would allow functionality to be made available without requiring documentation of everything the device does.

Of course, there are problems with this approach, he said. Pass-through interfaces raise security concerns. The kernel does not know what the device is doing and cannot, for example, prevent it from being told to overwrite unrelated data in the kernel. These devices require calibration for every combination of ISP, sensor, and lens; without full documentation, that calibration cannot be done. And, naturally, there will be disagreement over what a canonical ISP should do; vendors will push to implement the bare minimum possible.

An alternative is to ignore proprietary functionality entirely, and document only the basic functionality of the device. Vendors would then provide an out-of-tree driver to make the device actually work as intended. In this world, though, vendors are unlikely to bother with an upstream driver at all. The result will be hard for both distributors and users to manage.

What to do?

This introduction was followed by an extensive, passionate, wandering, and often loud discussion over the proper approach to take regarding complex cameras. The discussion was also long, far overflowing the allotted time. Rather than trying to capture the whole thing, what follows is an attempt to summarize the most significant points of view. Apologies to all participants whose contribution is not reflected here.

It was clear that the conclusions from the closed summit did not find a consensus in the room. Almost every aspect of them was questioned at one point or another. Nonetheless, they made a useful starting point for the discussion that followed.

V4L maintainer Hans Verkuil asserted that, in his experience, the device-specific functionality requested by vendors tends to not really be only found in one device; this functionality should be generalized in the interface, he said. The DRM layer requires Mesa support (in user space) for new drivers; that is the level where the API is standardized. V4L, he said, has lacked the equivalent of Mesa, so the kernel API is the standard interface.

Now, though, the libcamera library is taking over the role of providing the "real" interface to camera devices, which can change the situation. Vendor-specific support can be implemented there, hiding it from users. So perhaps the best solution is to require the existence of a libcamera interface for complex camera drivers. Meanwhile, the V4L interface will still be needed to control the sensor part of any processing chain; perhaps code could move to DRM for the ISP part, if V4L support will be too long in coming. Media subsystem maintainer Mauro Carvalho Chehab agreed that libcamera makes a DRM-like model more possible.

DRM subsystem maintainer Dave Airlie said that the existing architecture of the V4L subsystem is simply not suitable for modern camera devices. It is, he said, in the same position as DRM was 20 years ago, when that subsystem had to make a painful transition from programming device registers to exchanging commands and data with user space via ring buffers. If there is a libcamera API that can describe these devices in a general way, he said, then a kernel driver for a specific device should not be merged until libcamera support is present. Then, he said, the ecosystem will sort itself out over time.

He later added that he does not want camera drivers in the DRM subsystem, which was not designed for them; he is willing to accept them if need be, though. He had really expected V4L to have evolved some DRM-like capabilities by now; that is where the problem is.

The lesson from the DRM world, he said, is to just go ahead and build something, and the situation will improve over time. The vendors with better drivers will win in the market. He advised against writing specific rules for the acceptance of drivers, saying that vendors would always try to game them. Instead, each driver should be merged after a negotiation with the vendor, with the requirements being ratcheted up over time. That may mean allowing in some bad code initially, especially from vendors that cooperate early on, but the bar can be raised as the quality of the subsystem improves overall.

There was some discussion about how closely ISPs actually match the GPUs driven by the DRM subsystem. Sakari Ailus said that they are different; they are a pipeline of processing blocks that is configured by user space. There is only one real command: "process a frame". Libcamera developer Laurent Pinchart said that the current model for ISPs does not involve a ring buffer; instead, user space submits a lot of configuration data for each frame to be processed. Both seemed to think that the DRM approach might not work for this kind of device.

Daniel Stone said, though, that there are GPUs that operate with similar programming models; they do not all have ring buffers. He took strong exception to the claim that, if pass-through functionality is provided, vendors will have no incentive to upstream their drivers. Often, he said, it's not the vendors who do that work in the first place. The Arm GPU drivers were implemented by Collabora; Arm is only now beginning to help with that work. There were a lot of people who wanted open drivers, he said, and were willing to pay for the work to be done. So he agreed with Airlie that the market would sort things out over time. Manufacturers who participate in the process will do better in the end.

Several participants disagreed with the premise that vendors need to keep aspects of their hardware secret for patent or competition reasons. As is the case elsewhere in the technology industry, companies are well aware of exactly what their competitors are doing. Airlie added that all of these companies copy each other's work, aided by engineers who move freely between them.

There was also a fair amount of discussion on whether allowing pass-through drivers would facilitate a complete reverse engineering of the hardware. Pinchart said that the large number of parameters for each frame makes reverse engineering difficult. Stone replied that this argument implies either that ISPs are different from any other device, or that the engineers working with them are less capable than others; neither is true, he said. Airlie added that the same was said about virtual-reality headsets, but then "one guy figured it out" and free drivers were created. Pinchart said that regular documentation for these devices would not suffice to write a driver, again to general disagreement.

Goals

The developers in the room tried to at least coalesce on the goals they were trying to reach. The form of the desired API is not clear at this point, though Pinchart said that it should not force a lot of computation in the kernel. Airlie suggested pushing all applications toward the use of libcamera; eventually developers will prefer that over working with proprietary stacks. Pinchart was concerned that allowing pass-through functionality would roll back the gains that have been made with vendors so far. So some features, at least, have to be part of the standard API; the hard part, he said, is deciding which ones.

Pinchart said there seemed to be a rough agreement that vendors should be required to provide a certain level of functionality to have drivers merged, but he wondered how those requirements should be set. Airlie repeated that hard rules here would be gamed, and that there should be a negotiation with each vendor. If being friendly with one advances the situation, he said, "go for it". Pinchart worried that both vendors and the community might object. Airlie said that he just tells complaining vendors to go away, but that the early cooperative vendors, to a great extent, are the community.

Several developers said that the requirements could vary depending on the market each device serves. For a camera aimed at Chromebooks, open functionality sufficient for basic video conferencing may be sufficient. But for cameras to be used in other settings, the bar may be higher, with requirement for more functionality provided by an in-tree driver. Some developers suggested a minimum image-quality requirement, but that would be hard to enforce; properly measuring image quality requires a well-equipped (and expensive) laboratory.

As this multi-hour session ran over time, Ribalda made some attempts to distill a set of consensus conclusions, but he was not hugely successful. Nonetheless, this discussion would appear to have made some headway. There are other kernel subsystems that have had to solve this problem in the past, resulting in a lot of experience that can be drawn from. Support for complex camera devices in Linux seems likely to be messy and proprietary for some time but, with luck, it will slowly improve.

[ Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our travel to this event. ]

Index entries for this article
Kernel	Device drivers/Video4Linux2
Conference	Linux Plumbers Conference/2024

Clarification on the usage of DRM vs. V4L2

Posted Oct 3, 2024 13:43 UTC (Thu) by laurent.pinchart (subscriber, #71290) [Link]

Thank you Jon for summarizing this long and passionate discussion, it wasn't an easy job. I would like to also thank Ricardo for co-organizing this micro-conference and handling the larger part of the logistics.

> Libcamera developer Laurent Pinchart said that the current model for ISPs does not involve a ring buffer; instead, user space submits a lot of configuration data for each frame to be processed. Both seemed to think that the DRM approach might not work for this kind of device.

Small clarification here: my opinion isn't that the DRM approach couldn't work for these devices (memory-to-memory ISPs), but that it doesn't bring much technical advantage compared to what can already be done with V4L2, or to what will be possible with V4L2 once current work in progress gets merged upstream.

This comment is only intended as a clarification of what I expressed (or tried to express) during the micro-conference, not as an attempt to debate this opinion on LWN.net. There are some technical limitations of V4L2 in its current form that we all agreed exist, but nobody has provided quantitative data to prove that (and how much) they affect the use cases we discussed.

The big picture isn't that gloomy

Posted Oct 3, 2024 13:56 UTC (Thu) by laurent.pinchart (subscriber, #71290) [Link] (5 responses)

> Support for complex camera devices in Linux seems likely to be messy and proprietary for some time but, with luck, it will slowly improve.

I'd like to bring a bit of a more positive spin to this conclusion. The Linux kernel and libcamera already have fully open-source support for multiple ISPs, most notably the Raspberry Pi 4 ISP, the VSI ISP8000 (a.k.a. rkisp1 for historical reasons) found in SoCs from Rockchip, NXP and other vendors, and the Intel IPU3 ISP (found in Sky Lake and Kaby Lake SoCs). Support for the Raspberry Pi 5 ISP and the Arm Mali C55 ISP is developed in the open and close to getting merged upstream in both the kernel and libcamera. The list is constantly growing.

This being said, it's not all rainbows and unicorns either, support for some important platforms is missing today.

The big picture isn't that gloomy

Posted Oct 3, 2024 20:07 UTC (Thu) by ribalda (subscriber, #58945) [Link]

I also want to thank Laurent for all his work co-organizing this MicroConfonference.

In the last years, Laurent has done an amazing job with vendors such as Raspberry PI or the Arm Mali. They are the golden standard for what an open camera stack should look like.

But that work is difficult to map into the hardware that runs *most* of the consumer electronics today. We do not support the cameras in most (maybe all) of the phones and the only way to use the current intel hardware is to software emulate the ISP, with limited capabilities and very poor performance.

The vendors that want to collaborate with us say that *for ISPs* they do not need any of the abstractions provided by V4L2 (formats, controls, media controller) and that the current V4L2 openness model is not compatible with their business model. The very same vendors are delivering open graphic stacks... so it is not fair to say that the lack of support is all their fault.

There are some positive points from the MC:
- It is the first time that 4 vendors attended an open conference.
- We are talking about relaxing the openness requirements of v4l2 in favor of our users
- We have started to look into what other subsystems are doing in terms of building an ecosystem

If we manage to include the vendors into our community, support for new ISPs will keep flowing (instead of being heroic achievements), and users will soon enjoy their cameras in their open OSs.

IPU4 and IPU6 support?

Posted Oct 8, 2024 3:07 UTC (Tue) by DemiMarie (subscriber, #164188) [Link] (1 responses)

Will IPU4 and the PSYS part of IPU6 be supported? That's what I'm most interested in today, as it would allow shipping high quality cameras on Linux laptops.

IPU4 and IPU6 support?

Posted Oct 8, 2024 14:38 UTC (Tue) by laurent.pinchart (subscriber, #71290) [Link]

Those are questions for Intel. My understanding is they have no plan to provide IPU4 support. For the IPU6 PSYS, there's a global consensus that we all want it to be supported upstream, but no agreement yet on how to get there.

Tuning for image quality?

Posted Oct 8, 2024 16:22 UTC (Tue) by DemiMarie (subscriber, #164188) [Link] (1 responses)

Are there any plans to perform the per-device tuning needed for optimal image quality, or will the image quality always be worse than with the OEM OS?

Tuning for image quality?

Posted Oct 8, 2024 18:35 UTC (Tue) by laurent.pinchart (subscriber, #71290) [Link]

libcamera has a tuning tool. It hasn't reached feature and quality parity with closed-source implementations yet, but we're actively working on improving it. The image quality doesn't depend only on tuning, but also on the implementation of the ISP control algorithms. This is also an area that we are actively working on.

The complexity has merely moved

Posted Oct 3, 2024 16:04 UTC (Thu) by atnot (subscriber, #124910) [Link] (9 responses)

A bit of context, as the framing of cameras becoming more "complex" is kind of misleading I think.

What has happened instead is that while in the past, the camera module contained a CPU and ISP with firmware that did all the processing, with smartphones this has increasingly been integrated to the main SoC instead to save cost, power, and increase flexibility. So instead of the camera module sending a finished image over a (relatively) low-speed bus, image sensors are now directly attached to the CPU and deliver raw sensor data to an on-board ISP. This means that what used to be the camera firmware needs to be moved to the main CPU too. This is not a problem under Android since they can just ship that "firmware" as a blob on the vendor partition and use standard system APIs to get a completed image. But for v4l2, which is fundamentally built around directly passing images through from the device to applications, that's a problem.

So it's not that cameras have gotten more complex. They're still doing exactly the same thing. It's just that where that code runs has changed and existing interfaces are not equipped to deal with that.

The complexity has merely moved

Posted Oct 3, 2024 16:18 UTC (Thu) by laurent.pinchart (subscriber, #71290) [Link]

That's a good summary, thank you. It became more complex only from a Linux kernel and userspace point of view.

V4L2 was originally modelled as a high-level API meant to be used directly by applications, with an abstraction level designed for TV capture cards and webcams, where all the processing was handled internally by hardware and firmware. It has evolved with the introduction of the Media Controller and V4L2 subdev APIs to support a much lower level of abstraction. These evolutions were upstreamed nearly 15 years ago (time flies) by Nokia. Unfortunately, due to Nokia's demise, the userspace part of the framework never saw the light of day. Fast forward to 2018, the libcamera project was announced to fill that wide gap and be the "Mesa of cameras". We now have places for all of this complex code to live, but the amount of work to support a new platform is significantly larger than it used to be.

The complexity has merely moved

Posted Oct 3, 2024 16:34 UTC (Thu) by laurent.pinchart (subscriber, #71290) [Link] (5 responses)

> So it's not that cameras have gotten more complex. They're still doing exactly the same thing. It's just that where that code runs has changed and existing interfaces are not equipped to deal with that.

I think it's a bit more complicated than this. With the processing moving to the main SoC, it became possible for the main OS to have more control over that processing. I think this has led to more complex processing being implemented, that would have been more difficult (or more costly) to do if everything had remained on the camera module side. Advanced HDR processing is one such feature for instance, and NPU-assisted algorithms such as automatic white balance is another example. Some of this would probably still have happened without the processing shifting to the main SoC, but probably at a different pace and possibly in a different direction.

In any case, the situation we're facing is that cameras now look much more complex from a Linux point of view, due to a combination of processing moving from the camera module to the main SoC, and processing itself getting more complex (regardless of whether or not the latter is a partial consequence of the former).

The complexity has merely moved

Posted Oct 3, 2024 17:37 UTC (Thu) by intelfx (subscriber, #130118) [Link] (4 responses)

> due to a combination of processing moving from the camera module to the main SoC, and processing itself getting more complex (regardless of whether or not the latter is a partial consequence of the former).

Forgive me if I'm wrong, but I always thought it was mostly the other way around, no? I. e. due to R&D advances and market pressure the image processing started to get increasingly more complex (enter computational photography), and at some point during that process it became apparent that it's just all around better to get rid of the separate processing elements in the camera unit and push things into the SoC instead.

Thus, a few years later, this trend is finally trickling down to general-purpose computers and thus Linux proper.

Or am I misunderstanding how things happened?

The complexity has merely moved

Posted Oct 3, 2024 18:49 UTC (Thu) by laurent.pinchart (subscriber, #71290) [Link] (1 responses)

I think it goes both ways. I can't tell exactly when this transition started and what was the trigger, but once processing moved to the main SoC, it opened the door to more possibilities that people may not have thought of otherwise. In turn, that probably justified the transition, accelerating the move.

There are financial arguments too, in theory, at least if development costs and the economical impact on the users are ignored, this architecture is supposed to be cheaper. A more detailed historical study would be interesting. Of course it won't change the situation we're facing today.

The complexity has merely moved

Posted Oct 4, 2024 19:00 UTC (Fri) by Wol (subscriber, #4433) [Link]

> I think it goes both ways. I can't tell exactly when this transition started and what was the trigger, but once processing moved to the main SoC, it opened the door to more possibilities that people may not have thought of otherwise. In turn, that probably justified the transition, accelerating the move.

This seems to be pretty common across a lot of technology. It starts with general purpose hardware, moves to special-purpose hardware because it's faster/cheaper, the special purpose hardware becomes obsolete / ossified, it moves back to general purpose hardware, rinse and repeat. Just look at routers or modems ...

Cheers,
Wol

The complexity has merely moved

Posted Oct 3, 2024 19:55 UTC (Thu) by excors (subscriber, #95769) [Link] (1 responses)

From what I vaguely remember, back when Android only supported Camera HAL1 (which used the old-fashioned webcam model: the application calls setParameters(), startPreview(), takePicture(), and eventually gets the JPEG in a callback and can start the process again), there was already a trend to put the ISP on the main SoC. It's almost always cheaper to have fewer chips on your PCB, plus you can use the same ISP hardware for both front and rear cameras (just reconfigure the firmware when switching between them), and you can use the same RAM for the ISP and for games (since you're not going to run both at the same time), etc, so there are significant cost and power-efficiency benefits.

The early SoC ISPs were quite slow and/or bad so they'd more likely be used for the front camera, while the higher-quality rear camera might have a discrete ISP, but eventually the SoCs got good enough to replace discrete ISPs on at least the lower-end phones. (And when they did have a discrete ISP it would probably be independent of the camera sensor, because the phone vendor wants the flexibility to pick the best sensor and the best ISP for their requirements, not have them tied together into a single module by the camera vendor.)

Meanwhile phone vendors added proprietary extensions to HAL1 to support more sophisticated camera features (burst shot, zero shutter lag, HDR, etc), because cameras were a big selling point and a great way to differentiate from competitors. They could implement that in their app/HAL/drivers/firmware however they wanted (often quite hackily), whether it was an integrated or discrete ISP.

Then Google developed Camera HAL2/HAL3 (based around a pipeline of synchronised frame requests/results) to support those features properly, putting much more control on the application side, standardising a lower-level interface to the hardware. I'm guessing that was harder to implement on some discrete ISPs that weren't flexible enough, whereas integrated ones typically depended more heavily on the CPU so they were already more flexible. It also reduced the demand for fancy new features on ISP hardware since the application could now take responsibility for them (using CPU/GPU/NPU to manipulate the images efficiently enough, and using new algorithms without the many-year lag it takes to implement a new feature in dedicated hardware).

It was a slow transition though - Android still supported HAL1 in new devices for many years after that, because phone vendors kept using cameras that wouldn't or couldn't support HAL3.

So, I don't think there's a straightforward casuality or a linear progression; it's a few different things happening in parallel, spread out inconsistently across more than a decade, applying pressure towards the current architecture. And (from my perspective) that was all driven by Android, and now Linux is having to catch up so it can run on the same hardware.

The complexity has merely moved

Posted Oct 3, 2024 20:11 UTC (Thu) by laurent.pinchart (subscriber, #71290) [Link]

> And (from my perspective) that was all driven by Android, and now Linux is having to catch up so it can run on the same hardware.

It started before Android as far as I know. Back in the Nokia days, the FrakenCamera project (https://graphics.stanford.edu/papers/fcam/) developed an implementation of a computational photography on a Nokia N900 phone (running Linux). A Finnish student at Stanford university participated in the project and later joined Google, where he participated in the design of the HAL3 API based on the FCam design.

The complexity has merely moved

Posted Oct 4, 2024 10:27 UTC (Fri) by mchehab (subscriber, #41156) [Link] (1 responses)

Not entirely true, as an ISP-based processing allow more complex processing pipelines with things like face recognition, more complex algorithms and extra steps to enhance image's quality.

During the libcamera discussions, we referred to the entire set as simply "V4L2 API", but there are actually three different APIs used to control complex camera hardware: V4L2 "standard" API, media controller and sub-devices API. Currently, on complex cameras:

1. input and capture (output) devices are controlled via V4L2 API (enabled via config VIDEO_DEV);
2. the pipeline is controlled via the media controller API (enabled via config MEDIA_CONTROLLER);
3. each element of the pipeline is individually controlled via the V4L2 sub-device API (enabled via config VIDEO_V4L2_SUBDEV_API).

Following the discussions, it seems that we could benefit of having need new ioctl(s) for (1) to simplify the number of ioctl calls for memory-to-memory sub-devices, to simplify ISP processing, perhaps as a part of sub-device API.

For (3), we may need to add something to pass calibration data.

Yet, the final node of the pipeline (the capture device) is the same, and can be completely mapped using the current V4L2 API: a video stream, usually compressed with a codec (mpeg, h-264, ...) with a known video resolution, frame rate and a fourcc identifying the output video format (bayer, yuv, mpeg, h-264, ...).

Most of vendor-specific "magic" happens at the intermediate nodes inside the pipeline. Typically, modern cameras produce two different outputs: a video stream and a metadata stream. The metadata is used by vendor-specific 3A algorithms (auto focus, auto exposure and auto whitebalance), among others. The userspace component (libcamera) need to use such metadata to produce a set of changes to be applied to the next frames by the ISP. They also use a set of vendor-specific settings that are related to the hardware attached to the ISP, including camera sensor and lens. Those are calibrated by the hardware vendor.

The main focus of the complex camera discussions is around those intermediate nodes.

As I said during libcamera discussions, from my perspective as the Media subsystem maintainer, I don't care how the calibration data was generated. This is something that IMO we can't contribute much, as it would require an specialized lab
to test the ISP+sensor+lens with different light conditions and different environments (indoor, outdoor, different focus settings, etc.). I do care, however, to now allow executing binary blobs sent from userspace at the Kernel.

By the way, even cameras that have their own CPUs and use just V4L2 API without the media controller have calibration data. Those are typically used during device initialization as a series of register values inside driver tables. While we want to know what each register contains (so we strongly prefer to have those registers mapped with #define macros), it is not mandatory to have all of them documented.

In the past, our efforts were to ensure that the Kernel drivers is fully open sourced. Now that we have libcamera, the requirement is that the driver (userspace+kernel) to be open sourced. The Kernel doesn't need to know how configuration data passed from userspace was calculated, provided that such calculus is part of libcamera.

The complexity has merely moved

Posted Oct 8, 2024 18:31 UTC (Tue) by laurent.pinchart (subscriber, #71290) [Link]

> Not entirely true, as an ISP-based processing allow more complex processing pipelines with things like face recognition, more complex algorithms and extra steps to enhance image's quality.

atnot's point is that ISP were there already, just hidden by the webcam firmware, and that's largely true. We have USB webcams today that handle face detection internally and enhance the image quality in lots of way (we also have lots of cheap webcams with horrible image quality of course).

> For (3), we may need to add something to pass calibration data.

It's unfortunately more complicated than that (I stopped counting the number of times I've said this). "Calibration" or "tuning" data is generally not something that is passed directly to drivers, but needs to be processed by userspace based on dynamically changing conditions. For instance, the lens shading compensation data, when expressed as a table, needs to be resampled based on the camera sensor field of view. Lots of tuning data never makes it to the device but only influences userspace algorithms.

> Yet, the final node of the pipeline (the capture device) is the same, and can be completely mapped using the current V4L2 API: a video stream, usually compressed with a codec (mpeg, h-264, ...) with a known video resolution, frame rate and a fourcc identifying the output video format (bayer, yuv, mpeg, h-264, ...).

There is usually no video encoder in the camera pipeline when using an ISP in the main SoC. Encoding is performed by a separate codec.

> By the way, even cameras that have their own CPUs and use just V4L2 API without the media controller have calibration data. Those are typically used during device initialization as a series of register values inside driver tables. While we want to know what each register contains (so we strongly prefer to have those registers mapped with #define macros), it is not mandatory to have all of them documented.

Those are usually not calibration or tuning data, as they are not specific to particular camera instances.

Khronos Kamaros camera API

Posted Oct 3, 2024 16:54 UTC (Thu) by joib (subscriber, #8541) [Link] (1 responses)

What about the Kamaros work by Khronos? How does that fit into the FOSS camera stack picture? (I know very little about camera programming, so feel free to ELI5)

Khronos Kamaros camera API

Posted Oct 3, 2024 17:01 UTC (Thu) by laurent.pinchart (subscriber, #71290) [Link]

Kamaros is an ongoing effort of the Khronos group to standardize a cross-platform API for cameras. It is currently foreseen to expose a level of abstraction similar to the libcamera API, or to the Android Camera HAL3 API. If and when a first version of Kamaros is published, I believe we will implement it as a layer on top of libcamera, the same way that libcamera has an Android Camera HAL3 adaptation layer. Drawing a parallel to the graphics world, libcamera with a Kamaros adaptation layer would be equivalent to Mesa implementing the Vulkan API. Kamaros will be implemented once in the adaptation layer, and work with all platforms supported by libcamera.

All of this is of course more on the realm of speculation than commitment, as there is no public Kamaros API yet.

What is a camera?

Posted Oct 3, 2024 17:32 UTC (Thu) by SLi (subscriber, #53131) [Link] (9 responses)

As someone not familiar with V4L: What does the system assume a camera is? Is it just a device that periodically sends an opaque buffer, and that can be configured?

I'm trying to think of weirder cameras that might require different approaches, but fundamentally I guess everything even remotely camera-like (depth cameras, IR temperature cameras, ...) can be represented as a buffer of WxHxC pixels where W and H tend to be large and C much smaller, as long as you don't assume too much about C.

Some of this gets maybe blurred a bit by things like light field cameras, but in the end you still have a fairly normal camera sensor, only what it gets to measure is more of a Fourier transformed scene.

Are there any weird devices reasonably called cameras that V4L currently can't and foreseeable won't work with?

What is a camera?

Posted Oct 3, 2024 19:28 UTC (Thu) by laurent.pinchart (subscriber, #71290) [Link] (8 responses)

> As someone not familiar with V4L: What does the system assume a camera is? Is it just a device that periodically sends an opaque buffer, and that can be configured?

V4L2 has two levels of APIs for camera. The "traditional" (I sometimes call it "legacy", due to my bias towards embedded systems and direct control of ISPs) API is high-level, and maps to what you can expect today from a USC webcam. The camera is a device that periodically produces images stored in buffers, in a variety of formats. It also exposes controls, to configure contrast, brightness, focus, zoom, or more device-specific parameters. V4L2 defines many pixel formats, some of them used for depth maps or temperature maps.

The lower-level API exposes all building blocks of the camera processing pipeline (camera sensor, MIPI CSI-2 receiver, ISP, ...) and lets userspace control all of them individually. The kernel drivers are mostly responsible for ensuring nothing blows up, by verifying for instance that buffers are big enough to store images in the configured capture format. The controls are also lower-level, and more numerous. ISPs typically have from tens of kBs to MBs of parameters (to be precise, a large part of that parameters space is made of 1D and 2D tables, so we're not dealing with millions of individually computed independent values). They also produce, in addition to images, statistics (histograms for instance).

In this lower-level model, userspace is responsible for computing, in real time, sensor and ISP parameters for the next frame using statistics from the previous frame. This control loop, usually referred to as the ISP algorithms control loop, is totally out of scope of the kernel drivers. It is the component that I consider legitimate for vendors not to disclose. In libcamera we implement an open-source ISP algorithm control loop module (named Image Processing Algorithms module, or IPA) for each supported platform, and also allow vendors to ship their closed-source IPA modules. This is similar to Mesa shipping an open-source 3D stack, without preventing vendors from shipping their competing closed-source stack.

> Are there any weird devices reasonably called cameras that V4L currently can't and foreseeable won't work with?

Yes and no. In my opinion, there are no devices that can reasonably be called a camera that could easily be supported in DRM and couldn't be supported by V4L2 with a reasonable development effort. There are however cameras that V4L2 can't support and would have a hard time managing. I'm thinking about very high frame rate cameras for instance (10kfps and higher), or cameras that require CPU action for every line of the image. This isn't so much a V4L2 limitation than a Linux limitation, handling an interrupt every 10µs with real-time performance requires a real-time operating system.

V4L2 has known limitations today. It lacks multi-context support (the ability for multiple independent clients to time-multiplex a memory-to-memory ISP), an atomic API (the ability to queue processing for a frame with a single ioctl call), fences, or sub-frame latency. Some of these issues are being addressed (there's a patch series for multi-context support), will be addressed (the building blocks for the atomic API have slowly been upstreamed over the past few years) or could be addressed (patches have been posted for fences, but didn't get merged due to a lack of real-life performance gain ; sub-frame latency currently needs a champion to build a use case and make a proposal).

What is a camera?

Posted Oct 3, 2024 21:35 UTC (Thu) by ribalda (subscriber, #58945) [Link] (7 responses)

> As someone not familiar with V4L: What does the system assume a camera is? Is it just a device that periodically sends an opaque buffer, and that can be configured?

Laurent has made a very good description. A simpler description is that today cameras usually contain two parts:
- The online capture part: Produces periodically frames and can be configured.
- The offline capture part aka as ISP: that takes N frames and M configuration buffers from memory and produces X frames and Y statistics buffers. The number of buffers and the size of them is hardware specific.

> Are there any weird devices reasonably called cameras that V4L currently can't and foreseeable won't work with?

In my opinion, for ISPs, the issue with v4l2 is that we do not have a good API to: "send N buffers and receive N buffers" efficiently. Instead we have to deal with controls, entities, pads, formats...

Some of those abstractions were implemented to standardize the capture process. Other abstractions were implemented to avoid sending random data to the ISP from userspace.

We can probably "fix" v4l2, and Laurent has given a good list of things that we are missing. But I am more aligned with the opinion of Dave Airlie that v4l2 is not suitable for ISPs (today). We need a change in the abstraction level and we need more vendors in our community.

All that said. v4l is a beautiful piece of engineering for the "capture part".

What is a camera?

Posted Oct 4, 2024 10:43 UTC (Fri) by mchehab (subscriber, #41156) [Link]

> But I am more aligned with the opinion of Dave Airlie that v4l2 is not suitable for ISPs (today). We need a change in the abstraction level and we need more vendors in our community.

Internally, V4L2 core code is generic and good enough to support ISPs. IMO, what it is needed is new IOCTL(s) - maybe at sub-device level - which would avoid the need of sending multiple ioctls per frame, with fences and dmabuf support. From internal code's perspective, just like we currently have videobuf2-v4l2.c and videobuf2-dvb.c as the top layer for per-API buffer handling, we may need a videobuf2-codec.c layer on the top of VB2 to handle the needs for ISP using such new IOCTL(s).

What is a camera?

Posted Oct 4, 2024 15:49 UTC (Fri) by Sesse (subscriber, #53779) [Link] (5 responses)

> All that said. v4l is a beautiful piece of engineering for the "capture part".

Not sure if I agree; as a userspace programmer, I just gave up supporting V4L2 input at some point because the API was so painful and bare-bones. Every single camera under the sun seems to support a slightly different set of formats and settings, and the onus is on you as an application to figure out which ones to support (e.g. you'll frequently need to embed a JPEG decoder!). At some point, I wanted to support _output_ via v4l2loopback, but that means you'll need to go through the format dance again; browsers and other clients will accept only a set of formats and nothing will try to convert for you. Eventually I went to the point of looking at the Chromium source and picking the format it accepted that was the least pain for me to create. :-) Thankfully, I only needed 720p60, so I didn't have to worry about fast DMA between my GPU and V4L2 buffers.

What is a camera?

Posted Oct 8, 2024 14:44 UTC (Tue) by laurent.pinchart (subscriber, #71290) [Link] (4 responses)

> > All that said. v4l is a beautiful piece of engineering for the "capture part".
>
> Not sure if I agree;

I concur. We've made lots of mistakes in V4L2 over the years, and I'm responsible for some of them. That's not specific to V4L2 though, the whole kernel is developed by making mistakes and then trying to fix them. The important part is to constantly improve the APIs and implementation. And make new mistakes along the way to replace the old ones :)

This being said, I think that V4L2 is in a much better place than it was 10 years ago for cameras and ISPs. Don't use the API directly in your applications though. While V4L2 was designed as an application API, the world has moved on and we now need and increasingly have userspace frameworks to handle the hardware complexity.

What is a camera?

Posted Oct 8, 2024 15:49 UTC (Tue) by Sesse (subscriber, #53779) [Link] (3 responses)

Do you have any good recommendations—is it libcamera that one would want to use? (I'd rather not go to the level of FFmpeg or Gstreamer if I can avoid it; FFmpeg is a great codec library but pretty weak on anything involving live, and Gstreamer is just a world of pain)

What is a camera?

Posted Oct 8, 2024 18:34 UTC (Tue) by laurent.pinchart (subscriber, #71290) [Link] (2 responses)

> Do you have any good recommendations—is it libcamera that one would want to use? (I'd rather not go to the level of FFmpeg or Gstreamer if I can avoid it; FFmpeg is a great codec library but pretty weak on anything involving live, and Gstreamer is just a world of pain)

It depends on your use cases. For desktop applications, the future is Pipewire, which itself will interface with libcamera (or for the time being directly with V4L2 for USB webcams). For more specific applications, especially in embedded and IoT use cases, I recommend GStreamer with the libcamerasrc element. On Android one would of course use the libcamera Android adaptation layer that implements the camera HAL3 API. Direct usage of the libcamera API isn't something I expect to see very commonly for generic-purpose applications.

What is a camera?

Posted Oct 8, 2024 18:45 UTC (Tue) by Sesse (subscriber, #53779) [Link] (1 responses)

FWIW, my application is low-level enough to already talk to ALSA directly (both for PCM and for MIDI) and have its own userspace USB drivers for certain capture cards, so it sounds like libcamera directly would be the most likely avenue to try first. :-)

What is a camera?

Posted Oct 8, 2024 21:16 UTC (Tue) by laurent.pinchart (subscriber, #71290) [Link]

If you're dealing with systems that require controlling an ISP, libcamera would then be a good way forward. If you're dealing with UVC webcams, libcamera could still help but is probably overkill as it won't give you the features you mentioned were missing (such as transcoding between formats or decoding JPEG). If you're dealing with IP cameras, libcamera will definitely not help as we don't support those.

Where to move secret sauce

Posted Oct 3, 2024 17:41 UTC (Thu) by jengelh (guest, #33263) [Link] (4 responses)

>vendors claim that they cannot be documented. This area, it seems, is a minefield of patents and "special sauce";

Just move it into firmware. If GPUs can do it, so can those camera vendors.

Where to move secret sauce

Posted Oct 3, 2024 18:53 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

GPUs already have a lot of computational power, so they can spare a bit of it to run the firmware. Cameras do not, so they want to use the main CPU (with maybe some hardware accelerators) to run the image processing algorithms.

Where to move secret sauce

Posted Oct 4, 2024 6:56 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

> GPUs already have a lot of computational power, so they can spare a bit of it to run the firmware. Cameras do not...

This looks like comparing Apples with Oranges. Conceptually, cameras are more similar to monitors and GPUs to ISPs (which do of course run firmware)

Where to move secret sauce

Posted Oct 4, 2024 19:38 UTC (Fri) by laurent.pinchart (subscriber, #71290) [Link]

>This looks like comparing Apples with Oranges. Conceptually, cameras are more similar to monitors and GPUs to ISPs

In this context, a "camera" is usually a device made of at least an imaging sensor (the chip with the glass with shiny colours sitting under the lens) and an ISP (a hardware components that performs image processing tasks that would be too expensive for the CPU). When the imaging sensor and the ISP are in different chips, as is the case for the "complex cameras" we're dealing with, there are also different types of physical transmitters and receivers between the two components (MIPI CSI-2 is often involved). Pre- or post-processing steps can also be implemented in DSPs, NPUs, GPUs and/or CPUs as part of the camera pipeline. There needs to be a software component in the system with a global view of the whole pipeline and all the elements it contains, in order to configure them and run real time control loops.

The imaging sensor can contain a firmware, but that's largely out of scope here. It only deals with the internal operation of the sensor (sequencing exposure or read-out of lines for instance), and not with image processing by the ISP. GPUs, DSPs and NPUs, if used in the camera pipeline, can also include firmwares, but that's not relevant either from the points of view of the ISP or the top-level control loops.

> (which do of course run firmware)

Many ISPs don't. They are often fixed-function pipelines with a large number of parameters, but without any part able to execute an instruction set. Some ISPs are made of lower-level hardware blocks that need to be scheduled at high frequency and with very low latency, or contain a vector processor that executes an ISA. In those cases, the ISP usually contains a small MCUs that runs a low-level firmware. When the ISP is designed to be integrated in a large SoC, those firmwares often have very limited amount of memory and no access to the imaging sensor. For these reasons they are mostly designed to expose the ISP as a fixed-function pipeline to the OS.

When the ISP is a standalone chip, sitting between the imaging sensor and the main SoC, the MCU integrated with the ISP is usually a bit more powerful and will run the camera control algorithms, taking full control over the imaging sensor. The ISP chip then exposes a higher-level interface to the main SoC, similar to what an imaging sensor with an integrated ISP would expose.

Other firmwares can also be involved. Large SoCs often include cores meant to run firmwares, and those can usually interact with the entire camera (imaging sensor and ISP). Some vendors implement full camera control in such firmwares, exposing a higher-level interface similar to a webcam. There's a big downside in doing so, as adding support for a different imaging sensor, or even tuning the camera for a different lens, requires modifying that firmware. I believe this is done for instance by the Apple M1, as they have full control of the platform. More directly relevant for Linux, this kind of architecture is also seen in automotive environments where the camera is controlled by a real-time OS, and Linux then accesses some of the camera streams with a higher level of abstraction.

Where to move secret sauce

Posted Oct 4, 2024 15:21 UTC (Fri) by neggles (subscriber, #153254) [Link]

The command processors running those firmware blob are actually their own dedicated processor in modern desktop GPUs - nVidia's GSP, for example, is a fairly high-clock RISC-V core (as of appx. Ampere generation)