Scheduling for the Android display pipeline

January 16, 2020

This article was contributed by Alessio Balsini

Android users make heavy use of the displays on their devices for almost all of their interaction; good display performance is thus critical for a satisfactory user experience. Achieving that performance is not always easy; there are a lot of pieces that need to work together, and the kernel does not always support this collaboration as well as one might like. The Android team is currently considering a number of combinations of existing kernel features and possible enhancements in its efforts to provide the best display experience possible.

The display is managed by the Android display pipeline, which is a complex system where different tasks and hardware accelerators collaborate in the execution of the application and the update of the graphical content presented to the user through the screen. The display pipeline is responsible for generating the display output, so its performance directly affects one of the most important channels of interaction between the user and the device.

Providing stable frame rates where no frames are skipped is a priority for the display pipeline in addition to the low latency requirement demanded by the growing mobile-gaming industry. Moreover, Android finds its leading use in mobile devices, where limited energy and thermal dissipation resources represent other strict requirements that the system has to meet; these can be generalized as minimizing power consumption. All of these demands are in opposition with each other and require accurate tuning and workload distribution among the entities that take part in this process, as well as a wise use of the features provided by the Linux kernel scheduler and CPU-frequency governor. The choices made here are directly responsible for the overall performance of the system.

A short journey through the Android display pipeline

The Android display pipeline is a mix of software and hardware components cooperating not only in the production of the display output but also in the advancement of the application's execution. The software components taking part in the display pipeline are split between the application itself and the Android framework; these pieces exchange data through ad-hoc, zero-copy mechanisms. This section unrolls the Android display pipeline loop and follows the dependency chain among the components, providing a brief description of each of them.

Everything starts with the display controller which, besides taking care of the display buffer and the display configuration, is also responsible for the management of all the synchronization signals between the display and the rest of the system. When the display is ready to accept new data to show (a "frame"), the display controller generates a VSYNC signal, which represents the starting trigger for the whole display pipeline. This signal occurs at the refresh rate of the display, and is thus periodic: a display with a 60Hz refresh rate generates a VSYNC every 16.6667ms. Because of its signal characteristics, the name VSYNC has been chosen in honor of the vertical synchronization for cathode-ray tube (CRT) monitors.

This process is optimized by making the DispSync thread of SurfaceFlinger directly responsible for the propagation of periodic VSYNC signals to the other components of the display pipeline. Depending on the system implementation, these signals may be periodically re-aligned with the hardware VSYNC generated by the Display Controller. DispSync is also responsible for broadcasting VSYNC to both the application and the main SurfaceFlinger thread, at predefined different offsets. The introduction of a delay and displacement to the VSYNC signals broadcasting will be explained later.

When the user-interface thread of the user application receives the VSYNC signal from DispSync, it awakens from an epoll() sleep and performs the following operations:

Processes input events.
Executes the Animation callback(s) defined by the application developer.
Traverses the View tree to lay out the UI and create a tree of drawing commands called a RenderNode tree.
Passes the updated RenderNode tree to another application thread called RenderThread.
Performs some other operations, like cleanup and monitoring, and goes back to sleep in epoll() waiting for the next VSYNC.

When RenderThread awakes after receiving the RenderNode tree, it:

Fetches the next output buffer from a BufferQueue shared with SurfaceFlinger and waits on the associated release fence in case the buffer is not yet available.
Optimizes the list of draw commands (e.g. by removing operations affecting hidden objects).
Translates the list into GPU commands.
Asks the GPU to perform the rendering.
Enqueues the output buffer to the BufferQueue shared with SurfaceFlinger. The output buffer is enqueued without waiting for the GPU to complete; it includes a hardware fence through which the GPU notifies SurfaceFlinger that the rendering of the rasterized frame has been completed.
Performs other closing operations and goes back to sleep waiting for the next request from the UI thread.

As mentioned earlier, memory allocation and display-related data passing are zero-copy operations performed through an ad-hoc data structure called BufferQueue. The synchronization between user space and kernel space — to notify user space that the GPU has finished generating the rasterized frame, for example — is achieved with fences implemented in the kernel.

The main thread of SurfaceFlinger is also awakened by DispSync with a delayed VSYNC signal. SurfaceFlinger is responsible for gluing together (or composing) the rasterized frames coming from different sources that in most of the cases are:

the application that is currently shown on the screen,
the navigation bar, which shows the buttons placed at the bottom of the screen on devices with no physical buttons, and
the status bar (the bar on the top), which shows the time, the remaining battery, the notification icons, etc.

The composition time is reduced by a hardware 2D compositor that relieves the GPU from this duty, leaving it free to be accessed by applications for rendering; it is also more efficient and faster than the GPU in performing this operation. After the composition, the final frame is ready to be sent to the display.

See the figure below for a simplified (!) overview of how all of these pieces fit together:

What is the benefit of this complexity?

The advantage of splitting the frame production workflow among multiple entities organized in a pipeline is to improve parallelism and reduce bottlenecks. In a high-workload scenario, when the display is showing frame N, SurfaceFlinger is already composing frame N+1; meanwhile, within the application, RenderThread is preparing frame N+2 and the UI thread is already working on frame N+3.

In this scenario, the application requires up to three display periods to get a frame to the display, which is an unfortunate but acceptable situation. What usually happens, though, is that the UI thread and RenderThread of the application complete in a single period, so the total latency of the pipeline becomes just two periods. There is also the best-case scenario, where the UI thread and RenderThread are so short that both the application and SurfaceFlinger can fit into a single period, thus reducing the latency to less than a single frame. This scenario can only happen if SurfaceFlinger starts right after the app generates the rasterized frame and returns the composited frame before the upcoming VSYNC. A couple of these scenarios are illustrated below:

The best-case scenario is made possible by the VSYNC phase displacements introduced by DispSync when forwarding the signal to the application and SurfaceFlinger. If each of these components has a total duration that does not exceed the VSYNC period, then the system generates a smooth display output which follows the display frame rate. If one of the entities misbehaves and takes longer to execute, the result may be the skipping of one or more frames.

Scheduling display pipeline entities

Android uses the SCHED_FIFO realtime scheduling policy for all of the SurfaceFlinger and Hardware Composer tasks mentioned above. The rationale is that these activities are latency-sensitive and, as system services, their execution path is known and their execution time is guaranteed not to starve other processes.

The application is also a component of the display pipeline chain, so its performance is also fundamental to the user's experience. Unfortunately, apps are developed by third parties, and there is no prior knowledge of their performance characteristics. Assigning an app to a scheduling class such as SCHED_RT may cause the starvation of other system services running as SCHED_OTHER, thus causing undesired behavior. The use of of the completely fair scheduler (CFS), instead, as the default application scheduler provides important features, such as fairness among tasks and consolidated integration with CPU-frequency governors.

Schedutil, Linux, and Android

The default CPU-frequency governor used by Android is schedutil, which relies on the CPU utilization of the runnable tasks to select the frequency of the CPU they execute on: the higher the utilization, the higher the frequency of the CPU when they are runnable. This governor fits so well with the needs of mobile Android devices that, in Android, it also takes care of the SCHED_RT tasks, which are normally run at the maximum frequency in mainline Linux kernels.

Schedutil chooses the lowest frequency sufficient not to overload the system, based on the measurement of the system utilization. This solution works well when tasks are independent and are able to run in parallel. But, whenever there is a dependency — tasks that are blocked on the completion of others — the single-task utilization accounting mechanism is no longer sufficient to define the requirements of the whole task set.

For example, in the scenario shown below, schedutil sees that RenderThread only requires 50% of a CPU's capacity, so it sets the CPU frequency to 50% of the maximum. But RenderThread cannot run until the UI thread has done its work — the two tasks cannot run in parallel — so it misses its deadline.

Android currently implements a workaround called "TouchBoost" to deal with this misbehavior. When the user interacts with the device, TouchBoost sets the minimum frequency that the governor can choose to a higher value for a given amount of time. This brute-force solution successfully provides the resources required by the display pipeline when the user interacts with the device. The disadvantage of this approach is that, when the display pipeline has a low workload, the minimum frequency forced by TouchBoost may be much higher than the demand of the pipeline. This possible overprovisioning of the frequency results in some waste of energy that gives no improvement to the user experience and should be limited, or possibly eliminated.

Some possible solutions

Many different paths can be followed to deal with this behavior that require more or less effort in either user space or kernel space. Possible solutions include using different scheduling classes, implementing feedback loops in the Android framework to offload from the kernel the aggregation of the CPU utilization of interdependent tasks, or extending the scheduling mechanisms.

CFS and Utilclamp

A straightforward solution to overcome the TouchBoost overprovisioning is to implement a similar, but smarter, mechanism that forecasts the minimum utilization required by the interdependent tasks to complete on time. The implementation of this mechanism requires monitoring the execution time of the application's threads (which varies by device, core, application, the application's current status, and the other operations the system runs), and an API to notify the kernel about the performance requirements of the tasks.

Utilclamp is a mechanism developed by Patrick Bellasi; it allows user space to constrain the utilization of one or more tasks measured by the kernel with a minimum and/or maximum cap. This mechanism allows user space to change the kernel's behavior when dealing with specific tasks, ensuring that the CPU frequency will be set within the given bounds when those tasks are runnable. Manipulating the utilization clamps may affect not only the choice of the CPU frequency but also the task placement in heterogeneous architectures.

For example, a short task with big interleaving time has low utilization, but if user space knows that the task must complete as soon as possible it may set a clamp to increase the minimum perceived utilization by that task so that it will run on a high-performance CPU at a high frequency. Another example is a batch task that probably has a measured utilization close to 100%, but user space knows it can run in the background and may set a clamp on its maximum utilization so that the task can run at low frequencies on a low-performance, energy-efficient CPU.

In the specific case of the Android display pipeline, the Android framework may compute the right utilization of a group that covers the execution of both the UI thread and the RenderThread. It can then use Utilclamp to force the kernel to use that value rather than the internally measured single-task utilizations. This solution would solve the problem of defining the utilization of interdependent tasks but, as a consequence, the UI thread and RenderThread would compete for the CPU against other less important CFS tasks, and there is no notion of either a deadline or other real-time requirements for such a latency-dependent task set.

Moreover, it is almost impossible to obtain a good estimate of the duration of a task by using the canonical CLOCK_MONOTONIC or CLOCK_THREAD_CPUTIME_ID clocks, since these measurements are affected by both the frequencies at which the task was running and by the capacities of the CPU(s) where the task was placed. Additionally, the Linux kernel is still lacking a way of exporting the capacity- and frequency-invariant execution time of a process or thread to user space.

`SCHED_RT`

If our application tasks are so important, why can't the SCHED_FIFO or SCHED_RR realtime classes be chosen? This solution would solve the competition between the application and CFS services and, since SCHED_RT tasks also run at a frequency determined by schedutil, the same energy efficiency considerations as CFS apply, as well as the use of a utilization-clamping mechanism.

The problem here is that, while it is true that the performance of our application tasks is important, not all the sections of our tasks are time-critical and there is the risk that our application may cause the starvation of system services running under CFS. Real-time throttling is a mechanism that can constrain this issue, but it may cause the loss of some of application bandwidth, resulting in an unworkable solution. RT_RUNTIME_GREED by Daniel Bristot de Oliveira is a solution that may mitigate this bandwidth-loss problem by putting throttled realtime tasks back into execution if there is no other non-realtime task to run.

Another thing that is still missing in this solution is the notion of the deadline of the task.

`SCHED_DEADLINE`

SCHED_DEADLINE is a scheduling class that has many features that may be beneficial to the Android display pipeline. When using SCHED_DEADLINE, the application's tasks wouldn't be preempted by any other scheduling classes accessible to user space. In addition, this class finally introduces the deadline of the tasks, which is used as a dynamic priority for the scheduling. SCHED_DEADLINE also requires a defined run time for the tasks; it becomes a bandwidth constraint that can be used both for task throttling and for establishing the utilization value that will be used for frequency selection and task placement.

Unfortunately, SCHED_DEADLINE also has several drawbacks that make it unusable as it is. This scheduling class does not yet have a usable priority-inheritance mechanism to manage interdependent tasks, but a proxy execution mechanism that should address this problem is currently under discussion.

Moreover, the deadline scheduler's bandwidth-throttling mechanism is too aggressive and causes deadlines to be missed in the case where the run time chosen for the task is too small. This requires setting a pessimistic run time with a safe margin, causing the loss of bandwidth that can be used by other SCHED_DEADLINE tasks and, if using schedutil, the choice of a higher frequency, and thus a waste of energy. As with SCHED_RT, the application threads are not entirely time-critical, and their non-time-critical sections could be scheduled with CFS, instead of taking part in the run time of the SCHED_DEADLINE task.

Another problem of applying SCHED_DEADLINE to application tasks is that these tasks self-suspend; for example, RenderThread may block waiting for an output buffer to dequeue from the BufferQueue. When a SCHED_DEADLINE task wakes up after a suspension, the kernel might postpone its deadline, reducing the priority of the task itself and potentially causing it to miss its deadline because it may get preempted by another SCHED_DEADLINE task.

SCHED_DEADLINE also does not consider the CPU capacity in task placement and, in heterogeneous architectures, it may schedule high-bandwidth tasks on slow CPUs, leading to deadline misses. Capacity awareness, which should address this problem, is currently under discussion. Finally, the scheduling policy implemented by SCHED_DEADLINE does not manage overlapping cpuset control groups, so the use of custom CPU affinities is forbidden and the definition of a limited set of CPUs can only be achieved with exclusive cpuset cgroups (root domains).

`SCHED_DEADLINE` with token passing

A thing to note in the interaction between the application's UI thread and RenderThread is that the critical path follows the data flow, which is sequential.

Instead of thinking about the deadline of a task, another way of seeing the problem is to associate the deadline with the data. Once the application wakes up, the UI thread is critical until it sends the draw command list to the RenderThread, and the RenderThread is critical until it submits the rasterized frame to the BufferQueue. The deadline for submitting this data is the time at which SurfaceFlinger wakes up to consume the BufferQueue data for the composition. This would be similar to having a single SCHED_DEADLINE task with a deadline equal to the wakeup time of SurfaceFlinger, which executes the critical sections of the UI thread and RenderThread.

This solution can be implemented by extending the scheduler API with a token-passing mechanism that exchanges the scheduling properties of two tasks. A SCHED_DEADLINE UI thread may donate its internal scheduling parameters to RenderThread after it passes the draw command list, and RenderThread can give back the SCHED_DEADLINE token to the UI thread.

This approach saves the deadline bandwidth from the not time-critical sections of the application and does not require any priority-inheritance or deadline-synchronization mechanism.

When the system is overloaded, however, the Android display pipeline may have the UI thread and RenderThread running simultaneously, an execution mode that would require either falling back to a CFS solution or require introducing two SCHED_DEADLINE tokens that must be exchanged between the UI thread and RenderThread, with a more complicated synchronization mechanism between the two.

Hierarchical scheduling: `SCHED_DEADLINE` on top of `SCHED_RT`

In a more general case where the system has groups of collaborating realtime tasks, a different solution may be worth considering. By substituting the realtime throttling mechanism of SCHED_RT with SCHED_DEADLINE entities, multiple collaborating tasks that share the same deadline could be scheduled as a special SCHED_DEADLINE entity, and within that scheduling entity, they could be scheduled with SCHED_RT (or other) policies.

Besides the simplification of the realtime analysis of the task set, this solution would also reduce the SCHED_DEADLINE bandwidth pessimism. When choosing the run time for the task group, the probability of falling in the unfortunate worst-case execution time code branch for multiple tasks at the same time is very low, so the total run time allocated for the group of tasks can be smaller than the sum of all the worst-case execution times of its tasks.

On the other hand, this solution introduces more scheduling overhead due to the two scheduling layers and a lot of task migrations. Moreover, this approach would expose the problems of SCHED_DEADLINE, such as CPU affinities, to other scheduling classes; this makes it infeasible.

The theory for the hierarchical scheduling algorithm I mention in this section was first introduced by Leontyev and Anderson (A hierarchical multiprocessor bandwidth reservation scheme with timing guarantees [PDF]); a first implementation in the Linux kernel was described in An implementation of a multiprocessor bandwidth reservation mechanism for groups of tasks [PDF] by Andrea Parri, Mauro Marinoni, Juri Lelli, and Giuseppe Lipari. To provide realtime guarantees supported by realtime analyses requires the additional overhead of the two-level scheduling hierarchy, plus all the effort of migrating the task group when its guaranteed execution time on the current CPU expires.

Conclusions

The solutions presented here each lack features that would make them usable out of the box for the Android display pipeline. Each solution forces this complex subsystem to rely and depend on solutions that are not completely merged upstream in the Linux kernel. Diverging from the mainline Linux code base would cause these modifications to be hard, if not impossible, to maintain with future kernel releases.

Currently, there is no clear winner among these solutions, and adopting one of them is an expensive, risky operation that requires careful planning. However, all these solutions have a high potential for success, and filling the mentioned gaps step-by-step may be immediately beneficial for many applications now, and will potentially be part of the foundations for the Android display pipeline in the future.

Index entries for this article
Kernel	Android
Kernel	Scheduler
GuestArticles	Balsini, Alessio

Scheduling for the Android display pipeline

Posted Jan 16, 2020 21:07 UTC (Thu) by pj (subscriber, #4506) [Link] (2 responses)

It's clear (to me) that a good solution to this is going to have to teach the scheduler about thread dependencies... does the scheduler take into account that thread A is doing an epoll() on a socket whose other end is owned by thread B ? Fencing operations could likewise be instrumented to 'show' when a thread is waiting on a fence.

Scheduling for the Android display pipeline

Posted Jan 18, 2020 18:34 UTC (Sat) by scientes (guest, #83068) [Link] (1 responses)

Didn't the seL4 scheduler start doing this type of analysis?

Scheduling for the Android display pipeline

Posted Jan 20, 2020 21:21 UTC (Mon) by scientes (guest, #83068) [Link]

Yes, here[1] is Gernot Heiser talking about it, and how it is different from how traditional RT schedulers work.

[1] https://youtu.be/6s5FDX5PkZI?t=801

Scheduling for the Android display pipeline

Posted Jan 26, 2020 22:59 UTC (Sun) by alison (subscriber, #63752) [Link]

VSYNC "occurs at the refresh rate of the display, and is thus periodic: a display with a 60Hz refresh rate generates a VSYNC every 16.6667ms." With the newish G-Sync/FreeSync HW introduced by Nvidia and AMD respectively, VSYNC is now effectively an IRQ rather than a clock. Do any mobile devices yet make use of such a dynamic scan-out mechanism? Might being able to slightly adjust the scan-out time help?

Scheduling for the Android display pipeline

A short journey through the Android display pipeline

What is the benefit of this complexity?

Scheduling display pipeline entities

Schedutil, Linux, and Android

Some possible solutions

CFS and Utilclamp

SCHED_RT

SCHED_DEADLINE

SCHED_DEADLINE with token passing

Hierarchical scheduling: SCHED_DEADLINE on top of SCHED_RT

Conclusions

Scheduling for the Android display pipeline

Scheduling for the Android display pipeline

Scheduling for the Android display pipeline

Scheduling for the Android display pipeline

`SCHED_RT`

`SCHED_DEADLINE`

`SCHED_DEADLINE` with token passing

Hierarchical scheduling: `SCHED_DEADLINE` on top of `SCHED_RT`