i915 performance, master, i915tex & gem

From:  Keith Whitwell <>
To:  Keith Whitwell <>, Dave Airlie <>, Ian Romanick <>
Subject:  i915 performance, master, i915tex & gem
Date:  Mon, 19 May 2008 05:09:59 -0700 (PDT)
Message-ID:  <>
Cc:  DRI <>
Just reposting this with a new subject line and less preamble.

----- Original Message ----

> Well the thing is I can't believe we don't know enough to do this in some 
> way generically, but maybe the TTM vs GEM thing proves its not possible. 

I don't think there's anything particularly wrong with the GEM
interface -- I just need to know that the implementation can be fixed
so that performance doesn't suck as hard as it does in the current one,
and that people's political views on basic operations like mapping
buffers don't get in the way of writing a decent driver.

We've run a few benchmarks against i915 drivers in all their permutations, and to summarize the
results look like:
    - for GPU-bound apps, there are small differences, perhaps up to 10%.  I'm really not concerned
about these (yet).
    - for CPU-bound apps, the overheads introduced by Intel's approach to buffer handling impose a
significant penalty in the region of 50-100%.

think the latter is the significant result -- none of these experiments
in memory management significantly change the command stream the
hardware has to operate on, so what we're varying essentially is the
CPU behaviour to acheive that command stream.  And it is in CPU usage
where GEM (and Keith/Eric's now-abandoned TTM driver) do significantly

Or to put it another way, GEM & master/TTM seem to burn huge
of CPU just running the memory manager.  This isn't true for
master/no-ttm or for i915tex using userspace sub-allocation, where the
CPU penalty for getting decent memory management seems to be minimal
relative to the non-ttm baseline.  

If there's a political
desire to not use userspace sub-allocation, then whatever kernel-based
approach you want to investigate should nonetheless make some effort to
hit reasonable performance goals -- and neither of the current two
kernel-allocation-based approaches currently are at all impressive.


And on an i945G, dual core Pentium D 3Ghz 2MB cache, FSB 800 Mhz, single-channel ram:

Openarena timedemo at 640x480:
master w/o TTM:  840 frames, 17.1 seconds: 49.0 fps, 12.24s user 1.02s system 63% cpu 20.880 total
master with TTM: 840 frames, 15.8 seconds: 53.1 fps, 13.51s user 5.15s system 95% cpu 19.571 total
i915tex_branch:  840 frames, 13.8 seconds: 61.0 fps, 12.54s user 2.34s system 85% cpu 17.506 total
gem:             840 frames, 15.9 seconds: 52.8 fps, 11.96s user 4.44s system 83% cpu 19.695 total

It's less obvious here than some of the tests below, but the pattern is
still clear -- compared to master/no-ttm i915tex is getting about the
same ratio of fps to CPU usage, whereas both master/ttm and gem are
significantly worse, burning much more CPU per fps, with a large chunk
of the extra CPU being spent in the kernel.  

The particularly
worrying thing about GEM is that it isn't hitting *either* 100% cpu
*or* maximum framerates from the hardware -- that's really not very
good, as it implies hardware is being left idle unecessarily.


A: ~1029 fps, 20.63user 2.88system 1:00.00elapsed 39%CPU  (master, no ttm) 
B: ~1072 fps, 23.97user 18.06system 1:00.00elapsed 70%CPU  (master, ttm)
C: ~1128 fps, 22.38user 5.21system 1:00.00elapsed 45%CPU  (i915tex, new)
D: ~1167 fps, 23.14user 9.07system 1:00.00elapsed 53%CPU  (i915tex, old)
F: ~1112 fps, 24.70user 21.95system 1:00.00elapsed 77%CPU  (gem)

The high CPU overhead imposed by GEM and (non-suballocating) master/TTM
should be pretty clear here.  master/TTM burns 30% of CPU just running
the memory manager!!  GEM gets slightly higher framerates but uses even
more CPU than master/TTM.  

?fgl_glxgears -fbo:

A: n/a
B: ~244 fps, 7.03user 5.30system 1:00.01elapsed 20%CPU  (master, ttm)
C: ~255 fps, 6.24user 1.71system 1:00.00elapsed 13%CPU  (i915tex, new)
D: ~260 fps, 6.60user 2.44system 1:00.00elapsed 15%CPU  (i915tex, old)
F: ~258 fps, 7.56user 6.44system 1:00.00elapsed 23%CPU  (gem)

KW: GEM & master/ttm burn more cpu to build/submit the same command streams.

?openarena 1280x1024:

A: 840 frames, 44.5 seconds: 18.9 fps  (master, no ttm)
B: 840 frames, 40.8 seconds: 20.6 fps  (master, ttm)
C: 840 frames, 40.4 seconds: 20.8 fps  (i915tex, new)
D: 840 frames, 37.9 seconds: 22.2 fps  (i915tex, old)
F: 840 frames, 40.3 seconds: 20.8 fps  (gem)

no cpu measurements taken here, but almost certainly GPU bound.  A lot
of similar numbers, I don't believe the deltas have anything in
particular to do with memory management interface choices...


A: ~285000 Poly/sec (master, no ttm)
B: ~217000 Poly/sec (master, ttm)
C: ~298000 Poly/sec (i915tex, new)
D: ~227000 Poly/sec (i915tex, old)
F: ~125000 Poly/sec (gen, GPU lockup on first attempt)

KW: no cpu measurements in this run, but all are almost certainly 100% pinned on CPU.  
  - i915tex (in particular i915tex, new) show similar performance to classic - ie low cpu overhead
for this memory manager.
- GEM is significantly worse even than master/ttm -- hopefully this is
a bug rather than a necessary characteristic of the interface.


A: total texels=393216000.000000  time=3.004000 (master, no ttm)
B: total texels=434110464.000000  time=3.000000 (master, ttm)
C: (i915tex new --- woops, crashes)  
D: total texels=1111490560.000000  time=3.002000 (i915tex old)
F: total texels=279969792.000000  time=3.004000 (gem)

the huge (3x-4x) performance lead of i915tex, despite the embarassing
crash in the newer version.  I suspect this is unrelated to command
handling and probably somebody has disabled or regressed some aspect of
the texture upload path...  

NOTE:  The reason that i915tex
does so well relative to master/no-ttm is because we can upload
directly to "VRAM"...  master/no-ttm treats vram as a cache &
always keeps a second copy of the texture safe in main memory...  Hence
performance isn't great for texture uploads on master/no-ttm.

Here's what we're seeing on a i915 3GHz Celeron 256kB cache. Dual channel. Reportdamage 
disabled. DRM master:


    *i915 master, TTM*
    *i915 master, classic*
    ( no gem results on this machine ... )
    1033fps, 70.1% CPU.  (i915tex)
    726fps, 100% CPU. (master, ttm)
    955fps, 56%CPU. (master, no-ttm)
    47,1fps, 17.9u, 2.7s time (i915tex)
    31.5fps, 21.1u, 8.7s time (master, ttm)
    39fps, 17.9u, 1.3s time (master, no-ttm)
    1327MB/s (i915tex)
    551MB/s (master, ttm)
    572MB/s (master, no-ttm)
Texdown, subimage
    1014MB/s (i915tex)
    134MB/s (master, ttm)
    148MB/s (master, no-ttm)
Ipers, no help screen
    255 000 tri/s, 100% cpu (i915tex)
    139 000 tri/s, 100% cpu (master, ttm)
    241 000 tri/s, 100% cpu (master, no-ttm)

I would summarize the results like this:
   - master/no-ttm has a basically "free" memory manager in terms of CPU overhead
   - master/ttm and GEM gain a proper memory manager but introduce a huge CPU overhead & consequent
performance regression
- i915tex makes use of userspace sub-allocation to resolve that
regression & achieve comparable efficiency to master/no-ttm.

   - a separate regression seems to have killed texture upload performance on master/ttm relative
to it's ancestor i915tex.


