Wiki page request - performance tips

mattgolsen's picture

Could someone start a wiki page about optimizing compositions for performance? With the advent of Performance Inspector, it's made me realize how little I know about optimizing my compositions. I know what is eating up most of the processor cycles, but I really don't have a good handle on fixing it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

dwskau's picture
GPU Usage

Does anyone know of any GPU monitoring utility that could show how full graphics memory is and what the load on the GPU is? This could help in optimizing composition performance.

cwright's picture
opengl profiler

There's an application called "OpenGL Profiler" which provides some info like this. It's not particularly well documented, but it provides a pretty good view of gl/gpu cost of an applications. Also check out OpenGL Driver Monitor.

(IMPORTANT NOTE: Having your VRAM usage at 90% isn't significantly worse* than 10% -- Having your vram usage at 101% is a catastrophy, since that's the point where it requires trans-bus accesses. I get way too many people wanting this information, thinking that having a bunch of spheres on screen is magically eating significant amounts of vram, and that the amount used is catastrophically destroying their framerates)

[a 24x36 stack sphere will have 864 faces, with 3 vertices each. Fully expanded (no vertex reuse) would yield 2592 vertices. Assuming 2 texture dimensions, 3 normal dimensions, and 4 spatial dimensions (x, y, z, w), that's 23328 floats. At 4 bytes each, that's less than 100k of vram for sphere vertex data (including normals and texture coordinates), and that's assuming no vertex reuse -- if they reuse them, it's around 32k per such sphere. A single 96x96 rgba texture will consume more vram than that.]

(*) there are gpu cache details that matter with usage, but overall you're not going to work at that level through QC, since there's so much abstraction.

cwright's picture
$10000.05

This is like the Tesla/engineer/whomever-the-presenter-likes story where a guy (Tesla, engineer, whomever...) is asked to fix someone's problem. The solution ends up requiring one screw. Later, the "someone" receives an invoice for $10000.05. When asking Tesla/engineer/whomever for an itemized breakdown, it reads something like this:

  • 1 screw: $0.05
  • knowing where to put it: $10000

Optimization (in Application, CoreImage, QC, GLSL, PHP, SQL, anywhere) is an extensive framework of knowledge that requires time to develop and hone. While there are basic axioms that help (Don't Have a Zillion Iterators, or a Zillion Core Image Filters on dynamic images), generalizations in optimization lead to "People Talking Before The Know What They're Talking About Or Making Any Attempt At All To Test What They're Saying" (tm).

There are at least 3 skills you'll need to develope to optimize effectively, and they can't really be Automated:

  • Knowing how to interpret measured results (PerformanceInspector helps make the measurements, but doesn't attempt to interpret them except for being able to sort what it collects to group stuff)
  • Knowing how to test around bottlenecks. Simply finding a slow patch is just a tiny portion of the battle. You then need to figure out why it's slow. Is it because the patch sucks? Is it because you're doing more work than you really need to? Is is because you're doing synchronous operations over a slow medium? Is it because you're saturating some crucial system resource (cpu power, bus transfers, vram, sys ram, polygon fill rate, etc)? This can't be automated, because each situation is unique.
  • Knowing how to solve the problem. Finding alternatives, work-alikes, look-alikes, refactoring stuff, caching results, etc. There are a shocking number of optimization tricks out there that people have been honing since the 60's (back when individual bytes were precious storage vessels). While some techniques have become less useful than when they were new (look-up-tables after the advent of clock multipliers & async cpu/bus speeds), but there are still times where each trick can be employed to certain problems to yield a benefit. This is very very application specific, and also can't be automated.

So while the general idea that you want is a noble one, it's not really possible to address. We just start the journey (by making PT), and can lend a hand where possible.