QuartzCrystalCLI and Xgrid

soundscreen's picture

I am calling QuartzCrystalCLI from an Xgrid job. I think there is context switching/contention for the gpu - the QuartzCrystalCLI process often becomes "stuck" and I'm only getting 40% utilization per core.

This technote implies that it's possible to take advantage of multiple GPUs: http://developer.apple.com/technotes/tn2008/tn2229.html

Is this incorporated into QuartzCrystalCLI? Do you think that will help with performance? Any other suggestions?

cwright's picture
Re: QuartzCrystalCLI and Xgrid

soundscreen wrote:
the QuartzCrystalCLI process often becomes "stuck" and I'm only getting 40% utilization per core.

This technote implies that it's possible to take advantage of multiple GPUs: http://developer.apple.com/technotes/tn2008/tn2229.html

Is this incorporated into QuartzCrystalCLI? Do you think that will help with performance? Any other suggestions?

The answer to this isn't so simple.

First, we have memory limits -- Due to QuickTime limitations, QuartzCrystalCLI is a 32Bit application. This means there's a 2GB memory limit for the process. When you crank up antialiasing or motionblur, this becomes dangerously close to the limit (and it crosses it sometimes -- QC needs memory too, as does the app itself, so it can't all go to graphics). Doing multiple simultaneous renders would multiply usage against an already strained memory limit.

Second, QC is stateful. This means that a given output frame can depend on the previous output frame. If we start rendering frame 2 before frame 1 is complete, we risk getting wrong output. There's no way to automatically detect whether or not a composition is safe to render "out of order". (heuristics could potentially handle this in some cases, but it's certainly not a 100% reliable solution).

Third, GPUs are terrible for rendering accurate results (due to driver quirks, driver bugs, gpu bugs, etc). As such, we actually don't use any GPUs while rendering, and instead use the CPU-bound Software Renderer. This ensures that the output is consistent even if the underlying hardware changes. This means that if you have 128 GPUs in a machine (ridiculous by toady's standards of 1-4 GPUs), we're using exactly 0 of them. We have plans to continue using 0 until we find a case where GPU drivers actually do what they're supposed to as well as the Software Renderer.

Fourth, we're not in complete control. We have no control over what/how QuickTime decides to do its thing (and it cannot safely be threaded), and so it stalls sometimes. We also have no control over CoreImage, which we use for image postprocessing (antialiasing and motionblur) -- it's possible that stalls happen in that case due to CI uploading to a GPU behind the scenes. We could work around both of these problems by writing our own image filtering code (admittedly very simple), and our own movie encoder code (ridiculously difficult and unlikely to happen).

If you'd like, you can fling one of your 40% compositions our way, and we can do some profiling to see if it is something we can change for a future version. No promises, but it can't hurt to check, right?

soundscreen's picture
Re: QuartzCrystalCLI and Xgrid

Thanks, good to know that it's not going through the gpu. That should be better for Xgrid, which schedules tasks on available processor cores.

In this application Xgrid is distributing encodes from qtz to mov. Each task is separate, and there isn't any relationship / interdependency between the tasks. The qtz files we are using do embed QT movies - so QT is also part of the input stream.

What's odd is that when one task is submitted, it hums along at ~100% on one of the processor cores. However, when multiple tasks are submitted in parallel, processor utilization drops under 40% per core. Not sure where the bottlenecks are.

cwright's picture
Re: QuartzCrystalCLI and Xgrid

Interesting -- Are your machines hyperthreaded (newer Xeons are HT-capable I think?)?

Also, what render settings are you using? (important details include resolution, motion blur, and antialiasing settings, with codec being important as well)

(I'm almost suspecting it to be CoreImage that's doing GPU work behind the scenes, or possibly QT encoding -- in that case, the GPU may be able to keep up with 1 encoder, but start to be the bottleneck when two or more are submitting work)