A Comparison of Cg Compiler Profiles and Performance

A pocketmoon production

Intro

This brief article outlines the initial results of some Cg shader performance tests carried out on NV30 class hardware. Comparisons are made between the compiled output of three profiles, NV30 & ARBFP1 (OpenGL) and PS_2_X (DirectX9). The performance of each profile is assesses. These test are restricted to fragment (pixel) shaders, which represent the biggest potential performance bottleneck.

Cg

The Cg compiler takes a shader written in the high level Cg language and compiles it down into low level vertex or fragment shader code. This compilation can be carried out at build time using the stand-alone cgc.exe or at run time using the appropriate Cg library functions. The compiler can be directed to produce output for either vertex or fragment shader and compiling for one of many output targets.

API

Available Fragment Shader Targets

OpenGL

ARBFP (OpenGL standard) , FP30 (Nvidia proprietary)

DX9

PS1.1, PS1.2, PS1.3, PS1.4*1, PS2.0, PS2.x*2

*1 – Supported by DX9 HLSL
*2 – Supported by Cg and DX9 HLSL Beta (as PS2_a)

For OpenGL, there is support for Nvidia's proprietary shader using the 'FP30' profile, and also support for the generic ARB standard, 'ARBFP1', which is also supported by ATI's latest hardware. For DX9 shaders, profiles are available to compile up to PS2_x level (PS2_a under HLSL).

The application code which loads and runs the test shaders is based on the excellent demos developed by Kevin Harris and available from his site http://www.codesampler.com. Both his openGL and DirectX frameworks are compact, functional and easy to understand. His site contains both HLSL and Cg (DX and OpenGL) examples.

The 5 shaders described below are not all meant to provide useful effects. They are intended more to flex various aspects of shader functionality and allow us to compare Cg profiles and Cg vs HLSL compiler optimisation when generating PS2_x/PS2_a shaders. Each shader is described and the full precision version (using float data types) is listed. Where half and fixed precision shaders are used the only difference will be in the declared data types, hence the Cg code is not listed for each variety. In addition, the context in which the shaders are being tested is highly synthetic – geometry is a minimum and the single bound texture map is only 128x128 in size. The aim being to compare pure fragment shader performance. Of course in a 'real life' context pure shader performance can only be leveraged by an architecture that excels in all areas of the rendering pipeline.

Set Up

Pentium IV 1.8
512MB RAM
Nvidia Quadro FX 2000 (43.00 drivers)
Cg Release 1.1 (March 2003)
DX9 HLSL Release 1+ (March 2003 Beta)

N.B. The current Nvidia drivers comply with an older version of the DX9 shader spec, which contained a typographical error. As a result, the Cg DX9 PS profiles always opts to use Partial Precision where allowed, rather than defaulting to the full precision as indicated in the recently (Feb 03) corrected DX9 specs.



Shader 1: Fake Noise







This shader makes an attempt to implement fast fake noise. It makes no texture samples and relies on some nasty maths to try and generate a random number using the input texture coordinates as a seed. This shader requires full floating point precision to work and the output was only correct for ARBFP and FP30(Full) profiles. Out of those two the FP30 profile holds the performance advantage. For an outline of the 'fake' noise algorithm see here.

fragout_float main( vertout IN,
uniform sampler2D testTexture )
{
fragout_float OUT;
float3 vfrac = frac(IN.texcoord0.xyx*30.0);
float3 v = (IN.texcoord0.xyx*30.0) – vfrac;
float4 n;

n.x = v.x + 17.0 * v.y;
n.y = n.x + 17.0;
n.zw = n.yx + 1;

float4 nint = n-frac(n);
float4 ni = frac(nint*(nint*nint*0.00093400)+0.04685891);
float3 t = (3*vfrac*vfrac-2*vfrac*vfrac*vfrac);
float2 interp = ni.xy + t.xx * (ni.wz – ni.xy);
float p1 = interp.x + t.y * (interp.y – interp.x);

OUT.col.xyz = p1.xxx;
OUT.col.w = 1.0;
return OUT;
}



Shader 2: Sobel Edge Filter







A shader that implements a Sobel edge filter. Requires nine texture samples and some maths to calculate luminance values upon which a convolution filter is applied. The FP30 profile using 'fixed' data types has a big advantage, as the NV30 hardware shows its ability to run fixed and float/sampler instructions in parallel.

fragout_float main( vertout IN,
uniform sampler2D testTexture )
{
fragout_float OUT;
// Take nine samples
float3 col1 = f3tex2D(testTexture,IN.texcoord0.xy+float2(-0.0078125, 0.0078125));
float3 col2 = f3tex2D(testTexture,IN.texcoord0.xy+float2( 0.00 , 0.0078125));
float3 col3 = f3tex2D(testTexture,IN.texcoord0.xy+float2( 0.0078125, 0.0078125));
float3 col4 = f3tex2D(testTexture,IN.texcoord0.xy+float2(-0.0078125, 0.00 ));
float3 col5 = f3tex2D(testTexture,IN.texcoord0.xy);
float3 col6 = f3tex2D(testTexture,IN.texcoord0.xy+float2( 0.0078125, 0.007 ));
float3 col7 = f3tex2D(testTexture,IN.texcoord0.xy+float2(-0.0078125,-0.0078125));
float3 col8 = f3tex2D(testTexture,IN.texcoord0.xy+float2( 0.00 ,-0.0078125));
float3 col9 = f3tex2D(testTexture,IN.texcoord0.xy+float2( 0.0078125,-0.0078125));

// Calculate luminance
float3 rgb2lum = float3(0.30, 0.59, 0.11);
float lum1 = dot(col1.xyz, rgb2lum);
float lum2 = dot(col2.xyz, rgb2lum);
float lum3 = dot(col3.xyz, rgb2lum);
float lum4 = dot(col4.xyz, rgb2lum);
float lum5 = dot(col5.xyz, rgb2lum);
float lum6 = dot(col6.xyz, rgb2lum);
float lum7 = dot(col7.xyz, rgb2lum);
float lum8 = dot(col8.xyz, rgb2lum);
float lum9 = dot(col9.xyz, rgb2lum);

//Sobel filter
float x = lum3 + lum9 + 2*lum6 - lum1 - 2*lum4 – lum7;
float y = lum7 + 2*lum8 + lum9 - lum1 - 2*lum2 – lum3;
float pp = x*x+y*y;
float edge =(pp<0.04)?1.0:0.0; // Edge Threshold
OUT.col.xyz = col5.xyz * edge.xxx;
OUT.col.w = 1.0;
return OUT;
}

Shader 3: Multiple Dependent Reads







A test on dependent read performance. This shader makes four samples, the result of each being used to offset the coordinates of following sample. This use of dependent texture reads can occur where function have been implemented as look-up tables encoded in 1D, 2D or 3D texture maps. The final output colour is an average of the four samples. Again the use of 'fixed' data types within an FP30 profile gives a big performance boost.

fragout_float main( vertout IN,
uniform sampler2D testTexture )
{
fragout_float OUT;
float3 col0 = f3tex2D(testTexture,IN.texcoord0.xy);
float3 col1 = f3tex2D(testTexture,IN.texcoord0.xy + col0.xy);
float3 col2 = f3tex2D(testTexture,IN.texcoord0.xy + col1.xy);
float3 col3 = f3tex2D(testTexture,IN.texcoord0.xy – col2.xy);
OUT.col.xyz = (col0 + col1 + col2 + col3 )*0.2;
OUT.col.w = 1.0;
return OUT;
}

Shader 4: Median Filter







A tricky one to implement. Basically this shader takes one center sample and 4 surrounding samples (up, down, left, right). Luminance values for all samples are calculated and the output colour is the sample which has the 'middle' luminance. See http://www.t-pot.com for the original median filter implementation upon which this one is based. This shader contains plenty of flow control. Notice how the median filter removes the 'salt and pepper' noise from the top left of the texture map. This is the only shader that produced a wide spread in compiled instruction count, with Microsofts HLSL managing to compile the shader down to less than half the instructions of the Cg PS_2_x profile. Behavior like this probably indicates a bug in the compiler. If the Cg compiler were to produce shaders of a comparable length to the HLSL compiler, I would expect the FP30(fixed) shader to again outperform all others.


ARBFP

FP30

FP30-PP

FP30-Fixed

PS_2_x (Cg)

PS_2_0 (HLSL)

PS_2_a (HLSL)

Shader 1

27

27

27

27

27

27

27

Shader 2

43

44

44

44

46

44

45

Shader 3

13

13

13

13

13

13

13

Shader 4

92

104

104

112

150

68

61

Shader 5

19

18

18

18

19

19

19



fragout main( vertout IN,
uniform sampler2D testTexture)
{
fragout OUT;
float3 rgb2lum = float3(0.30, 0.59, 0.11);

// Some sampling offsets
float2 s1 = { 0.0f, 0.0078125};
float2 s2 = { 0.0078125, 0.0f};
float2 tc = IN.texcoord0.xy;

float3 col0 = f3tex2D(testTexture,tc);
float3 col1 = f3tex2D(testTexture,tc + s1);
float3 col2 = f3tex2D(testTexture,tc + s2);
float3 col3 = f3tex2D(testTexture,tc – s1);
float3 col4 = f3tex2D(testTexture,tc – s2);

float b0 = dot(col0.xyz,rgb2lum);
float b1 = dot(col1.xyz,rgb2lum);
float b2 = dot(col2.xyz,rgb2lum);
float b3 = dot(col3.xyz,rgb2lum);
float b4 = dot(col4.xyz,rgb2lum);

float flag0 = ((b0< b1)?1.0:0.0) +
((b0< b2)?1.0:0.0) +
((b0< b3)?1.0:0.0) +
((b0< b4)?1.0:0.0);

float flag1 = ((b1<=b0)?1.0:0.0) +
((b1< b2)?1.0:0.0) +
((b1< b3)?1.0:0.0) +
((b1< b4)?1.0:0.0);

float flag2 = ((b2<=b0)?1.0:0.0) +
((b2<=b1)?1.0:0.0) +
((b2< b3)?1.0:0.0) +
((b2< b4)?1.0:0.0);

float flag3 = ((b3<=b0)?1.0:0.0) +
((b3<=b1)?1.0:0.0) +
((b3<=b2)?1.0:0.0) +
((b3< b4)?1.0:0.0);

OUT.col.xyz = ( flag0 ==2.0 ) ? col0 :
(( flag1 ==2.0 ) ? col1 :
(( flag2 ==2.0 ) ? col2 :
(( flag3 ==2.0 ) ? col3 : col4 )));

OUT.col.w = 1.0;
return OUT;
}



Shader 5: Bilinear Filter







A mix of texture samples and maths, this shader implements a basic bi-linear filter. The pipeline is set up to provide unfiltered texels and the shader samples the four texels around the current fragment and interpolates a final colour value from them. Another big win for the fixed point FP30 shader. For techniques like filtering, the fixed data type is more than adequate for processing standard 32Bit (8 bit component) textures.

fragout_float main( vertout IN,
uniform sampler2D testTexture )
{
fragout_float OUT;
float2 pp = IN.texcoord0*128.0-float2(0.5,0.5);
float2 fp = frac(pp);
float2 ip = (pp-fp)*0.0078125;

//sample 4 times and filter. Texture size is 128x128
float4 bl = f4tex2D(testTexture,ip);
float4 tl = f4tex2D(testTexture,ip + float2(0.0, 0.0078125));
float4 br = f4tex2D(testTexture,ip + float2(0.0078125, 0.0));
float4 tr = f4tex2D(testTexture,ip + float2(0.0078125, 0.0078125));

float4 top = lerp(tl, tr, fp.xxxx);
float4 bottom = lerp(bl, br, fp.xxxx);
OUT.col.xyz = lerp(bottom, top, fp.yyyy);
OUT.col.w = 1.0;
return OUT;
}

Findings

The above results indicate that Nvidia's NV30 architecture gives best performance when using fixed point data types, especially when combined with other floating point activity or texture sampling.. This data type is only available in the NV30 profile, FP30 which is OpenGL specific. In some cases the 'partial precision' type, half, gives very very useful gains (shaders 4, 5). What is also clear is that currently HLSL does a better job at optimising some shaders than Cg (shader 4), while Cg has a slight edge in others (shader 5). No doubt this is due to Microsofts extensive experience at producing optimising compilers, although personally I'd love to see Intel do some work in producing HLSL or Cg profiles. Both Cg and HLSL are relatively young tools and I would expect to see more consistent and better optimised results as both compilers mature.

The great benefit of Cg is that it will handle target specifics like lack of fixed or half datatypes by automatically up-casting. You could produce a shader using fixed types optimised for the FP30 profile (Nvidia+OpenGL+NV30) and the same Cg shader can be compiled, unchanged, to ARB OpenGL or DX9 PS2 targets.

Cg Tips

A point made clear while testing these shaders is that although you can apply general performance rules, a lot depends on the exact nature of the shader. Some techniques work well with some profiles. Here a few that I came across while testing the FP30 profile.

1. f3tex2D vs f4tex2D

In shader 5, using f3tex2D to sample into float3 variables performed worse than than using f4tex2D and sampling into float4's, even though only 3 component RGB where needed.

// FP30 214fps PS2.x 239 fps
float3 bl = f3tex2D(testTexture,ip);
float3 tl = f3tex2D(testTexture,ip + float2(0.0, 0.0078125));
float3 br = f3tex2D(testTexture,ip + float2(0.0078125, 0.0));
float3 tr = f3tex2D(testTexture,ip + float2(0.0078125, 0.0078125));

vs

// FP30 232fps PS2.x 239fps
float4 bl = f4tex2D(testTexture,ip);
float4 tl = f4tex2D(testTexture,ip + float2(0.0, 0.0078125));
float4 br = f4tex2D(testTexture,ip + float2(0.0078125, 0.0));
float4 tr = f4tex2D(testTexture,ip + float2(0.0078125, 0.0078125));

2. Fixed vs Half vs Float

Fixed data types are faster than half which are faster than float. Use them when output quality allows. Shaders 3 & 5 show the ability of NV30 hardware to execute fixed instructions in parallel with texture sampling.

3. Intrinsics

Using built-in functions can be fast e.g ,

float p = (p1.x + p1.y + p1.z + p1.w)*0.25;

is slower than

float p = dot(p1, float4(0.25, 0.25, 0.25, 0.25));



However, intrinsics can be slower, ie. from shader 5 calculating the integer and fractional parts of a float:

//90fps
float2 ip;
float2 pp = IN.texcoord0*128.0;
float2 fp = modf(pp, ip);

vs

//150fps
float2 pp = IN.texcoord0*128.0;
float2 ip = floor(pp);
float2 fp = pp - ip;

4. Help the Compiler

From shader 1, the original use of the pow() function was probably over the top!

//Slowest
fp = 6*pow(fp,5.0) - 15*pow(fp,4.0)+10*pow(fp,3.0);

vs

//Faster
fp = 6.0*fp*fp*fp*fp*fp - 15.0*fp*fp*fp*fp + 10.0*fp*fp*fp;

vs

//Fastest
float2 t2 = fp*fp;
float2 t3 = t2*fp;
fp = 6.0*t2*t3 - 15.0*t2*t2 + 10.0*t3;



5. Implicit casts

Watch out for implicit casts – they will slow you down, e.g.

half2 myoffset = half2(1.23, 4.56);
half3 col1 =h3tex2D(testTexture,IN.texcoord0.xy+myoffset); // myoffset cast to float



6. Odds and Ends

An odd one this, but in a simple shader (full and partial precision)

half3 col = h3tex2D(testTexture, IN.texcoord0.xy);

may be slower than

float3 col = f3tex2D(testTexture, IN.texcoord0.xy);

In one case doing four half texture samples in a row was faster if the first sampler was made a float!



I hope to be doing some more tests soon, perhaps on some more usefull shaders in a less synthetic manner. Let me know if there is any comparison you'd like to see or if I've cocked-up anything in the artice

© Rob James 2003

robjames@pocketmoon.com