Quantize Your Attributes!

  ยท  3 min read

No VRAM

I have seen a shader like this too often:

 1struct VertexInput {
 2    uint vertexIndex : SV_VertexID;
 3    uint instanceIndex : SV_InstanceID;
 4};
 5
 6struct VertexOutput {
 7    float4 position : SV_Position;
 8    float3 normal;
 9    float2 texcoord; 
10};
11
12struct VertexSSBO {
13    float3 position;
14    float3 normal;
15    float2 texcoord;
16};
17
18StructuredBuffer<VertexSSBO> vertices;
19
20[shader("vertex")]
21VertexOutput vertex_main(VertexInput input) {
22    VertexSSBO vData = vertices[input.vertexIndex];
23    
24    VertexOutput output;
25    output.position = float4(vData.position, 1.0f);
26    output.normal = vData.normal;
27    output.texcoord = vData.texcoord;
28    return output;
29}

The example above is written in Slang. This is a simple vertex shader that pulls clip-space positions, normals, and texture coordinates from an SSBO—not too different from what one might find in a typical renderer, minus some matrix multiplications. But it consumes too much bandwidth and VRAM. Can you find out why?

Notice that the struct VertexSSBO contains attributes specified as vectors with floating-point members. How big is that? You might think $4 * 3 + 4 * 3 + 4 * 2 = 32$ bytes. But according to std430 rules, float3 must be 16-byte aligned, and even though float2 only needs 8-byte alignment, the entire struct must be aligned to its largest member (16 bytes). This results in an effective size of 48 bytes due to internal and tail padding. For a mesh containing 1 million vertices, that means 48 megabytes must be transferred and maintained in VRAM. In Vulkan the transfer goes like this:

1void* data{};
2vkMapMemory(device, stagingBufferMemory, 0, bufferSize, 0, &data);
3memcpy(data, meshData, bufferSize);
4vkUnmapMemory(device, stagingBufferMemory);

That is your CPU churning through 48 megabytes. You will also need a call to vkCmdCopyBuffer to transfer those 48 megabytes from a staging buffer to a device buffer.

For most projects you don’t need 32-bit precision. For simple low-poly meshes you can probably even get away with 8-bit precision. Consider this:

 1struct VertexInput {
 2    uint vertexIndex : SV_VertexID;
 3    uint instanceIndex : SV_InstanceID;
 4};
 5
 6struct VertexOutput {
 7    float4 position : SV_Position;
 8    float3 normal;
 9    float2 texcoord; 
10};
11
12struct VertexSSBO {
13    uint position;
14    uint normal;
15    uint texcoord;
16};
17
18StructuredBuffer<VertexSSBO> vertices;
19
20[shader("vertex")]
21VertexOutput vertex_main(VertexInput input) {
22    VertexSSBO vData = vertices[input.vertexIndex];
23
24    // de-quantize on the GPU
25    float4 position = unpackSnorm4x8ToFloat(vData.position);
26    float3 normal = unpackSnorm4x8ToFloat(vData.normal).xyz;
27    float2 texcoord = unpackSnorm4x8ToFloat(vData.texcoord).xy;
28    
29    VertexOutput output;
30    output.position = position;
31    output.normal = normal;
32    output.texcoord = texcoord;
33    return output;
34}

I used unpackSnorm4x8ToFloat to convert the quantized data back to float. For actual quantization, you may use:

 1uint32_t p8(float v)
 2{
 3    v = (v > 1.0f) ? 1.0f : (v < -1.0f ? -1.0f : v);
 4    return (uint32_t)((int)roundf(v * 127.0f) & 0xFF);
 5}
 6
 7uint32_t packVec4F32ToU32(float x, float y, float z, float w)
 8{
 9    return p8(x) | (p8(y) << 8) | (p8(z) << 16) | (p8(w) << 24);
10}

So how much did we save? The new struct is $4 + 4 + 4 = 12$ bytes. For a mesh with 1 million vertices, that is about 12 megabytes—75% reduction. Now you can use that saved VRAM for something actually interesting, like higher-resolution cascaded shadow maps.

By the way, 8-bit quantization may be too aggressive for most meshes (although usually fine for normals), so 16-bit would be better in practice (you may use the meshoptimizer library for this). This demo shows what 8-bit quantization does to a sphere.

Also we have assumed that our attributes are normalized to the $[-1.0, 1.0]$ range. This may not be true for a random model downloaded from the Internet, so some pre-processing may be necessary.