What’s the best way to draw a fullscreen quad in OpenGL 3.2?

Question

I’m going to argue that the most efficient approach will be in drawing a single “full-screen” triangle. For a triangle to cover the full screen, it needs to be bigger than the actual viewport. In NDC (and also clip space, if we set w=1), the viewport will always be the [-1,1] square. For a triangle to cover this area just completely, we need to have two sides to be twice as long as the viewport rectangle, so that the third side will cross the edge of the viewport, hence we can for example use the following coordiates (in counter-clockwise order): (-1,-1), (3,-1), (-1,3).

We also do not need to worry about the texcoords. To get the usual normalized [0,1] range across the visible viewport, we just need to make the corresponding texcoords for the vertices tiwce as big, and the barycentric interpolation will yield exactly the same results for any viewport pixel as when using a quad.

This approach can of course be combined with attribute-less rendering as suggested in demanze’s answer:

out vec2 texcoords; // texcoords are in the normalized [0,1] range for the viewport-filling quad part of the triangle
void main() {
        vec2 vertices[3]=vec2[3](vec2(-1,-1), vec2(3,-1), vec2(-1, 3));
        gl_Position = vec4(vertices[gl_VertexID],0,1);
        texcoords = 0.5 * gl_Position.xy + vec2(0.5);
}

Why will a single triangle be more efficient?

This is not about the one saved vertex shader invocation, and the one less triangle to handle at the front-end. The most significant effect of using a single triangle will be that there are less fragment shader invocations

Real GPUs always invoke the fragment shader for 2×2 pixel sized blocks (“quads”) as soon as a single pixel of the primitive falls into such a block. This is necessary for calculating the window-space derivative functions (those are also implicitly needed for texture sampling, see this question).

If the primitive does not cover all 4 pixels in that block, the remaining fragment shader invocations will do no useful work (apart from providing the data for the derivative calculations) and will be so-called helper invocations (which can even be queried via the gl_HelperInvocation GLSL function). See also Fabian “ryg” Giesen’s blog article for more details.

If you render a quad with two triangles, both will have one edge going diagonally across the viewport, and on both triangles, you will generate a lot of useless helper invocations at the diagonal edge. The effect will be worst for a perfectly square viewport (aspect ratio 1). If you draw a single triangle, there will be no such diagonal edge (it lies outside of the viewport and won’t concern the rasterizer at all), so there will be no additional helper invocations.

Wait a minute, if the triangle extends across the viewport boundaries, won’t it get clipped and actually put more work on the GPU?

If you read the textbook materials about graphics pipelines (or even the GL spec), you might get that impression. But real-world GPUs use some different approaches like Guard-band clipping. I won’t go into detail here (that would be a topic on it’s own, have a look at Fabian “ryg” Giesen’s fine blog article for details), but the general idea is that the rasterizer will produce fragments only for pixels inside the viewport (or scissor rect) anyway, no matter if the primitive lies completely inside it or not, so we can simply throw bigger triangles at it if both of the following are true:

a) the triangle does only extend the 2D top/bottom/left/right clipping planes (as opposed to the z-Dimension near/far ones, which are more tricky to handle, especially because vertices may also lie behind the camera)
b) the actual vertex coordinates (and all intermediate calculation results the rasterizer might be doing on them) are representable in the internal data formats the GPU’s hardware rasterizer uses. The rasterizer will use fixed-point data types of implementation-specific width, while vertex coords are 32Bit single precision floats. (That is basically what defines the size of the Guard-band)

Our triangle is only factor 3 bigger than the viewport, so we can be very sure that there is no need to clip it at all.

But is it worth it?

Well, the savings on fragment shader invocations are real (especially when you have a complex fragment shader), but the overall effect might be barely measurable in a real-world scenario. On the other hand, the approach is not more complicated than using a full-screen quad, and uses less data, so even if might not make a huge difference, it won’t hurt, so why not using it?

Could this approach be used for all sorts of axis-aligned rectangles, not just fullscreen ones?

In theory, you can combine this with the scissor test to draw some arbitrary axis-aligned rectangle (and the scissor test will be very efficient, as it just limits which fragments are produced in the first place, it isn’t a real “test” in HW which discards fragments). However, this requires you to change the scissor parameters for each rectangle you want to draw, which implies a lot of state changes and limits you to a single rectangle per draw call, so doing so won’t be a good idea in most scenarios.

Leave a Comment Cancel reply