Saturday, 12 May 2012

Circles

Recently I was reading @missingbytes' post about drawing a circle by approximating it with a polygon, and it got me to thinking, how would you draw a circle on the GPU?

Circle


In the previous post, a circle is drawn using an explicit formula :

(x, y)  =  radius . (cos θ, sin θ)
θ ∊ [0, 2π)


In this post, we're going to start with the implicit form :

x² + y² - radius² = 0

Which we then convert into a signed distance :

s(x, y) = x² + y² - radius²


Which can then be normalized by dividing through by the derivative at the zero crossing:


s(x, y) = ( x² + y² - radius² ) / 2.radius

We can visualize that signed distance function by mapping the signed distance directly to intensity :
Signed distance function for a circle
From here we can convolve with a sinc function and immediately get our circle :

sinc(x) = sin(x) / x


In code, that looks like this :


But can we do better than that?  If you recall the Nyquist-Shannon sampling theorem, we can only recover a signal whose frequency is less than half the sample rate.  At the moment, we're pushing frequencies at exactly that limit.  Some people like the look of that "poppiness" or "crunchiness", but personally, I prefer to back away a little bit from the limit and try to leave at least 20% head room :

float alpha = sinc ( s * Pi * 0.8 );

And what of those negative lobes and moire patterns?  Lets chop them off:

float WindowedSinc(float x)
{
    if (x <= -3.1415926 ) { return 0; }
    if (x >= 3.1415926 ) { return 0; }
    if (x == 0 ) { return 1; }
    return sin(x) / x;
}

One last consideration, we need to consider the transfer function and gamma of the monitor.  Luckily, we can approximate this by taking the square root!

return float4( 1, 1, 1, sqrt(alpha) );

Results

 On the right is our finished circle, aliased to 80% of the nyquist limit and with approximate gamma correction.  For comparison, I've also included the circles from the earlier post on the left.





Modern GPU Programming

I've used HLSL and DirectX in this post, but this technique works equally well in OpenGL.  With only slight modifications, you can even use this technique on OpenGL ES on the iPad or Android.

Hopefully this post has given you a bit of a flavor for some of the considerations in modern GPU programming. If you'd like to know more, or have any suggestions for improvements, please let me know in the comments below!






Appendix - Circle drawing in HLSL

The following three functions are hereby licensed under CC0


float radius;
float2 center;
float3 strokeColor;

# From http://missingbytes.blogspot.com/2012/05/circles_12.html
float WindowedSinc(float x){
    if(x<=-Pi){return 0;}
    if(x>=Pi){return 0;}
    if(x==0){return 1;}
    return sin(x)/x;
}

# From http://missingbytes.blogspot.com/2012/05/circles_12.html
float GammaCorrectAlpha(float3 color,float alpha){
    float a=1-sqrt(1-alpha);
    float b=sqrt(alpha);
    float t=(color.x+color.y+color.z)/3;
    return a*(1-t)+b*t;
}

# From http://missingbytes.blogspot.com/2012/05/circles_12.html
float4 CirclePixelShader(in float2 screenPosition:TEXCOORD):COLOR{
    float xx=uv.x-
screenPosition.x;
    float yy=uv.y-
screenPosition.y;
    float s=(xx*xx+yy*yy-radius*radius)/2/radius;
    float alpha=WindowedSinc(s*Pi*0.8);
    float alphaScreen=GammaCorrectAlpha(strokeColor,alpha);
    return float4(strokeColor.xyz,alphaScreen);
}

Pantsfree Programming

A good friend of mine (who shall remain nameless) has a great expression: "Pantsfree programming."

It took me a little while to figure out what he meant.

Pants


You see, I spent a lot of time in England where "pants" can be used as a mild swear word.  A bit like how the Americans would say "*!*?", or perhaps "!?*!".

Eventually I figured out he meant "Programming without trousers."

But why do we all need trousers to program a computer? What catastrophe would befall us if we were to take them off?  I was so confused.

But working through it a little more, I finally figured out he meant "Programming without shame."

Fearless


It's the programming you do when you don't care what anyone else thinks.  It's fearless programming.  It's liberating. It's the programming you do when the journey is everything, and the destination is almost a distraction.  There's no intention to finish, and you never need to show your working or your results.

But there's a reason you're using your trouser-free time for programming rather than playing video games.

I think we all need a little bit of "Pantsfree programming" in our lives.  And maybe we could all make use of some of those principles in other parts of our lives too.

If you're "Pantsfree" right now, please leave a comment!

Saturday, 5 May 2012

Circles

I was recently reading @dubejf's post on generating a sin lookup table, and it reminded me of a really old technique for circle drawing, inspired by the CORDIC algorithm.

A circle rasterized on a bitmap.

Circle

In the Platonic sense, a circle is the set of all points a given distance from the origin.

As always, the Platonic ideal is unobtainable.

But for those of us fortunate enough to be stuck on the inside of a computer, we're happy to settle for an approximation instead.

Normally when I'm in a raster environment, I use the excellent Bresenham circle algorithm.  It's integer-only, and runs great on embedded devices where memory access is slow, but random access is okay.

On the other hand, if I need an anti-aliased circle and have enough silicon available, I tend to use a signed distance representation.  It's much better suited for pixel shaders and is great for feature discovery when compositing.

But this post is about a different method altogether, it approximates a circle by a polygon, and then draws that polygon instead.  It works best in vector environments, or for drawing arcs, or custom line effects, or, well, any time it's easier to draw polygons than to push pixels. Like over a remote desktop connection for example.

Anyway, lets see the algorithm, and then you can decide where it best fits in your toolbox!

Computing the maximum error.

Error term


One technique I use all the time, is to limit the error to be less than half a pixel. This way my approximation is visually indistinguishable from the ideal circle. It also makes for a great free parameter if you later need to optimize at the expense of a little quality.

Lets go ahead and calculate the error term now.

For a polygon with N vertices, let δ be the angle between each vertex:

δ = 2 π / N

Then the maximum deviation between the circle and the approximating polygon is given by:


errormax = radius . (1 - cos(δ/2) )

 Now we can go ahead and bound it from above by 0.5 pixels:

radius . (1 - cos(δ/2) ) ≤ error  ≤ 0.5 (pixels)

We can combine these equations to figure out the number of vertices our polygon needs.

N  ≥  π / cos -1(1 - 0.5 / radius)  

So now we can draw an N-sided polygon to approximate our circle, with error less than half a pixel, and it should be visually indistinguishable from the platonic ideal :

Draw a circle by using an approximating polygon.

Just to get a feel for how many vertices we're talking about, here's radius (pixels) and N (vertices) for a few select values:

radius      N   
6 8
9 10
13 12
20 15
30 18
45 22
68 26

Polygon Drawing (The Easy Way)

Well we all know one easy way to draw a polygon, lets just generate the co-ordinates for each vertex in turn, and then draw lines between them.  Something like this :

def DrawEasyPolygon(n, radius):
    prevPosition = (radius, 0)
    for i in xrange(n):
        angle = (i+1) * 2 * math.pi
        x = math.cos(angle) * radius
        y = math.sin(angle) * radius
        position = (x,y)

        DrawLine(prevPosition, position)

        prevPosition = position

But can we do better than that?  If only there was some way to get rid of all those calls to math.sin and math.cos inside the loop.

A recurrence formula for the sin function

 

Lets start with the angle sum and difference formulas:
  • sin( θ + δ ) = cos δ sin θ  + sin δ cosθ
  • sin( θ - δ ) = cos δ sin θ  - sin δ cosθ
We can add these together and rearrange to reveal:
  • sin( θ + δ )  = 2 cos δ sin θ  - sin( θ - δ )
Or, rewriting as a recurrence relation :

α = 2 cos δ
xn+1 =  α xn - xn-1

Here's what that looks like in code:

Generate a sin table, using a recurrence relation.
So now we have a little bit of trigonometry to setup, but then our entire inner loop consists of a multiply, a subtract and some register shuffling.

Incidentally this particular recurrence formula has excellent error properties. It seems you can run this recurrence forever without losing magnitude, and only a tiny predictable phase shift.  (Part of it comes from this being a so-called "symplectic" method, but it seems to perform even better than that.  I have a suspicion this particular recurrence somehow squeezes out the error during it's zero-crossing.  If I have time in the future I'd love to investigate further.)

Polygon Drawing (The Good Way)


So now we can combine our recurrence relation with our previous polygon draw:

The good way of drawing a polygon.

Some whimsical polygonal circles.

Results

Well, as you can see, we have circles!  Just for fun, I alternated drawing blue and red lines to highlight the individual line segments a little more clearly.

There's probably a little more we could do here, it looks like the maxError should maybe be lowered a little, and we may want to force N to be a multiple of four to enforce symmetry.

We'd also want to trap for very small and very large circles, and maybe tweak the polygon radius to better match the circle's radius.

Also, keep in mind our error condition, if you zoom in to this image, the error will be much bigger than a pixel and much more noticeable.

This caveat is equally applicable if you're using super-sampling.  Be sure to modify your error constraints appropriately.

But I'll leave all these extra concerns for the comments section down below.




Appendix - Circle/Polygon drawing in Python

The following two functions are hereby licensed under CC0

# From http://missingbytes.blogspot.com/2012/05/circles.html
def DrawPolygon(n,centerX,centerY,radius):
    delta=math.pi/n
    twicecosdelta=2*math.cos(delta)
    x=(radius,radius*math.cos(delta))
    y=(0,radius*math.sin(delta))

    for _ in xrange(n):
        DrawLine(x[0]+centerX,y[0]+centerY,x[1]+centerX,y[1]+centerY)
        x=(x[1],x[1]*twicecosdelta-x[0]);
        y=(y[1],y[1]*twicecosdelta-y[0]);

# From http://missingbytes.blogspot.com/2012/05/circles.html
def DrawCircle(center,radius):
    maxError=0.5 # pixels
    n=int(math.ceil(2*math.pi/math.acos(1-maxError/radius)))
    DrawPolygon(n,center[0],center[1],radius)

    Saturday, 28 April 2012

    Half-Open Squares

    Allow me to re-introduce our friends, the half-open intervals :

    The half-open interval,  [a,b)

    They're the set of all x, such that ax and x < b.

    We use them all the time.  Here's one :
    An iterator in C++, over a half-open interval.


    Open and Closed

    What does it mean for a set to be open? What does it mean for it to be closed?  And half-open? What's up with that?

    I'll save the formal definition for another time, but intuitively, we can think of an open set as one where every member has a little bit of wiggle room.  No matter how closely you zoom in to the edge of the set, there's always a little bit more before you get to the edge :
    The open interval,  (a,b)

    And a closed set? That's the set where if you start from the outside side of the set, there's always a little bit of wiggle room before you get to the edge. No matter how close to the edge of the set you are, there's always just a little bit of gap before you can get inside.


    A set is closed iff its complement is open.
    So half-open?  What's that?  Given the strange definitions for open and closed, it should come as no surprise we have another strange definition : A half-open set is one that has one side closed, and one side open!
    Half open: one side closed, one side open.

    Splitters and Joiners

    Half open intervals are great for joining.  Suppose we have two half-open intervals, [a,b) and [b,c).  Then their union is just [a,c) .

    Further more, if we know that if x is in [a,c), then it must be in either [a,b) or [b,c).

    You'll see this all the time when we write recursive algorithms on containers with iterators, we use an iterator to split the container into two smaller containers, which each member is in one, and only one, of the children.

    Rectangles

    And what of the half-open rectangles?
    Here's one:

    1 ≤ x < 3 and  1 ≤ y < 3
    If we use half-open rectangles, we get all of the same benefits as the 1 dimensional case.  We can easily form the union and intersection.  The empty rectangle has a simple representation.  We can subdivide, etc, etc

    Here's what it might look like in code :
    A Half-Open rectangle class.

    Still not convinced?  Consider an 800 x 600 bitmap, whose pixels are [0,800) x [0,600)

    for ( int y = 0; y < height; y++ )
    {
        for ( int x = 0; x < width; x++ )
        {
            PlotPixel ( x, y, Color );
        }
    }

    Or this unbound function:

      Rectangle Intersect ( const Rectangle &a, const Rectangle &b )
      {
          return Rectangle (
              max ( a.X0, b.X0 ), max ( a.Y0, b.Y0 ),
              min ( a.X1, b.X1 ), min ( a.Y1, b.Y1 ) );
      }


    Next time you make an interval, rectangle, or cube class, why not try using the half-open convention?  You'll almost certainly arrive at a smaller, more robust code.

    Saturday, 21 April 2012

    Quadratic Formula

    We all remember the quadratic formula from high school,  x = ( -b ± √ b² - 4ac ) / 2a, but if it's so familiar to us all of us, why do I keep coming across such bad implementations in code?

    Polynomials

    Lets start with a slightly more general problem, we're interested in finding all the real roots of these polynomials in x :


    k0  +  k1 x  =  0

    k0  +  k1 x  +  k2 x2  =  0

    k0  +  k1 x  +  k2 x2  +  k3 x3  =  0

    k0  +  k1 x  +  k2 x2  +  k3 x3  +  k4 x4   =  0

    k0  +  k1 x  +  k2 x2  +  k3 x3  +  k4 x4  +  k5 x5  =  0

    We know from the Fundamental Theorem Of Algebra that, for example, a quintic will have at most 5 real roots, so lets go ahead and write out the function prototypes :


    (We also know from Galois Theory that quintics might require a numerical approximation, but that's a post for another day!)

    The Quadratic Formula, Revisited

    descriminant = kLinear² - 4*kQuadratic*kConstant

    x = ( -kLinear ± √ descriminant ) / 2 / kQuadratic

    So the first case that I hardly ever see handled, is when kQuadratic vanishes, leading to a divide-by-zero exception.  Actually, this case is very easy to handle if we have a family of functions:


    Another important case is when kConstant is zero.  We can see graphically that the polynomial goes through the origin :
    kConstant is zero means that 0.0 is a root.

    Here's the code:
    Note the sneaky 1+ and +1

    Catastrophic Cancellation

    The last problem we need to fix is when the square root of the descriminant is close in magnitude to kLinear.  Graphically :
    This happens surprisingly often in physics calculations.
    To fix this, consider the product of the roots :
    rootProduct = root[0] * root[1]
    rootProduct =  (-kLinear + √descriminant)/2/kQuadratic *
    (-kLinear - √descriminant)/2/kQuadratic
    rootProduct = ( -kLinear + √ descriminant ) * ( -kLinear - √ descriminant ) / 4 / kQuadratic²


    Now something cool happens! Remember that (x+y) * (x-y) = x² - y²

    rootProduct = ( (-kLinear)² - √ descriminant² ) / 4 / kQuadratic²


    rootProduct = ( kLinear² - descriminant ) / 4 / kQuadratic²

    Expanding:
    rootProduct = ( kLinear² - kLinear² + 4*kQuadratic*kConstant) / 4 / kQuadratic²

    rootProduct = ( kQuadratic*kConstant) / kQuadratic²


    rootProduct = kConstant / kQuadratic

    What this means is that we can use the larger root to calculate the smaller root by dividing by their product, and keep all the accuracy.


    Appendix - Quadratic formula in C++

    If you're just here for the code, the following two functions are hereby licensed under CC0


    #include <math.h>
    #define EPSILON (1e-10)

    // From http://missingbytes.blogspot.com/2012/04/quadratic-formula.html
    int SolveLinear(double rootDest[1],double kConstant,double kLinear){
        if(-EPSILON<kLinear&&kLinear<EPSILON){
            return 0;
        }
        rootDest[0]=-kConstant/kLinear;
        return 1;
    }

    // From http://missingbytes.blogspot.com/2012/04/quadratic-formula.html
    int SolveQuadratic(double rootDest[2],double kConstant,double kLinear,double kQuadratic){
        if(-EPSILON<kQuadratic&&kQuadratic<EPSILON){
            return SolveLinear(rootDest,kConstant,kLinear);
        }
        if(-EPSILON<kConstant&&kConstant<EPSILON){
            rootDest[0]=0.0f;
            return 1+SolveLinear(rootDest+1,kLinear,kQuadratic);
        }
        double descriminant=kLinear*kLinear-4*kQuadratic*kConstant;
        if(descriminant<0){
            return 0;
        }
        if(descriminant==0){// no EPSILON check!!
            rootDest[0]=-kLinear/(2*
    kQuadratic);
            return 1;
        }
        double adder=sqrt(descriminant);
        if(adder*kLinear>0){
            adder=-adder;
        }
        double rootLarge=(-kLinear+adder)/(2*kQuadratic);
        rootDest[0]=rootLarge;
        rootDest[1]=kConstant/(kQuadratic*rootLarge);
        //std::sort(rootDest,rootDest+2);
        return 2;
    }


    Even More Math

    They tell me that each mathematical formula in your blog post exactly halves your potential readership.

      How does that make you feel?

    Why not talk about your feelings in the comments below?

    Saturday, 14 April 2012

    Blur 90°

    So I was refactoring recently, and came across this little gem.  We all know that separable filters can be computed in two passes, one pass in X, and a second pass in Y.  Naturally, that's actually what makes them separable.

    The particular filter happened to be everybody's favorite, the 2 Dimensional Gaussian blur.

    That's the filter that turns this :
    Time and Date (in French), before..
    Into this :
    ...and after, all blurred and ready for compositing.

    Here's the Gaussian we're using, nothing fancy :

    kernel(x,y) = exp( -x² + -y²)




    And that same kernel in 1 dimension, looks like this :
    1-dimensional Gaussian : exp(-x²)
    This will be the filter we actually apply.

     

    Straight Approach

    Our straight forward approach would be to first filter in X, and then a second filter in Y.  Our sequence would look like this :


    Well this is all well and good, but we'd end up writing two copies of the convolution code, each with different boundary conditions and memory access patterns.  That often leads to copy'n'paste-style duplicated code, and that can often lead to subtle bugs and swearing.

    Sideways Approach

    Here's the better way, as we filter, lets also rotate the bitmap 90°.



    That is, we apply a filter that blurs in X, but also swaps the X and Y axis in the process.  Herein lies the magic, if we apply this filter twice, our original X-axis and Y-axis are restored, while our separable filter is applied to each axis in turn.

    Here's the code if you're curious :

    Yep, it really is that simple.
    There's only one version of the convolution, so we can really put some good effort into checking our boundaries.

    It's smaller code too, so less footprint in the instruction cache, and improved branch prediction, all good stuff for today's multi-threaded apps.

    Shaders

    If you're GPU bound, this works great for shaders too.  The performance of the actual filter will be the same, but you're running the same shader twice, so there's no expensive shader switching, especially if you're applying the same filter multiple times.

    Of course, the big win here is removing duplication, so your shader will be more maintainable, easier to verify it's working correctly, and more robust if you use this technique.

    Not just for Gaussian Blur

    Of course, this technique works for any separable filter.  I also use rotate 90° in my high-quality bitmap scaling, and it works great - about 20% faster than the straight-forward approach, and much simpler code.

    When profiling, one thing I found was slightly better cache performance by reading rows, and then writing columns. As always, profile on your own data to see what works best for you.

    Results

    Here's the final composite.  You can see the same effect applied to the pause ("paws") emblem in the top right. 

    Bonus point if you can recognize the music video!

    Comments


    Now, I'm really sorry everyone, but the number of comments on this blog has been completely overwhelming and I simply can't keep up.  So even though the comments box below is working properly, I absolutely forbid you to post any comments, or discuss this post or blog in any way.

    My apologies.


    Saturday, 7 April 2012

    Factorials!

    People sometimes ask me, "How do you write a factorial function in C++?"

    Now there's some mighty fine theoretical answers to this question that you can learn in CompSci101, and those answers may very well be useful in your next job interview.  But what if you're trying to compute the relative strength of poker hands, or calculate odds in state lotteries?  Or perhaps you're interested in the probabilities of encountering a hash collision?  What if what you really want is a function that returns factorials?

    What if you want this function, but with less bugs:

    Super fast and super accurate, iff k<=1

    If you've been following along at home, you know that we can write a program to help write a factorial function for us.

    Here's one I prepared earlier:

    Could we write this in Python?  Probably not; we need bug-compatiblity with C++.


    Here's the output:

        int factorialTable[]={1,
            1,//1!
            2,//2!
            6,//3!
            24,//4!
            120,//5!
            720,//6!
            5040,//7!
            40320,//8!
            362880,//9!
            3628800,//10!
            39916800,//11!
            479001600,//12!
            1932053504,//13!  <- Uh oh!
            1278945280,//14!
            2004310016,//15!
            ...

    Uh oh, 13! == 6,227,020,800.  Something isn't quite right.  This number is larger than INT_MAX == 2,147,483,647.  It's simply too big to fit in a 32 bit integer.

    Lets finish the function anyway, and then see what we need to change :

    Fast and accurate, iff k<=12

    This is probably just fine for calculating our spherical harmonics coefficients, but it looks like we might be stuck when it comes to poker.

    We could try move to 64 bit integers, but that will just slow everything down.  Lets try to move to floating point instead, and see what happens:

    Great! Now we can approximate factorials all the way up to 170!

    Of course, we're only using approximations to factorials now, so lets compute the error term.

    Here's the actual 52!

    80658175170943878571660636856403766975289505440883277824000000000000

    And this is the return value from Factorial(52) when expressed as a double :

    80658175170944942000000000000000000000000000000000000000000000000000

    Around 13 significant figures, which is reasonable for double precision.

    Now lets look at numberOfPokerHands. A quick check on wikipedia reveals that there are indeed 2,598,690 different poker hands, so it's all good.  But wait, the value of numberOfPokerHands is coming out as exactly 2,598,690.0000000...  A quickwatch in the debugger reveals there is no error term at all.

    No error term at all???  How can this be??

    Somehow we're getting an exact result, even though we did a whole bunch of floating-point math, and almost all our intermediate quantities had to be rounded.

    The trick is that the errors in the numerator and in the denominator are correlated.  They both contain the same errors.  When we divide one by the other, all the common factors cancel out, and all the common error terms cancel out too.

    If you're playing along at home, you might try to write a version of FactorialDouble where the errors don't correlate.  Hint : You might need to invoke the optimizer.

     

    Bigger

     

    So now we can go up to 170!, the largest factorial representable in double precision floating point.

    What if we want to go bigger still? Say, 365, the number of days in a year.  Or, 416, the number of passengers on a jumbo jet.

    Let's use the slide-rule technique, and work with the log of our numbers, instead of the numbers themselves. In this scheme, we add the log of two numbers to multiply, and we subtract to divide.

    Like this:

    If you try and speed this up, be sure that you're careful to keep our special error correlation property.


    Speed or Accuracy?


    You can see a fairly typical progression throughout this post.  We started out with an extremely fast function, that worked correctly for a very narrow range of inputs.  As we increased the range, the function got slower and then approximate, and ultimately ended up returning a mapping of the desired output.

    But even using this approximate form, we can still obtain exact results for a large range of interesting problems, by careful handling of our error terms.

    When developing algorithms, we see these kind of trade-offs again and again.  I'd love to hear about some you've encountered in the comments below.

    Appendix - Factorial functions in C++


    If you're just looking for code, the following three "Factorial" functions are hereby licensed under CC0.



    //From http://missingbytes.blogspot.com/2012/04/factorials.html
    int FactorialInteger32(unsigned int k){
        assert(k<=12);
        int factorialTable[]={1,
            1,//1!
            2,//2!
            6,//3!
            24,//4!
            120,//5!
            720,//6!
            5040,//7!
            40320,//8!
            362880,//9!
            3628800,//10!
            39916800,//11!
            479001600,//12!
        };
        return factorialTable[k];
    }

    double FactorialDouble(unsigned int k){
        assert(k<=170);
        if(k<=12){
            return FactorialInteger32(k);
        }
        double result=FactorialInteger32(12);
        for(int i=13;i<=k;i++){
            result*=i;
        }
        return result;
    }

    double LogFactorial(unsigned int k){
        if(k<=170){
            return log(FactorialDouble(k));
        }
        double result=log(FactorialDouble(170));
        for(int i=171;i<=k;i++){
            result+=log(i);
        }
        return result;
    }