HN.zip

Even faster asin() was staring right at me

77 points by def-pri-pub - 39 comments
srean [3 hidden]5 mins ago
A notable approximation of ~650 AD vintage, by Bhaskara is

   ArcCos(x)= Π √((1-x)/(4+x)).
The search for better and better approximations led Indian mathematicians to independently develop branches of differential and integral calculus.

This tradition came to its own as Madhava school of mathematics from Kerala. https://en.wikipedia.org/wiki/Kerala_school_of_astronomy_and...

Note the approximation is for 0 < x < 1. For the range [-1, 0] Bhaskara used symmetry.

If I remember correctly, Aryabhatta had derived a rational approximation about a hundred years before this.

EDIT https://doi.org/10.4169/math.mag.84.2.098

fhdkweig [3 hidden]5 mins ago
This is a followup of a different post from the same domain. 5 days ago, 134 comments https://news.ycombinator.com/item?id=47336111
def-pri-pub [3 hidden]5 mins ago
Thank you for linking that!
ashdnazg [3 hidden]5 mins ago
No idea if it's not already optimised, but x2 could also be x*x and not just abs_x * abs_x, shifting the dependencies earlier.
coldcity_again [3 hidden]5 mins ago
I've been thinking about this since [1] the other day, but I still love how rotation by small angles lets you drop trig entirely.

Let α represent a roll rotation, and β a pitch rotation.

Let R(α) be:

    ( cos α   sin α   0)
    (-sin α   cos α   0)
    (   0       0     1)
Let R(β) be:

    (1     0       0   )
    (0   cos β   -sin β)
    (0   sin β    cos β)
Combine them:

    R(β).R(α) = (    cos α           sin α          0   )
                ((-sin α*cos β)   (cos α*cos β)   -sin β)
                ((-sin α*sin β)   (cos α*sin β)    cos β)
But! For small α and β, just approximate:

    ( 1   α   0)
    (-α   1  -β)
    ( 0   β   1)
So now:

    x' = x + αy
    y' = y - αx - βz
    z' = z + βy
[1]https://news.ycombinator.com/item?id=47348192
coldcity_again [3 hidden]5 mins ago
If you just see the conclusion I think it's hard to immediately grok how rotation can arise from this.

This is a great technique for cheaply doing 3D starfields etc on 8-bit machines.

Look ma, no sine table!

srean [3 hidden]5 mins ago
A related interesting fact is that small angular motions compose almost like vectors, order does not matter (i.e. they are commutative). This makes differential kinematics easier to deal with when dealing with polar or cylindrical coordinate systems.

Large angular deflections while being linear transforms, do not in general commute.

It will spoil the linear relation in your elegant expression, but a slightly better approximation for cos for small θ is

    1 - 0.5θ²
jonasenordin [3 hidden]5 mins ago
I haven't kept up with C++ in a few years - what does constexpr do for local variables?

  constexpr double a0 = 1.5707288;
loeg [3 hidden]5 mins ago
It is required to be evaluated at compile time, and it's const.

An optimizing compiler might see through a non-constexpr declaration like 'double a0 = ...' or it might not. Constexpr is somewhat more explicit, especially with more complicated initializer expressions.

fancy_pantser [3 hidden]5 mins ago
The compiler can substitute the value how it sees fit. It's like #define, but type-safe and scoped.

Maybe it's folded into expressions, propagated through constant expressions, or used it in contexts that require compile-time constants (template parameters, array sizes, static_assert, other constexpr expressions).

I mean, not in this case of pi/2, where it's more about announcing semantics, but in general those are the purposes and uses.

gzread [3 hidden]5 mins ago
It can do this with const too or even a normal variable that just happens to not vary.
ErroneousBosh [3 hidden]5 mins ago
I'd like something like this in C or C++ quite honestly.

Something like a struct that I can say "this struct is global to the whole program and everyone can see it, but once this function exits those values are locked in". Maybe something like that one function is allowed to unlock and update it, but nowhere else.

Think in terms of storing a bunch of precomputed coefficients that are based on the samplerate of a system, where you really only need to set it up once on startup and it is unlikely to change during the application's running lifetime.

I feel like there probably is a way to do this, and if I was good at high level languages like C I'd know what it is. If you know, tell me what I'm not understanding ;-)

jaen [3 hidden]5 mins ago
Cool, although more ILP (instruction-level parallelism) might not necessarily be better on a modern GPU, which doesn't have much ILP, if any (instead it uses those resources to execute several threads in parallel).

That might explain why the original Cg (a GPU programming language) code did not use Estrin's, since at least the code in the post does add 1 extra op (squaring `abs_x`).

(AMD GPUs used to use VLIW (very long instruction word) which is "static" ILP).

jagged-chisel [3 hidden]5 mins ago
> It also gets in the way of elegance and truth.

That’s quite subjective. I happen to find trigonometry to be elegant and true.

I also agree that trigonometric functions lack efficiency in software.

def-pri-pub [3 hidden]5 mins ago
>> It also gets in the way of elegance and truth.

Where did that come from in the article?

jagged-chisel [3 hidden]5 mins ago
I revisited that article and ... now I have no idea. Maybe I stumbled into some other trig-related article and came back here. Or maybe this one had some A/B content going on?

The only thing I remember at this point is that I copied and pasted that sentence (I didn't type it.) Even search doesn't find the sentence anywhere but HN.

FEELmyAGI [3 hidden]5 mins ago
Search finds that sentence on this blog post https://iquilezles.org/articles/noacos/
jagged-chisel [3 hidden]5 mins ago
Thanks for finding it! Still not sure how I got there while thinking I was at "Even Faster Asin()..."
fatih-erikli-cg [3 hidden]5 mins ago
I think it is `atan` function. Sin is almost a lookup query.
xyzzyz [3 hidden]5 mins ago
On modern machines, looking things up can be slower than recomputing it, when the computation is simple. This is because the memory is much slower than the CPU, which means you can often compute something many times over before the answer from memory arrives.
connicpu [3 hidden]5 mins ago
Unless your lookup table is small enough to only use a portion of your L1 cache and you're calling it so much that the lookup table is never evicted :)
somat [3 hidden]5 mins ago
Not just modern machines, the Nintendo64 was memory bound under most circumstances and as such many traditional optimizations (lookup tables, unrolling loops) can be slower on the N64. The unrolling loops case is interesting. Because the cpu has to fetch more instructions this puts more strain on the memory bus.

If curious, On a N64 the graphics chip is also the memory controller so every thing the cpu can do to stay off the memory bus has an additive effect allowing the graphics to do more graphics. This is also why the n64 has weird 9-bit ram, it is so they could use a 18-bit pixel format, only taking two bytes per pixel, for cpu requests the memory controller ignored the 9th bit, presenting a normal 8 bit byte.

They were hoping that by having high speed memory, 250 mHz, the cpu ran at 90mHz, it could provide for everyone and it did ok, there are some very impressive games on the n64. but on most of them the cpu is running fairly light, gotta stay off that memory bus.

https://www.youtube.com/watch?v=xFKFoGiGlXQ (Kaze Emanuar: Finding the BEST sine function for Nintendo 64)

gzread [3 hidden]5 mins ago
The N64 was a particularly unbalanced design for its era so nobody was used to writing code like that yet. Memory bandwidth wasn't a limitation on previous consoles so it's like nobody thought of it.
ErroneousBosh [3 hidden]5 mins ago
> This is also why the n64 has weird 9-bit ram, it is so they could use a 18-bit pixel format, only taking two bytes per pixel, for cpu requests the memory controller ignored the 9th bit, presenting a normal 8 bit byte.

The Ensoniq EPS sampler (the first version) used 13-bit RAM for sample memory. Why 13 and not 12? Who knows? Possibly because they wanted it "one louder", possibly because the Big Rival in the E-Mu Emulator series used μ-law codecs which have the same effective dynamic range as 13-bit linear.

Anyway you read a normal 16-bit word using the 68000's normal 16-bit instructions but only the upper 13 were actually valid data for the RAM, the rest were tied low. Haha, no code space for you!

kleiba [3 hidden]5 mins ago
Interesting. About 20 years ago, it must have been the other way around because I remember this paper [1] where the authors were able to speed up the log function by making use of a lookup table in the CPU cache.

[1] https://www.researchgate.net/profile/Nikki-Mirghafori/public...

hansvm [3 hidden]5 mins ago
It........depends.

I make things faster all the time by leveraging various CPU caches, sometimes even disk or networked disks. As a general principle though, memory lookups are substantially slower than CPU (and that has indeed changed over time; a decade or three ago they were close to equal), and even cache lookups are fairly comparatively slow, especially when you consider whole-program optimization.

That isn't to say that you can't speed things up with caches, but that you have to be replacing a lot of computations for even very small caches to be practically better (and even very small caches aren't helpful if the whole-program workload is such that you'll have to pull those caches from main RAM each time you use them).

To your paper in particular, their technique still assumes reasonably small caches which you constantly access (so that you never have to reach out to main RAM), even when it was written, and part of what makes it faster is that it's nowhere near as accurate as 1ULP.

Logarithms are interesting because especially across their entire domain they can take 40-120 cycles to compute, more if you're not very careful with the implementation. Modern computers have fairly fast floating-point division and fused multiply-add, so something I often do nowadays is represent them as a ratio of two quadratics (usually rescaling the other math around the problem to avoid the leading coefficient on one of those quadratics) to achieve bounded error in my domain of interest. It's much faster than a LUT (especially when embedded in a larger computation and not easily otherwise parallelizable) and much faster than full-precision solutions. It's also pretty trivially vectorizable in case your problem is amenable to small batches. Other characteristics of your problem might cause you to favor other solutions.

AlotOfReading [3 hidden]5 mins ago
Logarithms are interesting because there's hardware to approximate them built into every modern processor as part of floating point. If you can accept the error, you can abuse it to compute logs with a single FMA.

An example of an exp and a log respectively from my personal library of bit hacks:

    bit_cast<float>((int32_t)(fma(12102203.2f, x, 0x3f800000)));

    bit_cast<float>((uint32_t)(-0x3f800000 - 36707.375f*x)) + 7;
vlovich123 [3 hidden]5 mins ago
It’s a delicate balance and really hard to benchmark. You can write a micro benchmark that keeps the lookup table in cache but what if your function isn’t the only thing being done in a loop? Then even if it’s in the hotpath, there’s insufficient cache to keep the table loaded the entire way through the loop and lookup is slower.

TLDR: it depends on the usage and we actually should have multiple functions that are specialized based on the properties of the caller’s needs where the caller can try a cache or compute approach.

fatih-erikli-cg [3 hidden]5 mins ago
It may be, especially when it comes to unnecessary cache. But I think `atan` is almost a brute force. Lookup is nothing comparing to that.

Sin/cos must be borders of sqrt(x²+y²). It is also cached indeed.

groundzeros2015 [3 hidden]5 mins ago
What do you mean brute force?

We can compute these things using iteration or polynomial approximations (sufficient for 64 bit).

fatih-erikli-cg [3 hidden]5 mins ago
There is a loop of is it close enough or not something like that. It is a brute force. Atan2 purely looks like that to me.
BigTTYGothGF [3 hidden]5 mins ago
> Sin/cos must be borders of sqrt(x²+y²). It is also cached indeed

This doesn't make a ton of sense.

fatih-erikli-cg [3 hidden]5 mins ago
In what way do you think a sin function is computed? It is something that computed and cached in my opinion.

I think it is stored like sintable[deg]. The degree is index.

paulddraper [3 hidden]5 mins ago
atan(x) = asin(x / sqrt(1+x*x))
fatih-erikli-cg [3 hidden]5 mins ago
So the asin is brute force. I think it is `atan` function. Written article explains nothing. Sqrt is also like that.

It looks like there is a reasonable explanation when it is written math form but there is no.

thomasahle [3 hidden]5 mins ago
Did you try polynomial preprocessing methods, like Knuth's and Estrin's methods? https://en.wikipedia.org/wiki/Polynomial_evaluation#Evaluati... they let you compute polynomials with half the multiplications of Horner's method, and I used them in the past to improve the speed of the exponential function in Boost.
kzrdude [3 hidden]5 mins ago
yes, Estrin's method is the update
thomasahle [3 hidden]5 mins ago
Sorry, I said that wrong. Estrin's doesn't reduce the number of multiplications.
fph [3 hidden]5 mins ago
If your goal is reducing the number of multiplications, I imagine it would make sense to factor that polynomial into degree-1 and degree-2 factors.