Back

Convolutions, Polynomials and Flipped Kernels

106 points9 monthseli.thegreenplace.net
incognito1249 months ago

My favourite use case for this: By the same derivation as this blog, one can prove that, if you have any two probability distributions X and Y (they can be different), the probability distribution of X+Y is a convolution of the PMFs/PDFs of X and Y.

srean9 months ago

On similar lines the MAX operator on the random variables become PRODUCT operator on its distribution.

It's fun to play with the (Max, +)algebra of random variables and infer it's distribution.

This turns out to be quite useful in estimating completion time of dependant parallel jobs.

Spawning multiple parallel jobs becomes a Max operation and chaining sequential jobs becomes a '+' operation on the completion times. This expression tree of Max'es and Plus'es can be algebraically processed to obtain bounds on the completion time distribution.

One example is the straggler problem in mapreduce/Hadoop. In the naive case, the completion time is the max of each parallel subtask. (*)

If the tasks have a heavy tail, which sometimes they do, the straggler's completion time can be really bad. This can be mitigated by k-out-n set up, where you encode the problem in such a way that only k out of n jobs need to finish to obtain the final result. One can play with this trade-off between potentially wasted computation and expected completion time.

For heavy tailed distributions another simplification is possible. The tails of Max and + start becoming of the same order, so one can switch between convolutions and products.

(*) This shows up in microservices architectures also. The owner/maintainer of a microservice end point might be very happy with its tail latency. However an end user who consumes and composed the results of the endpoints can experience really bad tail latencies for the final output.

incognito1249 months ago

Thanks for sharing the name of that problem! I've encountered it before while optimizing batched LLM inference. The whole batch would last until all queries in a batch were done, and by changing the batch size, you'd trade off per-query-speed (better in a larger batch) with overall performance (worse with a larger batch).

Nowadays I think this is solved in an entirely different way, though.

srean9 months ago

It gets more entertaining.

It's common to wrap API calls with

retry on failure, or

spawn an identical request if taking longer than x,or

recursively spawn an identical request if taking longer than x,or

retry on failure but no more than k times.

All of these and similar patterns/decorators can be analysed using the same idea.

+1
incognito1249 months ago
deepsun9 months ago

Almost every probability theorem starts with "let's take independent random variables". But in reality almost nothing is independent. Superdeterminism even claims that exactly nothing is independent.

mananaysiempre9 months ago

Almost every elementary probability theorem. There are plenty of theorems conditional on bounds on how weak the dependency should be (or how fast correlations decrease with order, etc.), including CLT-like theorems, it’s just that they are difficult to state, very difficult to prove, and almost impossible to verify the applicability of. In practice you are anyway going to instead use the simpler version and check if the results make sense after the fact.

srean9 months ago

You are right.

The degree of dependence matters though. Mutual information is one way to measure that.

Thankfully, some theorems remain valid even when independence is violated. The next stop after independence is martingale criteria. Martingale difference sequences can be quite strongly dependent yet allow some of the usual theorems to go through but with worse convergence rates.

ttoinou9 months ago

What ? I've never heard of this math. Do you mean literally those are the resulting operations in the general case or are those approximate explanation and we need to find more specific cases to make this true ?

srean9 months ago

The math is general and exact.

Max and Plus at the random variables space becomes product and convolution in their distribution function space.

    Distr(X+Y) =  DistrX ° DistrY

    Distr (X ^ Y) = DistrX * DistrY. 

    Where '^' denotes Max and '°' denotes convolution.
Note *, +, ° and ^ being commutative and associative they can be chained. One can also use their distributive properties. This really the math of groups and rings.

However, one can and one does resort to approximations to compute the desired end results.

More specifically, people are often interested not in the distribution, but some statistics. For example, mean, standard deviation, some tail percentile etc. To compute those stats from the exact distributions, approximations can be employed.

+1
gjm119 months ago
silloncito9 months ago

$Let M=Max(X,Y)$. If $X$ and $Y$ are independent then: $F_M(k) = P(M \leq K) = P((X \leq K) and (Y \leq K))$, so that

  $P(X \leq K) x P(Y \leq K) = F_X(K) x F_Y(K)$.

 So $F_M = F_X \times F_Y$
+1
srean9 months ago
pizza9 months ago

Check out tropical algebras

ttoinou9 months ago

Fascinating thank you for reminding me about that

whatshisface9 months ago

That doesn't sound right. If P(X) is the vector {0.5,0,0.5} and P(Y) is {0.5,0.5,0}, P(X)P(Y) is {0.25,0,0} and that's both not normalized and clearly not the distribution for max(X,Y). Did you get that from an LLM?

srean9 months ago

You are using PMFs. I meant and wrote distribution function aka cumulative distribution function. They are closed under products.

> Did you get it from LLM

LOL. There must be a fun and guilty story lurking inside the accusation.

On a more serious note, I would love it if LLMs could do such simplifications and estimations on their own.

+2
whatshisface9 months ago
jerf9 months ago

3blue1brown has a walkthrough of this: https://www.youtube.com/watch?v=IaSGqQa5O-M

stared9 months ago

Beware - one step more and you get into the region of generating functions. I recommend a book Herbert Wilf with a wonderful name of Generatingfunctionology (https://www2.math.upenn.edu/~wilf/gfology2.pdf).

eliben9 months ago

Indeed, generating functions are mentioned in a footnote :) Very interesting topic

stared9 months ago

Saw that!

Sometimes it makes things simpler (quite a a lot of things in combinatorics), other times it is a tools for nice tricks (I have no idea how I would solved these equations if it were not for generating functions, see the appendix from a Mafia game paper, https://arxiv.org/abs/1009.1031).

srean9 months ago

Ooh! Lovely. Thank you.

Generating functions, Z-transforms are indispensable in probability theory, Physics, signal processing, and now it seems for a good round of Mafia while camping with friends.

esafak9 months ago

I tip my hat to the person who invented that.

nayuki9 months ago

You can also multiply polynomials by way of analogy with integer multiplication:

         3  1  2  1
       ×    2  0  6
       ------------
        18  6 12  6
      0  0  0  0
   6  2  4  2
  -----------------
   6  2 22  8 12  6
= 6x^5 + 2x^4 + 22x^3 + 8x^2 + 12x^1 + 6x^0.
kazinator9 months ago

Not to mention divide:

  2x^2 + 5x - 3
  -------------
     x + 2

                  2x + 1
          ______________
   x + 2 | 2x^2 + 5x - 3
           2x^2 + 4x
           -------------
                   x - 3
                   x + 2
                   -----
                     - 5
The remainder is -5, which gives us a -5/(x + 2) term: Thus

                    5
   =    2x + 1  - -----
                  x + 2
How about something we know divides:

                     x + 1
             _______________
      x + 1 | x^2 + 2x + 1
              x^2 + x
              --------
                     x + 1
                     x + 1
                     -----
                         0
Sourabhsss19 months ago

The visualizations make the concept easy to grasp.

bjt123459 months ago

This and complex analysis are fascinating topics in Undergraduate studies.

esafak9 months ago

Contour integrals still feel cool.

gitroom9 months ago

[dead]