# FLOPs Optimization Using SymPy

July 9, 2018

## Symbolic optimization of FLOPs

In certain computations, when some input parameters of a function is known, it is possible to make simplifications to expressions computed by the function. Fewer computations translate into faster execution. Compilers already apply expression simplifications to some extent, however, there exist more advanced schemes that can perform greater simplifications and if these can be applied offline the resulting optimized expression may increase performance.

Optimization in the sense of this article refers to arithmetic expression minimization. The goal is to minimize the logical number of Floating Point Operations (FLOPs) in the expression translating into faster executation time. In this article optimizations are performed using SymPy - a Python package that can perform expression minimization, count FLOPs and output the resulting expression as C/C++ code.

• Presentation of Lagrange interpolation
• Symbolic optimization of the algorithm with SymPy
• Performance comparison against baseline algorithm

## Lagrange interpolation

I came across this algorithm when studying how to delay a time series signal less than one sample. For example, let’s say a running series of 100 measurements are kept and you need to find the value inbetween consequtive samples and preserve the spectral properties of the time series i.e. reconstruct a value inbetween two samples that is true with respect to the source signal. There are many ways of computing a value inbetween samples in a “good enough”-optimial sense but in order to reconstruct a value inbeteen samples the approach should feature an all-pass filter characteristic. Without going further into how such a filter should behave I will simply state that Lagrange interpolation applied directly to the time series has these characteristics.

Interpolation, as you probably know, is the process of creating a new value within a range of known values. That is, given a set pair values

$$(x_1, y_1), (x_2, y_2), … , (x_N, y_N)$$

an interpolator computes a new value using any number of pair values that is optimal with respect to some error measure

$$f(x_{1.5}) \approx y_{1.5}$$ .

Let the order of the Lagrange interpolator be $N$, as given by $N$ pairs of data $(x_1, y_1), (x_2, y_2), … , ( x_N, y_N)$ and the delay $\hat{x}$. Then the interpolated value at time $\hat{x}$ is

$$L(\hat{x}) = \sum_{i=1}^{N}y_i \times l_i(\hat{x})$$

where $l_i(\hat{x})$ are Lagrange polynomials. $l_i(\hat{x})$ is given by:

$$l_i(\hat{x}) = \prod \frac{\hat{x} - x_n}{x_i - x_n}, 0 < n < N-1, n \neq i$$

Assuming the pairs $(x_n, y_n)$ are equidistant, which is the case in i.e. audio, and that the order is fixed the polynomials $l_i(x)$ can be simplified.

The base line algorithm is shown below

def lagrangePolynomial(samples, xhat):

lhat = 0.0
N = len(samples)

for i in range(0,N):

xi = samples[i][0]
yi = samples[i][1]

factor = yi

for n in range(0,N):

xn = samples[n][0]

if i != n:
factor = factor * (xhat - xn)/(xi - xn)

lhat += factor

return lhat

xy =[(0,1), (1,2), (2,3), (3,4)]

print(lagrangePolynomial(xy, 1.0)) # ... 2.0
print(lagrangePolynomial(xy, 1.5)) # ... 2.5
print(lagrangePolynomial(xy, 2.0)) # ... 3.0


### Simplifying $l_i(x)$

For equidistant samples, the points $x_n$ can be seen as shifted to the origin by subtracting $x_1$ from all $x_n$ and also the delay $\hat{x}$. Denote $\hat{x}$ that has been shifted by $x_1$ units $\bar{x}$. This gives

$$l_i(\bar{x}) = \frac{\bar{x} - 0}{x_i - 0} \times \frac{\bar{x} - 1}{x_i - 1} \times \cdots$$ .

Furthermore, since $i$ is outer summation this can be taken into consideration too

$$l_1(\bar{x}) = \frac{\bar{x} - 0}{1 - 0} \times \frac{\bar{x} - 1}{1 - 2} \times \cdots$$ $$l_2(\bar{x}) = \frac{\bar{x} - 0}{2 - 0} \times \frac{\bar{x} - 1}{2 - 1} \times \cdots$$ .

The insight from this exercise is that for an arbitrary $\hat{x}$ and equidistant points $x_n$ the coefficients $l_i(x)$ involved recurring computations to some extent. If these coefficients can be computed once and reused it provides a target for expression optimization - especially if the coefficients are recomputed often.

Before moving into how to optimize this expression in terms of number of FLOPs, we split the expression up in its now constant and variable parts. Denote the denominator $q = (x_i - x_n)$, then

$$\prod \frac{\bar{x} - x_n}{x_i - x_n} = \prod \frac{\bar{x} - x_n}{q} = \prod \bar{x}\cdot q - \frac{x_n}{q}$$

## Symbolic optimization with SymPy

First consider which variables we must account for while performing optimization. Without giving too much thought one simple approach is to categorize symbols and expression into constants and input dependent value (variables). Applying this thought process we see that having shifted values of $x_n$ to the origin what remains as unknown is $y_n$ and the delay $\bar{x}$. In SymPy we set these as symbols - representing a quantity that is not constant within the scope of the optimization.

from sympy import *
xbar = Symbol('xbar', positive=True, real=True)
ys = [Symbol('y'+str(i), real=True), for i in range(1,N+1)] # [y0,y1,...,yN]


The constant expressions in the inner most loop discussed in the previous section is computed as such

denominator = 1.0/(float(i) - float(j))
constant = float(j)*denominator # x_n / q


And the Lagrange coefficient is computed as a product over $n$:

factor =  factor * (xbar*denominator - constant)


Building the symbolic expression becomes

from sympy import *
N = 5
xbar = Symbol('xbar', positive=True, real=True)
ys = [Symbol('y'+str(i), real=True) for i in range(1,N+1)]
ybar = Symbol('ybar', real=True)

ybar = 0.0

for i in range(0,N):
factor = ys[i]
for n in range(0,N):
if i != n:
denominator = 1/(float(i)-float(n))
constant = float(n)*denominator
factor = factor * (xbar*denominator - constant)
ybar += factor


Proceeding to minimizing the expression by using SymPys Common Subexpression Liminiation (cse)

simplified = ybar.simplify()
simplified_ybar = cse(simplified)
print(count_ops(ybar, visual=True))            # 4*ADD + 40*MUL + 16*SUB
print(count_ops(simplified_ybar, visual=True)) # 2*ADD + 27*MUL + 12*SUB


resulting in a $\approx 31$% reduction of FLOPs for an interpolation order of $N = 5$. It is also possible to make SymPy output the expression as C-code

for subexp in simplified_ybar[0]:
print(ccode(subexp[1], subexp[0], standard='C99'))

print(ccode(simplified_ybar[-1][0], 'ybar', standard='C99'))

""" output
x0 = xbar - 1;
x1 = 1.0*xbar;
x2 = x0*(x1 - 3.0);
x3 = 0.5*xbar;
x4 = x1 - 2.0;
x5 = x3 - 1.0;
x6 = 0.333333333333333*xbar;
ybar = -0.166666666666667*x0*x4*xbar*y4*(x1 - 4.0) + 1.0*x0*x5*y1*(x6 - 1.0)*(0.25*xbar - 1.0) - x1*x4*y2*(x3 - 1.5)*(x6 - 1.33333333333333) + 0.0833333333333333*x2*x5*xbar*y5 + 0.5*x2*xbar*y3*(x3 - 2.0);
"""


As you can see the reduction in FLOPs comes to a small cost in increased memory.

and SymPy even supports printing this in the form of a function with separate header and source files. Other supported code generators are Fortran, Octave/Matlab, Javascript, Julia, Mathematica, Rust, Python.

The next section displays the number of FLOPs in the baseline and the CSE expression.

### Baseline and optimized FLOPS

Running this optimization for a range of $N$ shows how many FLOPs can be simplified using this approach

+----+----------------+-----------+-----------------+
| N  | Baseline FLOPs | CSE FLOPs | Improvement (%) |
+----+----------------+-----------+-----------------+
| 2  |       6        |     6     |       0.0       |
| 3  |       18       |     14    |      22.222     |
| 4  |       36       |     26    |      27.778     |
| 5  |       60       |     41    |      31.667     |
| 6  |       90       |     58    |      35.556     |
| 7  |      126       |     81    |      35.714     |
| 8  |      168       |    107    |      36.31      |
| 9  |      216       |    136    |      37.037     |
| 10 |      270       |    169    |      37.407     |
| 11 |      330       |    199    |      39.697     |
| 12 |      396       |    237    |      40.152     |
| 13 |      468       |    280    |      40.171     |
| 14 |      546       |    327    |      40.11      |
| 15 |      630       |    376    |      40.317     |
| 16 |      720       |    429    |      40.417     |
+----+----------------+-----------+-----------------+


The improvement climbs to $\approx 41.2$% for $N = 32$. The baseline for $N = 32$ has 2976 FLOPs compared to 1750 FLOPs for the CSE expression.

## Discussion

In conclusion a case where a priori information about the input of a function was used to demonstrate the power of symbolic expression optimization using SymPy. For the case of Lagrange interpolation we managed to reduce the number of required FLOPs by $\approx 40$%. As was seen, this came to a small cost of increased memory usage.

## References

Found an error? Report it by email and I will put your name and contribution in the article.