There are many algorithms to produce linear approximations. These are usually called "linear regressions". Common methods include Least Squares (directly or through the use of Singular Value Decomposition, SVD), Best Basis, etc. The approximation problem can be posed in linear algebra terms as: given a matrix of A (dim n x m) and a vector b (dim n), we want to find a vector x (dim m) such that Ax ~ b. This approximation is in the least squares sense, i.e. ||Ax-b||^2 is minimum.

It is a well known fact that the computation of x for most common approximation algorithms can be done with A^t * A (dim m x m), b*A (dim m) and b*b alone. That is, these 3 elements contain all the information needed to resolve the approximations. Normally m << n, and hence the representation based on these products is much more economical than keeping A and b.

There are other advantages too. The data, (size n) may be too large to fit in memory, and hence we have to use it (or read it) sequentially and process every row completely before processing the next row. Alternatively, online algorithms (data comes incrementally, i.e. we want to approximate with n1 data points, later increase to n2, and approximate again, etc.) will also force this scheme.

This bio-recipe shows 5 ways of computing AtA, b*A and b*b and discusses its efficiency and other numerical features.

For our computation we will create a random matrix A(5000,20), and a dependent vector b which has an obvious relation with two of the columns of A plus a small random noise. This is done by:

n := 5000; m := 20; A := Rand(array(numeric,n,m)): b := CreateArray(1..n): for i to n do b[i] := 3*A[i,3]-7*A[i,7]+0.1*Rand() od:

n := 5000 m := 20

gigahertz();

bytes alloc=9600000, time=0.810 1.5495

The function gigahertz gives us an idea of the speed of the processor for comparison purposes.

The first method to compute AtA, b*A and b*b is by direct computation. We time the computation of the 3 values.

Set(gc=10^7): st := time(): AtA1 := A^t * A: btA := b*A: btb := b*b: time1 := time()-st;

time1 := 0.08000000

To obtain more reliable timings we have used large dimensions. To avoid the overhead of garbage collection, we set its frequency very low (every 10^7 words) and force garbage collection before taking each timing. Notice that in this first method we used linear algebra operations for all operations. These are implemented at the kernel level and hence are quite efficient. They are also easy to read. For the purpose of seeing the results, here is an SVD approximation:

nams := [ seq( var.i, i=1..m ) ]: print( SvdAnalysis(AtA1,btA,btb,nams,0.01) );

Results of SVD analysis quadratic norm of residuals: |r|^2=4.56573 20 singular values used: 370.7 378.9 385.9 389 392.9 396.5 401.8 408.6 410.1 415.8 417.8 424.5 431.7 436.1 440.4 447 453 455.6 459.9 2.534e+04 norm of the raw data, |b|^2=41125 variable value/stdev norm decrease var7 -6.9971 +- 0.0484 2.088e+04 var3 3.0063 +- 0.0481 3904 var10 0.0074 +- 0.0480 0.02356 var1 0.0069 +- 0.0475 0.02111 var5 0.0063 +- 0.0476 0.01738 var17 0.0064 +- 0.0485 0.01725 var4 0.0060 +- 0.0478 0.01604 var12 0.0057 +- 0.0477 0.01447 var9 0.0053 +- 0.0480 0.01211 var14 0.0051 +- 0.0470 0.012 var20 0.0051 +- 0.0474 0.01141 var18 0.0051 +- 0.0477 0.01134 var8 0.0046 +- 0.0474 0.009613 var16 0.0041 +- 0.0483 0.007362 var2 0.0040 +- 0.0482 0.006815 var11 0.0040 +- 0.0483 0.006683 var6 0.0034 +- 0.0484 0.004881 var15 0.0032 +- 0.0475 0.004435 var19 0.0031 +- 0.0478 0.004256 var13 0.0030 +- 0.0478 0.003868

The next method computes the matrix AtA based on the inner products of each column of A. To do this efficiently, we transpose A so that each column becomes a row, and a direct entry in transpose(A). (A matrix is a list of rows). By doing this we save almost half of the computation as the matrix AtA is symmetric.

gc(): st := time(): At := transpose(A): AtA2 := CreateArray(1..m,1..m): for i to m do for j from i to m do AtA2[i,j] := AtA2[j,i] := At[i] * At[j] od od: btA := b*A: btb := b*b: time2 := time()-st;

bytes alloc=10800000, time=2.940 time2 := 0.04000000

We see that the computation done this way saves quite a bit of time, in exchange for code which is slightly more complicated. Hence this is the method of choice when the matrix A can be stored in memory. The reason for the additional efficiency is saving about half of the computation and that inner product computation of two vectors is very efficient in Darwin. The results are numerically identical.

evalb( AtA1 = AtA2 );

true

The rest of the methods are incremental, i.e. we can use them to compute the results with one row of data at a time. So we will write a loop for all rows and in the body of the loop we use A[i] and b[i]. Once that the body is executed, for a given i, we cannot reuse these values.

The third method computes the values incrementally using vector algebra as much as possible:

gc(): st := time(): AtA3 := CreateArray(1..m,1..m): btA := CreateArray(1..m): btb := 0: for i to n do Ai := A[i]; for j to m do AtA3[j] := AtA3[j] + Ai[j]*Ai od; btA := btA + b[i]*Ai; btb := btb + b[i]^2 od: time3 := time()-st;

bytes alloc=10800000, time=2.980 time3 := 0.2700

The overhead of the loop is noticeable and the computation is more difficult to understand, but this is an online method which does not require the matrix A or vector b to be available in memory. We can try a Best Basis analysis this time:

print( SvdBestBasis(AtA3,btA,btb,nams,3,0.01) );

Results of SVD analysis quadratic norm of residuals: |r|^2=5.53540 3 singular values used: 405.9 425.5 4138 norm of the raw data, |b|^2=41125 variable value/stdev norm decrease var7 -6.9730 +- 0.0416 2.806e+04 var3 3.0311 +- 0.0407 5533 var10 0.0319 +- 0.0408 0.6122

The fourth method does all of the matrix and vector algebra explicitly, i.e. with for-loops. Given that AtA is a symmetric matrix, as with the second method, we save half of the computation.

gc(): st := time(): AtA4 := CreateArray(1..m,1..m): btA := CreateArray(1..m): btb := 0: for i to n do Ai := A[i]; for j to m do for k from j to m do AtA4[j,k] := AtA4[j,k] + Ai[j]*Ai[k] od od; btA := btA + b[i]*Ai; btb := btb + b[i]^2 od: for j to m do for k from j+1 to m do AtA4[k,j] := AtA3[j,k] od od; time3 := time()-st;

bytes alloc=40497048, time=3.360 time3 := 1.5400

We can see that this takes substantially longer, a darwin loop compares disfavourably to vector operations in the kernel. The program is even more complicated, so there is no advantage whatsoever with this method.

The fifth method uses an exterior product of Ai^t * Ai, so that all the computation is one step and can be done without internal for-loops. To do exterior products in Darwin we have to build a (1 x m) and an (m x 1) matrices. The (1 x m) is trivial, [Ai], and the (m x 1) is obtained by transposition.

gc(): st := time(): AtA5 := CreateArray(1..m,1..m): btA := CreateArray(1..m): btb := 0: for i to n do Ai := A[i]; AtA5 := AtA5 + transpose([Ai]) * [Ai]; btA := btA + b[i]*Ai; btb := btb + b[i]^2 od: time5 := time()-st;

bytes alloc=40497048, time=5.030 time5 := 0.2500

This last method is the most efficient of the online methods. Its code is rather short, and once that the exterior products are understood, the program is quite simple. It is hence the online method of choice.

From the point of view of numerical accuracy, the first and second methods are more accurate than the others (which should all be identical). The reason for this extra accuracy is that computing AtA directly uses inner products computed in the kernel, and these are computed at additional precision.

diff13 := AtA1 - AtA3: min(diff13)..max(diff13);

-6.8212e-12..7.9581e-12

(A quick way of identifying the extreme values of a matrix is to show its maximum and minimum. When these are computed over the difference of two matrices, it becomes a quick way of measuring the differences of the matrices.) So there is a difference and we can say with authority that the AtA1 values are more accurate.

evalb( AtA3=AtA4 ), evalb( AtA3=AtA5 );

true, true

As expected these values are identical. The errors between the methods are dwarfed by the errors done when the independent data is not uniformly distributed. This can be resolved quite easily and is part of the subject of the bio-recipe on Linear Regression.

© 2005 by Gaston Gonnet, Informatik, ETH Zurich

Last updated on Fri Sep 2 14:44:09 2005 by GhG