Linear Regressions: 5 Methods to Compute A^t * A


     There are many algorithms to produce linear approximations. These are usually called "linear regressions". Common methods include Least Squares (directly or through the use of Singular Value Decomposition, SVD), Best Basis, etc. The approximation problem can be posed in linear algebra terms as: given a matrix of A (dim n x m) and a vector b (dim n), we want to find a vector x (dim m) such that Ax ~ b. This approximation is in the least squares sense, i.e. ||Ax-b||^2 is minimum.

     It is a well known fact that the computation of x for most common approximation algorithms can be done with A^t * A (dim m x m), b*A (dim m) and b*b alone. That is, these 3 elements contain all the information needed to resolve the approximations. Normally m << n, and hence the representation based on these products is much more economical than keeping A and b.

     There are other advantages too. The data, (size n) may be too large to fit in memory, and hence we have to use it (or read it) sequentially and process every row completely before processing the next row. Alternatively, online algorithms (data comes incrementally, i.e. we want to approximate with n1 data points, later increase to n2, and approximate again, etc.) will also force this scheme.

     This bio-recipe shows 5 ways of computing AtA, b*A and b*b and discusses its efficiency and other numerical features.

     For our computation we will create a random matrix A(5000,20), and a dependent vector b which has an obvious relation with two of the columns of A plus a small random noise. This is done by:

n := 5000;  m := 20;
A := Rand(array(numeric,n,m)):
b := CreateArray(1..n):
for i to n do
    b[i] := 3*A[i,3]-7*A[i,7]+0.1*Rand() od:
n := 5000
m := 20
gigahertz();
bytes alloc=9600000, time=0.810
1.5495

     The function gigahertz gives us an idea of the speed of the processor for comparison purposes.

     The first method to compute AtA, b*A and b*b is by direct computation. We time the computation of the 3 values.

Set(gc=10^7):  st := time():
AtA1 := A^t * A:
btA := b*A:
btb := b*b:
time1 := time()-st;
time1 := 0.08000000

     To obtain more reliable timings we have used large dimensions. To avoid the overhead of garbage collection, we set its frequency very low (every 10^7 words) and force garbage collection before taking each timing. Notice that in this first method we used linear algebra operations for all operations. These are implemented at the kernel level and hence are quite efficient. They are also easy to read. For the purpose of seeing the results, here is an SVD approximation:

nams := [ seq( var.i, i=1..m ) ]:
print( SvdAnalysis(AtA1,btA,btb,nams,0.01) );
Results of SVD analysis
quadratic norm of residuals: |r|^2=4.56573
20 singular values used:  370.7  378.9  385.9  389  392.9
  396.5  401.8  408.6  410.1  415.8  417.8  424.5  431.7  436.1
  440.4  447  453  455.6  459.9  2.534e+04
norm of the raw data, |b|^2=41125

        variable    value/stdev    norm decrease
            var7  -6.9971 +- 0.0484  2.088e+04
            var3   3.0063 +- 0.0481  3904
           var10   0.0074 +- 0.0480  0.02356
            var1   0.0069 +- 0.0475  0.02111
            var5   0.0063 +- 0.0476  0.01738
           var17   0.0064 +- 0.0485  0.01725
            var4   0.0060 +- 0.0478  0.01604
           var12   0.0057 +- 0.0477  0.01447
            var9   0.0053 +- 0.0480  0.01211
           var14   0.0051 +- 0.0470  0.012
           var20   0.0051 +- 0.0474  0.01141
           var18   0.0051 +- 0.0477  0.01134
            var8   0.0046 +- 0.0474  0.009613
           var16   0.0041 +- 0.0483  0.007362
            var2   0.0040 +- 0.0482  0.006815
           var11   0.0040 +- 0.0483  0.006683
            var6   0.0034 +- 0.0484  0.004881
           var15   0.0032 +- 0.0475  0.004435
           var19   0.0031 +- 0.0478  0.004256
           var13   0.0030 +- 0.0478  0.003868

     The next method computes the matrix AtA based on the inner products of each column of A. To do this efficiently, we transpose A so that each column becomes a row, and a direct entry in transpose(A). (A matrix is a list of rows). By doing this we save almost half of the computation as the matrix AtA is symmetric.

gc():  st := time():
At := transpose(A):
AtA2 := CreateArray(1..m,1..m):
for i to m do for j from i to m do
    AtA2[i,j] := AtA2[j,i] := At[i] * At[j] od od:
btA := b*A:
btb := b*b:
time2 := time()-st;
bytes alloc=10800000, time=2.940
time2 := 0.04000000

     We see that the computation done this way saves quite a bit of time, in exchange for code which is slightly more complicated. Hence this is the method of choice when the matrix A can be stored in memory. The reason for the additional efficiency is saving about half of the computation and that inner product computation of two vectors is very efficient in Darwin. The results are numerically identical.

evalb( AtA1 = AtA2 );
true

     The rest of the methods are incremental, i.e. we can use them to compute the results with one row of data at a time. So we will write a loop for all rows and in the body of the loop we use A[i] and b[i]. Once that the body is executed, for a given i, we cannot reuse these values.

     The third method computes the values incrementally using vector algebra as much as possible:

gc():  st := time():
AtA3 := CreateArray(1..m,1..m):
btA := CreateArray(1..m):
btb := 0:
for i to n do
    Ai := A[i];
    for j to m do AtA3[j] := AtA3[j] + Ai[j]*Ai od;
    btA := btA + b[i]*Ai;
    btb := btb + b[i]^2
    od:
time3 := time()-st;
bytes alloc=10800000, time=2.980
time3 := 0.2700

     The overhead of the loop is noticeable and the computation is more difficult to understand, but this is an online method which does not require the matrix A or vector b to be available in memory. We can try a Best Basis analysis this time:

print( SvdBestBasis(AtA3,btA,btb,nams,3,0.01) );
Results of SVD analysis
quadratic norm of residuals: |r|^2=5.53540
3 singular values used:  405.9  425.5  4138
norm of the raw data, |b|^2=41125

        variable    value/stdev    norm decrease
            var7  -6.9730 +- 0.0416  2.806e+04
            var3   3.0311 +- 0.0407  5533
           var10   0.0319 +- 0.0408  0.6122

     The fourth method does all of the matrix and vector algebra explicitly, i.e. with for-loops. Given that AtA is a symmetric matrix, as with the second method, we save half of the computation.

gc():  st := time():
AtA4 := CreateArray(1..m,1..m):
btA := CreateArray(1..m):
btb := 0:
for i to n do
    Ai := A[i];
    for j to m do
        for k from j to m do
            AtA4[j,k] := AtA4[j,k] + Ai[j]*Ai[k] od od;
    btA := btA + b[i]*Ai;
    btb := btb + b[i]^2
    od:
for j to m do for k from j+1 to m do AtA4[k,j] := AtA3[j,k] od od;
time3 := time()-st;
bytes alloc=40497048, time=3.360
time3 := 1.5400

     We can see that this takes substantially longer, a darwin loop compares disfavourably to vector operations in the kernel. The program is even more complicated, so there is no advantage whatsoever with this method.

     The fifth method uses an exterior product of Ai^t * Ai, so that all the computation is one step and can be done without internal for-loops. To do exterior products in Darwin we have to build a (1 x m) and an (m x 1) matrices. The (1 x m) is trivial, [Ai], and the (m x 1) is obtained by transposition.

gc():  st := time():
AtA5 := CreateArray(1..m,1..m):
btA := CreateArray(1..m):
btb := 0:
for i to n do
    Ai := A[i];
    AtA5 := AtA5 + transpose([Ai]) * [Ai];
    btA := btA + b[i]*Ai;
    btb := btb + b[i]^2
    od:
time5 := time()-st;
bytes alloc=40497048, time=5.030
time5 := 0.2500

     This last method is the most efficient of the online methods. Its code is rather short, and once that the exterior products are understood, the program is quite simple. It is hence the online method of choice.

Numerical considerations

     From the point of view of numerical accuracy, the first and second methods are more accurate than the others (which should all be identical). The reason for this extra accuracy is that computing AtA directly uses inner products computed in the kernel, and these are computed at additional precision.

diff13 := AtA1 - AtA3:
min(diff13)..max(diff13);
-6.8212e-12..7.9581e-12

     (A quick way of identifying the extreme values of a matrix is to show its maximum and minimum. When these are computed over the difference of two matrices, it becomes a quick way of measuring the differences of the matrices.) So there is a difference and we can say with authority that the AtA1 values are more accurate.

evalb( AtA3=AtA4 ), evalb( AtA3=AtA5 );
true, true

     As expected these values are identical. The errors between the methods are dwarfed by the errors done when the independent data is not uniformly distributed. This can be resolved quite easily and is part of the subject of the bio-recipe on Linear Regression.

© 2005 by Gaston Gonnet, Informatik, ETH Zurich

Index of bio-recipes

Last updated on Fri Sep 2 14:44:09 2005 by GhG