Now we have the boring stuff out the way we can do some q. For this example we'll work through the same linear regression example in R given on this website: Linear Least Squares Regression and examine some of the statistics. We'll also be using a C extension to calculate the inverse because the native version of q's inverse function sacrifices accuracy for performance. For this example we are much more interested in numerical stability than speed. This function will use lapack's dgetrf and dgetri functions, the source can be found here: qinv All the functions here are natively implemented in this library: https://github.com/jpneill/qlr
Now let's load in the data and get started:
q)year:`float$2000+til 5
q)rate:9.34 8.5 7.62 6.93 6.60
Before we start calculating our parameters we'll examine the correlation between the year and the rate variables. Pearson's coefficient is a measure of correlation between 2 variables that has a value between -1 and 1. When it has a value of 1 a linear equation describes the trend of the data where an increase in X is mirrored by an increase in Y. When it has a value of -1 then the linear equation describes a decreasing trend instead. A value at 0 or close to 0 indicates there is little to no linear correlation between the variables. It is calculated as: $$\rho_{XY}=\frac{cov(X,Y)}{\sigma_X\sigma_Y}$$cov(X,Y) is the covariance between two population samples with size N: $$X=\{x_1,x_2,\dots,x_n\};Y=\{y_1,y_2,\dots,y_n\}$$ $$cov(X,Y)=\sum_{i=1}^N\frac{(x_i-\bar{x})(y_i-\bar{y})}{N}$$ In q the covariance and the Pearson coefficient calculations are provided already. Covariance:
q)cov[rate;year]
-1.41
The Pearson coefficient:
q)cor[rate;year]
-0.9880813
This value for the coefficient tells us that a linear equation descibes the data very well and is a good candidate for applying a simple linear regression model.
Finally we get to use all the theory we discussed in part a. We defined β as: $$\mathbf{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$ This is quite easy to write in q due to lists being the building blocks of all things q:
q)neq:{(qinv[flip[x]$x]$flip[x])$y}
Before we can apply this to our data set we need to build our design matrix. This can be easily accomplished by appending a 1 to each of the values in the year variable:
q)X:1.0,/:year
q)X
1 2000
1 2001
1 2002
1 2003
1 2004
Now we can find our coefficients:
q)beta:neq[X;rate]
q)beta
1419.208 -0.705
To determine the residual standard error we need to first find the sum of squared residuals using the following definition: $$SSE=(\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X\beta})$$ In q it has the form:
q)sse:{[x;y;beta]y$y-:x$beta}
q)sse[X;rate;beta]
0.12063
Now we can calculate the residual standard error:$$\hat{\sigma}=\sqrt{\frac{SSE}{d.o.f.}}$$ where d.o.f. is the degrees of freedom - the difference between the number of samples in our data set and the number of parameters we are estimating.
q)rse:{[x;y;beta]`dof`rse!(dof;sqrt sse[x;y;beta]%dof:count[y]-count beta)}
q)rse[X;rate;beta]
dof| 3
rse| 0.2005243
The standard error of the coefficients β are the diagonal entries of the matrix: $$\hat{s.e}\beta=\sqrt{\hat{\sigma}^2(\mathbf{X}^T\mathbf{X})^{-1}}$$ This is easily calculated in q:
q)sec:{[x;y;beta]raze {x where y}'[a;{x=/:x}til count a:sqrt (sse[x;y;beta]%count[y]-count beta)*qinv[flip[x]$x]]}
q)sec[X;rate;beta]
126.9496 0.06341136
The coefficient of determination (commonly referred to as R-squared) explains the amount of variation in the dependent variable explained by the independent variables. Its value is always between 0 and 1, with values close to 1 indicating that the regression line of the determined model predicts future values well. We can calculate the R-squared value using the Pearson coefficient between the observed data and the predicted data: $$R^2=\rho^2_{y\hat{y}}$$
q)r2:{[x;y;beta]a*a:y cor x$beta}
q)r2[X;rate;beta]
0.9763047
R-squared increases with the number of parameters in the model. This can lead to artificial inflation of the R-squared value if we just keep adding more parameters. To deal with this it is also useful to calculate the adjusted R-squared value: $$R^2_{adj}=1-\frac{(n-1)SSE}{(d.o.f.)SST}$$ where SST is the total sum of squares:$$SST=\sum^n_{i=1}(y_i-\bar{y})^2$$
q)sst:{x$x-:avg x}
q)ar2:{[x;y;beta]1-(count[y]-1)*sse[x;y;beta]%(count[y]-count beta)*sst[y]}
q)ar2[X;rate;beta]
0.9684062