Investment Studio > Expressions > Functions > Statistical > LINEST
float array[*][*] linest(float array known_ys, float array known_xs = {1, 2, 3 [, ...]}, boolean fit_constant = TRUE, boolean statistics = FALSE)
Returns the least square fit of the plane in N dimensions (independent variables)
| N | ||
| y(x1, x2, x3, ... xN) = b + | S | mk xk |
| k = 1 |
to a set of known y(x) pairs.
Input
known_ys is the set of known y values. If it contains a single column (row), each column (row) in known_xs is interpreted as containing the values of a separate independent variable. Note that this can be used for polynomial fitting by putting different powers of the same independent variable in different columns (rows).
If known_ys contains multiple columns and rows, known_xs is interpreted as containing the values of a single independent variable. Corresponding elements in the two arrays are then determined by order of appearance (reading from left to right, top to bottom).
known_xs is the set of known x values. It can contain one or more independent variables. If only one independent variable is used, known_xs and known_ys can have any shape(s) as long as they contain the same number of elements; corresponding elements in the two arrays are then determined by order of appearance (reading from left to right, top to bottom). If more than one independent variable is used, known_ys must be a column (row) vector, and known_xs must contain a column (row) for each independent variable.
known_xs may be omitted, in which case it defaults to the array {1, 2, 3, ...} with the same number of elements as known_ys.
fit_constant is used to specify how the b (intercept) parameter of the fit is to be computed. If fit_constant = TRUE, the value of b returned by the normal least square fit is used. If fit_constant = FALSE, b is forced to = 0 and the mk parameters are adjusted accordingly to still fit the known y values.
If fit_constant is omitted, it defaults to TRUE.
statistics is used to specify if extra statistics are to be appended to the coefficients of the fit returned by the function.
If statistics is omitted, it defaults to FALSE.
All array elements are converted to float, with exclusion if conversion fails.
Output
The result array is headed by a single row containing the N + 1 coefficients of the least square fit in "reverse" order: the last variable (mN) is in the first column, the intercept b in the last column.
If statistics = TRUE, this section is followed by:
| A row
containing the standard errors of the coefficients of the
fit (listed in the same order as the coefficients). The
standard error of the intercept b is included only if fit_constant
= TRUE. These values (along with the number of degrees of freedom, listed two rows below them) can be used to estimate how likely each coefficient is to encode useful information rather than just noise: significance level = tdist(abs(coefficient / standard error), degrees of freedom, 1) As usual in statistics, the smaller the significance level, the better (confidence level = 1 - significance level), with 5% being a common rejection threshold. |
||
A row
containing the coefficient of determination (first column)
and the standard error (estimated standard deviation) of
the known y values (second column).
The coefficient of determination is a real number Î [0, 1] obtained by dividing the regression sum of squares (see below) by the total sum of squares (the sum of the squares of the differences between the known y values and their arithmetic average). It can be interpreted as the proportion of the variance in the dependent variable attributable to the independent variables. A value of 0 means that the linear fit has no predictive power; a value of 1 means that there is perfect agreement between actual and predicted y values. For a single independent variable, the coefficient of determination reduces to the square of the Pearson product moment correlation coefficient (see function rsq). |
||
| A row
containing the F statistic (first column) and the number
of degrees of freedom (second column). These values can be used to estimate the overall significance level of the observed relationship between y and the independent variable(s), i.e. the probability that it's occurring by chance: significance level = fdist(F statistic, degrees of freedom, number of independent variables) Generally speaking, the larger the F statistic, the smaller the probability of the observed relationship being due to chance. |
||
| A row
containing the regression sum of squares (first column)
and the residual sum of squares (second column). The regression sum of squares is computed as the difference between the total sum of squares (the sum of the squares of the differences between the known y values and their arithmetic average) and the residual sum of squares. The residual sum of squares is the sum of the differences between actual and estimated y values in each known point. |
The following illustration summarizes the layout of the output array:

Examples
Basic slope and intercept
Given y values {1, 3, 2, 5, 4} and x values {2, 3, 4, 5, 6},
=linest({1, 3, 2, 5, 4}, {2, 3, 4, 5, 6})
returns {0.8, -0.2}, meaning that the linear least square fit to the data is y = 0.8 * x - 0.2.
Multiple linear regression
A real estate agent needs a guideline for pricing apartments based on floor space and fixed monthly costs. Ten apartments currently on the market have the following properties:

The expression
=linest(a2:a12, b2:c12, true, true)
returns (with rounding) {{46.3995, 1551.58, 68135.1}, {96.9, 500.5, 34768}, {0.808, 55304}, {16.9, 8}, {103213156627, 24468979736}} from which we read off
price » 46.3995 * monthly rate + 1551.58 * floor space + 68135.1
The coefficient of determination (0.808) in the first column of the third row tells us that there is a fairly strong relationship between the chosen independent variables (monthly rate, floor space) and the price.
To check that the observed relationship is not likely to have occurred by chance, we can feed the F statistic (16.9: first column, fourth row) and the degrees of freedom (8: second column, fourth row) to fdist:
=fdist(16.9, 8, 1)
» 19%. While this is quite a bit larger than the commonly used 5% significance threshold, it should do for a rough guideline.
At this point, some introductions to statistics will tell you that you can use the standard error of the known prices (55304: second column, third row) to augment your price estimates with confidence intervals. This is not correct. The standard error is just an estimate of the standard deviation of the prices in the sample. To estimate the error in a price computed from the fit you need to add up the variances of all contributing terms and take the square root of the result:
price variance = (monthly rate - average monthly rate in sample)^2 * (standard error of monthly rate coefficient)^2 + (floor space - average floor space in sample)^2 * (standard error of floor space coefficient)^2 + (standard error of intercept)^2
price standard deviation = sqrt(price variance)
Plugging in the averages (computed separately) and the standard errors of the cofficients from the second row in the result array,
price variance » (monthly rate - 469)^2 * 96.9^2 + (floor space - 103)^2 * 500.5^2 + 34768^2
Consider for instance monthly rate = 900 and floor space = 100. With these figures, price variance » 2.9E9 Þ price standard deviation » 54362. This is fairly close to the estimated standard error of the prices in the sample (55304) because we are close to the middle of the sample. On the other hand, with monthly rate = 1000 and floor space = 250, price variance » 9.3E9 Þ price standard deviation » 96300, almost twice the estimated standard error of the prices in the sample. The corresponding 95% confidence interval is
price ± confidence(1 - 95%, 1, 1) * (price standard deviation) » 500000 ± 190000
To identify the main source of uncertainty, we can compute the significance level of each coefficient using tdist. Starting with the intercept (third column), we have
=tdist(68135.1 / 34768, 8, 1)
» 4.3% (OK). For the dependence on price per square meter,
=tdist(1551.58 / 500.5, 8, 1)
» 0.8% (even better). The validity of our implied assumption that price is linearly dependent on monthly rate turns out to be rather more questionable:
=tdist(46.3995 / 96.9, 8, 1)
» 33%. Removing this variable entirely might actually provide a better fit.
S&P 500 vs. DAX
We want to see to which extent the German DAX tracked the US S&P 500 in the course of a year.
First we must put our data in a form suitable for analysis. Given arrays containing dates and daily closes for the two indices (as returned by function asset_quotes with format string "c") and referenced by symbols _DAX and _SP, we define the symbol
_LEN = 20
and compute the _LEN-day percentage moves in the two indices using the roc function:
_DAX_ROC = roc(array(_DAX), _LEN)
_SP_ROC = roc(array(_SP), _LEN)
Next, we match up the dates in the ROC arrays using the sync function:
_ROCS = sync(array(_DAX_ROC), array(_SP_ROC), FALSE)
This gives us a three-column array: column #1 contains a date; column #2 contains the _LEN-day percentage move in the DAX ending on that date; column #3 contains the _LEN-day percentage move in the S&P 500 ending on the same date. For instance, looking in the row dated January 31, 2000 (column #1) we see that the DAX closed that day 1.26% higher (column #2) than 20 trading days earlier, while the S&P 500 closed 5.09% lower (column #3) than 20 trading days earlier.
Now we have all the data we need. To find the linear relationship (if any) between columns #2 and #3 we define
_OFFSET = 1
and compute
_LINEST = linest(subarray(array(_ROCS), 1 + _OFFSET, 0, 2, 2), subarray(array(_ROCS), 1, rows(array(_ROCS)) - _OFFSET, 3, 3), TRUE, TRUE)
Here, we are using column #2 as our y (dependent) variable and column #3 as our single x (independent) variable. The shift of column #2 by _OFFSET means that we are looking for a relationship between y _OFFSET days into the future of x, or equivalently between y and x _OFFSET days in the past. Mathematically, we are looking for y = f(x - OFFSET).
With _LEN = 20 and _OFFSET = 1, we are effectively asking if knowing the 20-day ROC of the S&P 500 on day T allows us to predict the 20-day ROC of the DAX on day T + 1.
For the year 2000, the answer is as follows:
| m1
= index(array(_LINEST), 1, 1) » 0.33
s1 = index(array(_LINEST), 2, 1) » 0.077 degrees of freedom = index(array(_LINEST), 4, 2) = 245 Þ significance level of m1 = tdist(m1/s1, degrees of freedom, 1) » 1.3E-5 This is excellent. |
|
| b = index(array(_LINEST),
1, 2) » -0.31 sb = index(array(_LINEST), 2, 2) » 0.34 Þ significance level of sb = tdist(b/sb, degrees of freedom, 1) » 0.18 This is much larger than the common 5% rejection threshold. |
|
| coefficient
of determination = index(array(_LINEST), 3, 1) » 0.07 This implies very low predictive power. |
|
| standard
error for y estimate = index(array(_LINEST), 3, 2) » 5.3 (percentage points, since the ROCs
used as input to linest are expressed in percentage
points) This is a large error, confirming that a straight line is not a good fit to the data. |
|
| F
statistic = index(array(_LINEST), 4, 1) » 18.4 Þ significance level of observed relationship = fdist(F statistic, degrees of freedom, 1) » 0.18 This too is much larger than the common 5% rejection threshold. |
|
| regression
sum of squares = index(array(_LINEST), 5, 1) » 520 residual sum of squares = index(array(_LINEST), 5, 2) » 6939 |
While we can get better fits by increasing _LEN toward a full year (reflecting the fact that both the DAX and the S&P 500 were down on the year in 2000) the case for the DAX tracking the S&P 500 in the short and medium term during the year 2000 looks weak at best.
On the other hand, repeating the same exercise for the year 2001 yields a very different answer: still with _LEN = 20 and _OFFSET = 1, the coefficient of determination is no less than 0.79, and the significance level of the observed relationship (2.7%) is well below the common 5% rejection threshold. A resonable interpretation of these observations is that the apparent decoupling in the year 2000 was a consequence of the breaking of the long term uptrend of the 90s. Once a new trend was in place - this time, heading down - the relationship between the two indices was re-established.
See also correl, forecast, growth, intercept, logest, slope, steyx, trend.