httprover's 2nd blog: June 2019

Wednesday, June 26, 2019

The Student t Distribution

If one adds a given set of normal random numbers with the same mean μ and standard deviation σ together Student showed in 1908 that the sum deviated from a normal distribution. The square of the standard deviation of a sum is the sum of the squares of the standard deviation of each number and since each is the same σ_Σ =σ√n̅ and for an average σ_avg =σ/√n̅. Student used the z-value z=(x̄-μ)/σ_avg to define the probability density distribution for the average of the sum which has become known as the Student t distribution. The variable t for which the distribution is defined is a z-value. A formula for the probability p(t) can be found in the Wikipedia article Student's t-distribution. We can evaluate this function for a number of degrees of freedom (DF) in Excel.

As the number of degrees of freedom increases the t-distribution approaches a normal distribution with mean μ=0 and standard deviation σ=1.

Using Excel's t.dist function we get the same results.

Saturday, June 22, 2019

Doing Histograms in Excel

We've seen how one can generate a set of normally distributed random numbers by combining the Excel norm.inv and rand functions but creating a histogram of the results gives a good visual check on the data. The example below used mean value μ=0 and standard deviation σ=0.5. To save the results for a given set of random numbers it's convenient to work with just a copy of the values. One can also use a macro to simplify the copy and paste values routine.

For the histogram one needs a set of numbers to designate the sides of the bins and do counts of the number of random numbers in each bin. This can be done in two steps using the countif function to get a count for numbers greater than or equal to the left side of a bin then taking the difference between consecutive counts. Once entered for one line one can use drag and fill for the remaining lines but one needs to be careful about the initial and final values. Naming the fixed set of random numbers by a letter such as x makes formula simpler and easier to read.

If one compares the counts from the histogram wizard found in the analysis toolpak add-in one sees they are associated with the value for the right side of the bin.

To plot a histogram one needs to create a table containing double entries for each count. In the partial table below the offset function is used to copy values from the named bins range and cnt range above with the help of the floor function to read the correct values. The initial and final histogram values are set to zero.

So when plotted one gets the step function for the histogram.

The way the worksheet was set up it is automatically recalculated every time the copy and paste macro shortcut is used. The histogram and plot produced by the histogram wizard have to be redone each time the values in x are changed.

Using the above set of procedures makes it easier to select a good looking histogram and also shows the variation that naturally occurs.

Supplemental (Jun 23): The argument in favor of using the leftmost x value to designate a bin is that we read from left to right. The choice may be a cultural preference.

Supplemental (Jun 23): There's a lot of culture clash in the computer industry with differing file types and operating systems. Data structures and code can influence the choices programmers make which may seem strange to a mathematician. Usually the number line is read from left to right and tables are read from top to bottom. Using countif with "<" is an alternative choice for the histogram table but reversing the order of the difference doesn't change the results in the cnt column.

Thursday, June 13, 2019

An Executive Summary on Systematic Error for Least Squares

If one needs a quick reference an executive summary on the systematic error for an ordinary least squares fit might prove useful.

Science and Education employ transmission lines for communications and losses can occur as one moves farther from the source in distance and time and the quality of the source varies. If we want scientists and educators to be responsible citizens of a country or the world at large they all must share a responsibility for keeping a watch out for misrepresentations of knowledge passed on to others. If one is for a "rule of law" one cannot disregard the "laws of nature."

When presenting a thesis or the results of an experiment one needs to adopt a defensible position. Any systematic error present raises some doubts about the conclusions or results. If one wants to build up some structure it has to be built on solid ground.

Supplemental (Jun 13): Here's an example of the determination of an ordinary least squares regression line in Barreto & Howland, Introductory Econometrics and an indication the slope is an unbiased estimator. The formulas employ a different notation but are equivalent to what is used above.

The denominator used in the book is the sum of the square of the deviation from the mean whose average is the standard deviation, s_x, used above. The numerator is related to the covariance of X and Y.

Sunday, June 9, 2019

A Simple Derivation of the Factors in the Least Squares Fit Error Formulas

The most difficult part of finding expressions for the expected values of the slope and intercept as functions of the random errors involved in the least squares fits is getting a handle on the problem. One can start by writing the x value as a sum of an expected value, ⟨x⟩, a displacement, Δx along the x-axis and a random error, δx and doing the same for the y value. To simplify the derivation we start with two variables u and v instead of x and y and combine the expressions to get an expression for the product then determine the expected values of the terms. The expression can be simplified considerably since the expected value of the differences and the expected value of uncorrelated differences are zero.

Setting u and v equal to x gives the formula for the standard deviation of x as a function of the standard deviation of the errors in x and similarly for y. When u=x and v=y we have to be a little more careful since the Δx and Δy are correlated due to their dependence on the equation for the line. As shown through the "diagnostic" the numerator for the ordinary least squares for the slope is essentially a constant with some errors due to the expected values of the product terms not being exactly zero. Evaluating this factor using the estimated expected values and exact values gives results that agree fairly well.

⟨xy⟩-⟨x⟩⟨y⟩=0.1797 vs sσ_x0²=0.1812.

With errors present the expect values for errors of the slope and intercept for the fits are not zero as one might expect but their deviation appears to be negligible for small random errors.

Supplemental (Jun 12): Note a systematic error in a linear fit might raise some doubts about the validity of the Gauss-Markov theorem.

Saturday, June 8, 2019

Effect of Random Errors on the LS Diagnostic Results

Are we quibbling over minutia with the least squares diagnostic? The information may be useful for the design of experiments. For example if one wanted to measure some chemical rate constant one might try to fit a large number of data points on a line or combine the results of a number of independent researchers. The theory of errors is useful but one also has to be concerned with systematic errors.

The observed error curves for the least squares fits in the last blog were replaced with formulas for the errors but in practice they use expected values which are subject to statistical error. Here is an example of how each pair of points on the curves were determined. For the slopes and intercepts the fits were repeated a large number of times and values calculated. From the convergence of the partial sums we get an estimate of the limiting value but the actual values have a lot of variation in them. A comparison of histograms show that there is a differences in the two fits but there is also a lot of overlap in the values too.

The shift in the peaks of the curves is small compared with the spread of the observed results. The histograms above were not precise enough to compute a mean value for the distributions or the comparison of results. The convergence curves more clearly showed the discrepancy although one can see that the two histograms are shifted by about Δs=0.006.

Supplemental (Jun 8): Note the values for the slope and intercept indicated above are the batch averages for 20 fits. One would expect the spread for individual fits to be about 4.5 times greater and the spread for the set averages to be reduced by a factor of 5.5. So it appears that the discrepancy for random errors in the data less than 5% is negligible. The diagnostic just allowed us to get the formulas for the expected values for slope and intercept as a function of the random error in the data.

Supplemental (Jun 8): books on precision measurements and systematic errors

1897 Holman - Discussion of the Precision of Measurements
1969 NBS - Precision Measurements and Calibration: Statistical Concepts and Procedures

Supplemental (Jun 9): I used an index to tag the x,y data sets for the batches and ended up confusing the number of points per line (20) with the number of lines per batch (25) in this and some preceding posts. The increase in the spread in the histograms above for those of individual line fits would therefore be √25=5 instead of 4.5. I caught the error while running a check to verify that the fit formulas returned the original slope and intercept for exact data points and looking for other errors in the spreadsheet formulas. The data below includes the expected values and standard deviations which would be hidden values if there was error present.

Friday, June 7, 2019

More on the Least Squares Fit Systematic Errors

I decided to do a diagnostic to determine the source of the systematic errors in ordinary least squares and transverse least squares fits and came to the conclusion they are due to changes in the spread of the data caused by the random errors. We know from data analysis that the rms error for two independent sources of error is computed using the Pythagorean theorem so the square of the spread of the values of x, σ_x, will be sum of the squares of a constant term, σ_x0, and the rms error, δx, for x. The same is true for the spread of y.

The diagnostic gave the following results for various values of δx. The initial line had a slope s=1.25, intercept y₀=0.5, and rms error in y δy=0.03.

The numerator for the slope for ordinary least squares is essentially a constant. Only the square of σ_x showed a quadratic dependence on δx.

This seems to present the opportunity for an adjustment of the systematic errors associated with these least squares fits.

Saturday, June 1, 2019

Both Least Squares Methods Make Specific Assumptions About Errors Which Affect Bias

The apparent bias in the ordinary least squares fit appears to be due to a subtle error in the assumption concerning the errors in the x-axis. Ordinary least squares assumed the values of the x-axis were exact while transverse least squares assumes both variables are subject to error. If one removes the errors in the x values one finds ordinary least squares gives the better estimate for the averages of the coefficients of the line when the law of large numbers is applied to the cumulative averages.

A discrepancy between the assumptions in the fit method and the data can result in a blunder which is an example of a systematic error.