Javascript required
Skip to content Skip to sidebar Skip to footer

What Does the Root Word Itis Mean

What does RMSE really mean?

James Moody

Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative data. Formally it is defined as follows:

Let's try to explore why this measure of error makes sense from a mathematical perspective. Ignoring the division by n under the square root, the first thing we can notice is a resemblance to the formula for the Euclidean distance between two vectors in ℝⁿ:

This tells us heuristically that RMSE can be thought of as some kind of (normalized) distance between the vector of predicted values and the vector of observed values.

B u t why are we dividing by n under the square root here? If we keep n (the number of observations) fixed, all it does is rescale the Euclidean distance by a factor of √(1/n). It's a bit tricky to see why this is the right thing to do, so let's delve in a bit deeper.

Imagine that our observed values are determined by adding random "errors" to each of the predicted values, as follows:

These errors, thought of as random variables, might have Gaussian distribution with mean μ and standard deviation σ, but any other distribution with a square-integrable PDF (probability density function) would also work. We want to think of ŷᵢ as an underlying physical quantity, such as the exact distance from Mars to the Sun at a particular point in time. Our observed quantity yᵢ would then be the distance from Mars to the Sun as we measure it, with some errors coming from mis-calibration of our telescopes and measurement noise from atmospheric interference.

Picture of the Sun and Mars

(NOT TO SCALE)

The mean μ of the distribution of our errors would correspond to a persistent bias coming from mis-calibration, while the standard deviation σ would correspond to the amount of measurement noise. Imagine now that we know the mean μ of the distribution for our errors exactly and would like to estimate the standard deviation σ. We can see through a bit of calculation that:

Here E[…] is the expectation, and Var(…) is the variance. We can replace the average of the expectations E[εᵢ²] on the third line with the E[ε²] on the fourth line where ε is a variable with the same distribution as each of the εᵢ, because the errors εᵢ are identically distributed, and thus their squares all have the same expectation.

Remember that we assumed we already knew μ exactly. That is, the persistent bias in our instruments is a known bias, rather than an unknown bias. So we might as well correct for this bias right off the bat by subtracting μ from all our raw observations. That is, we might as well suppose our errors are already distributed with mean μ = 0. Plugging this into the equation above and taking the square root of both sides then yields:

Notice the left hand side looks familiar! If we removed the expectation E[ … ] from inside the square root, it is exactly our formula for RMSE form before. The central limit theorem tells us that as n gets larger, the variance of the quantity Σᵢ (ŷᵢ — yᵢ)² / n = Σᵢ (εᵢ)² / n should converge to zero. In fact a sharper form of the central limit theorem tell us its variance should converge to 0 asymptotically like 1/n. This tells us that Σᵢ (ŷᵢ — yᵢ)² / n is a good estimator for E[Σᵢ (ŷᵢ — yᵢ)² / n] = σ². But then RMSE is a good estimator for the standard deviation σ of the distribution of our errors!

We should also now have an explanation for the division by n under the square root in RMSE: it allows us to estimate the standard deviation σ of the error for a typical single observation rather than some kind of "total error". By dividing by n, we keep this measure of error consistent as we move from a small collection of observations to a larger collection (it just becomes more accurate as we increase the number of observations). To phrase it another way, RMSE is a good way to answer the question: "How far off should we expect our model to be on its next prediction?"

To sum up our discussion, RMSE is a good measure to use if we want to estimate the standard deviation σ of a typical observed value from our model's prediction, assuming that our observed data can be decomposed as:

The random noise here could be anything that our model does not capture (e.g., unknown variables that might influence the observed values). If the noise is small, as estimated by RMSE, this generally means our model is good at predicting our observed data, and if RMSE is large, this generally means our model is failing to account for important features underlying our data.

RMSE in Data Science: Subtleties of Using RMSE

In data science, RMSE has a double purpose:

  • To serve as a heuristic for training models
  • To evaluate trained models for usefulness / accuracy

This raises an important question: What does it mean for RMSE to be "small"?

We should note first and foremost that "small" will depend on our choice of units, and on the specific application we are hoping for. 100 inches is a big error in a building design, but 100 nanometers is not. On the other hand, 100 nanometers is a small error in fabricating an ice cube tray, but perhaps a big error in fabricating an integrated circuit.

For training models, it doesn't really matter what units we are using, since all we care about during training is having a heuristic to help us decrease the error with each iteration. We care only about relative size of the error from one step to the next, not the absolute size of the error.

But in evaluating trained models in data science for usefulness / accuracy , we do care about units, because we aren't just trying to see if we're doing better than last time: we want to know if our model can actually help us solve a practical problem. The subtlety here is that evaluating whether RMSE is sufficiently small or not will depend on how accurate we need our model to be for our given application. There is never going to be a mathematical formula for this, because it depends on things like human intentions ("What are you intending to do with this model?"), risk aversion ("How much harm would be caused be if this model made a bad prediction?"), etc.

Besides units, there is another consideration too: "small" also needs to be measured relative to the type of model being used, the number of data points, and the history of training the model went through before you evaluated it for accuracy. At first this may sound counter-intuitive, but not when you remember the problem of over-fitting.

There is a risk of over-fitting whenever the number of parameters in your model is large relative to the number of data points you have. For example, if we are trying to predict one real quantity y as a function of another real quantity x, and our observations are (xᵢ, yᵢ) with x₁ < x₂ < x₃ … , a general interpolation theorem tells us there is some polynomial f(x) of degree at most n+1 with f(xᵢ) = yᵢ for i = 1, … , n. This means if we chose our model to be a degree n+1 polynomial, by tweaking the parameters of our model (the coefficients of the polynomial), we would be able to bring RMSE all the way down to 0. This is true regardless of what our y values are. In this case RMSE isn't really telling us anything about the accuracy of our underlying model: we were guaranteed to be able to tweak parameters to get RMSE = 0 as measured measured on our existing data points regardless of whether there is any relationship between the two real quantities at all.

But it's not only when the number of parameters exceeds the number of data points that we might run into problems. Even if we don't have an absurdly excessive amount of parameters, it may be that general mathematical principles together with mild background assumptions on our data guarantee us with a high probability that by tweaking the parameters in our model, we can bring the RMSE below a certain threshold. If we are in such a situation, then RMSE being below this threshold may not say anything meaningful about our model's predictive power.

If we wanted to think like a statistician, the question we would be asking is not "Is the RMSE of our trained model small?" but rather, "What is the probability the RMSE of our trained model on such-and-such set of observations would be this small by random chance?"

These kinds of questions get a bit complicated (you actually have to do statistics), but hopefully y'all get the picture of why there is no predetermined threshold for "small enough RMSE", as easy as that would make our lives.

What Does the Root Word Itis Mean

Source: https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e