WEBVTT Kind: captions; language: en-us
NOTE
Treffsikkerhet: 91% (H?Y)
00:00:00.700 --> 00:00:09.500
In this video, we will begin to talk about how we can deal with variability and in particular how to
00:00:09.500 --> 00:00:16.900
derive a summary estimate from variable data. First of all, let us talk about a very important word
00:00:16.900 --> 00:00:24.800
that will hear many times in the context of Statistics, the word error. This word doesn't mean what
00:00:24.800 --> 00:00:29.900
it usually means in everyday communication. In the context of statistics when
NOTE
Treffsikkerhet: 91% (H?Y)
00:00:29.900 --> 00:00:37.200
we talk about error, we usually mean some form of variability and usually about unwanted
00:00:37.200 --> 00:00:46.349
variability. That is otherwise known as noise. Error may also mean the difference from some assumed
00:00:46.349 --> 00:00:53.400
true value, which is unknown to us. There can be systematic error in which case it is also called
00:00:53.400 --> 00:00:54.950
bias.
NOTE
Treffsikkerhet: 91% (H?Y)
00:00:54.950 --> 00:01:04.500
Error generally can mean uncertainty lack of confidence. Variability is always associated with
00:01:04.500 --> 00:01:06.400
uncertainty.
NOTE
Treffsikkerhet: 91% (H?Y)
00:01:07.000 --> 00:01:16.100
Error often means difference from a prediction. If we have a model to predict some values and we
00:01:16.100 --> 00:01:23.100
have some actual observed values, then the difference between these two is the prediction error or
00:01:23.100 --> 00:01:29.800
the residual error. Whatever is left over after we're done predicting. The important thing here is
00:01:29.800 --> 00:01:33.500
that error does not mean mistake.
NOTE
Treffsikkerhet: 91% (H?Y)
00:01:33.500 --> 00:01:40.900
Error means that our values are different from one another, different from what we want, or different
00:01:40.900 --> 00:01:43.150
from what we predict.
NOTE
Treffsikkerhet: 90% (H?Y)
00:01:43.150 --> 00:01:50.500
To illustrate the ubiquity of error, when I was teaching statistics in an actual classroom I would
00:01:50.500 --> 00:01:59.600
ask for six volunteers students in the beginning of the first class. And I would ask one of them to
00:01:59.600 --> 00:02:07.900
stand out, give a measuring tape to the others and ask them to measure the height of the one brave
00:02:07.900 --> 00:02:14.000
volunteer. So, each person would take the measuring tape go on and measure the
NOTE
Treffsikkerhet: 91% (H?Y)
00:02:14.000 --> 00:02:22.300
of the one designated person. Write down measurement on a piece of paper without announcing it fold
00:02:22.300 --> 00:02:28.900
the paper so that it is not visible. And then the next person would do the same thing and on and on.
00:02:29.000 --> 00:02:37.600
This is the result from 2019. These are the actual writings of the five students who wrote down the
00:02:37.600 --> 00:02:43.750
height of one person measured in class. So we get these five
NOTE
Treffsikkerhet: 88% (H?Y)
00:02:43.750 --> 00:02:50.300
numbers, that are supposed to be the height of the same person measured at the same time in the
00:02:50.300 --> 00:02:55.900
same place with the same instrument and you may be thinking that something is wrong. But this is
00:02:55.900 --> 00:03:04.550
actually what always happens. These are the numbers from the year before. So we have five values for
00:03:04.550 --> 00:03:12.500
person A's height and five values for person B's height. This kind of variability, which is known as
00:03:12.500 --> 00:03:13.750
measurement error.
NOTE
Treffsikkerhet: 84% (H?Y)
00:03:13.750 --> 00:03:20.300
is very common. In fact, it's always happened. Every year. I've taught a class like this and it's a
00:03:20.300 --> 00:03:25.700
very robust effect. If you think it's weird, you can ask five of your friends to do it and see it
00:03:25.700 --> 00:03:35.000
for yourself. So this is the reality, reality is variability. And this raises an important question
00:03:35.000 --> 00:03:42.900
that needs a principled and justified answer. And the question is, what is the person's height?
NOTE
Treffsikkerhet: 91% (H?Y)
00:03:42.900 --> 00:03:49.600
Of course, there is also the sobering consideration that in real life if you wanted to measure
00:03:49.600 --> 00:03:55.800
someone's height you would only measure it once, which is like having only one of these five
00:03:55.800 --> 00:04:01.100
numbers and not having the others. This prevents you from realizing that there is actually error
00:04:01.100 --> 00:04:08.400
associated with your measurement. But if you repeat it several times, then the error just shows up
00:04:08.400 --> 00:04:13.100
and you realize that no estimate is free from variability.
NOTE
Treffsikkerhet: 91% (H?Y)
00:04:13.100 --> 00:04:18.549
So what are we supposed to answer when someone asks, what is this person's height?
NOTE
Treffsikkerhet: 84% (H?Y)
00:04:18.549 --> 00:04:24.800
We need an answer, obviously based on the observations the data we have and these are the data we
00:04:24.800 --> 00:04:32.549
have. How are we supposed to choose from these? What makes a good answer?
NOTE
Treffsikkerhet: 84% (H?Y)
00:04:32.549 --> 00:04:40.500
We could use in statistical criterion, which is to minimize error, in the sense of making the
00:04:40.500 --> 00:04:46.800
smaller error or make smaller distance from the actual true height, but we don't know what
00:04:46.800 --> 00:04:54.300
the true height is. So this is kind of a theoretical definition. In statistics we can actually find
00:04:54.300 --> 00:05:01.000
out actually the mathematicians can find out for us, which procedure based on these data will give
00:05:01.000 --> 00:05:03.150
us answers that are
NOTE
Treffsikkerhet: 85% (H?Y)
00:05:03.150 --> 00:05:11.400
as close as possible to the actual value using models. We're not going to do that. But it's good
00:05:11.400 --> 00:05:17.600
to have in mind that we need to understand what we mean by a good answer before we can begin to
00:05:17.600 --> 00:05:19.549
think about giving one.
NOTE
Treffsikkerhet: 91% (H?Y)
00:05:19.549 --> 00:05:27.900
Let us now see some common ways to produce summary answers from a set of data like this.
NOTE
Treffsikkerhet: 91% (H?Y)
00:05:28.400 --> 00:05:39.400
The simplest approach is democracy. Let them vote. So each value gets as many votes as the times
00:05:39.400 --> 00:05:46.700
it appears in the data set. In this case we find the value that occurs most often. That's the value
00:05:46.700 --> 00:05:56.750
162 centimeters and this is called the mode. The value that happens most often in the data set.
NOTE
Treffsikkerhet: 82% (H?Y)
00:05:56.750 --> 00:06:05.000
Unfortunately, there is one complication with the mode and that's that there may not be one value
00:06:05.000 --> 00:06:14.850
that seem most often. For example, the measurement from 2019 for this same class exercise where the
00:06:14.850 --> 00:06:24.400
numbers you saw before which have 172 twice and 171 twice. In this case there is no mode.
NOTE
Treffsikkerhet: 91% (H?Y)
00:06:25.500 --> 00:06:34.200
Another approach is to first sort the observations from smallest to largest.
NOTE
Treffsikkerhet: 91% (H?Y)
00:06:34.300 --> 00:06:42.000
So these are the numbers we obtained and we just put them in order. So this is the smallest
00:06:42.000 --> 00:06:45.600
value, progressively going on
NOTE
Treffsikkerhet: 91% (H?Y)
00:06:45.700 --> 00:06:48.750
to the largest value.
NOTE
Treffsikkerhet: 91% (H?Y)
00:06:48.750 --> 00:06:56.500
And then pick the one in the middle and that is called the median.
NOTE
Treffsikkerhet: 91% (H?Y)
00:06:58.600 --> 00:07:08.800
Finally, the third approach one that you are certainly familiar with. Is to allow all values to
00:07:08.800 --> 00:07:18.050
influence our answer in the same way. So, if we have these five observations, what we can do is to
00:07:18.050 --> 00:07:20.450
add them up.
NOTE
Treffsikkerhet: 88% (H?Y)
00:07:20.450 --> 00:07:24.049
And divide by how many they are.
NOTE
Treffsikkerhet: 90% (H?Y)
00:07:24.049 --> 00:07:35.250
And this is the mean and it symbolized with this Greek letter μ. You already know how to do this.
NOTE
Treffsikkerhet: 90% (H?Y)
00:07:35.250 --> 00:07:43.500
The issue that arises with the mean is what happens if some values are not as good as other values.
00:07:43.500 --> 00:07:48.400
I'm going to give you an example of what it what I mean with that.
NOTE
Treffsikkerhet: 91% (H?Y)
00:07:48.800 --> 00:07:57.650
So our original observations, these numbers have produced the indices we have already discussed:
00:07:57.650 --> 00:08:01.500
the mode, the median and the mean.
NOTE
Treffsikkerhet: 91% (H?Y)
00:08:01.500 --> 00:08:09.900
And as you see these three possible answers are very close to each other. But now, let us imagine
00:08:09.900 --> 00:08:16.350
that one of the five students actually made a mistake, the tape slipped, and instead of reporting
00:08:16.350 --> 00:08:19.900
this value. We got this one.
NOTE
Treffsikkerhet: 91% (H?Y)
00:08:19.900 --> 00:08:28.300
But we don't know that a mistake was made. All we have are these five numbers to work with. So if we
00:08:28.300 --> 00:08:38.500
just calculate the same indices, the mode is again, 162. The median is again, 162. The mean is now
00:08:38.500 --> 00:08:46.900
almost 163, that's quite a bit higher than the previous one. So, what happened with the mean is it
00:08:46.900 --> 00:08:50.300
was most affected by the occurrence
NOTE
Treffsikkerhet: 91% (H?Y)
00:08:50.300 --> 00:08:56.800
of a problem in the measurement process. And if we weren't aware of this mistake, and we're just
00:08:56.800 --> 00:09:04.900
using the numbers as if everything was fine, then we would, we would, this would result in a biased
00:09:04.900 --> 00:09:10.300
estimator, miss estimation of the height of this person using the mean.
NOTE
Treffsikkerhet: 89% (H?Y)
00:09:11.400 --> 00:09:18.600
Another example of a problem that can arise with the mean. Let's imagine that you have applied for a
00:09:18.600 --> 00:09:25.400
summer job. And then you visited the place that you have applied to and you found five people
00:09:25.400 --> 00:09:32.800
working there and you ask them how much they earn. You're interested to know what your salary maybe,
00:09:32.800 --> 00:09:36.200
as you consider the possibility of working there.
NOTE
Treffsikkerhet: 76% (H?Y)
00:09:36.200 --> 00:09:42.900
So let's assume you asked everyone you saw there and they all answered truthfully and you got these
00:09:42.900 --> 00:09:45.300
not these answers.
NOTE
Treffsikkerhet: 88% (H?Y)
00:09:45.300 --> 00:09:50.300
These are five numbers from five people.
NOTE
Treffsikkerhet: 91% (H?Y)
00:09:50.300 --> 00:09:56.300
The mode for these numbers is not defined because no value appears twice.
NOTE
Treffsikkerhet: 91% (H?Y)
00:09:56.300 --> 00:09:59.750
The median is 160.
NOTE
Treffsikkerhet: 81% (H?Y)
00:09:59.750 --> 00:10:06.000
And the mean is 222 kroner per hour.
NOTE
Treffsikkerhet: 91% (H?Y)
00:10:06.200 --> 00:10:11.150
How much would you expect to be paid based on these data?
NOTE
Treffsikkerhet: 88% (H?Y)
00:10:11.150 --> 00:10:19.100
Well, if you take up the job, you will realize that one of the five people that you asked actually
00:10:19.100 --> 00:10:27.400
was the boss. So, this number is a different kind of number from all of these. They're not all
00:10:27.400 --> 00:10:34.550
measuring the same thing. They hourly wage of an employee. Like the one that you would be.
NOTE
Treffsikkerhet: 91% (H?Y)
00:10:34.550 --> 00:10:42.100
So, by allowing the mean to be influenced equally by all the answers, you have, you have produced
00:10:42.100 --> 00:10:48.300
an estimate of how much you would be paid using the procedure of the mean that is actually not a
00:10:48.300 --> 00:10:49.550
good one.
NOTE
Treffsikkerhet: 91% (H?Y)
00:10:49.550 --> 00:10:55.300
So the mean which generally happens to have good properties and we tend to use it all the time for
00:10:55.300 --> 00:11:04.900
very good reason, also comes with some assumptions and some issues. You need to be aware of.
00:11:04.900 --> 00:11:11.700
All these things that we have talked about, are called indices of central tendency. Because they are
00:11:11.700 --> 00:11:19.050
estimates of where the center of our observations might be, like, the center being
NOTE
Treffsikkerhet: 91% (H?Y)
00:11:19.050 --> 00:11:25.300
a value that would be reasonable to consider as the answer based on all of the observations, taken
00:11:25.300 --> 00:11:34.000
together. The mode that we saw first can be used even with nominal level variable. So
00:11:34.000 --> 00:11:40.700
nominal scales of measurement can be used with categorical variables because it only requires you to
00:11:40.700 --> 00:11:48.450
count the values and you can always count how many labels you have in a categorical variable.
NOTE
Treffsikkerhet: 91% (H?Y)
00:11:48.450 --> 00:11:55.300
One difficulty with the mode is that it may not be defined if there is not one value that is more
00:11:55.300 --> 00:12:02.400
frequent than all the others. But otherwise the mode can be very useful. The second index of central
00:12:02.400 --> 00:12:10.500
tendency is the median. The median can be used for ordinal level data and above. As long as you can
00:12:10.500 --> 00:12:14.550
rank your data. You can produce a median.
NOTE
Treffsikkerhet: 84% (H?Y)
00:12:14.550 --> 00:12:23.000
The median, like the mode, is a value from the data set. Not actually the median for the median,
00:12:23.000 --> 00:12:28.400
that's only true if there is an odd number of measurements, in which case, the middle one
00:12:28.400 --> 00:12:35.300
exist. If there is an even number, you have to do something about the two middle ones. And that's
00:12:35.300 --> 00:12:41.900
easier to decide when their numbers done where their levels of an ordinal variable.
NOTE
Treffsikkerhet: 91% (H?Y)
00:12:42.100 --> 00:12:50.700
A very desirable property of the median is that is generally stable and it's not easily fooled
00:12:50.700 --> 00:12:57.000
by having values that are far off, due to problems. Values that are distant from the others are
00:12:57.000 --> 00:13:03.250
called outliers, as they lie outside the distribution of the rest.
NOTE
Treffsikkerhet: 91% (H?Y)
00:13:03.250 --> 00:13:11.100
And the third index of central tendency that we saw is the mean. The mean requires numbers to be
00:13:11.100 --> 00:13:18.300
calculated on. So it can only be used for interval, or ratio level data. That is only with numeric
00:13:18.300 --> 00:13:25.800
or quantitative variables. The mean produces a value that may not actually appear in the data, so
00:13:25.800 --> 00:13:31.800
you can have a mean value of children per family be 2.5.
NOTE
Treffsikkerhet: 91% (H?Y)
00:13:31.800 --> 00:13:36.300
Although no family ever has 2.5 children.
NOTE
Treffsikkerhet: 91% (H?Y)
00:13:36.300 --> 00:13:43.300
And the mean has some difficulties in the sense that it depends on assumptions for where the data
00:13:43.300 --> 00:13:51.700
points come from, and how they are distributed. So what are we supposed to choose then? What is
00:13:51.700 --> 00:14:00.400
the best answer about this person height or about situations, when we have some set of data and need
00:14:00.400 --> 00:14:03.050
one summary estimate from them?
NOTE
Treffsikkerhet: 91% (H?Y)
00:14:03.050 --> 00:14:07.500
What is the most appropriate statistic for the set?
NOTE
Treffsikkerhet: 91% (H?Y)
00:14:07.500 --> 00:14:15.300
Unfortunately different cases require different answers. So this is why you need to understand the
00:14:15.300 --> 00:14:21.400
properties and assumptions of the different indices. So we can choose wisely which one to use in
00:14:21.400 --> 00:14:22.750
each case.
NOTE
Treffsikkerhet: 91% (H?Y)
00:14:22.750 --> 00:14:30.800
For numeric that is quantitative variables. We usually use the mean and the mean is known to
00:14:30.800 --> 00:14:37.200
minimize the error in the long term. That is, if you always use the mean to produce this kind of
00:14:37.200 --> 00:14:46.700
answers. In the long run this will lead you to make the least error that is to be least distant from
00:14:46.700 --> 00:14:48.349
the right answers.
NOTE
Treffsikkerhet: 81% (H?Y)
00:14:48.349 --> 00:14:53.200
However, the mean needs to be used with caution.