WEBVTT Kind: captions; language: en-us NOTE Treffsikkerhet: 91% (H?Y) 00:00:00.700 --> 00:00:09.500 In this video, we will begin to talk about how we can deal with variability and in particular how to 00:00:09.500 --> 00:00:16.900 derive a summary estimate from variable data. First of all, let us talk about a very important word 00:00:16.900 --> 00:00:24.800 that will hear many times in the context of Statistics, the word error. This word doesn't mean what 00:00:24.800 --> 00:00:29.900 it usually means in everyday communication. In the context of statistics when NOTE Treffsikkerhet: 91% (H?Y) 00:00:29.900 --> 00:00:37.200 we talk about error, we usually mean some form of variability and usually about unwanted 00:00:37.200 --> 00:00:46.349 variability. That is otherwise known as noise. Error may also mean the difference from some assumed 00:00:46.349 --> 00:00:53.400 true value, which is unknown to us. There can be systematic error in which case it is also called 00:00:53.400 --> 00:00:54.950 bias. NOTE Treffsikkerhet: 91% (H?Y) 00:00:54.950 --> 00:01:04.500 Error generally can mean uncertainty lack of confidence. Variability is always associated with 00:01:04.500 --> 00:01:06.400 uncertainty. NOTE Treffsikkerhet: 91% (H?Y) 00:01:07.000 --> 00:01:16.100 Error often means difference from a prediction. If we have a model to predict some values and we 00:01:16.100 --> 00:01:23.100 have some actual observed values, then the difference between these two is the prediction error or 00:01:23.100 --> 00:01:29.800 the residual error. Whatever is left over after we're done predicting. The important thing here is 00:01:29.800 --> 00:01:33.500 that error does not mean mistake. NOTE Treffsikkerhet: 91% (H?Y) 00:01:33.500 --> 00:01:40.900 Error means that our values are different from one another, different from what we want, or different 00:01:40.900 --> 00:01:43.150 from what we predict. NOTE Treffsikkerhet: 90% (H?Y) 00:01:43.150 --> 00:01:50.500 To illustrate the ubiquity of error, when I was teaching statistics in an actual classroom I would 00:01:50.500 --> 00:01:59.600 ask for six volunteers students in the beginning of the first class. And I would ask one of them to 00:01:59.600 --> 00:02:07.900 stand out, give a measuring tape to the others and ask them to measure the height of the one brave 00:02:07.900 --> 00:02:14.000 volunteer. So, each person would take the measuring tape go on and measure the NOTE Treffsikkerhet: 91% (H?Y) 00:02:14.000 --> 00:02:22.300 of the one designated person. Write down measurement on a piece of paper without announcing it fold 00:02:22.300 --> 00:02:28.900 the paper so that it is not visible. And then the next person would do the same thing and on and on. 00:02:29.000 --> 00:02:37.600 This is the result from 2019. These are the actual writings of the five students who wrote down the 00:02:37.600 --> 00:02:43.750 height of one person measured in class. So we get these five NOTE Treffsikkerhet: 88% (H?Y) 00:02:43.750 --> 00:02:50.300 numbers, that are supposed to be the height of the same person measured at the same time in the 00:02:50.300 --> 00:02:55.900 same place with the same instrument and you may be thinking that something is wrong. But this is 00:02:55.900 --> 00:03:04.550 actually what always happens. These are the numbers from the year before. So we have five values for 00:03:04.550 --> 00:03:12.500 person A's height and five values for person B's height. This kind of variability, which is known as 00:03:12.500 --> 00:03:13.750 measurement error. NOTE Treffsikkerhet: 84% (H?Y) 00:03:13.750 --> 00:03:20.300 is very common. In fact, it's always happened. Every year. I've taught a class like this and it's a 00:03:20.300 --> 00:03:25.700 very robust effect. If you think it's weird, you can ask five of your friends to do it and see it 00:03:25.700 --> 00:03:35.000 for yourself. So this is the reality, reality is variability. And this raises an important question 00:03:35.000 --> 00:03:42.900 that needs a principled and justified answer. And the question is, what is the person's height? NOTE Treffsikkerhet: 91% (H?Y) 00:03:42.900 --> 00:03:49.600 Of course, there is also the sobering consideration that in real life if you wanted to measure 00:03:49.600 --> 00:03:55.800 someone's height you would only measure it once, which is like having only one of these five 00:03:55.800 --> 00:04:01.100 numbers and not having the others. This prevents you from realizing that there is actually error 00:04:01.100 --> 00:04:08.400 associated with your measurement. But if you repeat it several times, then the error just shows up 00:04:08.400 --> 00:04:13.100 and you realize that no estimate is free from variability. NOTE Treffsikkerhet: 91% (H?Y) 00:04:13.100 --> 00:04:18.549 So what are we supposed to answer when someone asks, what is this person's height? NOTE Treffsikkerhet: 84% (H?Y) 00:04:18.549 --> 00:04:24.800 We need an answer, obviously based on the observations the data we have and these are the data we 00:04:24.800 --> 00:04:32.549 have. How are we supposed to choose from these? What makes a good answer? NOTE Treffsikkerhet: 84% (H?Y) 00:04:32.549 --> 00:04:40.500 We could use in statistical criterion, which is to minimize error, in the sense of making the 00:04:40.500 --> 00:04:46.800 smaller error or make smaller distance from the actual true height, but we don't know what 00:04:46.800 --> 00:04:54.300 the true height is. So this is kind of a theoretical definition. In statistics we can actually find 00:04:54.300 --> 00:05:01.000 out actually the mathematicians can find out for us, which procedure based on these data will give 00:05:01.000 --> 00:05:03.150 us answers that are NOTE Treffsikkerhet: 85% (H?Y) 00:05:03.150 --> 00:05:11.400 as close as possible to the actual value using models. We're not going to do that. But it's good 00:05:11.400 --> 00:05:17.600 to have in mind that we need to understand what we mean by a good answer before we can begin to 00:05:17.600 --> 00:05:19.549 think about giving one. NOTE Treffsikkerhet: 91% (H?Y) 00:05:19.549 --> 00:05:27.900 Let us now see some common ways to produce summary answers from a set of data like this. NOTE Treffsikkerhet: 91% (H?Y) 00:05:28.400 --> 00:05:39.400 The simplest approach is democracy. Let them vote. So each value gets as many votes as the times 00:05:39.400 --> 00:05:46.700 it appears in the data set. In this case we find the value that occurs most often. That's the value 00:05:46.700 --> 00:05:56.750 162 centimeters and this is called the mode. The value that happens most often in the data set. NOTE Treffsikkerhet: 82% (H?Y) 00:05:56.750 --> 00:06:05.000 Unfortunately, there is one complication with the mode and that's that there may not be one value 00:06:05.000 --> 00:06:14.850 that seem most often. For example, the measurement from 2019 for this same class exercise where the 00:06:14.850 --> 00:06:24.400 numbers you saw before which have 172 twice and 171 twice. In this case there is no mode. NOTE Treffsikkerhet: 91% (H?Y) 00:06:25.500 --> 00:06:34.200 Another approach is to first sort the observations from smallest to largest. NOTE Treffsikkerhet: 91% (H?Y) 00:06:34.300 --> 00:06:42.000 So these are the numbers we obtained and we just put them in order. So this is the smallest 00:06:42.000 --> 00:06:45.600 value, progressively going on NOTE Treffsikkerhet: 91% (H?Y) 00:06:45.700 --> 00:06:48.750 to the largest value. NOTE Treffsikkerhet: 91% (H?Y) 00:06:48.750 --> 00:06:56.500 And then pick the one in the middle and that is called the median. NOTE Treffsikkerhet: 91% (H?Y) 00:06:58.600 --> 00:07:08.800 Finally, the third approach one that you are certainly familiar with. Is to allow all values to 00:07:08.800 --> 00:07:18.050 influence our answer in the same way. So, if we have these five observations, what we can do is to 00:07:18.050 --> 00:07:20.450 add them up. NOTE Treffsikkerhet: 88% (H?Y) 00:07:20.450 --> 00:07:24.049 And divide by how many they are. NOTE Treffsikkerhet: 90% (H?Y) 00:07:24.049 --> 00:07:35.250 And this is the mean and it symbolized with this Greek letter μ. You already know how to do this. NOTE Treffsikkerhet: 90% (H?Y) 00:07:35.250 --> 00:07:43.500 The issue that arises with the mean is what happens if some values are not as good as other values. 00:07:43.500 --> 00:07:48.400 I'm going to give you an example of what it what I mean with that. NOTE Treffsikkerhet: 91% (H?Y) 00:07:48.800 --> 00:07:57.650 So our original observations, these numbers have produced the indices we have already discussed: 00:07:57.650 --> 00:08:01.500 the mode, the median and the mean. NOTE Treffsikkerhet: 91% (H?Y) 00:08:01.500 --> 00:08:09.900 And as you see these three possible answers are very close to each other. But now, let us imagine 00:08:09.900 --> 00:08:16.350 that one of the five students actually made a mistake, the tape slipped, and instead of reporting 00:08:16.350 --> 00:08:19.900 this value. We got this one. NOTE Treffsikkerhet: 91% (H?Y) 00:08:19.900 --> 00:08:28.300 But we don't know that a mistake was made. All we have are these five numbers to work with. So if we 00:08:28.300 --> 00:08:38.500 just calculate the same indices, the mode is again, 162. The median is again, 162. The mean is now 00:08:38.500 --> 00:08:46.900 almost 163, that's quite a bit higher than the previous one. So, what happened with the mean is it 00:08:46.900 --> 00:08:50.300 was most affected by the occurrence NOTE Treffsikkerhet: 91% (H?Y) 00:08:50.300 --> 00:08:56.800 of a problem in the measurement process. And if we weren't aware of this mistake, and we're just 00:08:56.800 --> 00:09:04.900 using the numbers as if everything was fine, then we would, we would, this would result in a biased 00:09:04.900 --> 00:09:10.300 estimator, miss estimation of the height of this person using the mean. NOTE Treffsikkerhet: 89% (H?Y) 00:09:11.400 --> 00:09:18.600 Another example of a problem that can arise with the mean. Let's imagine that you have applied for a 00:09:18.600 --> 00:09:25.400 summer job. And then you visited the place that you have applied to and you found five people 00:09:25.400 --> 00:09:32.800 working there and you ask them how much they earn. You're interested to know what your salary maybe, 00:09:32.800 --> 00:09:36.200 as you consider the possibility of working there. NOTE Treffsikkerhet: 76% (H?Y) 00:09:36.200 --> 00:09:42.900 So let's assume you asked everyone you saw there and they all answered truthfully and you got these 00:09:42.900 --> 00:09:45.300 not these answers. NOTE Treffsikkerhet: 88% (H?Y) 00:09:45.300 --> 00:09:50.300 These are five numbers from five people. NOTE Treffsikkerhet: 91% (H?Y) 00:09:50.300 --> 00:09:56.300 The mode for these numbers is not defined because no value appears twice. NOTE Treffsikkerhet: 91% (H?Y) 00:09:56.300 --> 00:09:59.750 The median is 160. NOTE Treffsikkerhet: 81% (H?Y) 00:09:59.750 --> 00:10:06.000 And the mean is 222 kroner per hour. NOTE Treffsikkerhet: 91% (H?Y) 00:10:06.200 --> 00:10:11.150 How much would you expect to be paid based on these data? NOTE Treffsikkerhet: 88% (H?Y) 00:10:11.150 --> 00:10:19.100 Well, if you take up the job, you will realize that one of the five people that you asked actually 00:10:19.100 --> 00:10:27.400 was the boss. So, this number is a different kind of number from all of these. They're not all 00:10:27.400 --> 00:10:34.550 measuring the same thing. They hourly wage of an employee. Like the one that you would be. NOTE Treffsikkerhet: 91% (H?Y) 00:10:34.550 --> 00:10:42.100 So, by allowing the mean to be influenced equally by all the answers, you have, you have produced 00:10:42.100 --> 00:10:48.300 an estimate of how much you would be paid using the procedure of the mean that is actually not a 00:10:48.300 --> 00:10:49.550 good one. NOTE Treffsikkerhet: 91% (H?Y) 00:10:49.550 --> 00:10:55.300 So the mean which generally happens to have good properties and we tend to use it all the time for 00:10:55.300 --> 00:11:04.900 very good reason, also comes with some assumptions and some issues. You need to be aware of. 00:11:04.900 --> 00:11:11.700 All these things that we have talked about, are called indices of central tendency. Because they are 00:11:11.700 --> 00:11:19.050 estimates of where the center of our observations might be, like, the center being NOTE Treffsikkerhet: 91% (H?Y) 00:11:19.050 --> 00:11:25.300 a value that would be reasonable to consider as the answer based on all of the observations, taken 00:11:25.300 --> 00:11:34.000 together. The mode that we saw first can be used even with nominal level variable. So 00:11:34.000 --> 00:11:40.700 nominal scales of measurement can be used with categorical variables because it only requires you to 00:11:40.700 --> 00:11:48.450 count the values and you can always count how many labels you have in a categorical variable. NOTE Treffsikkerhet: 91% (H?Y) 00:11:48.450 --> 00:11:55.300 One difficulty with the mode is that it may not be defined if there is not one value that is more 00:11:55.300 --> 00:12:02.400 frequent than all the others. But otherwise the mode can be very useful. The second index of central 00:12:02.400 --> 00:12:10.500 tendency is the median. The median can be used for ordinal level data and above. As long as you can 00:12:10.500 --> 00:12:14.550 rank your data. You can produce a median. NOTE Treffsikkerhet: 84% (H?Y) 00:12:14.550 --> 00:12:23.000 The median, like the mode, is a value from the data set. Not actually the median for the median, 00:12:23.000 --> 00:12:28.400 that's only true if there is an odd number of measurements, in which case, the middle one 00:12:28.400 --> 00:12:35.300 exist. If there is an even number, you have to do something about the two middle ones. And that's 00:12:35.300 --> 00:12:41.900 easier to decide when their numbers done where their levels of an ordinal variable. NOTE Treffsikkerhet: 91% (H?Y) 00:12:42.100 --> 00:12:50.700 A very desirable property of the median is that is generally stable and it's not easily fooled 00:12:50.700 --> 00:12:57.000 by having values that are far off, due to problems. Values that are distant from the others are 00:12:57.000 --> 00:13:03.250 called outliers, as they lie outside the distribution of the rest. NOTE Treffsikkerhet: 91% (H?Y) 00:13:03.250 --> 00:13:11.100 And the third index of central tendency that we saw is the mean. The mean requires numbers to be 00:13:11.100 --> 00:13:18.300 calculated on. So it can only be used for interval, or ratio level data. That is only with numeric 00:13:18.300 --> 00:13:25.800 or quantitative variables. The mean produces a value that may not actually appear in the data, so 00:13:25.800 --> 00:13:31.800 you can have a mean value of children per family be 2.5. NOTE Treffsikkerhet: 91% (H?Y) 00:13:31.800 --> 00:13:36.300 Although no family ever has 2.5 children. NOTE Treffsikkerhet: 91% (H?Y) 00:13:36.300 --> 00:13:43.300 And the mean has some difficulties in the sense that it depends on assumptions for where the data 00:13:43.300 --> 00:13:51.700 points come from, and how they are distributed. So what are we supposed to choose then? What is 00:13:51.700 --> 00:14:00.400 the best answer about this person height or about situations, when we have some set of data and need 00:14:00.400 --> 00:14:03.050 one summary estimate from them? NOTE Treffsikkerhet: 91% (H?Y) 00:14:03.050 --> 00:14:07.500 What is the most appropriate statistic for the set? NOTE Treffsikkerhet: 91% (H?Y) 00:14:07.500 --> 00:14:15.300 Unfortunately different cases require different answers. So this is why you need to understand the 00:14:15.300 --> 00:14:21.400 properties and assumptions of the different indices. So we can choose wisely which one to use in 00:14:21.400 --> 00:14:22.750 each case. NOTE Treffsikkerhet: 91% (H?Y) 00:14:22.750 --> 00:14:30.800 For numeric that is quantitative variables. We usually use the mean and the mean is known to 00:14:30.800 --> 00:14:37.200 minimize the error in the long term. That is, if you always use the mean to produce this kind of 00:14:37.200 --> 00:14:46.700 answers. In the long run this will lead you to make the least error that is to be least distant from 00:14:46.700 --> 00:14:48.349 the right answers. NOTE Treffsikkerhet: 81% (H?Y) 00:14:48.349 --> 00:14:53.200 However, the mean needs to be used with caution.