Contents

**Course Description**

Welcome to ECON 2843: *Elements of Statistics, *offered by Saleh S. Tabrizy (Assistant Professor of Economics at the University of Oklahoma). This is an introductory Statistics course, which surveys basic statistical techniques with particular emphasis on business and economic applications. The learning objective of this course is to improve students’ analytical skills in understanding and employing descriptive and inferential statistics under classical tradition. We begin this course by learning how to describe the data in use. We, then, survey selected topics in probability theory that enable us to understand the essence of statistical inference. And for the rest of the course, we explore multiple inference tools such as confidence interval estimation, hypothesis testing, and the analysis of variance. These tools help us make use of sample information to reach conclusions about unknown population parameters.

In this lecture series, I rely heavily on the last edition of *Basic Business Statistics* by Berenson, Levine, Szabat, and Stephan. I also make use of *MyStatLab*, developed and maintained by Pearson. As for software, we use *MS Excel* in class. Students at the University of Oklahoma have free access to this software. They are asked to install and maintain an updated version of the software on their laptop or tablet devices. They must also load the *Data Analysis Toolpak*, which will be used frequently in class.

For more information about course requirements and assessments, please review the syllabus, uploaded under the Canvas course entry for this course. If you have any questions, you may contact me via: tabrizy@ou.edu.

**Part I – Descriptive Statistics**

### Introduction

- Data is everywhere. You can find countless examples. Here is just one example:
- My students and I put together this dataset in class of Fall 2017 using an in-class survey. This survey shows the news outlets that a randomly selected group of your peers rely on. It also shows the outlets that they anticipate they will rely in seven years.
- A quick look at the survey results of the survey for:
- students from the 1st section
- students from the 2nd section
- The results from the two sections are
*not*identical — yet, as expected, they are quite similar.

- If you are interested, you may also download the dataset here. To download this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx).*

- A quick look at the survey results of the survey for:

- My students and I put together this dataset in class of Fall 2017 using an in-class survey. This survey shows the news outlets that a randomly selected group of your peers rely on. It also shows the outlets that they anticipate they will rely in seven years.

STATISTICS helps us transform data into useful information for decision making.

DESCRIPTIVE STATISTICS provides some information about the variations (e.g., the mean and standard deviation of the annual earnings of the OU alumni), and INFERENTIAL STATISTICS provides some information about the population using sample observations (e.g., testing whether the choice of major has any impacts on the life-time earning of the OU alumni, using only a sample of alumni).

- Statistics is a branch of mathematics. Our focus in this course is on applied statistics, however. Even when we survey the underlying mathematical models, the emphasis is on the practical implications. Advanced courses explore the theoretical issues in more details.

The DCOVA framework:

Those who work with data are typically involved in either of these activities: Defining data, Collecting the defined data, Organizing the collected data, Visualizing the organized data, or Analyzing them using the tools that are developed in Inferential Statistics.

- This course is organized around the DCOVA framework, with particular emphasis on the
*Analyses*part. We begin by a survey of methods that are used for collecting the defined data (Ch. 1). We, then, examine some organization and visualization tools (Ch. 2). And for the remainder of the course, we explore descriptive and inferential tools that are widely used in data analyses.

**Chapter 1. Defining and Collecting Data**

- Lecture Presentation (PDF)
- Highlights:
- Pay attention to the difference between CATEGORICAL (qualitative) and NUMERICAL (quantitative) variables
- Also, pay attention to the difference between DISCRETE and CONTINUOUS numerical variables
- Note: Given their decimal precision, measures of income and expenditure are often considered to be “continuous” though they appear to be the result of counting

- Example: Soda Consumption Survey Results. To download this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx).*- Variables include: ID, Name, Gender, Weight, Number of Soda Last Week, Regular vs Diet, Coke vs Pepsi, Other Brands, 5 Cents Price Increase, and 95 Cents Price Increase.
- The above variables help us understand:
- Some general characteristics about the observations; e.g., their gender, their weight, etc.
- Some general information about their consumption of soda drinks; e.g., number of cans of soda that they drank last week and regular vs. diet.
- Some general information about their preferences for brands of soda drinks; e.g, Coke vs. Pepsi, other brands, etc.
- Some general information about the sensitivity of their demand for soda drinks (a.k.a. price elasticity of their demand for soda drinks); e.g., 5 cents price increase and 95 cents price increase.

- Among the above variables, some are categorical and some are numerical.
- Categorical variables:
- Nominal: Name, Gender, Regular vs. Diet, Coke vs Pepsi, Other Brands
- Ordinal: 5 Cents Price Increase and 95 Cents Price Increase

- Numerical variables:
- Discrete: Number of Soda Last Week
- Continuous: Weight

- Categorical variables:

- You should further pay attention to the difference between POPULATION and SAMPLE.
- Note: We are only interested in the populations (I cannot put enough emphasis on this!). We employ samples along with inferential statistics techniques to understand the populations better. To do this, we may rely on a host of sampling methods.

- Probability Sampling:
- SIMPLE RANDOM SAMPLING and SYSTEMATIC SAMPLING: In these two methods, we neglect the characteristics of the items in population when we draw a random sample. Items are nothing to us but bunch of IDs.
- STRATIFICATION and CLUSTERING: In these two methods, we consider the characteristics of the items in the population. Taking the Gender Composition of the population of voters into account, for instance, a random sample can be chosen to represent the voters in the US. This is called Stratification. Taking the share of each State in the population of voters into account, another random sample can be chosen to represent the voters in the US. This is called Clustering.

- A class activity and an Excel exercise on sampling and recoding:
- A class activity on Probability Sampling (PDF)
- This activity is motivated by the sampling procedure in the European Firms in a Global Economy project.

- An Excel file for Sampling and Recoding. To download this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx).*- Start from a tab called Note. Then, go to dataset.

- A class activity on Probability Sampling (PDF)

**Chapter 2. Organizing and Visualizing Data**

- Lecture Presentation (PDF)
- Highlights:
- The FREQUENCY DISTRIBUTION and HISTOGRAM are the most important tools for organizing and visualizing numerical data. Make sure that you know how to construct the frequency, relative frequency, cumulative, and relative cumulative distribution tables using the Data Analysis Toolpak in Excel. For this, you may also use the Frequency function if you know how to work with array functions. Using the frequency distribution, you can then put together the relative frequency, cumulative, and relative cumulative distribution tables. You may use the Data Analysis Toolpak in Excel to plot histograms. We will practice this in class.

The Road Ahead:

In near future, you will see that PROBABILITY DISTRIBUTION and the graphical illustration of DENSITY FUNCTION are closely related to FREQUENCY DISTRIBUTION and HISTOGRAM, respectively.

- Highlights (continued):
- The SUMMARY TABLE (tabulation) and CONTINGENCY TABLE (cross tabulation) are also among the important tools that are introduced in this chapter. You are expected to know how to construct, read, and understand these tables.
- Make use of PivotTable tools in Excel to construct summary and contingency tables using this data set (XLSX). To download this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx)*. - Also, make sure that you know how to read and understand bar charts, pie charts, Pareto charts, and side-by-side bar charts.

- Make use of PivotTable tools in Excel to construct summary and contingency tables using this data set (XLSX). To download this file on your computer, go on
- There are also two types of graphs that can be used to visualize the variations in two numerical variables: SCATTER PLOT and TIME SERIES. It is important that you can read and interpret both of them.

- The SUMMARY TABLE (tabulation) and CONTINGENCY TABLE (cross tabulation) are also among the important tools that are introduced in this chapter. You are expected to know how to construct, read, and understand these tables.

**Chapter 3. Numerical Descriptive Measures**

- Lecture Presentation (PDF)
- Highlights:
- The idea behind measures of central tendency and measures of dispersion for a sample:
- A sample may include multiple observations with a particular characteristics that can take different numerical values; e.g., as seen in this small dataset, a randomly selected group of baseball players hit different number of home runs over a given season. To download the this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx)*. - As discussed in Chapter 2, we are able to construct a frequency distribution, which can also be illustrated via a histogram, using numerical variations in a sample; e.g., numerical variation in the number of home runs. This would be the best way for us to understand how data are distributed.
- A frequency distribution, however, often provides too much information. Alternatively, we can make use of only two measures to understand how data are distributed. For instance:
- We can make use of MEAN and STANDARD DEVIATION to measure central tendency and dispersion, respectively. For example, the mean number of home runs for the above sample is 4.10 home runs (central tendency), and the standard deviation is 6.99 home runs (dispersion).
- Alternatively, we can make use of MEDIAN and INTERQUARTILE RANGE for central tendency and dispersion, respectively. For example, the median number of home runs for the above sample is 0 home runs (central tendency), and the interquartile range is 7.5 home runs (central tendency).

- A sample may include multiple observations with a particular characteristics that can take different numerical values; e.g., as seen in this small dataset, a randomly selected group of baseball players hit different number of home runs over a given season. To download the this file on your computer, go on
- Mean and Median are both measures of central tendency. Mean is very useful in making decisions. It is also a very useful measure in inferential statistics. However, it is sensitive to outliers (i.e., the observations that take extreme numerical values). Median is very useful in describing the data, and it is not sensitive to outliers. This is, in fact, evident in the small sample above. Though the majority of players hit no home runs, a few hit many. Those few players are the outliers. They push the mean towards themselves, which is why the mean is quite high (4.1 home runs) despite the fact that the majority do not even hit one home run. However, the median remains at the center (0 home runs) despite the fact that there are some outliers. The mean is useful in decision making, while the median is useful in description.
- Standard deviation (SD) measures the dispersion around mean, and inter-quartile range (IQR) measures the dispersion around median. They are both very useful in describing the data: the greater the SD or IQR, the greater the dispersion.

- The idea behind measures of central tendency and measures of dispersion for a sample:

Mean, median, standard deviation, and inter-quartile range have the same unit as the variable in use. For instance, if the variable in use is measured in volts, its mean, median, standard deviation, and inter-quartile range are also measured in volts. Variance, which is the base for computing standard deviation and is also used as another measure for dispersion, does not have the same unit as the variable in use. For example, if the variable in use is measured in volts, its variance is not measured in volts.

- A class activity on Summary Statistics (PDF)
- Computing numerical descriptive measures using Excel:
- How to compute the
*sample*and*population*summary statistics?- Sample Summary Statistics: We are going to make use of this dataset for the number of cars owned by a
*sample*of households (with only seven observations; n=7) in a small neighborhood in OKC. To download the this file on your computer, go on*File*, then select*Download as*, and select*Microsoft Excel (.xlsx).*- Central tendency: Excel function for sample mean: =AVERAGE; Excel function for sample median: =MEDIAN
- Dispersion: Excel function for sample variance: =VAR.S; Excel function for sample standard deviation: STDEV.S — pay attention to the usage of “.S” in these two functions.

- Population Summary Statistics: We are going to make use of this dataset for the number of cars owned by the
*population*of households (with seventy five observations; N=75) in a small neighborhood in OKC. To download the this file on your computer, go on*File*, then select*Download as*, and select*Microsoft Excel (.xlsx)*.- Central tendency: Excel function for sample mean: =AVERAGE; Excel function for sample median: =MEDIAN — the functions for the population central tendency are the same as for the sample.
- Dispersion: Excel function for sample variance: =VAR.P; Excel function for sample standard deviation: =STDEV.P — unlike the functions for the sample, we are now using “.P” in these two function.

- Sample Summary Statistics: We are going to make use of this dataset for the number of cars owned by a
- How to compute the Inter-Quartile Range using Excel?
- Use the Quartile function to identify the first and the third quartiles. To compute the first quartile, use: =QUARTILE(array,1). To compute the third quartile, use: =QUARTILE(array,3). In these cases,
*array*is basically your data array. - Then, subtract the value of the first quartile from the third quartile in order to obtain the Inter-Quartile Range.
- The above is illustrated using a sample of calories per one cup of breakfast cereal. To download the this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx)*.

- Use the Quartile function to identify the first and the third quartiles. To compute the first quartile, use: =QUARTILE(array,1). To compute the third quartile, use: =QUARTILE(array,3). In these cases,
- How to compute the Geometric Mean Rate of Return using Excel?
- Enter the rates of return into Excel. Do not forget to include the signs, however; e.g., 50% loss should be entered as -0.5, and 100% recovery should be entered as 1.
- Then, add 1 to the rates that you entered; e.g., for the above loss it would be +0.5, and for the above recovery it would be 2.
- Then, use the geometric mean function and deduct 1 (=GEOMEAN (0.5,2)-1) in order to obtain the geometric mean rate of return.

- How to compute sample covariance and coefficient of correlation using Excel?
- Make use of this data set. Looking at 22 manufacturing industries in Korea during 2014, this data set shows the variations in industrial R&D expenditures and industrial exporting activities (both measured in Korean won). To download the this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx)*. - To compute the covariance between R&D expenditures and exports, you may make use of the Sample Covariance function in Excel: =COVARIANCE.S.
- To compute the coefficient of correlation between R&D expenditures and exports, you may make use of the Correlation function in Excel: =CORREL.

- Make use of this data set. Looking at 22 manufacturing industries in Korea during 2014, this data set shows the variations in industrial R&D expenditures and industrial exporting activities (both measured in Korean won). To download the this file on your computer, go on

- How to compute the

CORRELATION DOES NOT NECESSARILY IMPLY CAUSATION! Let’s reflect on the above example again. The coefficient of correlation between industrial R&D expenditures and industrial exports in Korea is quite high (about 0.85), implying that those industries that are R&D-intensive are also export-intensive and that those industries who are relatively less involved in R&D activities are also less involved in exporting activities. However, this does not mean that R&D activities of those industries have led them to engage more in exporting activities. Also, it does not mean that exporting activities of those industries have led them to engage more in R&D activities. To figure out the direction of causation, we need to run an experiment… something like a randomized controlled trial (RCT). We can also rely on some natural experiments, which may occur in real life. Or, alternatively, we may make use of some advanced applied statistics techniques to examine the direction of causality. But this should be done carefully.

**Part II – Probability**

**Chapters 5 and 6. Probability Distribution**

- Lecture Presentation (PDF)
- Highlights:
- PROBABILITY DISTRIBUTION is a listing of random events and their associated probabilities.
- Often, we make use of a function to describe how the events and their probabilities relate to each other.
- Like other distributions, probability distributions also have a measure of central tendency, known as Expected Value, and some measures for dispersion, such as Standard Deviation.
- The EXPECTED VALUE is simply a weighted mean. For the formula refer to the lecture presentations (slide 11).
- The STANDARD DEVIATION is the square root of the mean scatter from the Expected Value. For the formula refer to the lecture presentations (slide 13).

- The linear relationship between two random variables can also be measured using COVARIANCE. For the formula refer to the lecture presentation (slide 16).
- NOTE: A RANDOM VARIABLE is a variable that can take on either a finite or infinite number of random values.
- For example, the number of wins for a given baseball team during regular season is a DISCRETE random variable (with finite number of random values, which could be counted).
- The time required to download a music file is also another random variable. This one, however, is a CONTINUOUS random variable (with infinite number of random values, which could be measured).

- NOTE: A RANDOM VARIABLE is a variable that can take on either a finite or infinite number of random values.
- The number of WiFi outages per day being a random variable, this Excel file illustrates how you can compute the Expected Value and Standard Deviation of a probability distribution. It also shows how you can measure the Covariance between two random variables. To download this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx)*. - Also, refer to Slide 22 (and 24) to see how you could compute the Expected Value and the Standard Deviation of the (weighted) sum of two random variables. Make sure that you review the application for portfolio performance measurement.

- PROBABILITY DISTRIBUTION is a listing of random events and their associated probabilities.

The probability distributions of interest in this course:

In this lecture series, we examine two discrete probability distribution and one important continuous probability distribution.

- Discrete Probability Distributions:

- BINOMIAL Probability Distribution, which measure the probability of a given number of successes in a given number of trials when the probability of success is known and remains constant for all the trials.

- Example: We can use this function to compute the probability of getting 6 (i.e., the successful outcome) 10 times (i.e., he number of successes) when we throw a fair dice 50 times (i.e. the number of trials).
- POISSON Probability Distribution, which measures the probability of the number of times that an event happens in an area of opportunity, given the historical average number of events.

- Example: We can use this function to compute the probability that 15 fish are caught by a group of students (i.e., the number of times that an event happens) in a day of camping (i.e., the area of opportunity), knowing that on average students catch 10 fish per day of camping (i.e., the historical average).
- Continuous Probability Distribution:

- NORMAL Probability Density Function, which can be used to compute the probability for an interval over a continuum when the continuous random variable of interest is distributed symmetrically, like a bell.

- Example: We can use this function to compute the probability that a flight from MKE to OKC takes more than 3 hours and less than 3.5 hours.

#### Binomial and Poisson Probability Distributions

- The intuition behind Binomial Probability Distribution:
- Think of an event with two outcomes: Success and Failure. For example, let’s say that you may succeed with 50% probability and that you may fail with 50% probability. Let’s try this event for, say, ten times. Let’s assume that the probability of success and failure in each trial remains the same, i.e.: it does not change even after multiple trials – no learning is involved. Under these assumptions, you may compute the probability of a given number of successes within those ten trials using Binomial Probability Distribution. For instance, you may compute the probability that you are successful
*exactly*2 times during 10 trials, or that you are successful*more*than 2 times during 10 trials, or that you are successful*less*than 2 times during 10 trials. - Why is this a “probability distribution”? Well, because it provides you with a list of events (e.g., more than 2 successes in 10 trials) and their associated probabilities.
- How could we compute those probabilities? In this course, we make use of its Excel function. For a detailed description, refer to slide 44.
- Tip: When computing “the probability of exactly 2 successes,” put FALSE for the last argument in the Excel function (i.e., make use of the Mass Function). To compute “the probability of 2 or less successes,” put TRUE as the last argument (i.e., make use of the Cumulative Function).
- This Excel file provides a playground. To download this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx)*.

- Think of an event with two outcomes: Success and Failure. For example, let’s say that you may succeed with 50% probability and that you may fail with 50% probability. Let’s try this event for, say, ten times. Let’s assume that the probability of success and failure in each trial remains the same, i.e.: it does not change even after multiple trials – no learning is involved. Under these assumptions, you may compute the probability of a given number of successes within those ten trials using Binomial Probability Distribution. For instance, you may compute the probability that you are successful
- The intuition behind Poisson Probability Distribution:
- Think of an event that may occur repeatedly, like car accidents. Imagine you have some historical information about that event. For instance, you know about the average number of car accidents per day on I-35 between Dallas and OKC. You may make use of Poisson Probability Distribution to compute the probability that such event happens for a particular number of times. Given the average number of car accidents per day on I-35 between Dallas and OKC, for example, you may use Poisson Probability Distribution to compute the probability that
*exactly*10 accidents, or*more*than 10 accidents, or*less*than 10 accidents happen in a given day on this segment of interstate highway . - Why is this a “probability distribution”? Well, because it provides you with a list of events (e.g., more than 10 accidents per day) and their associated probabilities.
- How could we compute those probabilities? In this course, we make use of its Excel function. For more, refer to slide 61.
- Tip: When computing “the probability of exactly 10 accidents per day,” put FALSE for the last argument in the Excel function (i.e., make use of the Mass Function). To compute “the probability of 10 or less accidents per day,” put TRUE as the last argument (i.e., make use of the Cumulative Function).
- This Excel file provides a playground. To download this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx)*.

- Think of an event that may occur repeatedly, like car accidents. Imagine you have some historical information about that event. For instance, you know about the average number of car accidents per day on I-35 between Dallas and OKC. You may make use of Poisson Probability Distribution to compute the probability that such event happens for a particular number of times. Given the average number of car accidents per day on I-35 between Dallas and OKC, for example, you may use Poisson Probability Distribution to compute the probability that

A note on

MassversusCumulativeprobability distribution functions:Probability Distribution Functions “describe” how events and their associated probabilities relate to each other. This can be done in two different ways. Mass functions provide us with the probability for a precise outcome; e.g., the probability that the annual income of a randomly selected household is exactly equal to $100,000. Cumulative functions provide us with the probability for the outcome being less or equal to a precise value; e.g., the probability that the annual income of a randomly selected household is equal or less than $100,000.

#### Excel Examples for Binomial and Poisson Probability Distributions

Excel Examples for Binomial Prob. Distribution:

A laboratory is planning to test the quality of 500 newly developed transmitters. From experience, they know that 85% of transmitters pass the quality control test.

- What is the probability that out of 500 newly developed transmitters exactly 430 transmitters pass the quality control test?
*Prob(X=430|500, 0.85)*, a mass probability (as opposed to cumulative probability), could be computed in Excel using: =BINOM.DIST(430,500,0.85,FALSE), which yields 0.0420 (4.2%). Note that the last argument in the above command (i.e., FALSE) implies that we employ the mass probability function (again, as opposed to the cumulative probability function).

- What is, then, the probability that out of 500 newly developed transmitters 430 or less than 430 transmitters pass the quality control test?
*Prob(X<=275|500, 0.85)*, a cumulative probability (as opposed to mass probability), could be computed in Excel using: =BINOM.DIST(430,500,0.85,TRUE), which yields 0.7521 (75.2%). Note that the last argument in the above command (i.e., TRUE) implies that we employ the cumulative probability function (again, as opposed to the mass probability function).

- What is The probability that out of 500 newly developed transmitters less than 430 transmitters pass the quality control test?
*Prob(X<430|500, 0.85)*is equal to*Prob(X<=430|500, 0.85)*minus*Prob(X=430|500, 0.85)*. And we know them both. Thus:*Prob(X<430|500, 0.85)=0.7521-0.0420=0.7101*(about 71%)

- What is the probability that out of 500 newly developed transmitters 430 or more than 430 transmitters pass the quality control test?
*Prob(X>=430|500, 0.85)*is equal to the sum of*Prob(X=430|500, 0.85)*and*Prob(X>430|500, 0.85)*. We have already computed*Prob(X=430|500, 0.85)*. We only need to compute*Prob(X>430|500, 0.85)*. It is quite easy. This latter probability could be written as:*1-Prob(X<=430|500, 0.85)*. And, fortunately, we have already computed*Prob(X<=430|500, 0.85)*. In short:

*Prob(X>=430|.)=Prob(X=430|.)+Prob(X>430|.)*

Also,

*Prob(X>430|.)=1-Prob(X<=430|.)*

Thus:

*Prob(X>=430|.)=Prob(X=430|.)+(1-Prob(X<=430|.))=0.0420+(1-0.7521)=0.2899*(about 29%)

For more information about BINOM.DIST command, see the official Excel syntax description.

Excel Examples for Poisson Prob. Distribution:

Consider a rivalry between two European soccer teams. Historical data suggests that on average 1.05 goals are scored per game.

- What is the probability that in the next game exactly 2 goals are scored?
*Prob(X=2|1.05)*, a mass probability (as opposed to cumulative probability), could be computed in Excel using: =POISSON.DIST(2,1.05,FALSE), which yields 0.1929 (19.3%). Note that the last argument in the above command (i.e., FALSE) implies that we employ the mass probability function (again, as opposed to the cumulative probability function).

- What is, then, the probability that in the next game 2 goals or less than 2 goals are scored?
*Prob(X<=2|1.05)*, a cumulative probability (as opposed to mass probability), could be computed in Excel using: =POISSON.DIST(2,1.05,TRUE), which yields 0.9103 (about 91%). Note that the last argument in the above command (i.e., TRUE) implies that we employ the cumulative probability function (again, as opposed to the mass probability function).

- Given the last two examples for binomial distribution, it would be easy for you to compute the following probabilities (The idea is the same; the commands are different):
*Prob(X<2|1.05)**Prob(X>=2|1.05)*

For more information about POISSON.DIST command, see the official Excel syntax description.

#### Normal Probability Distribution

- The intuition:
- Think of a game with two outcomes. Let’s say that with the probability of 50% you will win the game and with probability of 50% you will lose. Let’s also say that we would like to play this game repeatedly, over and over again. When we keep playing this game for 1,000 times, for instance, we can ask yourself: what is the probability that we win the game, say, 250 times or less? Keep in mind that winning this game for 250 times or less is the result of a
*large sum**of random events*. The Normal Probability Distribution is what you get when you add up a large number of random events together. - To generate a Normal Probability Distribution and a Non-normal Probability Distribution, we may conduct an experiment.
- Let us first generate three random events: X1 is a random number between 10 and 20, X2 is a random number generated by a binomial distribution with 100 trials and 50% probability of success, and X3 is a random number generated by a Poisson distribution with historical average of 10.
- Using the above numbers, we generate two random variables: Y is the sum of X1, X2, and X3; Z is the product of X1-squared, X2, and X3.
- We may repeat this process for 10,000 times, generating 10,000 Ys and Zs.
- Given the intuition behind Normal Probability Distribution, we expect Ys to be normally distributed and Zs to be non-normal. This difference is well reflected in the histograms below.
- Histogram for Y, which is the sum of three random events
- Histogram for Z, which is the product of three random events
- Python code for the above experiment (.PY)

- Think of a game with two outcomes. Let’s say that with the probability of 50% you will win the game and with probability of 50% you will lose. Let’s also say that we would like to play this game repeatedly, over and over again. When we keep playing this game for 1,000 times, for instance, we can ask yourself: what is the probability that we win the game, say, 250 times or less? Keep in mind that winning this game for 250 times or less is the result of a

- What is the bell-shaped curve that we always see for Normal Probability Distribution? That is called NORMAL DENSITY, which is basically a smoothed histogram for the associated probabilities. Imagine that you draw the histogram for the probability of winning the game X number of times, where X takes different values. For this graph the horizontal axis is the number of wins and the vertical axis is the associated probabilities. You can connect the top of the histograms to each other in a smooth fashion. The resulting curve is called the Normal Density.
- What is the Normal Density used for? It is used to compute
*the probability that you win the game X number of times or less*. The area under the Normal Density and to the left of X is equal to the probability that you win the game X number of times of less. Thus, the Normal Density is used to identify Cumulative Normal Probabilities. And keep in mind that the area under the Normal Density for the entire range of X is always equal to 1.

- As illustrated in slide 78, the Normal Probability Distribution is known by:
- its symmetric bell-shaped density
- its mean
- its standard deviation

The Road Ahead:

Because of a theorem called Central Limit Theorem we will make use of Normal Probability Distribution frequently.

- STANDARDIZED Normal Probability Distribution:
- In practice, there are a lot of different random variables that are “normally distributed.” These variables all have a symmetric bell-shaped density. However, depending on their units and scale, they are going to have different means and standard deviations.
- Think of these two variables: height and weight. Let’s assume that these variables are normally distributed, which is likely to be the case. These variables are measured using different units (e.g., feet vs. lbs), and they also have different scales (e.g., different ranges).

- To get rid of the differences in means and standard deviations, we may transform the data by using the Z-score for each observation rather than using the observation itself. This is known as “standardization.” For the observations that are normally distributed, the Z-scores are also normally distributed. Unlike original distribution, however, the mean and standard deviation of the
*standardized*distribution are always equal to zero and one, respectively.- Reminder: What is the Z-score? The Z-scores are computed by taking the difference between the value of each observation and the mean, divided (a.k.a. adjusted) by the standard deviation. They measure the deviation of each observation from the mean in terms of standard deviations; e.g., an observation with a Z-score equal to +2 is two standard deviations greater than the mean, while an observation with a Z-score equal to -2 is two standard deviation less than the mean.

- In practice, there are a lot of different random variables that are “normally distributed.” These variables all have a symmetric bell-shaped density. However, depending on their units and scale, they are going to have different means and standard deviations.
- As illustrated in slide 83, the Normal Probability Distribution is known by:
- its symmetric bell-shaped density
- its mean, which is always equal to zero
- its standard deviation, which is always equal to one

The most important application of Normal Probability Distribution in this course is finding the normal probabilities. It is very important to keep in mind that we can make use of Normal Density to compute the

cumulativenormal probabilities.Let X be a normally distributed random variable (e.g., weight). Let

aandbbe some constants (e.g., a=125 lbs and b=175 lbs). We can make use of Normal Density to compute the probability that X is less or equal toa; that X is greater or equal toaand it is at the same time less or equal tob(e.g., Slide 88); and that X is greater or equal tob. Identifying these probabilities can either be done using Cumulative Standardized Normal Probability Distribution (which requires standardization) or using the functions that are built in Excel (which does not require standardization).

- Click here to obtain the Cumulative Standardized Normal Probability Distribution Table. Your textbook explains how you may use this table. I do not explain this method, however, as I find it outdated. I use Excel, instead.
- Use this Excel file as a playground. It helps you identify normal probabilities faster. To download this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx)*.

**Chapter 7. Sampling Distribution**

- Lecture Presentation (PDF)
- Highlights:
- The Sampling Distribution of Means is among the key concepts in Statistical Inference. An intuitive understanding of this concept will help you a lot in understanding how inference is conducted.
- Imagine a “population” in which the items are not all the same, like Oddland (slide 8). You are asked to choose a randomly selected sample and compute a mean for that very sample (e.g., average age). Let’s call the computed mean:
*X-bar-one*. You are, then, asked to repeat the same exercise one more time: choose another random sample, compute a mean, and call the computed mean*X-bar-two*. In fact, you are asked to repeat this for, say, hundred times, obtaining*X-bar-one*,*X-bar-two*,*X-bar-three*, …,*X-bar-hundred*. Depending upon the items that are included in the randomly selected samples, the computed means may be different from one another. Some are, in fact, equal. But you will obtain other values for sample means. Slides 10 to 13 illustrate this quite well.- Since sample means
*vary*, given the*random*sample drawn, the sample means above become a*random**variable*. A variable with hundred values that may or may not be equal to one another. - THE SAMPLING DISTRIBUTION OF MEANS is the distribution of sample means that are obtained from repeated sampling. Like other distributions, one may identify:
- The MEAN of the sampling distribution of means
- The STANDARD DEVIATION of the sampling distribution of means (a.k.a. Standard Error)

- What type of distribution will the sampling distribution of means, then, follow? This is an important question that is addressed in this course under two sets of assumptions.
- When the population, from which random samples are drawn, has a normal distribution, the sampling distribution of means is also normal. The mean of the sampling distribution of means in this case is equal to the population mean. The standard deviation of the sampling distribution of means in this case is equal to the population standard deviation, adjusted by the square root of sample size. For formula, please refer to slide 21.
- When the population, from which random samples are drawn, is not normal, the sampling distribution of means is approximately normal provided that the sample size is large enough. The mean of the sampling distribution of means in this case is again equal to the population mean. The standard deviation of the sampling distribution of means in this case is again equal to the population standard deviation, adjusted by the square root of sample size. For formula, please refer to slide 37.

- Since sample means

The CENTRAL LIMIT THEOREM implies that, as the sample size gets large enough, the Sampling Distribution of Means is normally distributed. This is true regardless of the shape of the population distribution. But how large is “large enough”? As a general rule, when the sample size is larger than 30, the sampling distribution of means is approximately normal (slide 41).

- To illustrate the implications of Central Limit Theorem, we may conduct a short exercise using real data:
- Let us begin with this histogram, which shows the distribution of 750 midterm grades in Elements of Statistics. Since each question is worth five points in the exams, the width of each class in this histogram is set to be equal to five points.
- From the population of midterm grades (N=750):
- we choose a random sample of 50 observations with replacement,
- we compute the mean grade for the chosen sample,
- we record the obtained mean in a new data set
- and we repeat the three steps above 500,000 times, which in turn yields 500,000 mean grades that are each coming from a sample of 50 observations

- Given these recorded sample means, we can now plot the histogram for the sampling distribution of means with 500,000 observations, where each observation is the mean grade coming from a randomly selected sample of 50 grades. In this histogram, the width of each class is set to be equal to half a point.
- Despite the fact that the
*Population Grade Distribution*does not look like a normal distribution, the*Sampling Distribution of Grade Means*, for which n=50, looks very much like a normal distribution. Plus, we observe that:- the mean of grade means (=76.97) is an unbiased estimator for the population mean grade (=76.97)
- the standard deviation of grade means (=2.40) is also much smaller than the population grade standard deviation (=16.99). In fact, the ratio of population grade standard deviation divided by the square root of sample size (=50) is equal to the standard deviation of grade means. To confirm this, just type in =16.99/SQRT(50) into a random cell in Excel.

- In case you are interested, the codes for the above exercise is written in Stata. If you do not have Stata installed on your computer, you may use the computer lab at the Department of Economics. Download the codes (.DO) and grades (.DTA). For confidentiality reasons, no name or identification number is included in the grades data set. Thus, you should not be worried about using this data set.

- NOTE: I cover the Sampling Distribution of Proportions in my lectures only if time allows. Nevertheless, I left the slides at the end of lecture presentations for Chapter 7.

**Part III – Statistical Inference**

Let’s play a game, called Deadly Distribution. You will soon realize why this game has been incorporated in the lecture series. To access Deadly Distribution, go to *Canvas*, click on *Assignments*, and you will find the game under the *Extra Credits*. You may obtain 10 points, which may improve your midterm grades, if you accomplish all the missions. (This game is designed and developed by the OU K20 Center.)

**Chapter 8. Confidence Interval Estimation**

- Lecture Presentation (PDF)
- Highlights:
- Confidence Interval Estimation is a direct application of what you learned about Normal Density (Ch. 6) and Sampling Distribution of Means (Ch. 7).
- What is Confidence Interval Estimation about? Often, we do not know much about POPULATION PARAMETERS, like population mean (e.g., the average number of newly hired employees among all firms in the US over the last year). However, we are able to select a random sample from the population (e.g., a random sample of American firms), and compute SAMPLE STATISTICS, like sample mean (e.g., the average number of newly hired employees among the selected sample of firms over the last year). Employing Confidence Interval Estimation techniques, we are able to ESTIMATE the population mean using the information obtained from the sample.

Population mean and population standard deviation are often unknown PARAMETERS. Using STATISTICS such as sample mean and sample standard deviation, we are able to estimate the above-mentioned PARAMETERS. Drawing conclusions about the properties of a population using sample information is known as STATISTICAL INFERENCE.

- Confidence Interval Estimation is conducted under two sets of assumptions:
- The not-so-realistic assumption: population standard deviation is known to us.
- In this case, we make use of
*Normal Probability Distribution*.

- In this case, we make use of
- The realistic assumption: population standard deviation is unknown to us.
- In this case, we make use of another probability distribution, called:
*Student’s t Distribution.*

- In this case, we make use of another probability distribution, called:

- The not-so-realistic assumption: population standard deviation is known to us.
- Confidence Interval Estimation provides us with two, so called,
*limits*:- The Upper Confidence Limit, which is the point estimate (e.g., sample mean) plus the product of critical value, determined by the level of confidence and the above assumption, and standard error of the sampling distribution (e.g., sampling distribution of means)
- The Lower Confidence Limit, which is the point estimate (e.g., sample mean) minus the product of critical value , determined by the level of confidence and the above assumption, and standard error of the sampling distribution (e.g., sampling distribution of means)
- For more on the
*limits*, refer to Slide 52 (in which we assume that the population standard deviation is known to us) and Slide 81 (in which we assume that the population standard deviation is unknown to us). - For more on the critical values, refer to Slide 55 (in which we assume that the population standard deviation is known to us) and Slide 78 (in which we assume that the population standard deviation is unknown to us).

- For more on the
- To estimate the above limits, under the realistic assumption of unknown population standard deviation, you may make use of this Excel file. Again, think of it as a playground. To download this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx)*.

STUDENT’S t DISTRIBUTION is a probability distribution. It looks almost like a Normal Probability Distribution when the sample size is large (say, more than 120). As sample size gets smaller, the right and left tails of the distribution become a bit fat and the peakedness of the distribution also declines. (Slides 74 offers detailed illustration).

Sample size determines the DEGREE OF FREEDOM of the Student’s t Distribution. The degree of freedom reflects the number of observations that can

freelyvary, while a sample statistics (e.g., sample mean) is kept constant. To obtain a predetermined sample mean, for instance, we may only changen-1observations in a sample ofnobservations. Given the predetermined mean, thenthobservation depends on the other observations and cannot freely vary. In this case, therefore, the degree of freedom is equal to:n-1.

- To understand the intuition behind the degree of freedom better, you may refer to the explanation provided by Ding, Jin, and Shuai (2017) in Teaching Statistics journal:

- Example: The Average Amount on Time Spent on Social Media
- We surveyed 70 students in class on how much time they spent on social media over a non-exam weekend in April. The social media platforms of interest include Facebook, Youtube, Twitter, Instagram, and LinkedIn. The result of the survey is given in this file, and the histogram looks like this.
- Given the dataset collected, we may compute a 95% confidence interval for the average time spent on social media over a weekend among the population of students in my class. Relying on sample evidence, we could be 95% confident that the average time spent on social media over a weekend is greater than 2 hours, yet it is less than 3 hours. Click here for more details.
- To estimate this interval with greater confidence, you may change the probability used for the critical value. Go ahead and plug in 1-0.99 rather than 1-0.95 to obtain the critical value associated with 99% confidence level. What would happen to the confidence interval estimation? Why? Make sure that you use drawing to justify your answer.

- NOTE: I cover the Confidence Interval Estimation of Proportion in my lectures only if time allows. Nevertheless, I left the slides at the end of lecture presentations for Chapter 8.

**Chapter 9. Hypothesis Testing: One-sample Tests**

- Lecture Presentation (PDF)
- In statistical inference, a HYPOTHESIS is always about a POPULATION PARAMETER; e.g.: population mean. Assuming that the hypothesis of interest is true (e.g., H0: population mean weight = 185 pounds), one gathers some evidence from a randomly selected sample of observations, trying to REJECT the null hypothesis (H0) and ACCEPT an alternative hypothesis (e.g., H1: population mean weight > 185 pounds or H1: population mean weight < 185 pounds).
- In Chapter 9, you learn how to test a hypothesis about one population using One-sample Tests. In Chapter 10, you learn how to test a hypothesis about two populations using Two-samples Tests.
- Keep in mind that:
- You may never
*accept*the null hypothesis (H0), as you always begin by*assuming*that the null hypothesis is true. - You may only
*accept*the alternative hypothesis (H1) when you find enough evidence suggesting that your assumption about the null hypothesis was irrelevant. - You may never be 100% sure about your conclusion. At best, you may say that, given the sample evidence, the alternative hypothesis (H1) is more likely than the null hypothesis.
- You may commit a
*Type I Error*should you reject a true null hypothesis. It is like concluding that someone is guilty (rejecting the assumption of innocence), while she is, in fact, innocent.- The probability of
*Type I Error*determines the*confidence*that you may have in your test. The greater this probability, the lower the confidence.

- The probability of
- You may commit a Type II Error should you fail to reject a false null hypothesis. It is like concluding that someone is innocent (not being able to reject the assumption of innocence), while she is, in fact, guilty.
- The probability of
*Type II Error*determines the*power*of your test. The greater this probability, the lower the power.

- The probability of
- Type
*I*and type*II*errors may not happen at the same time. The former requires the null hypothesis to be true, while the latter requires that null hypothesis to be false. We cannot have a null hypothesis which is true and false at the same time. That is why these two errors may not happen at the same time. What we focus on is the type*I*error and the Confidence in the test.

- You may never

- Here is the recipe for One-sample Hypothesis Testing:
- State the null hypothesis (H0) and the alternative (H1).
- Choose the probability of committing type
*I*error. It is conventional to choose 1%, 5%, or sometimes even 10%. Also, choose a sample size:*n*.- Keep in mind that one minus the probability of committing type
*I*error will determine the confidence that you have in your test; e.g., if the probability of type*I*error is equal to 5%, you have 95% confidence in your test. - The sample size will affect the standard deviation of the sampling distribution. Remember that the greater the sample size, the lower the standard deviation of sampling distribution of means.

- Keep in mind that one minus the probability of committing type
- Determine the TEST STATISTIC.
- If you know the population standard deviation (which is quite unlikely), then use the Z-stat. (The formula is given in Slide 68)
- If you don’t know the population standard deviation (which is very likely), then use the t-stat. (The formula is given in Slide 114)

- Collect the data and compute the value of the chosen test statistic, given the formula. In the formula:
*X-bar*is the sample mean,*Mu*is the population mean under the null hypothesis (the hypothesized population mean if you wish), and*n*is the sample size.- If you have access to population standard deviation (
*Sigma*), you may use that value in the Z-stat. - If you do not have access to the population standard deviation, which is often the case, then use sample standard deviation (
*S*) in the t-stat.

- Given the probability of committing type
*I*error, either use CRITICAL VALUES or P-VALUE to draw a conclusion.- Critical Value Approach:
- Given the null hypothesis, critical values associated with the probability of committing type
*I*error will divide the sampling distribution of means into two areas:*Rejection*and*No-rejection*. Take a look at two examples:- Slide 70: Rejection and No-rejection area in two-tails test, where strict equality is used in the null hypothesis (e.g., H0: population mean weight is equal to 185 pounds)
- Slide 133: Rejection and No-rejection area in one-tail test, where an inequality is used in the null hypothesis (e.g., H0: population mean weight is greater or equal to 185 pounds)

- Reject the null hypothesis (H0) if the test statistic is in the Rejection area (e.g., Slide 81 for two-tails test and Slide 136 for one-tail test).
- Do not reject the null hypothesis (H0) if the test statistic is in the No-rejection area (e.g., Slide 122).

- Given the null hypothesis, critical values associated with the probability of committing type
- P-value Approach:
- Given the sampling distribution, p-value is the probability of the test statistics or anything more extreme; e.g., Slide 90.
- If the p-value is lower than the probability of committing type
*I*error (e.g., p-value < 5%), then you may safely reject the null hypothesis:*When the p-value is low, the null must go*. - If the p-value is greater than the probability of committing type
*I*error (e.g., p-value > 5%), then you may not reject the null hypothesis.

- Critical Value Approach:
- Don’t forget to explain the conclusion in the context of the problem. Use plain English!

The p-value is a key concept in hypothesis testing, and it is quite easy to work with. The p-value is the probability of the evidence obtained (or anything more extreme), assuming that the null is true. If the evidence obtained are highly unlikely (e.g., p-value < 5%), then there should be something wrong with our assumption that the null hypothesis is true. This may lead us to reject the null hypothesis.

Statistics computer packages often report the p-value associated with a hypothesis test. When you come across them, you should always keep in mind that:

when the p-value is low, the null must go.

- You may use Excel to conduct one-sample hypothesis testing.
- If you know the population standard deviation (which is unlikely), then go to Slides 103 – 106 to learn how to make use of Z-stat, critical values, and p-value.
- If you do not know the population standard deviation (which is more likely), then go to Slides 123 – 126 to learn how to make use of t-stat, critical values, and p-values.
- Also, download this Excel file for a two-tails one-sample test.Think of it as a playground. To download this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx)*. - There are three useful Excel functions to identify the p-value associated with a null hypothesis. To illustrate them, we are going to use a random sample of the grand total of the home game attendance for a given MLB team during a given season between 1990 and 2010. We test for three sets of hypotheses using Excel. The hypotheses are listed below, the p-value functions are given, and detailed computations are conducted in this file. To download this file on your computer, go on
*File*, then select*Download as*, and select*Microsoft Excel (.xlsx).*- H0: The Average Attendance Per Season = 2 million fans v.s. H1: The Average Attendance Per Season ~= 2 million fans (Note: ~= is used for
*not equal to*)- The p-value function: =T.DIST.2T(t-stat,df). Note: for this case, the absolute value of t-stat should be reported; e.g.: for t-stat of -1.18, you must enter 1.18.

- H0: The Average Attendance Per Season =< 2 million fans v.s. H1: The Average Attendance Per Season > 2 million fans (Note: <= is used for
*less or equal to*)- The p-value function: =T.DIST.RT(t-stat,df). Reminder: The rejection area of this test is on the right tail, which is why we compute the p-value on the right tail (RT).

- H0: The Average Attendance Per Season >= 2 million fans v.s. H1: The Average Attendance Per Season < 2 million fans (Note: >= is used for
*greater or equal to*)- The p-value function: =T.DIST(t-stat,df,true). Reminder: The rejection area of this test is on the left tail, which is why we compute the p-value on the left tail.

- H0: The Average Attendance Per Season = 2 million fans v.s. H1: The Average Attendance Per Season ~= 2 million fans (Note: ~= is used for
- The resulting p-values, computed in the first three tabs of this file, suggests that the average attendance per season is greater than 2 million fans. To estimate the magnitude, we may employ a Confidence Interval Estimation (CIE), which suggests that we could be 95% confident that between 1990 and 2010 on average more than 2.16 million fans but less than 2.59 million fans attended the MLB games each season.

- Hypothesis Testing Summary (pages 1 and 2 are relevant for this chapter).
- NOTE: I cover the Hypothesis Test of Proportion in my lectures only if time allows. Nevertheless, I left the slides at the end of lecture presentations for Chapter 9.

**Chapter 10. Hypothesis Testing: Two-sample Tests**

- Lecture Presentation (PDF)
- Using
*Two-samples*Tests, one may test:- How the MEANS of two INDEPENDENT populations relate to each other; e.g.: H0: Average Productivity of Exporting Firms = Average Productivity of Domestic Firms
- How the MEANS of two RELATED populations relate to each other; e.g.: H0: Average Productivity of Exporters Who Randomly Receive Subsidies = Average Productivity of Similar Exporting Firms With No Subsidy
- How the VARIANCE of two INDEPENDENT populations relate to each other; e.g.: H0: Variance of Sales Among Exporting Firms = Variance of Sales Among Domestic Firms

- Note: It is conventional to test the difference between the means in the null hypothesis; e.g., Ho: Average Productivity of Exporting Firms – Average Productivity of Domestic Firms = 0 (For more, see Slide 10). It is also conventional to test the ratio of the variances in the null hypothesis: H0: Variance of Sales Among Exporting Firms divided by the Variance of Sales Among Domestic Firms = 1 (For more, see Slide 51).
- Like One-sample Tests, we intend to
*reject*the null hypothesis (H0) using the appropriate test statistics by:- Comparing the value of test statistics to the critical value(s), as given by the sampling distribution
- Comparing the p-value associated with the test statistics (i.e., the probability of the obtained test statistics or anything more extreme) to conventional values such as 1%, 5%, or even 10%.

- The decision upon rejecting the null hypothesis (H0) in a Two-samples Test is quite similar to the decision upon rejection in a One-sample Test.
- Testing the means of two independent populations is done under two sets of assumptions:
- Assumption 1: The unknown
*variance*of the independent populations are*equal*- Under this assumption one may employ a
*pooled-variance t test.*(For the formula for this particular variance, refer to Slide 13) - The t-stat in this case is the difference between the difference in sample means and the difference in hypothesized population means, divided by the square root of the pooled-variance. (For the formula for this particular t-stat, refer to Slide 14)
- You may easily derive the confidence interval for the difference in population means using the pooled-variance. Refer to Slide 16 for more details.

- Under this assumption one may employ a
- Assumption 2: The unknown variance of the independent populations are
*not equal*- Under this assumption one may employ a
*separate-variance t test,*which is the sum of sample variances that are each adjusted by the sample sizes. - The t-stat in this case is the difference between the difference in sample means and the difference in hypothesized population means, divided by the square root of the separate-variance. (For the formula for this particular t-stat, refer to Slide 26)
- The above t-stat has a particular degree of freedom, as given in Slide 27.

- Under this assumption one may employ a

- Assumption 1: The unknown
- Testing the means of two related populations is based on the
*difference*between the*paired*values (Slides 35 and 36), which is why this is called a Paired Difference Test.- The following offers a step-by-step instruction.
- Step 1.) Using the two samples, compute the difference between the paired values for each observation. The paired difference becomes your new sample statistics.
- Step 2.) Compute the sample mean for the difference between the paired values.
- Step 3.) Compute the sample standard deviation for the difference between the paired values.
- Step 4.) Form the t-test as given by Slide 40 (or alternatively form the confidence interval as given by Slide 42).
- Step 5.) Compare the t-stat to the critical values, given the significance of your test. Alternatively, compare the associated p-value to the conventional probabilities of type
*I*error.

- The following offers a step-by-step instruction.
- Testing the variance of two independent populations is done using:
- F-stat
- The F-stat is simply the ratio of sample variances, putting the larger sample variance in the numerator and the smaller one in the denominator
- The F-stat has two degree of freedom. The first one is the sample size minus one, for the sample with the larger variance. The second one is the sample size minus one, for the sample with smaller variance.

- F-distribution
- Assuming that the populations of interest are normally distributed, the sampling distribution of the ratio of variances follows an F-distribution
- Given the probability of type
*I*error, one may identify the critical value for Rejection and No-rejection areas.- See Slide 58 for an illustration of the above areas.
- Use Excel if you would like to identify the critical value (Slide 55)

- F-stat
- Hypothesis Testing Summary (pages 3 and 4 are relevant for this chapter).

**Chapter 11. Analysis of Variance**

- Lecture Presentation (PDF)
- In one sample tests, we focus only on one population parameter. In two sample tests, we focus on two population parameters. In Analysis of Variance (ANOVA), we focus on three or more population parameters.
- Given the scope of this lecture series, we only examine One-way ANOVA, which relates to completely randomized designs that incorporate only one factor into the analysis.
- Example: Ceteris paribus (i.e., all else unchanged), how much of a
*factor*is the golf club brand in determining the distance traveled?

- Example: Ceteris paribus (i.e., all else unchanged), how much of a
- Running an experiment, one observes the total variation, which could be measured by Total Sum of Squares (SST). This measure is defined as the sum of the squared differences between each observation and the grand mean (i.e., the mean of all data values). Refer to slide 23 for an illustration and to Slide 27 for the formula. The total variation could, then, be partitioned into two sets of variations:
- The variations that are due to differences
*among*groups: Sum of Squares Among Groups (SSA)- The SSA variations are generated by the factor of analysis. (Illustration: Slide 24; Formula: Slide 29)
- Example: differences in distance traveled, caused only by the choice of golf club brand.

- The variations that are due to differences
*within*groups: Sum of Squares Within Groups (SSW)- The SSW variations are generated by some random things that could potentially affect the outcome but we have no control over. (Illustration: Slide 25; Formula: Slide 32)
- Example: difference in distance traveled, caused by a sudden change in wind’s direction.

- The variations that are due to differences
- The above measures of variation could then be divided by their degree of freedom to obtain something like variance.
- Degrees of Freedom and Mean Squares:
- For SSA, the degree of freedom is the number of groups, determined by the factor of interest, minus one. For instance, if we study three different brands of golf club, then the degree of freedom for SSA is two.
- The Mean Squares for SSA is, therefore, equal to: SSA divided by the above degree of freedom (Slide 34). We call this MSA, which measures the average of variations caused by the factor of interest.

- For SSW, the degree of freedom is the number of observation minus the number of groups. (The reasoning behind this degree of freedom is described fully in Slide 37)
- The Mean Square for SSW is, therefore, equal to: SSW divided by the above degree of freedom (Slide 34). We call this MSW, which measures the average of variations caused by random things that we have no control over.

- For SSA, the degree of freedom is the number of groups, determined by the factor of interest, minus one. For instance, if we study three different brands of golf club, then the degree of freedom for SSA is two.

- Degrees of Freedom and Mean Squares:
- To conduct One-way ANOVA, we perform a F test, where the F-statistics is the ratio of MSA over MSW and the degrees of freedom are as described above. Refer to Slide 41 for the formal set-up.

Though One-way ANOVA examines variations by employing the mean of squares among groups (MSA) and the mean of squares within groups (MSW), the purpose of One-way ANOVA is to reach conclusions about possible differences among the means of each groups. In a sense, we a ratio of two measures of sample variance to say something about population means.