**Data & Sampling Worksheet**

**Vocab! Vocab! Vocab!**

Some variables numerically measure some characteristic of an individual, such as height, weight, exam scores, and so on. These are called

__quantitative variables.__

Other variables simply put individuals into categories, such as sex, school subject, color, and so on. These are called

__qualitative variables__, or

__categorical variables.__

**A Yardstick:**if it makes sense to take an average, then it's likely you have quantitative data.

**Example**: The data are the number of machines in a gym. You sample five gyms. One gym has $12$ machines, one gym has $15$ machines, one gym has $10$ machines, one gym has $22$ machines, and the other gym has $20$ machines. What type of data is this?

This data is quantitative.

**Example**: The data are the areas of lawns in square feet. You sample five houses. The areas of the lawns are $144$ sq. feet, $160$ sq. feet, $190$ sq. feet, $180$ sq. feet, and $210$ sq. feet. What type of data is this?

This data is quantitative.

**Example**: The data are the colors of houses. You sample five houses. The colors of the houses are white, yellow, white, red, and white. What type of data is this?

This data is categorical.

**Two Types of Quantitative Data:**Discrete and Continuous.

**Discrete:**Discrete quantitative data can come from either a finite number of values, or can only take on certain numerical values. For example, count data such as the number of books in a student's backpack is discrete.

**Continuous:**continuous quantitative data can take on any decimal values in some interval. For example, the weight of books in a student's backpack can conceivably take on any non-negative decimal value.

**Example**:The data are the number of machines in a gym. You sample five gyms. One gym has $12$ machines, one gym has $15$ machines, one gym has $10$ machines, one gym has $22$ machines, and the other gym has $20$ machines. Is this quantitative data discrete or continuous?

Since we can have only whole numbers of machines, this quantitative data is discrete.

**Example**: The data are the areas of lawns in square feet. You sample five houses. The areas of the lawns are $144$ sq. feet, $160$ sq. feet, $190$ sq. feet, $180$ sq. feet, and $210$ sq. feet. Is this quantitative data discrete or continuous?

Since lawn area can be any positive decimal number, this quantitative data is continuous.

**Recall**: Super Important Vocab.

The

__population__is entire group of individuals (people, cars, animals, ball bearings, etc.) which we want information about.

A small part of the population we choose to gain information about the whole is called a

__sample.__

Gaining information about the entire population from a sample is called

__inference__.

**Question**: Why do we take samples instead of looking at the whole population?

**Answer**: Looking at the entire population might be too costly, time consuming, etc.

**Sampling Design**: the methods we use to collect our sample should be easy to describe in writing so that they can be repeated. These methods are called our

__sampling design__.

In order to create a sampling design we must:

- Define exactly what our population is
- Say exactly what we want to measure

**Example**: Suppose SWOCC sends a survey to students who are graduating this year. This year $230$ students are graduating and $100$ of the students are randomly selected to take the survey. Of these, $19$ students respond to the survey.

(a) What is the population of interest for this survey?

The graduating class of students for this year.

(b) What is the sample?

The $19$ surveys actually returned.

**The BIG Question**: How do we choose a sample which is representative of our population?

**BIG Answer**: We don't choose. We let chance decide.

**Simple Random Sampling**: Give every individual in our population a numerical label and then choose randomly from these numbers.

Since simple random sampling is such a common method of sampling, we often call a "simple random sample" an "SRS."

**Another Big Question:**Why random sampling?

**Answer**: To eliminate bias.

**Very Important Vocab**: A study is

__biased__if it favors, or tends toward a certain outcome which may or may not truly reflect the population.

**How To Sample Badly**.

**Bad Sample Design 1**: Start interviewing people at a shopping mall.

Who is over or under represented in such a sample?

This brand of sampling is called

__convenience sampling.__

Here, the researcher is choosing who is in the sample.

**Bad Sample Design $2$**: Create an online poll and let people respond.

**Example**: In a $2007$ broadcast, Lou Dobbs aired an episode of his program which criticized the idea of allowing undocumented immigrants to legally obtain drivers licenses as a public safety measure.

In the broadcast, he invited viewers to vote in an online poll which asked:

*"Would you be more or less likely to vote for a presidential candidate who supports giving drivers’ licenses to illegal aliens?"*

$97\%$ of the sample consisting of $7350$ people voted "Less likely."

This kind of sample is called a

__self-selected sample__, or a

__voluntary response sample__because each participant chose to be in the sample.

**Big Fact**

A huge step in eliminating bias is to let neither the researcher nor members the population (if the individuals are people) choose to be a part of the sample.

Chance alone should decide.

**Savvy Citizen Fact**

Any conclusion based a sample which is not random is questionable.

**Let's Talk About Sampling...**

**Choosing An SRS**: To take a simple random sample of a population, we assign a label to each individual, and choose randomly.

**Example:**Simple Random Sample

**Example:**Below is the height of all the citizens of Squaresville. The true mean height is Let's suppose that we don't know this and sample citizens to estimate the true mean height.

**Another Sampling Design**: Stratified Random Sampling.

When a population is spread out over large areas, or there are many groups and sub categories, we often assign labels to each region or group. We then

- randomly choose a sample of regions or groups
- take an SRS from each randomly chosen region or group
- combine each SRS into a single sample.

__strata__.

**Examples**:

- Sampling small plots in a large forest: break up forest into smaller strata.
- Break a large region up into smaller regions where states and/or counties serve as strata.

**Example:**Stratified Random Sample

**Yet Another Sampling Design**: Cluster Sampling.

If a population can be broken into groups (clusters) that are each of reasonable size to be studied exhaustively, then a

__cluster sample__may be appropriate:

- randomly choose a sample of groups or clusters
- include EVERY individual from that cluster
- combine each entire cluster into a single sample.

**Example:**Cluster Sample

**Really?! Another Sampling Design?!**Systematic Sampling.

- assign a label to each individual
- $k = \frac{\mbox{number of individuals in the population}}{\mbox{number of individuals needed in the sample}}$
- choose a random individual
- then include every $k$th individual in the sample down the line from your first data point.

**Example:**Systematic Random Sample

The population below is $400$ individuals. Let's collect $30$ data points. Then $$k=\frac{\mbox{number of individuals in the population}}{\mbox{number of individuals needed in the sample}}=\frac{400}{30} \approx 13.$$

**Problems In Practice**: Human subjects are the hardest to get good information about.

**Example**: Suppose we want to know the percentage of people will vote for Candidate A or Candidate B.

How are we going to ask people the question: "Who do you plan to vote for this November?"

What problems could we run into when trying to get good information?

__Undercoverage__occurs when some groups in the population are left out of the process of choosing the sample.

__Nonresponse__occurs when an individual chosen for the sample can’t be contacted or refuses to participate. Any honest poller will tell you their rate of nonresponse. It is a red flag if they don't.

The behavior of the respondent or of the interviewer can cause

__response bias__in sample results.

The

__wording of questions__is the most important influence on the answers given to a sample survey. Confusing or leading questions can introduce strong bias, and changes in wording can greatly change a survey’s outcome.

**Example:**Wording of Questions

Two differently worded questions, which essentially ask the same thing, can give you vastly different results. For example, in a survey about illegal immigration in the U.S., we have the following results.

When asked,

"Should illegal immigrants be prosecuted and deported for being in the U.S. illegally, or shouldn’t they?"

$69\%$ favored deportation.

On the other hand,

__the very same sample of people__was asked whether illegal immigrants who have worked in the United States for two years

"should be given a chance to keep their jobs and eventually apply for legal status,"

to which $62\%$ said that they should be given a chance.