Data & Sampling

Vocab! Vocab! Vocab!

Some variables numerically measure some characteristic of an individual, such as height, weight, exam scores, and so on. These are called quantitative variables.

Other variables simply put individuals into categories, such as sex, school subject, color, and so on. These are called qualitative variables, or categorical variables.

A Yardstick: if it makes sense to take an average, then it's likely you have quantitative data.

Example: The data are the number of machines in a gym. You sample five gyms. One gym has $12$ machines, one gym has $15$ machines, one gym has $10$ machines, one gym has $22$ machines, and the other gym has $20$ machines. What type of data is this?

This data is quantitative.

Example: The data are the areas of lawns in square feet. You sample five houses. The areas of the lawns are $144$ sq. feet, $160$ sq. feet, $190$ sq. feet, $180$ sq. feet, and $210$ sq. feet. What type of data is this?

This data is quantitative.

Example: The data are the colors of houses. You sample five houses. The colors of the houses are white, yellow, white, red, and white. What type of data is this?

This data is categorical.

Two Types of Quantitative Data: Discrete and Continuous.

Discrete: Discrete quantitative data can come from either a finite number of values, or can only take on certain numerical values. For example, count data such as the number of books in a student's backpack is discrete.

Continuous: continuous quantitative data can take on any decimal values in some interval. For example, the weight of books in a student's backpack can conceivably take on any non-negative decimal value.

Example:The data are the number of machines in a gym. You sample five gyms. One gym has $12$ machines, one gym has $15$ machines, one gym has $10$ machines, one gym has $22$ machines, and the other gym has $20$ machines. Is this quantitative data discrete or continuous?

Since we can have only whole numbers of machines, this quantitative data is discrete.

Since lawn area can be any positive decimal number, this quantitative data is continuous.

Recall: Super Important Vocab.

The population is entire group of individuals (people, cars, animals, ball bearings, etc.) which we want information about.

A small part of the population we choose to gain information about the whole is called a sample.

Gaining information about the entire population from a sample is called inference.

Question: Why do we take samples instead of looking at the whole population?

Answer: Looking at the entire population might be too costly, time consuming, etc.

Sampling Design: The methods we use to collect our sample should be easy to describe in writing so that they can be repeated. These methods are called our sampling design.

In order to create a sampling design we must:

Define exactly what our population is
Say exactly what we want to measure

Example: Suppose SWOCC sends a survey to students who are graduating this year. This year $230$ students are graduating and $100$ of the students are randomly selected to take the survey. Of these, $19$ students respond to the survey.

(a) What is the population of interest for this survey?

The graduating class of students for this year.

(b) What is the sample?

The $19$ surveys actually returned.

The BIG Question: How do we choose a sample which is representative of our population?

BIG Answer: We don't choose. We let chance decide.

Simple Random Sampling

Give every individual in our population a numerical label and then choose randomly from these numbers.

Since simple random sampling is such a common method of sampling, we often call a "simple random sample" an "SRS."

Another Big Question: Why random sampling?

Answer: To eliminate bias.

Very Important Vocab

A study is biased if it favors, or tends toward a certain outcome which may or may not truly reflect the population.

How To Sample Badly.

Bad Sampling Design 1: Start interviewing people at a shopping mall.

Who is over or under represented in such a sample?

This brand of sampling is called convenience sampling.

Here, the researcher is choosing who is in the sample.

Bad Sampling Design $2$: Create an online poll and let people respond.

Example: A well-known television commentator publicly criticized the idea of allowing undocumented immigrants to legally obtain drivers licenses as a public safety measure.

Viewers were encouraged to vote in an online poll which asked:

"Would you be more or less likely to vote for a presidential candidate who supports giving drivers licenses to illegal aliens?"

$97\%$ of the sample consisting of $7350$ people voted "Less likely."

This kind of sample is called a self-selected sample, or a voluntary response sample because each participant chose to be in the sample.

Big Fact

A huge step in eliminating bias is to let neither the researcher nor members the population (if the individuals are people) choose to be a part of the sample.

Chance alone should decide.

Savvy Citizen Fact

Any conclusion based a sample which is not random is questionable.

Let's Talk About Sampling...

Simple Random Sampling

To take a simple random sample (an SRS) of a population, we assign a label to each individual and choose randomly.

Simple Random Sample (SRS)

Random Sampling to Estimate a Mean

Below is the height of all the citizens of Squaresville. The true mean height is Let's suppose that we don't know this and sample citizens to estimate the true mean height.

Random Sampling to Estimate a Proportion

Below is the favorite color of the citizens of Squaresville. $40\%$ of the citizens (West Squaresville) below prefer pink, while $60\%$ of the citizens (East Squaresville) below prefer blue. Let's pretend we don't know this and collect a random sample of dots to estimate the proportion (percentage) of pink dots.

Another Sampling Design: Stratified Random Sampling.

When a population is spread out over large areas, or there are many groups and sub categories, we often assign labels to each region or group. We then

randomly choose a sample of regions or groups
take an SRS from each randomly chosen region or group
combine each SRS into a single sample.

The regions, groups, or subcategories are generically referred to as strata.

Examples:

Sampling small plots in a large forest: break up forest into smaller strata.

Break a large region up into smaller regions where states and/or counties serve as strata.

Stratified Random Sampling

Yet Another Sampling Design: Cluster Sampling.

If a population can be broken into groups (clusters) that are each of reasonable size to be studied exhaustively, then a cluster sample may be appropriate:

randomly choose a sample of groups or clusters
include EVERY individual from that cluster
combine each entire cluster into a single sample.

Cluster Sampling

Really?! Another Sampling Design?! Systematic Sampling.

assign a label to each individual

$k = \frac{\mbox{number of individuals in the population}}{\mbox{number of individuals needed in the sample}}$

choose a random individual

then include every $k$th individual in the sample down the line from your first data point.

Systematic Random Sampling

The population below is $400$ individuals. Let's collect $30$ data points. Then $$k=\frac{\mbox{number of individuals in the population}}{\mbox{number of individuals needed in the sample}}=\frac{400}{30} \approx 13.$$

Problems In Practice: Human subjects are the hardest to get good information about.

Example: Suppose we want to know the percentage of people will vote for Candidate A or Candidate B.

How are we going to ask people the question: "Who do you plan to vote for this November?"

What problems could we run into when trying to get good information?

Undercoverage occurs when some groups in the population are left out of the process of choosing the sample.

Nonresponse occurs when an individual chosen for the sample can’t be contacted or refuses to participate. Any honest poller will tell you their rate of nonresponse. It is a red flag if they don't.

The behavior of the respondent or of the interviewer can cause response bias in sample results.

The wording of questions is the most important influence on the answers given to a sample survey. Confusing or leading questions can introduce strong bias, and changes in wording can greatly change a survey’s outcome.

Example: Wording of Questions

Two differently worded questions, which essentially ask the same thing, can give you vastly different results. For example, in a survey about illegal immigration in the U.S., we have the following results.

When asked,

"Should illegal immigrants be prosecuted and deported for being in the U.S. illegally, or shouldn’t they?"

$69\%$ favored deportation.

On the other hand, the very same sample of people was asked whether illegal immigrants who have worked in the United States for two years

"should be given a chance to keep their jobs and eventually apply for legal status,"

to which $62\%$ said that they should be given a chance.

Data & Sampling Worksheet