Unit 1: Univariate Data
Definitions
Data
facts or measurements that are collect, analyze, and interpret —— raw material of statistical analysis can be quantitative and qualitative
- Cross-Sectional - collected at the same time
- Longitudinal - time-series collected over several time periods
Data Set - all the data, across observations and variables
Variable
may assume different values
- Discrete - can be counted —— takes on integer values
- Continuous - if interval is between a and b, variable can take any value in between
Constant - remains the same
Elements
an element is a unit of data, which is represented as a set of attributes or measurements —— entities on which data are collected
Observation - set of values for the variables for an element set of measurements for a single data entity
Population
all entities of interest in an investigative study
Population parameters - characteristics or quantities of the population
Probability - when we draw a sample for a population with known parameters
Measurement of Variables
- Nominal Level - sometimes called “categorical”
- Ordinal Level - order is meaningful
- Interval Level - interval between the values is fixed
- Ratio Level - zero has meaning
Sample - subset of the population
Sample Statistics - characteristics or qualities of the sample
Statistics
- Descriptive - describe data that have been collected
- Inferential - use data that have been collected to generalize or make inference based on it
Univariate distributions - explore how the collection of values for each variable is distributed across an array of possible values
use tables and graphs to capture the information
Distributions
- axes
- step size
- shape
- range / spread
- center
- unusual / gaps and outliers
Normal Distribution
- symmetry
- median = mean
- bell shape
Center
- Normal Distribution
- = population mean =
- = sample mean =
- Skewed Data
- Median / exact middle / 50% percentile / middle quartile
Unusual
- Normal Distribution
- Gap
- Outliers - 3 standard deviations away from the mean - 68% / 95% / 99%
- Skewed Data
- Outliers - and
Spread
-
Normal Distribution
-
Skew Data
- IQR - difference between the third and first quartile =
Shape
- normal
- uniform
- skew
- less on left = negative skew / left skew
- less on right = positive skew / right skew
Percentile
percentile - way of measuring relative position within a data set
The raw score where a certain percentage of score fell below that number
percentile rank - relative position within a given dataset
Modal
unimodal - have one mode
symmetric unimodal - mean = median = mode
bimodal - have two mode / peak
Non-skewed normal-like
center - mean
spread - standard deviation - on average how far the data is from the mean
median, IQR - Resistant Statistics
mean, standard deviation - Non-Resistant Statistics
spread ≈ variability ≈ dispersion ≈ heterogeneity ≈ unpredictability
range ≈ IQR ≈ standard deviation ≈ variance
Skewness
- When a distribution is positively skewed, the skewness is positive.
- When a distribution is negatively skewed, the skewness is negative.
- When a distribution symmetric, the skewness is 0.
- When you have 2 or more distributions, and you want to compare skewness, you use skewness ratio.
- By convention when the skewness ratio exceed 2 we consider the distribution highly skewed.
Transformations
Linear Transformation
Center
Spread
Shape
Z-score
Nonlinear Transformation
- retains the order
- changes the relative distances between the values in the distribution, and, in doing so, impacts spread and shape
Unit 2: Bivariate Data
Bivariate Data
- two variables
- usually comparing at least interval type data
- does one variable relate to another, if so, how?
- shape: linear / non-linear
- direction: postiive, negative
- strength:
- correlation causation
- obtain data - 2 variables (above nominal)
- scatterplot
- scale your data, fit a regression line
- look at correlation
- describe shape, direction, and strength
Correlation Coefficient
- - strong
- - moderate
- - weak
Linear Regression
line of best fit is obtained by minimizing the square of deviations to all points on the scatter plot
deviation also called residual
Interpretation
means that the independent variable accounts for of the change in dependent variable
Unit 3: Probability
Definitions
simple experiment - any action (flipping a coin, choosing a card) that leads to one of several possible outcomes
events - observable outcomes
If we only have two outcomes that are mutually exclusive, they are called complements.
Suppose are independent of each other
Conditional Probability
- two events not independent of each other
- determining probability of an event happening given that another has happened
The Law of Total Probability
Bayes’ Theorem
Binomial Distribution
B - binary
I - independent
N - fixed number of trials
S - probability success for each trial remains the same
- k - number of successes
- n - number of trials
- p - probability of success
10% Rule
The 10 percent rule is used to approximate the independence of trials where sampling is taken without replacement, If the sample size is less than 10% of the population size, then the trial size can be treated as independent even if it’s not.
a rule of thumb: n needs to be so large that the expected number of successes and failures are both at least 10
Fundamental Counting Principle
states that if one event can occur in ways and a second event can occur in ways for each at the occurrences of the first event then the first event and second event can occur in ways
Permutation
ordered sequences of objects in which each possible object occurs at most once but not all objects need to be used. The total number of permutations when selecting k objects from a total of n choice is:
Normal Distribution
Simple Random Sample
sample is chosen from a given in such a way as to ensure that person (or thing) in a population has an equal of independent chance of being picked for a sample
Central Limit Theorem
Given a population of values with no specify distribution and a sample size of that is sufficiently large, the sampling distribution of means for samples drawn from this population with replacement can be described as follows:
- Its shape is approximately normal
- Its mean
- Its standard deviation , also called the standard error
Large Sample Condition:
- original distribution is approximately normal
- sample size greater than 30
Population Proportion
population proportion -
sample proportion -
Sampling Distribution of a Sample Proportion
Choose a simple random sample of size from a population of size , with a proportion of successes. Then:
- the mean of the sampling distribution is equal to
- standard deviation of the sampling distribution of as long as
- as increases the sampling distribution approaches normal as long as
Sampling Distribution of
Suppose is the mean of a simple random sample of size drawn from a large population with mean and a standard deviation of . As long as
Point estimate is the statistic calculated from your sample.
- don’t expect point estimator to be the mean
- should be part of the normal distribution
- of the data is within 2 standard deviation of the mean
- Whenever is within 10 points of , is within 10 points of , this happens in about of all samples
Definition of Confidence Interval, Margin of Error, Confidence Level
A confidence interval gives an interval of plausible values for a parameter
proportion | mean | |
---|---|---|
population | ||
parameter | ||
sample | ||
statistic | ||
standard deviation |
Confidence Level
The confidence level gives the overall success rate of the method for calculating the confidence interval. That is in of all possible samples of a particular size . This method would yield an interval that captures the true parameter value.
Interval yields plausible values.
Generally we chose a confidence level of . is most common.
confidence interval for :
confidence interval for means:
T Distribution
- As degree of freedom increase, approaches standard normal curve
- symmetric mean centered around normal
- standard deviation changes
- bell curve
- unimodal
Significance Test
Null and Alternative Hypothesis
- a claim
- alternative hypothesis, which means is false
Parameter of interest is the true proportion of successes
Statistical test weigh evidence against a claim and an alternate claim
- Step 1: state hypothesis, hypothesis should express suspicions we have before we see the data
- one-sided or
- two-sided
- Step 2: Does the data give convincing evidence against the null?
- Reject the null
- Fail to reject the null
P-Value
The probability computed assuming is true that the statistic (such as or ) would take a value as extreme as or more extreme than the one actually observed in the direction of .
Small p-values are evidence against observed result is unlikely to occur when is true.
If the p-value is smaller than we say result are statistically significant at a level of (significant level), we reject the and conclude convincing evidence in favor of .
Levels of :
true | false | |
---|---|---|
reject | Type error | 😀 |
fail to reject | 😀 | Type error |
Test about a population proportion (1-prop z-test)
- state what parameter of interest is
- check conditions
- random - SRS / well-designed condition
- independence - if not replacing
- normal -
- perform the test
- test statistic:
- p-value
- conclude
- compare p-value to
- reject or fail to reject in context
1 sample t-test
- State parameter of interest
- Check conditions
- random - SRS / well-designed condition
- independent - 10% or less of N is sampling if without replacement
- normal - / population distribution is normal
- Hypothesis:
- perform the test
- test statistic:
- Conclusion
Power
The power of a test against a specific alternative is the probability that the test will reject at a chosen significance level when the specified alternative value of the parameter is true.
- get value according to through
- find position of the value in the actual distribution
Effect
Effect size is a quantitative measure of the strength or magnitude of a phenomenon. It tells how large the difference is between groups or how strong the relationship is between variables independent of sample size.
Cohen’s d - means
Cohen’s h
possible values:
Comparing 2 means
Population of treatment | Parameter | statistic | sample size |
---|---|---|---|
1 | |||
2 |
Sampling distribution of
choose an SRS of size from a population with mean and standard deviation and an SRS of size from a population with mean and standard deviation
- shape - approximately normal if sample size is large enough
- center -
- spread -
Confidence Interval
2 samples t-test
Comparing 2 proportions
Sampling Distribution for
- center:
- shape: approximately normal if the following conditions are met
- spread:
- independent: no more than of population for either
Confidence Interval
2 prop z-test
- compare to an
Nonparametric Inference Techniques
Everything thus far with inference relies on explicit assumptions about population, distributions, and parameters
- Today focus on how to work with categorical data
- Chi-square methods are relatively free from assumptions
Chi-Square Distribution
Given a normal distribution with a mean of and a variance of , we select a random from population.
The distributions of values could be obtained.
pchisq("x^2value", df, lower.tail = F)
Chi-Square Goodness of Fit Test
Statement
Collect random sample of size , how unlikely is it for the observed value to differ as much as they do from the expected values if is true?
Conditions
- Cells must mutually exclusive and exhaustive
- Observations need to be independent
- Expected frequency value must be at least 5
type = c(<categories...>)
freq = c(<actual_values...>)
prob = c(<expected_prop...>)
chisq.test(freq, p = prob)
The Chi-Square Test of Independence
Enable us to determine whether two categorical variables are related
Statement
Conditions
Same as Chi-Square Goodness of Fit Test
- Individual observations are independent of each other
The Chi-Square Test for Homogeneity
Determine whether different populations or groups have the same distribution of a categorical variable .
Statement
Conditions
Same as Chi-Square Goodness of Fit Test
number of sample | 1 | at least 2 | 1 |
number of variable | 1 | 1 | at least 2 |
The percentage distributions is same as given | The distribution of the CV are the same across all groups | two variables are mutually independent | |
value doesn't match | The distribution of the CV are not the same across all groups | two variables are not mutually independent | |
Expected Value | |||
Degree of Freedom |
Inferential Statistics for Linear Regression
If data are a random sample from a larger population, we need statistical inference to answer following questions:
- Is there a relationship between x and y in the population or could the pattern we see using the sample data happen by chance?
- In the population, how much will the predicted value of y changes for every increase of 1 unit of x?
- What is the marginal of error for the estimate?
Sampling Distribution of a Slope
Choose a Simple Random Sample of observations from a population of size
- The distribution of y for each value of x follows a normal distribution
Conditions:
- Linear - look at the scatter plot and plot of the residual
- Independent
- Normal - for any value of , varies according to a normal distribution
- look at the residuals - box plot, hist, stem&leaf plot
- Equal Standard Deviation - the standard deviation of y is the same for all values of x
- look at a scatterplot of residuals
- Random - data from a well-designed experiment or random sample
Confidence Interval
Significance Test
Hypothesis
Test Statistics
Conclusion
If determines if the value of y is impacted by the value of x