Statistics with R 2024-2025

Unit 1: Univariate Data

Definitions

Data

facts or measurements that are collect, analyze, and interpret —— raw material of statistical analysis can be quantitative and qualitative

  • Cross-Sectional - collected at the same time
  • Longitudinal - time-series collected over several time periods

Data Set - all the data, across observations and variables

Variable

may assume different values

  • Discrete - can be counted —— takes on integer values
  • Continuous - if interval is between a and b, variable can take any value in between

Constant - remains the same

Elements

an element is a unit of data, which is represented as a set of attributes or measurements —— entities on which data are collected

Observation - set of values for the variables for an element set of measurements for a single data entity

Population

all entities of interest in an investigative study

Population parameters - characteristics or quantities of the population

Probability - when we draw a sample for a population with known parameters

Measurement of Variables

  • Nominal Level - sometimes called “categorical”
  • Ordinal Level - order is meaningful
  • Interval Level - interval between the values is fixed
  • Ratio Level - zero has meaning

Sample - subset of the population

Sample Statistics - characteristics or qualities of the sample

Statistics

  • Descriptive - describe data that have been collected
  • Inferential - use data that have been collected to generalize or make inference based on it

Univariate distributions - explore how the collection of values for each variable is distributed across an array of possible values

use tables and graphs to capture the information

Distributions

  • axes
  • step size
  • shape
  • range / spread
  • center
  • unusual / gaps and outliers

Normal Distribution

  • symmetry
  • median = mean
  • bell shape

Center

  • Normal Distribution
    • M\Mu = population mean = xi/N\sum x_i / N
    • xˉ\bar{x} = sample mean = xi/n\sum x_i / n
  • Skewed Data
    • Median / exact middle / 50% percentile / middle quartile

Unusual

  • Normal Distribution
    • Gap
    • Outliers - 3 standard deviations away from the mean - 68% / 95% / 99%
  • Skewed Data
    • Outliers - Q75+1.5×IQRQ_{75} + 1.5 \times IQR and Q251.5×IQRQ_{25} - 1.5 \times IQR

Spread

  • Normal Distribution

    σ=(xiM)2N\sigma=\sqrt{\frac{\sum(x_i-M)^2}{N}} s=(xiM)2n1s=\sqrt{\frac{\sum(x_i-M)^2}{n-1}} range=maxminrange = max - min
  • Skew Data

    • IQR - difference between the third and first quartile = Q75Q25Q_{75}-Q_{25}

Shape

  • normal
  • uniform
  • skew
    • less on left = negative skew / left skew
    • less on right = positive skew / right skew

Percentile

percentile - way of measuring relative position within a data set

The raw score where a certain percentage of score fell below that number

percentile rank - relative position within a given dataset

Rank=percentile100(number of items+1)Rank = \dfrac{\text{percentile}}{100} * (\text{number of items} +1)

unimodal - have one mode

symmetric unimodal - mean = median = mode

bimodal - have two mode / peak

Non-skewed normal-like

center - mean

spread - standard deviation - on average how far the data is from the mean

median, IQR - Resistant Statistics

mean, standard deviation - Non-Resistant Statistics

spread ≈ variability ≈ dispersion ≈ heterogeneity ≈ unpredictability

range ≈ IQR ≈ standard deviation ≈ variance

Skewness

Skewness=(xixˉ)3/N((xiM)2n1)3=(xixˉ)3/N(standard deviation)3Skewness=\frac{\sum(x_i-\bar x)^3/N}{(\sqrt{\frac{\sum(x_i-M)^2}{n-1}})^3}=\frac{\sum(x_i-\bar x)^3/N}{(\text{standard deviation})^3} Standard Error Skewness=6N(N1)(N2)(N+1)(N+3)Standard\space Error\space Skewness=\sqrt{\frac{6N(N-1)}{(N-2)(N+1)(N+3)}} Skew Ratio=SkewnessStandard Error SkewnessSkew\space Ratio=\frac{\text{Skewness}}{\text{Standard Error Skewness}}
  • When a distribution is positively skewed, the skewness is positive.
  • When a distribution is negatively skewed, the skewness is negative.
  • When a distribution symmetric, the skewness is 0.
  • When you have 2 or more distributions, and you want to compare skewness, you use skewness ratio.
  • By convention when the skewness ratio exceed 2 we consider the distribution highly skewed.

Transformations

Linear Transformation

Xnew=kXold+C   where  k and C are constants and k0X_{new}=k X_{old}+C\space \ \ \textrm{where} \ \ \textrm{k and C are constants} \ and \ k\neq 0

Center

Rangenew=k RangeoldRange_{new} = |k|\space Range_{old} Xˉnew=kXˉold+C\bar X_{new} = k \bar X_{old} + C Xnew(median)=kXold(median)+CX_{new (median)} = k X_{old (median)} + C

Spread

IQRnew=k IQRoldIQR_{new} = |k|\space IQR_{old} Sdnew=k SdoldSd_{new} = |k|\space Sd_{old} Variancenew=k2 VarianceoldVariance_{new} = k^2\space Variance_{old}

Shape

Skewnew=kkSkewoldSkew_{new} = \frac{k}{|k|} Skew_{old} Skew.rationew=kkSkew.ratiooldSkew.ratio_{new} = \frac{k}{|k|} Skew.ratio_{old}

Z-score

ZScorei=xixˉsdZ-Score_{i}=\frac{x_i-\bar x}{sd}

Nonlinear Transformation

  • retains the order
  • changes the relative distances between the values in the distribution, and, in doing so, impacts spread and shape

Unit 2: Bivariate Data

Bivariate Data

  • two variables
  • usually comparing at least interval type data
  • does one variable relate to another, if so, how?
    • shape: linear / non-linear
    • direction: postiive, negative
    • strength: r=zx×zyN1r = \frac{\sum z_x\times z_y}{N-1}
  • correlation \neq causation
  1. obtain data - 2 variables (above nominal)
  2. scatterplot
  3. scale your data, fit a regression line
  4. look at correlation
  5. describe shape, direction, and strength

Correlation Coefficient

1r1-1\leq r\leq 1
  • 0.7\geq0.7 - strong
  • [0.4,0.7)[0.4,0.7) - moderate
  • <0.4<0.4 - weak

Linear Regression

line of best fit is obtained by minimizing the square of deviations to all points on the scatter plot

regression lineprediction line\text{regression line}\equiv \text{prediction line} D=di2N  where  di=yiy^(i)D=\sqrt{\frac{\sum d_i^2}{N}} \ \ \textrm{where} \ \ d_i=y_i - \hat y(i)

deviation also called residual

y^=a+bx  where  b=rsysx and a=yˉbxˉ\hat y = a + bx \ \ \textrm{where} \ \ b=r\frac{s_y}{s_x} \ \textrm{and} \ a=\bar y - b \bar x

r2r^2 Interpretation

r2=p%r^2=p\% means that the independent variable accounts for p%p\% of the change in dependent variable

Unit 3: Probability

Definitions

simple experiment - any action (flipping a coin, choosing a card) that leads to one of several possible outcomes

events - observable outcomes

P(E)=number of outcomes that satisfy conditiontotal number of outcomesP(E)=\frac{\text{number of outcomes that satisfy condition}}{\text{total number of outcomes}}

If we only have two outcomes that are mutually exclusive, they are called complements.

P(Ei)=P(Ei)  if events have no shared symbol\textstyle P(\bigcup E_i)=\sum P(E_i) \ \ \text{if events have no shared symbol} P(Ei)=P(Ei)P(Ei)\textstyle P(\bigcup E_i)=\sum P(E_i) - P(\bigcap E_i)

Suppose EiE_i are independent of each other

P(Ei)=P(Ei)\textstyle P(\bigcap E_i)=\prod P(E_i)

Conditional Probability

  • two events not independent of each other
  • determining probability of an event happening given that another has happened
P(E1E2)=P(E1E2)P(E2)P(E_1|E_2)=\frac{P(E_1 \cap E_2)}{P(E_2)}

The Law of Total Probability

P(E1)=P(E1E2)×P(E2)+P(E1E2c)×P(E2c)P(E_1)=P(E_1|E_2) \times P(E_2)+P(E_1|E_2^c) \times P(E_2^c)

Bayes’ Theorem

P(E1E2)=P(E2E1)×P(E1)P(E2)P(E_1|E_2)=\frac{P(E_2|E_1) \times P(E_1)}{P(E_2)}

Binomial Distribution

B - binary

I - independent

N - fixed number of trials

S - probability success for each trial remains the same

nCk=(nk)=n!k!(nk)!_nC_k = {n \choose k} = \frac{n!}{k!(n-k)!} P(x=k)=(nk)×pk×(1p)nkP(x=k)={n \choose k} \times p^k \times (1-p)^{n-k}
  • k - number of successes
  • n - number of trials
  • p - probability of success
μ=np  and  σ=np(1p)\mu=n p \ \ \text{and} \ \ \sigma = \sqrt{n p (1-p)}

10% Rule

The 10 percent rule is used to approximate the independence of trials where sampling is taken without replacement, If the sample size is less than 10% of the population size, then the trial size can be treated as independent even if it’s not.

Bin(n,p)N(μ=np,σ=np(1p))Bin(n,p)\approx \mathcal{N}(\mu=np,\sigma=\sqrt{np(1-p)})

a rule of thumb: n needs to be so large that the expected number of successes and failures are both at least 10

np10  and  n(1p)10np\geq10 \ \ and \ \ n(1-p)\geq10

Fundamental Counting Principle

states that if one event can occur in mm ways and a second event can occur in nn ways for each at the occurrences of the first event then the first event and second event can occur in m×nm \times n ways

Permutation

ordered sequences of objects in which each possible object occurs at most once but not all objects need to be used. The total number of permutations when selecting k objects from a total of n choice is:

nPk=n!(nk)!_nP_k = \frac{n!}{(n-k)!}

Normal Distribution

N(μ,σ)=1σ2πe12(xμσ)2\mathcal{N}(\mu, \sigma)=\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}

Simple Random Sample

sample is chosen from a given in such a way as to ensure that person (or thing) in a population has an equal of independent chance of being picked for a sample

Central Limit Theorem

Given a population of values with no specify distribution and a sample size of NN that is sufficiently large, the sampling distribution of means for samples drawn from this population with replacement can be described as follows:

  1. Its shape is approximately normal
  2. Its mean μxˉ=μ\mu_{\bar x}=\mu
  3. Its standard deviation σxˉ=σN\sigma_{\bar x}=\dfrac{\sigma}{\sqrt{N}} , also called the standard error

Large Sample Condition:

  • original distribution is approximately normal
  • sample size greater than 30

Population Proportion

population proportion - PP

sample proportion - p^\hat p

p^=count of successesnumber drawn=Xn\hat p = \frac{\text{count of successes}}{\text{number drawn}}=\frac{X}{n}

Sampling Distribution of a Sample Proportion

Choose a simple random sample of size nn from a population of size NN, with a proportion of pp successes. Then:

  • the mean of the sampling distribution is equal to pp μp^=p\mu_{\hat p}=p
  • standard deviation of the sampling distribution of p^\hat p σp^=p(1p)n\sigma_{\hat p}=\sqrt{\dfrac{p(1-p)}{n}} as long as n110Nn \leq \dfrac{1}{10} N
  • as nn increases the sampling distribution approaches normal as long as np10 and n(1p)10np \geq 10 \ \text{and} \ n(1-p) \geq 10

Sampling Distribution of xˉ\bar x

Suppose xˉ\bar x is the mean of a simple random sample of size nn drawn from a large population NN with mean μ\mu and a standard deviation of σ\sigma. As long as n110Nn \leq \dfrac{1}{10} N

  • μxˉ=μ\mu_{\bar x}= \mu
  • σxˉ=σn\sigma_{\bar x} = \dfrac{\sigma}{\sqrt{n}}

Point estimate is the statistic calculated from your sample.

  1. don’t expect point estimator xˉ\bar x to be the mean
  2. should be part of the normal distribution
  3. 95%95\% of the data is within 2 standard deviation of the mean
  4. Whenever xˉ\bar x is within 10 points of μ\mu, μ\mu is within 10 points of xˉ\bar x, this happens in about 95%95\% of all samples

Definition of Confidence Interval, Margin of Error, Confidence Level

A C%C\% confidence interval gives an interval of plausible values for a parameter

point estimate±margin of error\text{point estimate} \pm \text{margin of error} margin of errorvalue based on desired confidence level\text{margin of error} \rightarrow \text{value based on desired confidence level} Standard Error=σn\text{Standard Error}=\frac{\sigma}{\sqrt{n}}
proportionmean
population
parameterppμ\mu
sample
statisticp^\hat pxˉ\bar x
standard deviationp(1p)np^(1p^)n\sqrt{\dfrac{p(1-p)}{n}}\approx \sqrt{\dfrac{\hat p(1-\hat p)}{n}}σxˉ=σnsxn\sigma_{\bar x}=\dfrac{\sigma}{\sqrt{n}}\approx \dfrac{s_x}{\sqrt{n}}

Confidence Level

The confidence level CC gives the overall success rate of the method for calculating the confidence interval. That is in C%C\% of all possible samples of a particular size nn. This method would yield an interval that captures the true parameter value.

Interval yields plausible values.

Generally we chose a confidence level of 90%90\%. 95%95\% is most common.

confidence interval=statistics±ME=statistic±(critical value)×SE\begin{aligned} \text{confidence interval} &= \text{statistics} \pm \text{ME} \\ &= \text{statistic} \pm (\text{critical value}) \times \text{SE} \end{aligned}

confidence interval for pp:

p^±z×p^(1p^)n\hat p \pm z^* \times \sqrt{\frac{\hat p (1-\hat p)}{n}}

confidence interval for means:

xˉ±z×σnxˉ±t×sxˉn\bar x \pm z^* \times \frac{\sigma}{\sqrt{n}} \approx \bar x \pm t \times \frac{s_{\bar x}}{\sqrt{n}}

T Distribution

  • As degree of freedom increase, approaches standard normal curve
  • symmetric mean centered around normal
  • standard deviation changes
  • bell curve
  • unimodal
  • degree of freedom=n1\text{degree of freedom}=n-1

Significance Test

Null and Alternative Hypothesis

H0\mathcal{H}_0 - a claim

Ha\mathcal{H}_a - alternative hypothesis, which means H0\mathcal{H}_0 is false

Parameter of interest pp is the true proportion of successes

Statistical test weigh evidence against a claim and an alternate claim

  • Step 1: state hypothesis, hypothesis should express suspicions we have before we see the data
    • one-sided >> or <<
    • two-sided \neq
    H0:p=value  or  Ha:p<(>or) value\mathcal{H}_0: p = \text{value} \ \ \text{or} \ \ \mathcal{H}_a:p < (> \text{or} \neq) \ \text{value} H0:μ=value  or  Ha:μ<(>or) value\mathcal{H}_0: \mu = \text{value} \ \ \text{or} \ \ \mathcal{H}_a:\mu < (> \text{or} \neq) \ \text{value}
  • Step 2: Does the data give convincing evidence against the null?
    • Reject the null
    • Fail to reject the null

P-Value

The probability computed assuming H0\mathcal{H}_0 is true that the statistic (such as p^\hat p or xˉ\bar x) would take a value as extreme as or more extreme than the one actually observed in the direction of Ha\mathcal{H}_a.

Small p-values are evidence against H0\mathcal{H}_0 observed result is unlikely to occur when H0\mathcal{H}_0 is true.

If the p-value is smaller than α\alpha we say result are statistically significant at a level of α\underline \alpha (significant level), we reject the H0\mathcal{H}_0 and conclude convincing evidence in favor of Ha\mathcal{H}_a.

Levels of α\alpha: 0.01,0.05,0.10.01,0.05, 0.1

H0\mathcal{H}_0 trueH0\mathcal{H}_0 false
reject H0\mathcal{H}_0Type I\text{I} error =α=\alpha😀
fail to reject H0\mathcal{H}_0😀Type II\text{II} error =1β= 1-\beta

Test about a population proportion (1-prop z-test)

  1. state what parameter of interest is
  2. check conditions
    • random - SRS / well-designed condition
    • independence - if not replacing n110Nn\leq \frac{1}{10}N
    • normal - np10 and n(1p)10np \geq 10 \ \text{and} \ n(1-p) \geq 10
  3. perform the test
    • test statistic: normalcdf(z-score,upper/lower bound,mean,standard deviation)\text{normalcdf}(\text{z-score},\text{upper/lower bound}, mean, \text{standard deviation})
    • p-value
  4. conclude
    • compare p-value to α\alpha
    • reject or fail to reject in context

1 sample t-test

  1. State parameter of interest
  2. Check conditions
    • random - SRS / well-designed condition
    • independent - 10% or less of N is sampling if without replacement
    • normal - 30\geq30 / population distribution is normal
  3. Hypothesis: H0 and Ha\mathcal{H}_0 \ \text{and} \ \mathcal{H}_a
  4. perform the test
    • test statistic:
    statsparameterstandard error=xˉμ0sx/n=t\frac{\text{stats} - \text{parameter}}{\text{standard error}}=\frac{\bar x - \mu_0}{s_x / \sqrt{n}}=t
  5. Conclusion

Power

The power of a test against a specific alternative is the probability that the test will reject H0\mathcal{H}_0 at a chosen significance level α\alpha when the specified alternative value of the parameter is true.

  1. get value according to α\alpha through invNorm\text{invNorm}
  2. find position of the value in the actual distribution

Effect

Effect size is a quantitative measure of the strength or magnitude of a phenomenon. It tells how large the difference is between groups or how strong the relationship is between variables independent of sample size.

Cohen’s d - means

d=xˉ1xˉ2sp  where  sp=(n11)s12+(n21)s22n1+n22d=\frac{\bar x_1 - \bar x_2}{s_p} \ \ \text{where} \ \ s_p=\sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}} d={0.2small0.5middle0.8larged = \begin{cases} 0.2 &\rightarrow \text{small} \\ 0.5 &\rightarrow \text{middle} \\ 0.8 &\rightarrow \text{large} \end{cases}

Cohen’s h

h=2arcsinp12arcsinp2h=2 \arcsin{\sqrt{p_1}} - 2 \arcsin{\sqrt{p_2}}

possible values: π and π-\pi \ \text{and} \ \pi

Comparing 2 means

Population of treatmentParameterstatisticsample size
1μ1\mu_1xˉ1\bar x_1n1n_1
2μ2\mu_2xˉ2\bar x_2n2n_2

Sampling distribution of xˉ1xˉ2\bar x_1 - \bar x_2

choose an SRS of size n1n_1 from a population with mean μ1\mu_1 and standard deviation σ1\sigma_1 and an SRS of size n2n_2 from a population with mean μ2\mu_2 and standard deviation σ2\sigma_2

  • shape - approximately normal if sample size is large enough
  • center - xˉ=xˉ1xˉ2=μ1μ2\bar x = \bar x_1 - \bar x_2 = \mu_1 - \mu_2
  • spread - s=s12n1+s22n2s = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}
tstats=(xˉ1xˉ2)(μ1μ2)s12/n1+s22/n2t_{stats} = \frac{(\bar x_1 - \bar x_2) - (\mu_1-\mu_2)}{\sqrt{s_1^2 / n_1 + s_2^2 / n_2}}

Confidence Interval

confidence interval=(xˉ1xˉ2)±t×s12n1+s22n2\text{confidence interval} = (\bar x_1 - \bar x_2) \pm t^* \times \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

2 samples t-test

  • H0:μ1μ2=0  or  μ1=μ2\mathcal{H}_0: \mu_1 - \mu_2 = 0 \ \ \text{or} \ \ \mu_1=\mu_2
  • Ha:μ1μ2>(<or) 0  or  μ1>(<or) μ2\mathcal{H}_a: \mu_1 - \mu_2 > (< \text{or} \neq) \ 0 \ \ \text{or} \ \ \mu_1> (< \text{or} \neq) \ \mu_2
  • df=n1+n22\text {df}=n_1+n_2-2

Comparing 2 proportions

p^1p^2\hat p_1- \hat p_2

Sampling Distribution for p^1p^2\hat p_1- \hat p_2

  • center: μp^1p^2=p1p2\mu_{\hat p_1- \hat p_2}=p_1- p_2
  • shape: approximately normal if the following conditions are met n1p110 and n1(1p1)10n2p210 and n2(1p2)10n_1p_1\geq10 \ \text{and} \ n_1(1-p_1)\geq10 \\ n_2p_2\geq10 \ \text{and} \ n_2(1-p_2)\geq10
  • spread: p1(1p1)n1+p2(1p2)n2\sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}
  • independent: no more than 10%10 \% of population for either n1 or n2n_1 \ \text{or} \ n_2

Confidence Interval

CI=(p^1p^2)±z×p1(1p1)n1+p2(1p2)n2\text{CI} = (\hat p_1- \hat p_2)\pm z^* \times \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}

2 prop z-test

  • H0:p1p2=0\mathcal{H}_0: p_1-p_2=0
  • Ha:p1p2>(<or) 0\mathcal{H}_a: p_1-p_2 > (< \text{or} \neq) \ 0
zstats=p^1p^2p1(1p1)n1+p2(1p2)n2z_{stats} = \frac{\hat p_1 - \hat p_2}{\sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}} pv=normalcdf(zstats)p_v=\text{normalcdf}(z_{stats})
  • compare pvp_v to an α\alpha

Nonparametric Inference Techniques

Everything thus far with inference relies on explicit assumptions about population, distributions, and parameters

  • Today focus on how to work with categorical data
  • Chi-square methods are relatively free from assumptions

Chi-Square Distribution

Given a normal distribution with a mean of μ\mu and a variance of σ2\sigma^2, we select a random XX from population.

z2=(xˉμσ)2z^2=(\frac{\bar x - \mu}{\sigma})^2

The distributions of z2z^2 values could be obtained.

χ(N)2=i=1Nzi2\chi^2_{(N)}=\sum_{i=1}^N z_i^2
  • μ=df\mu = \text{df}
  • σ=2df\sigma = \sqrt{2\cdot \text{df}}
  • mode=df2\text{mode} = \text{df} - 2
pchisq("x^2value", df, lower.tail = F)

Chi-Square Goodness of Fit Test

Statement

  • H0:The percentage distributions is same as given\mathcal{H}_0: \text{The percentage distributions is same as given}
  • H0:value doesn’t match\mathcal{H}_0: \text{value doesn't match}

Collect random sample of size NN, how unlikely is it for the observed value to differ as much as they do from the expected values if H0\mathcal{H}_0 is true?

Conditions

  • Cells must mutually exclusive and exhaustive
  • Observations need to be independent
  • Expected frequency value must be at least 5
χ(k1)2=i=1k(OiEi)2Ei\chi_{(k-1)}^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}
  • df=k1=categories1\text{df} = k - 1 = \text{categories} - 1
pv=χ2cdf(lower=χ2,upper=/1099,df)p_v=\chi^2\text{cdf}(\text{lower}=\chi^2, \text{upper} = \infty / 10^{99}, \text{df})
type = c(<categories...>)
freq = c(<actual_values...>)
prob = c(<expected_prop...>)
chisq.test(freq, p = prob)

The Chi-Square Test of Independence

Enable us to determine whether two categorical variables are related

Statement

  • H0:two variables are mutually independent\mathcal{H_0}: \text{two variables are mutually independent}
  • Ha:two variables are not mutually independent\mathcal{H_a}: \text{two variables are not mutually independent}

Conditions

Same as Chi-Square Goodness of Fit Test

  • Individual observations are independent of each other
  • min(Eij)5\min(E_{ij})\geq5
χ(c1)(r1)2=ij(OijEij)2Eij\chi_{(\text{c}-1)(\text{r}-1)}^2=\sum_{ij} \frac{(O_{ij}-E_{ij})^2}{E_{ij}}
  • df=(row1)(column1)\text{df} = (\text{row}-1)\cdot(\text{column}-1)
  • Eij=(row total)(column total)÷(sample total)E_{ij}=\text{(row total)} \cdot \text{(column total)} \div \text{(sample total)}
  • Eij is assuming that independence is trueE_{ij} \ \text{is assuming that independence is true}
pv=χ2cdf(lower=χ2,upper=/1099,df)p_v=\chi^2\text{cdf}(\text{lower}=\chi^2, \text{upper} = \infty / 10^{99}, \text{df})

The Chi-Square Test for Homogeneity

Determine whether different populations or groups have the same distribution of a categorical variable CV\text{CV}.

Statement

  • H0:The distribution of the CV are the same across all groups\mathcal{H}_0: \text{The distribution of the CV are the same across all groups}
  • Ha:The distribution of the CV are not the same across all groups\mathcal{H}_a: \text{The distribution of the CV are not the same across all groups}

Conditions

Same as Chi-Square Goodness of Fit Test

χ2 goodness of fit\chi^2 \ \text{goodness of fit}χ2 for homogeneity\chi^2 \ \text{for homogeneity}χ2 for independence\chi^2 \ \text{for independence}
number of sample1at least 21
number of variable11at least 2
H0\mathcal{H}_0The percentage distributions is same as givenThe distribution of the CV are the same across all groupstwo variables are mutually independent
Ha\mathcal{H}_avalue doesn't matchThe distribution of the CV are not the same across all groupstwo variables are not mutually independent
Expected Valuen×pn\times pc×rt\dfrac{c\times r}{t}c×rt\dfrac{c\times r}{t}
Degree of Freedomcategories1\text{categories} - 1(r1)(c1)(\text{r}-1)\cdot(\text{c}-1)(r1)(c1)(\text{r}-1)\cdot(\text{c}-1)

Inferential Statistics for Linear Regression

If data are a random sample from a larger population, we need statistical inference to answer following questions:

  • Is there a relationship between x and y in the population or could the pattern we see using the sample data happen by chance?
  • In the population, how much will the predicted value of y changes for every increase of 1 unit of x?
  • What is the marginal of error for the estimate?

Sampling Distribution of a Slope

Choose a Simple Random Sample of nn observations (x,y)(x,y) from a population of size NN

y^=α+βx\hat y = \alpha + \beta x
  • μb=β\mu_b = \beta
  • σb=σσxn  as long as  n110N\sigma_b = \dfrac{\sigma}{\sigma_x \sqrt n} \ \ \text{as long as} \ \ n \leq \dfrac1{10} N
  • The distribution of y for each value of x follows a normal distribution

Conditions:

  • Linear - look at the scatter plot and plot of the residual
  • Independent
  • Normal - for any value of (x,y)(x,y), varies according to a normal distribution
    • look at the residuals - box plot, hist, stem&leaf plot
  • Equal Standard Deviation - the standard deviation of y is the same for all values of x
    • look at a scatterplot of residuals
  • Random - data from a well-designed experiment or random sample
s=(yiy^i)2n2s = \sqrt{\frac{\sum(y_i - \hat y_i)^2}{n - 2}} σb=σσxn\sigma_b = \dfrac{\sigma}{\sigma_x \sqrt n} SEb=ssxn\text{SE}_b = \frac{s}{s_x \sqrt n}

Confidence Interval

b±t×SEb  where  df=2b \pm t^* \times \text{SE}_b \ \ \text{where} \ \ \text{df} = 2

Significance Test

Hypothesis

  • H0:B=B0\mathcal{H}_0: \Beta = \Beta_0
  • Ha:B>(<or) B0\mathcal{H}_a: \Beta > (< \text{or} \neq) \ \Beta_0

Test Statistics

tstats=bB0SEbt_{stats} = \frac{b - \Beta_0}{\text{SE}_b}

Conclusion

If B0=0\Beta_0 = 0 determines if the value of y is impacted by the value of x