sample size calculation

with exercises in GPower

Wilfried Cools & Tim Pauwels

April 22, 2024

Workshop

  • to introduce the key ideas
  • to help you see the bigger picture
  • to offer first practical experience: GPower

 

Feedback

  • Help us improve this document

    wilfried.cools@vub.be

at SQUARE

  • Ask us for help

    we offer consultancy
    on methodology, statistics and its communication

    square.research.vub.be

Program

 

  • Part I:
    understand the reasoning
    • introduce building blocks
    • highlight how the relate
    • focus on t-test only
    • a few exercises in GPower

 

  • Part II:
    explore more complex situations
    • go beyond the t-test
    • simple but common
    • many exercises in GPower
    • not one formula for all

Sample size calculation: demarcation

 

  • How many observations will be sufficient ?
    • avoid too many, because typically observations imply a cost
      • money / time → limited resources
      • risk / harm / patient burden → ethical constraints
    • have enough
  • To offer strong enough statistical inference !
    • linked to standard error
      • testing → power [probability to detect effect]
      • estimation → accuracy [size of confidence interval]

Sample size calculation: a difficult design issue

 

  • Part of the design of a study
    • before data collection
    • requires understanding:
      • parameters: effect size of interest
      • data: future data properties
      • model: relation outcome and its conditions under which observed
    • decision based on (highly) incomplete information, thus based on (strong) assumptions

 

  • Not always possible nor meaningful !
    • confirmatory studies easier than exploratory
    • experiments (control) easier than observational
    • not obvious for complex models
      → simulation
    • not obvious for predictive models
      → no standard error

Sample size calculation: if not possible

 

  • If not possible in a meaningful way
    use alternative justification

    • common practice
    • feasibility

    •  
  • Avoid retrospective power analyses
    → OK for future study only

    Hoenig, J., & Heisey, D. (2001). The Abuse of Power:
    The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician, 55, 19–24.

 

  • Because less strong,
    put more weight on non-statistical justification
    • low cost
    • importance

Simple example confirmatory experiment

 

  • Does my radiotherapy work ?
    • aim: show my radiotherapy reduces tumor size
    • method: compare treatment and control group
      • tumor induced in N mice
      • random assignment mice
    • data: tumor sizes after 20 days
    • analysis: unpaired t-test

 

  • Sample size question
    • how many mice are required
    • to show treatment reduces tumor size more
    • assuming effect size:
      • my radiotherapy works if 25% more reduction treatment group
      • 20% ~ 4mm (control) versus 5mm (treatment)
    • with 80% probability (+ type I error probability .05)

Reference example

 

  • Apriori specifications
    • intend to perform a statistical test
    • comparing 2 equally sized groups
    • to detect difference of at least 2
    • assuming an uncertainty of 4 SD on each average
    • which results in an effect size of .5
    • evaluated on a Student t-distribution
    • allowing for a type I error prob. of .05 \((\alpha)\)
    • allowing for a type II error prob. of .2 \((\beta)\)
  • Sample size
    conditional on specifications being true

Difference detected approximately 80% of the times.

Note

  • This reference example used throughout the workshop !!

Formula you could use

 

  • Specifications for this particular case:
    • sample size (n → ?)
    • difference ( \(\Delta\) =signal → 2)
    • uncertainty ( \(\sigma\) =noise → 4)
    • type I errors ( \(\alpha\).05, so \(Z_{ \alpha /2}\) → -1.96)
    • type II errors ( \(\beta\).2, so \(Z_ \beta\) → -0.84)
  • Sample size = 2 groups x 63 observations = 126

 

\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{ \Delta^2} = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2} = 62.79\)

  • Formula’s are test and statistic specific but logic remains same
  • This and other formula’s implemented in various tools, our focus: GPower

GPower: a useful tool

 

  • Use it
    • implements wide variety of tests
    • free @ http://www.gpower.hhu.de/
    • popular and well established
    • implements various visualizations
    • documented -fairly- well
  • Maybe not use it
    • not all tests are included, the simpler
    • not without flaws
    • other tools exist (some paying)

 

GPower statistical tests

 

  • Test family - statistical tests [in window]
    • Exact Tests (8)
    • \(t\)-tests (11) → reference
    • \(z\)-tests (2)
    • \(\chi^2\)-tests (7)
    • \(F\)-tests (16)
  • Focus on the density functions

 

  • Tests [in menu]
    • correlation & regression (15)
    • means (19) → reference
    • proportions (8)
    • variances (2)
  • Focus on the type of parameters

GPower input

 

  • ~ reference example input
    • t-test : difference two independent averages
    • apriori: calculate sample size
    • effect size = standardized difference (Cohen’s \(d\))
      • Determine =>
        • \(d\) = |difference| / SD_pooled
        • \(d\) = |0-2| / 4 = .5
    • \(\alpha\) = .05; two-tailed ( \(\alpha\) /2 → .025 & .975 )
    • \(power = 1-\beta\) = .8
    • allocation ratio N2/N1 = 1 (equally sized groups)

GPower output

 

  • ~ reference example output
    • sample size \((n)\) = 64 x 2 = (128)
    • degrees of freedom \((df)\) = 126 (128-2)
    • power ≥ .80 (1- \(\beta\)) = 0.8015
    • distributions: central + non-central
    • critical t = 1.979
      • decision boundary given \(\alpha\) and \(df\)
        qt(.975,126)
    • non centrality parameter ( \(\delta\) ) = 2.8284
      • shift Ha (true) away from Ho (null)
        2/(4*sqrt(2))*sqrt(64)

GPower protocol

 

  • Summary for future reference or communication
    • central and non-central distributions (figure)
    • protocol of power analysis (text)

 

  • File/Edit save or print file (copy-paste)

 

Non-centrality parameter ( \(\delta\) ), shift Ha from Ho

 

  • non-centrality parameter \(\delta\) combines SIZES
    • assumed effect size ((standardized) signal)
    • conditional on sample size (information)
  • \(\delta\) determines overlap Ho and Ha: bigger ncp less overlap
    • \(\delta\) as violation of Ho → shift (location/shape)
    • power = probability beyond \(\color{green}{cut off}\) at Ho evaluated on Ha
    • push with sample size
  • Ha acts as \(\color{blue}{truth}\) assumed difference of e.g. .5 SD
    • Ha ~ t(ncp=2.828,df)
  • Ho acts as \(\color{red}{benchmark}\): typically no difference, no relation
    • set \(\color{green}{cutoff}\) on Ho ~ t(ncp=0,df) using \(\alpha\)

 

Alternative: divide by N

 

  • Sample sizes determine shape, not location
    • divide by n: sample size ~ standard error
      • peakedness of both distributions
      • often preferred didactically
    • non-centrality parameter: sample size ~ location
      • standardized distributions
      • often preferred in software / algorithms
  • Formula’s same (DIY: two equations for critical value)

 

 

\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}\)

Type I/II error probability

 

  • Inference (test) based on cut-off’s (density → AUC=1)

  • Type I error: incorrectly reject Ho (false positive):

    • cut-off at Ho, error prob. \(\alpha\) controlled
    • one/two tailed → one/both sides informative ?
  • Type II error: incorrectly fail to reject Ho (false negative):

    • cut-off at Ho, error prob. \(\beta\) obtained from Ha
    • Ha assumed known in a power analyses
  • power = 1 - \(\beta\) = probability correct rejection (true positive)

 

  • Inference versus truth
    • infer: effect exists vs. unsure
    • truth: effect exist vs. does not
infer=Ha infer=Ho sum
truth=Ho \(\alpha\) 1- \(\alpha\) 1
truth=Ha 1- \(\beta\) \(\beta\) 1

Create plot

 

  • Create a plot
    • X-Y plot for range of values
    • assumes calculated analysis
      • ~ reference example
    • specify Y-axis / X-axis / curves and constant
      • beware of order !
  • Plot sample size (y-axis)
    • by type I error \(\alpha\) (x-axis) → from .01 to .2 in steps of .01
    • for 4 values of power (curves) → with values .8 in steps of .05
    • assume effect size (constant) → .5 from reference example

 

  • Notice Table option

Errors: exercises

 

  • Where on the red curve (right) is
    the type II error equal to 4 * type I error ?
  • When smaller effect size (e.g., .25), what changes ?

 

Errors: exercises continued

 

  • Plot power instead of sample size

    • with 4 power curves
      with sample sizes 32 in step of 32
  • What is relation type I and II error ?

  • What would be difference between curves for \(\alpha\) = 0 ?

 

Decide Type I/II error probability

 

  • Reasoning on error probabilities
    • \(\alpha\) & \(\beta\) inversely related
    • which error you want to avoid most ?
      • cheap aids test ? → avoid type II
      • heavy cancer treatment ? → avoid type I

 

  • Popular choices for error probabilities
    • \(\alpha\) often in range .01 - .05 → 1/100 - 1/20
    • \(\beta\) often in range .2 to .1 → power = 80% to 90%

 

 

  • Popular rules of thumb
    • 4 * \(\alpha\) ~ \(\beta\)
      • type I error is ~4 times worse !!
  • Probability for both errors always exists

For fun: P(effect exists | test says so)

 

  • Using \(\alpha\), \(\beta\) and power or \(1-\beta\)
    • \(P(infer=Ha|truth=Ha) = power\)\(P\)(test says there is effect | effect exists)
    • \(P(infer=Ha|truth=Ho) = \alpha\)
    • \(P(infer=Ho|truth=Ha) = \beta\)
    • \(P(\underline{truth}=Ha|\underline{infer}=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha)}\) → Bayes Theorem
    • __ = \(\frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha|truth=Ha) * P(truth=Ha) + P(infer=Ha|truth=Ho) * P(truth=Ho)}\)
    • __ = \(\frac{power * P(truth=Ha)}{power * P(truth=Ha) + \alpha * P(truth=Ho)}\) → depends on prior probabilities
  • IF very low probability model is true (eg., \(P(truth=Ha) = .01\)) THEN probability effect exists if test says so is low (e.g., \(P(truth=Ha|infer=Ha) = \frac{.8 * .01}{.8 * .01 + .05 * .99} = .14\))

Control Type I error

 

  • Defined on the Ho, known
    • assumes only sampling variability

 

  • Multiple testing
    • control \(\alpha\) over set of tests
      • each test \(\alpha\) error
    • inflates type I error \(\alpha\)
      • k tests with probability for an error: \(1-(1- \alpha)^k\)
    • correct → eg., Bonferroni ( \(\alpha/k\))

 

  • Interim analysis
    • analyze and ‘conditionally’ proceed
    • type of multiple testing
    • plan in advance
    • adjustments of either \(p\) or \(\alpha\)
    • alpha spending, eg., O’Brien-Flemming bounds
    • NOT GPower
      our own simulation tool (Susanne Blotwijk)
    • determine boundaries with PASS, R (ldbounds), …

A family of comparisons: exercises

 

  • Comparing the control group and two treatments

  • Pairwise comparisons, typically not an omnibus

    • looked at as a set of t-tests, not as ANOVA
    • requires multiple testing correction, e.g., Bonferonni correction: divide \(\alpha\) by number of tests
  • use reference example (C = 0, T1 = 2), and extend with group 3 with T2 = 4 (same sd)

    • what samples sizes are necessary for all three pairwise tests combined ?
    • what if biggest difference (C-T2) is ignored, because considered easiest to detect ?

Effect sizes, in principle

 

  • Estimate / guestimate of minimal magnitude of interest
     

  • Typically standardized: signal to noise ratio

    • eg., effect size \(d\) = .5 means .5 standard deviations (pooled)
  • Part of non-centrality (as is sample size) → pushing away Ha

  • ~ Practical relevance

    • NOT statistical significance
      • p-value ~ effect size AND sample size

 

  • 2 main families of effect sizes
    • d-family (differences) and r-family (associations)
    • transform one into other,
      eg., d = .5 → r = .243
      \(\hspace{20 mm}d = \frac{2r}{\sqrt{1-r^2}}\)
      \(\hspace{20 mm}r = \frac{d}{\sqrt{d^2+4}}\)
      \(\hspace{20 mm}d = ln(OR) * \frac{\sqrt{3}}{\pi}\)

Effect sizes, in literature

 

  • Cohen, J. (1992).
    A power primer. Psychological Bulletin, 112, 155–159.

  • Cohen, J. (1988).
    Statistical power analysis for the behavioral sciences (2nd ed).

  • Famous Cohen conventions

    • but beware, just rules of thumb

 

Effect sizes, in literature continued

 

  • Ellis, P. D. (2010).
    The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results.

  • more than 70 different effect sizes… most of them related

 

Effect sizes, in GPower (Determine)

 

  • Effect sizes are test specific
    • t-test → group means and sd’s
    • one-way anova → variance explained & error
    • regression → sd’s and correlations
    • . . . .

 

  • GPower helps with Determine
    • sliding window
    • one or more effect size specifications

 

Effect sizes, test specific but not really

 

  • Total variance: 16.882
  • Between group variance
    with mean y equal to 0 and 2: 1.008
  • Within group variance
    (error or residual variance): 15.874
  • F-test statistic, square root of ratio
    between on within group variance: 0.252
  • \(\eta^2\) statistic, ratio between group on total variance 0.06
  • Design variance
    with X = 0 and 1: 0.252

 

Effect sizes, test specific but not really

 

  • Regression analysis results in R

Call:
lm(formula = y ~ factor(group), data = .dta)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6795  -2.6556   0.5043   2.6463   8.8380 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)    9.421e-16  5.000e-01   0.000  1.00000   
factor(group)2 2.000e+00  7.071e-01   2.828  0.00544 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4 on 126 degrees of freedom
Multiple R-squared:  0.0597,    Adjusted R-squared:  0.05224 
F-statistic:     8 on 1 and 126 DF,  p-value: 0.005444

 

  • ANOVA or t-test results in R
             Df Sum Sq Mean Sq F value  Pr(>F)   
group         1    128     128       8 0.00544 **
Residuals   126   2016      16                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • t-test results in R
[1] 0.00544392
        t 
-2.828427 

Effect sizes: exercises

 

  • For the reference example:
    • change mean values from 0 and 2 to 4 and 6, what changes ?
    • change sd values to 2 for each, what changes ?
      • effect size ?
      • total sample size ?
      • critical t ?
      • non-centrality ?
    • change sd values to 8 for each, what changes ?
    • change sd to 2 and 5.3, or 1 and 5.5,
      how does it compare to 4 and 4 ?

 

Effect sizes: exercises continued

 

  • For the reference example:
    • plot powercurve: power by effect size
    • compare 6 sample sizes: 34 in steps of 34
    • for a range of effect sizes in between .2 and 1.2
    • use \(\alpha\) equal to .05
    • how does power change when doubling the effect size ?

 

  • powercurve → X-Y plot for range of values

Effect sizes, how to determine them in theory

 

  • Choice of effect size matters → justify choice !!

  • Choice of effect size depends on aim of the study

    • realistic (eg., previously observed effect) → replicate
    • important (eg., minimally relevant effect)
    • NOT significant → meaningless, dependent on sample size
  • Choice of effect size dependent on statistical test of interest

    • for independent t-test → means and standard deviations
    • possible alternative: variance explained, eg., 1 versus 16+1

 

  • Examples
    • with one-way ANOVA
      \(f\) = .25 instead of d = .5
    • with linear regression
      \(f^2\) = .0625 instead of d = .5
    • psychometric freeware

Effect sizes, how to determine them in practice

 

  • Experts / patients
    • minimally clinically relevant effect
    • importance
    • use if possible
  • Literature
    • earlier study / systematic review
    • realistic
    • beware of publication bias
  • (Internal) Pilot
    • guestimate dispersion estimate
    • not to obtain effect size → small sample

 

  • Guestimate uncertainty…
    • sd from assumed range
      • assume normal and divide by 6
    • sd for proportions at conservative .5
    • sd from control, assume treatment the same
    • ...
  • Turn to Cohen
    • use if everything else fails
    • rules of thumb
      • eg., .2 - .5 - .8 for Cohen’s d

Effect sizes, a note about the SD

 

  • For independent t-test → means and standard deviations (sd)
    • sd ~ ‘unexplained’ variance
    • account for important predictors
  • Example: 50% variance unexplained by treatment, explained by predictor
    • a standard deviation of 4 (variance of 16)
    • split into
      • standard deviation of 2.8284 (variance of 8)
      • standard deviation of 2.8284 explained by important predictor

Effect sizes, a note about the SD continued

 

  • not accounting for the important predictor

 

  • accounting for it, sd (around average) reduced

Effect sizes, a note about non-inferiority

 

  • Most often aim to show effect is likely to exist
    • show difference (or relation) different from (typically) zero
    • assuming a particular difference (or relation)
  • Non-inferiority to show effect is no too much worse
    • show difference (or relation) not beyond margin of tolerance
    • assuming a particular difference (or relation),
    • most often but not necessarily 0
    • ! this is inherently one-sided

Effect sizes, notes: exercises

 

  • Use reference example
    • assume half the unexplained variance is accounted for by the predictor, what are the sample sizes ?
    • assume a non-inferiority margin of -2, and no difference, how big is the sample size ?
    • assume treatment to be 2 higher, compare the sample size for superiority (bigger than 0) and non-inferiority with margin of -2

 

Sample size

 

  • Sample size
    • allocated over groups
    • not necessarily equally
    • specify ratio
  • Most important: all groups sufficiently large
    • unequal group sizes not a problem
    • only when expected group size difference

Sample size: exercises

 

  • For the reference example:
    • compare for allocation ratios 1, .5, 2, 10, 50
    • repeat for effect size 1, and compare
  • ? no idea why n1 \(\neq\) n2
  • Plot with changed allocation ratio

 

Relation sample & effect size, type I & II errors

 

  • Building blocks:
    • sizes: sample ( \(n\) ) and effect ( \(\Delta\) )
    • errors: type I ( \(\alpha\) ), type II ~ power ( \(1-\beta\) )

 

  • GPower → type of power analysis
    • Apriori: \(n\) ~ \(\alpha\), power, \(\Delta\)
    • Post Hoc: power ~ \(\alpha\), \(n\), \(\Delta\)
    • Compromise: power, \(\alpha\) ~ \(\beta\:/\:\alpha\), \(\Delta\), \(n\)
    • Criterion: \(\alpha\) ~ power, \(\Delta\), \(n\)
    • Sensitivity: \(\Delta\) ~ \(\alpha\), power, \(n\)

 

  • each parameter conditional on others
  • one outgoing arrow, three going in  

Type of power analysis: exercises

 

  • For the reference example:
    • how big is the power for 128 observations (n=2x64, \(\alpha\)=.05 and \(\Delta\)=.5)
    • then, assume a power of .8 but with only half the sample size, how does the effect size \(\Delta\) change ?
    • then, set the ratio \(\beta\)/ \(\alpha\) to 4, what are the resulting \(\alpha\) and \(\beta\) ? and what is the critical value ?
    • then, set the effect size to .7 (same ratio), what are the resulting \(\alpha\) and \(\beta\) ? and what is the critical value ?

Getting your hands dirty

 

 

  • in G*Power
m1=0
m2=2
s1=4
s2=4  
alpha=.025
N=128  
var=.5*s1^2+.5*s2^2  
d=abs(m1-m2)/sqrt(2*var)*sqrt(N/2)  
tc=tinv(1-alpha,N-1)  
power=1-nctcdf(tc,N-1,d)  

 

  • in R
.n <- 64
.df <- 2*.n-2
.ncp <- 2 / (4 * sqrt(2)) * sqrt(.n)
.alpha <- .05
.power <- 1 -
    pt( 
        qt(1-.alpha/2,df=.df,ncp=0), 
        df=.df, ncp=.ncp
    )
round(.power,4)
[1] 0.8015
  • 2 steps
    • qt → quantile on H0 (p=\(Z_{1-\alpha/2}\))
    • pt → probability on Ha < quantile

GPower, beyond the independent t-test

 

  • So far, comparing two independent means  

  • From now on, selected topics beyond independent t-test
    with small exercises

    • dependent instead of independent
    • proportions, dependent and independent
    • non-parametric instead of assuming normality
    • relations instead of groups (regression)
    • correlations
    • more than 2 groups (compare jointly, pairwise, focused)
    • more than 1 predictor
    • repeated measures

 

  • Look into - GPower manual
    27 tests → effect size, non-centrality parameter and example !!

Dependence between groups: exercises

 

  • When comparing 2 dependent groups (eg., before/after treatment) → account for correlations

  • Correlation are typically obtained from pilot data, earlier research

  • GPower: matched pairs (t-test / means, difference 2 dependent means)

  • Use reference example,

    • assume a correlation of .5 and compare with reference example for effect size and n
    • how many observations are required if no correlation exists (think then try) ? effect size ?
    • what changes with correlation .875 (think: more or less n, higher or lower effect size) ?
    • what power would be obtained for the reference with sample size 2x64, but correlation .5 ?
    • get the sd of the difference and use it (\(\sqrt{\sigma_a^2 + \sigma_b^2 - 2 * \rho * \sigma_a * \sigma_b}\)) ?

Dependence: a note about the correlations

 

  • a noticeable correlation

 

  • clear correlation

Proportions: exercises

 

  • Test difference two independent proportions → [0..1]

  • Simplest version of a logistic regression on two groups

  • GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)

  • Testing whether two proportions are the same

    • for odds ratio 3 and p2 = .50, what is p1 ? and for odds ratio 1/3 and p2 = .75 ?
    • what is the sample size to detect a difference for both situations ?
    • for odds ratio 3 and p2 = .75, determine p1 and sample size, how does it compare with before ?
    • compare sample size for a .15 difference, at p1=.5 ?

Proportions: exercises

 

  • GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)

  • Plot 5 power curves

    • odds ratio = 2, with p2 reference probability .6
    • proportions .5 to 1
    • 1 curve per sample sizes 328, 428, 528… (intervals of 100)
    • type I error .05
  • Explain curve minimum, relation sample size ?

  • Repeat for one-tailed, difference ?

Dependent proportions: exercises

 

  • Test difference two dependent proportions → [0..1] categorical shift

    • for two categories, McNemar test: compare \(p_{12}\) with \(p_{21}\)
    • information from changes only → discordant pairs
    • effect size as odds ratio → ratio of discordance
  • GPower: McNemar test (exact / proportions, difference 2 dependent proportions)

  • Testing whether proportions of discordance are the same

    • assume odds ratio 2, .25 discordant, what is sample size
    • for discordant .5 and 1 what are \(p_12\) and \(p_21\) and sample sizes ?
    • for odds ratio .5 and 1 (prop discordant = .25), what are sample sizes ?
    • repeat for third alpha option first scenario, what happens ?

Non-parametric distribution: exercises

 

  • When non-normally distributed residuals are expected, not possible to circumvent (eg., transformations)

  • Only considers ranks or uses permutations → price is efficiency and flexibility

  • Requires a parent distribution (alternative hypothesis), ‘min ARE’ should be default

  • GPower: two groups → Wilcoxon-Mann-Whitney (t-test / means, diff. 2 indep. means)

  • Use reference example

    • use a normal parent distribution, how much efficiency is lost ?
    • use ‘min ARE’ as parent distribution, how much efficiency is lost ?

A relations perspective, regression analysis: exercises

 

  • Differences between groups ~ relation with grouping (categorization)

  • Example: d = .5 ~ r = .243 (note: slope \(\beta = {r*\sigma_y} / {\sigma_x}\))

    • total variance \(\sigma_y\) = residual variance + model variance (2 or 0) → var((2-1),(0-1),(2-1),(0-1),…)
    • design variance \(\sigma_x\) = variance -.5 and .5 for all observations → var((1-.5),(0-.5),(1-.5),(0-.5),…)
  • GPower: regression coefficient (t-test / regression, one group size of slope)

    • what is the slope \(\beta\) and \(\sigma_y\) for reference values, d=.5 (hint:d~r), SD = 4 and \(\sigma_x\) = .5 (1/0)
    • what is the resulting sample size
    • what happens with the slope and sample size if predictor values are taken as 1/-1 instead of 0/1 ?
    • what is \(\sigma_y\) for a slope of 6, \(\sigma_x\) = .5, and SD = 4, would it increase the sample size ?

Explained variance perspective, regression: exercises

 

  • A relation as ratio between and within group variance ~ explained variance R2

  • Different but related effect sizes \(f^2\) = \({R^2/{(1-R^2)}}\)

    • partial \(R^2\) = variance explained by predictor / total variance
    • \(f^2\) = variance explained by predictor / residual variance (Note: 2\(f\) = \(d\) for 2 groups)
  • GPower: regression coefficient (t-test / regression, fixed model single regression coef)

  • Use reference example

    • remember the variances, add them to calculate the effect size
    • calculate sample size ?
    • what if also other predictors in the model ?
    • what if 3 predictors extra reduce residual variance to 50% ?
      • hint: total variance remains constant at 17

A variance ratio perspective, ANOVA: exercises

 

  • Multiple groups, at least two differ → not one effect size d

  • F-test statistic & effect size f, ratio of variances \(\sigma_{between}^2 / \sigma_{within}^2\)

  • GPower: one-way Anova (F-test / Means, ANOVA - fixed effects, omnibus, one way)

  • use reference example

    • what is the sample size, for a difference of 2, each 64 observations ?
    • why are the ncp and critical value different ?
    • how does size matter ? (play with it)
  • include a third group (group 1 = 0, group 2 = 2)

    • for third group with mean 0, 2 or 4 (figure), what are the sample sizes ?
    • repeat for between variance 2.67 (balanced: 0, 2, 4) and within variance 16 ?

 

Multiple groups: contrasts: exercises

 

  • Contrasts are linear combinations → planned comparison
    • eg., T1-C: \(1 * T1 -1 * C \neq 0\)
    • eg., (T1+T2)/2-C: \(.5 * (1 * T1 + 1 * T2) -1 * C \neq 0\)
  • Effect sizes for planned comparisons must be calculated
    • contrast specific between variance \(\sigma_{contrast}^2\)
    • \(f\) ~ variance ratio between / within
    • Note: \(f\) = 2\(d\)
  • Obtain effect sizes for contrasts (assume equally sized for convenience)
    • T1-C, T2-C, (T1+T2)/2-C
    • each contrast requires 1 degree of freedom

 

  • Parameters
    • group means \(\mu_i\)
    • pre-specified coefficients \(c_i\)
    • sample sizes \(n_i\)
    • total sample size \(N\)
    • k levels


\(\sigma_{contrast} = \frac{|\sum{\mu_i * c_i}|}{\sqrt{N \sum_i^k c_i^2 / n_i}}\)  

\(f = \sqrt{\frac{\sigma_{contrast}^2}{\sigma_{error}^2}}\)

Multiple groups: contrasts: exercises continued

 

  • GPower: one-way ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)

  • For the reference example extended, with contrasts \(f_{T1-C}\)=.25, \(f_{T2-C}\)=.50 and \(f_{(T2+T1)/2-C}\)=.3535

    • what are the samples sizes for either contrast 1 or contrast 2 ?
    • what are the samples sizes for both contrast 1 and contrast 2 combined ?
    • if taking that sample size, what will be the power for T1-T2 ?
    • what are the samples size for contrast 3 ?

Repeated measures within: exercises

 

  • When the factor is time: repeated measures

    • relates to dependent t-test for multiple measurements (>2)
  • Beware: effect sizes obtained from literature may/may not include correlation

    • Options: as in GPower 3, or SPSS, …
  • GPower: repeated measures (F-test / Means, repeated measures within factors)

  • For reference example with effect size f = .25 (1/16 explained versus unexplained)

    • mimic independent t-test
    • mimic dependent t-test, correlation .5 !
    • what does an increase in correlation imply, why ?
    • for 4, or 8 repeated measurements (cor=.5), what changes.
    • for 4 groups (4 measures, cor=.5), what changes ?

Multiple factors

 

  • Multiple main effects and possibly interaction effects
    • main effect:
      • difference B1-B2 over all conditions of treatment (C,T1,T2)
      • difference C-T1-T2 over all conditions of type (B1,B2)
    • interaction effect:
      • effect of treatment (C-T1-T2) different per level of type (B1 or B2) or vise verse
    • note: numerator degrees of freedom
      • main effect (nr-1) or 2 for treatment and 1 for type
      • interaction (nr1-1)*(nr2-1) or 2 (=2x1)

 

 

  • effect sizes
    • \(\eta^2\) = \(f^2 / (1+f^2)\)
    • note: \(f = d/2\)
      for two groups

Multiple factors effect sizes: exercises

 

  • Get effect size: in-house shiny app

  • Use reference example for treatment (C-T1-T2) and add type (B1-B2)

    • determine the effect size \(\eta^2\)
      for averages 0, 2, and if necessary use 4 or 6
      • treatment effect C-T1
      • treatment effect C-T1-T2 - no type effect B1-B2
      • no treatment effect C-T1-T2 - type effect B1-B2
      • treatment effect C-T1-T2 within B1, not B2 - with interaction
      • treatment effect C-T1-T2 - type effect B1-B2 without interaction

Multiple factors: exercises

 

  • GPower: multiway ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)

  • Use reference example for treatment (C-T1-T2) and add type (B1-B2)
    what are the sample sizes

    • C-T1 with \(\eta^2\)=0.0588
    • C-T1-T2 (no B1-B2) with \(\eta^2\)=0.1429
    • B1-B2 (no C-T1-T2) with \(\eta^2\)=0.0588
    • interaction C-T1-T2 x no B1-B2 with \(\eta^2\)=0.04
    • additive C-T1-T2 + B1-B2
      • with \(\eta^2\)=0.1429 for C-T1-T2
      • with \(\eta^2\)=0.0588 for B1-B2

Repeated measures between: exercises

 

  • When repeated measures are obtained for different groups

  • GPower: repeated measures (F-test / Means, repeated measures between factors)

  • For reference example

    • relate to independent t-test, 2 uncorrelated measurements
    • mimic independent t-test, 2 almost perfectly correlated measurements
    • with a correlation .5, what changes ?
    • what does increase in correlation imply, why ?
    • for 3 groups extended reference, cor=.5, what changes ?
    • for 4, or 8 repeated measurements (cor=.5), what changes.

Interaction within x between effect sizes: exercises

 

  • When differences between groups depend on time

  • Get effect size: in-house shiny app

  • Use reference example for control-treatment (C-T1), and 2 or 4 time points

    • determine the effect size interaction \(\eta^2\) with r=0 and r=.5
      • treatment effect C-T1, both C and T1 increase with 2
      • treatment effect C-T1, only T1 increase with 2
      • if previous situation repeated twice, T1=2,4,2,4
      • if previous situation repeated twice, but reversed, T1=2,4,4,2

Interaction within x between: exercises

 

  • Options: different effect sizes are possible

    • is correlation already part of effect size ?
    • often it is when extracted from literature
  • GPower: repeated measures (F-test / Means, repeated measures within-between factors)

  • Use effect sizes previous exercise, part 2 and 3

    • determine the sample size, assuming a correlation of .5
      • for effect size assuming no correlation is included, include .5 correlation
      • for effect size assuming correlation is included, include correlation of 0

Correlations: exercises

 

  • Test difference of two independent correlations → [-1..1]

  • Use Fisher Z transformations to normalize

    • z = .5 * log( \(\frac{1+r}{1-r}\) )
    • q = z1-z2
  • Correlations easier to differentiate as they are more different from 0

  • GPower: z-tests / correlation & regressions: 2 indep. Pearson r’s

  • Testing whether two correlations are the same

    • with correlation coefficients .7844 and .5, what are the effect & sample sizes ?
    • with the same difference, but stronger correlations, eg., .9844 and .7, what changes ?
    • with the same difference, but weaker correlations, eg., .1 and .3844, what changes ?

Not included

 

  • GPower not for always sufficient

  • Tests too difficult to specify in GPower

    • statistics / parametervalues difficult to guestimate
    • manual not always sufficient
  • Tests not included in GPower

    • eg., survival analysis
    • many tools online, most dedicated to a particular model
  • Tests without formula

    • simulation may be the only tool

 

  • simulation in theory is always possible
    • iterate many times:
      • generate: simulated outcome
        introduce randomness ~ standard deviation
      • analyze: estimate parameters + test
    • count proportion of rejections (~power)
    • determine confidence bounds (~accuracy)
    • select sample size
      • with appropriate proportion (test)
      • with appropriate interval (estimation)

Simulation example: t-test

 

  • simulation in practice
    • reference example: 0-2 (4), 64x2
    • replicate 10000 times
      • generate:
        dta$y <- dta$y+rnorm(length(dta$y),0,4)
      • analyze:
        res <- t.test(data=dta,y~X)
    • count proportion rejection:
      mean(sims['p.val',] < .05)

 

 

gr <- rep(c('T','C'),64)
y <- ifelse(gr=='C',0,2)
dta <- data.frame(y=y,X=gr)
cutoff <- qt(.025,nrow(dta))
 
my_sim_function <- function(){
    dta$y <- dta$y+rnorm(length(dta$y),0,4)     # generate (with sd=4)
    res <- t.test(data=dta,y~X)                 # analyze
    c(res$estimate %*% c(-1,1),res$statistic,res$p.value)
}
sims <- replicate(10000,my_sim_function())      # many iterations
dimnames(sims)[[1]] <- c('diff','t.stat','p.val')

mean(sims['p.val',] < .05)  # p-values  0.8029
mean(sims['t.stat',] < cutoff)  # t-statistics 0.8029
mean(sims['diff',] > sd(sims['diff',])*cutoff*(-1)) # differences 0.8024

Focus / simplify

 

  • Complex statistical models
    • simulate BUT it requires programming and a thorough understanding of the model
    • alternative: focus on essential elements → simplify the aim
  • Sample size calculations (design) for simpler research aim
    • not necessarily equivalent to final statistical testing / estimation
    • requires justification to convince yourself and/or reviewers
      • successful already if simple aim is satisfied
      • ignored part is not too costly

 

  • Example:
    • statistics:
      group difference evolution 4 repeated measurements → mixed model
    • focus:
      difference treatment and control last time point is essential → t-test
    • argument: first 3 measurements low cost, interesting to see change

Conclusion

 

  • Sample size calculation is a design issue, not a statistical one

  • It typically focuses on ensuring sufficient data to result in sufficiently strong statistical inference

  • Sample size depends on effect size, type I & II errors, and the statistical test of interest

  • Effect sizes express the amount of signal compared to the background noise

  • GPower deals with not too complex models

    • more complex complex models imply more complex specification
    • simplify using a focus, if justifiable → then GPower can get you a long way

Questions ?

 

Thank you for your attention.