sample size calculation

with exercises in GPower

Wilfried Cools & Tim Pauwels

April 03, 2024

Workshop

  • to introduce the key ideas
  • to help you see the bigger picture
  • to offer first practical experience: GPower

 

  • Target audience
    • primarily the research community at VUB / UZ Brussel
  • History
    • started in 2019, with Sven Van Laere
    • almost yearly, gradually refined

Feedback

  • Help us improve this document

    wilfried.cools@vub.be

at SQUARE

  • Ask us for help

    we offer consultancy
    on methodology, statistics and its communication

    square.research.vub.be

Program

 

  • Part I:
    understand the reasoning
    • introduce building blocks
    • highlight how the relate
    • focus on t-test only
    • a few exercises in GPower

 

  • Part II:
    explore more complex situations
    • go beyond the t-test
    • simple but common
    • many exercises in GPower
    • not one formula for all

Sample size calculation: demarcation

 

  • How many observations will be sufficient ?
    • avoid too many, because typically observations imply a cost
      • money / time → limited resources
      • risk / harm / patient burden → ethical constraints
    • have enough
  • To offer strong enough statistical inference !
    • linked to standard error
      • testing → power [probability to detect effect]
      • estimation → accuracy [size of confidence interval]

Sample size calculation: a difficult design issue

 

  • Part of the design of a study
    • before data collection
    • requires understanding:
      • parameters: effect size of interest
      • data: future data properties
      • model: relation outcome and its conditions under which observed
    • decision based on (highly) incomplete information, thus based on (strong) assumptions

 

  • Not always possible nor meaningful !
    • confirmatory studies easier than exploratory
    • experiments (control) easier than observational
    • not obvious for complex models
      → simulation
    • not obvious for predictive models
      → no standard error

Sample size calculation: if not possible

 

  • If not possible in a meaningful way
    use alternative justification

    • common practice
    • feasibility

    •  
  • Avoid retrospective power analyses
    → OK for future study only

    Hoenig, J., & Heisey, D. (2001). The Abuse of Power:
    The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician, 55, 19–24.

 

  • Because less strong,
    put more weight on non-statistical justification
    • low cost
    • importance

Simple example confirmatory experiment

 

  • Does my radiotherapy work ?
    • aim: show my radiotherapy reduces tumor size
    • method: compare treatment and control group
      • tumor induced in N mice
      • random assignment mice
    • data: tumor sizes after 20 days
    • analysis: unpaired t-test

 

  • Sample size question
    • how many mice are required
    • to show treatment reduces tumor size more
    • assuming effect size:
      • my radiotherapy works if 25% more reduction treatment group
      • 20% ~ 4mm (control) versus 5mm (treatment)
    • with 80% probability (+ type I error probability .05)

Reference example

 

  • Apriori specifications
    • intend to perform a statistical test
    • comparing 2 equally sized groups
    • to detect difference of at least 2
    • assuming an uncertainty of 4 SD on each average
    • which results in an effect size of .5
    • evaluated on a Student t-distribution
    • allowing for a type I error prob. of .05 \((\alpha)\)
    • allowing for a type II error prob. of .2 \((\beta)\)
  • Sample size
    conditional on specifications being true

Note

  • This reference example used throughout the workshop !!

Formula you could use

 

  • Specifications for this particular case:
    • sample size (n → ?)
    • difference ( \(\Delta\) =signal → 2)
    • uncertainty ( \(\sigma\) =noise → 4)
    • type I errors ( \(\alpha\).05, so \(Z_{ \alpha /2}\) → -1.96)
    • type II errors ( \(\beta\).2, so \(Z_ \beta\) → -0.84)
  • Sample size = 2 groups x 63 observations = 126

 

\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{ \Delta^2} = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2} = 62.79\)

  • Formula’s are test and statistic specific but logic remains same
  • This and other formula’s implemented in various tools, our focus: GPower

GPower: the building blocks in action

 

  • 4 components and 2 distributions
    • distributions: Ho & Ha ~ test dependent shape
    • SIZES: effect size & sample size ~ shift Ha
    • ERRORS :
      • Type I error ( \(\alpha\) ) defined on distribution Ho
      • Type II error ( \(\beta\) ) evaluated on distribution Ha
  • Calculate sample size based on effect size, type I / II error



GPower: a useful tool

 

  • Use it
    • implements wide variety of tests
    • free @ http://www.gpower.hhu.de/
    • popular and well established
    • implements various visualizations
    • documented -fairly- well
  • Maybe not use it
    • not all tests are included, the simpler
    • not without flaws
    • other tools exist (some paying)

 

GPower statistical tests

 

  • Test family - statistical tests [in window]
    • Exact Tests (8)
    • \(t\)-tests (11) → reference
    • \(z\)-tests (2)
    • \(\chi^2\)-tests (7)
    • \(F\)-tests (16)
  • Focus on the density functions

 

  • Tests [in menu]
    • correlation & regression (15)
    • means (19) → reference
    • proportions (8)
    • variances (2)
  • Focus on the type of parameters

GPower input

 

  • ~ reference example input
    • t-test : difference two independent averages
    • apriori: calculate sample size
    • effect size = standardized difference (Cohen’s \(d\))
      • Determine =>
        • \(d\) = |difference| / SD_pooled
        • \(d\) = |0-2| / 4 = .5
    • \(\alpha\) = .05; two-tailed ( \(\alpha\) /2 → .025 & .975 )
    • \(power = 1-\beta\) = .8
    • allocation ratio N2/N1 = 1 (equally sized groups)

GPower output

 

  • ~ reference example output
    • sample size \((n)\) = 64 x 2 = (128)
    • degrees of freedom \((df)\) = 126 (128-2)
    • power ≥ .80 (1- \(\beta\)) = 0.8015
    • distributions: central + non-central
    • critical t = 1.979
      • decision boundary given \(\alpha\) and \(df\)
        qt(.975,126)
    • non centrality parameter ( \(\delta\) ) = 2.8284
      • shift Ha (true) away from Ho (null)
        2/(4*sqrt(2))*sqrt(64)

GPower protocol

 

  • Summary for future reference or communication
    • central and non-central distributions (figure)
    • protocol of power analysis (text)

 

  • File/Edit save or print file (copy-paste)

 

Non-centrality parameter ( \(\delta\) ), shift Ha from Ho

 

  • non-centrality parameter \(\delta\) combines
    • assumed effect size ((standardized) signal)
    • conditional on sample size (information)
  • \(\delta\) determines overlap Ho and Ha: bigger ncp less overlap
    • \(\delta\) as violation of Ho → shift (location/shape)
    • power = probability beyond \(\color{green}{cut off}\) at Ho evaluated on Ha
    • push with sample size
  • Ha acts as \(\color{blue}{truth}\) assumed difference of e.g. .5 SD
    • Ha ~ t(ncp=2.828,df)
  • Ho acts as \(\color{red}{benchmark}\): typically no difference, no relation
    • set \(\color{green}{cutoff}\) on Ho ~ t(ncp=0,df) using \(\alpha\)

 

Alternative: divide by N

 

  • Sample sizes determine shape, not location
    • divide by n: sample size ~ standard error
      • peakedness of both distributions
      • often preferred didactically
    • non-centrality parameter: sample size ~ location
      • standardized distributions
      • often preferred in software / algorithms
  • Formula’s same (DIY: two equations for critical value)

 

 

\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}\)

Type I/II error probability

 

  • Inference (test) based on cut-off’s (density → AUC=1)

  • Type I error: incorrectly reject Ho (false positive):

    • cut-off at Ho, error prob. \(\alpha\) controlled
    • one/two tailed → one/both sides informative ?
  • Type II error: incorrectly fail to reject Ho (false negative):

    • cut-off at Ho, error prob. \(\beta\) obtained from Ha
    • Ha assumed known in a power analyses
  • power = 1 - \(\beta\) = probability correct rejection (true positive)

 

  • Inference versus truth
    • infer: effect exists vs. unsure
    • truth: effect exist vs. does not
infer=Ha infer=Ho sum
truth=Ho \(\alpha\) 1- \(\alpha\) 1
truth=Ha 1- \(\beta\) \(\beta\) 1

Create plot

 

  • Create a plot
    • X-Y plot for range of values
    • assumes calculated analysis
      • ~ reference example
    • specify Y-axis / X-axis / curves and constant
      • beware of order !
  • Plot sample size (y-axis)
    • by type I error \(\alpha\) (x-axis) → from .01 to .2 in steps of .01
    • for 4 values of power (curves) → with values .8 in steps of .05
    • assume effect size (constant) → .5 from reference example

 

  • Notice Table option

Errors: exercises

 

  • Where on the red curve (right) is
    the type II error equal to 4 * type I error ?
  • When smaller effect size (e.g., .25), what changes ?

 

Errors: exercises continued

 

  • Plot power instead of sample size

    • with 4 power curves
      with sample sizes 32 in step of 32
  • What is relation type I and II error ?

  • What would be difference between curves for \(\alpha\) = 0 ?

 

Decide Type I/II error probability

 

  • Reasoning on error probabilities
    • \(\alpha\) & \(\beta\) inversely related
    • which error you want to avoid most ?
      • cheap aids test ? → avoid type II
      • heavy cancer treatment ? → avoid type I

 

  • Popular choices for error probabilities
    • \(\alpha\) often in range .01 - .05 → 1/100 - 1/20
    • \(\beta\) often in range .2 to .1 → power = 80% to 90%

 

 

  • Popular rules of thumb
    • 4 * \(\alpha\) ~ \(\beta\)
      • type I error is ~4 times worse !!
  • Probability for both errors always exists

Control Type I error

 

  • Defined on the Ho, known
    • assumes only sampling variability

 

  • Multiple testing
    • control \(\alpha\) over set of tests
      • each test \(\alpha\) error
    • inflates type I error \(\alpha\)
      • k tests with probability for an error: \(1-(1- \alpha)^k\)
    • correct → eg., Bonferroni ( \(\alpha/k\))

 

  • Interim analysis
    • analyze and ‘conditionally’ proceed
    • type of multiple testing
    • plan in advance
    • adjustments of either \(p\) or \(\alpha\)
    • alpha spending, eg., O’Brien-Flemming bounds
    • NOT GPower
      our own simulation tool (Susanne Blotwijk)
    • determine boundaries with PASS, R (ldbounds), …

For fun: P(effect exists | test says so)

 

  • Using \(\alpha\), \(\beta\) and power or \(1-\beta\)
    • \(P(infer=Ha|truth=Ha) = power\)\(P\)(test says there is effect | effect exists)
    • \(P(infer=Ha|truth=Ho) = \alpha\)
    • \(P(infer=Ho|truth=Ha) = \beta\)
    • \(P(\underline{truth}=Ha|\underline{infer}=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha)}\) → Bayes Theorem
    • __ = \(\frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha|truth=Ha) * P(truth=Ha) + P(infer=Ha|truth=Ho) * P(truth=Ho)}\)
    • __ = \(\frac{power * P(truth=Ha)}{power * P(truth=Ha) + \alpha * P(truth=Ho)}\) → depends on prior probabilities
  • IF very low probability model is true (eg., \(P(truth=Ha) = .01\)) THEN probability effect exists if test says so is low (e.g., \(P(truth=Ha|infer=Ha) = \frac{.8 * .01}{.8 * .01 + .05 * .99} = .14\))

Effect sizes, in principle

 

  • Estimate / guestimate of minimal magnitude of interest
     

  • Typically standardized: signal to noise ratio

    • eg., effect size \(d\) = .5 means .5 standard deviations (pooled)
  • Part of non-centrality (as is sample size) → pushing away Ha

  • ~ Practical relevance

    • NOT statistical significance
      • p-value ~ effect size AND sample size

 

  • 2 main families of effect sizes
    • d-family (differences) and r-family (associations)
    • transform one into other,
      eg., d = .5 → r = .243
      \(\hspace{20 mm}d = \frac{2r}{\sqrt{1-r^2}}\)
      \(\hspace{20 mm}r = \frac{d}{\sqrt{d^2+4}}\)
      \(\hspace{20 mm}d = ln(OR) * \frac{\sqrt{3}}{\pi}\)

Effect sizes, in literature

 

  • Cohen, J. (1992).
    A power primer. Psychological Bulletin, 112, 155–159.

  • Cohen, J. (1988).
    Statistical power analysis for the behavioral sciences (2nd ed).

  • Famous Cohen conventions

    • but beware, just rules of thumb