sample size calculation

with exercises in GPower

Wilfried Cools & Tim Pauwels

April 03, 2024

Workshop

  • to introduce the key ideas
  • to help you see the bigger picture
  • to offer first practical experience: GPower

 

  • Target audience
    • primarily the research community at VUB / UZ Brussel
  • History
    • started in 2019, with Sven Van Laere
    • almost yearly, gradually refined

Feedback

  • Help us improve this document

    wilfried.cools@vub.be

at SQUARE

  • Ask us for help

    we offer consultancy
    on methodology, statistics and its communication

    square.research.vub.be

Program

 

  • Part I:
    understand the reasoning
    • introduce building blocks
    • highlight how the relate
    • focus on t-test only
    • a few exercises in GPower

 

  • Part II:
    explore more complex situations
    • go beyond the t-test
    • simple but common
    • many exercises in GPower
    • not one formula for all

Sample size calculation: demarcation

 

  • How many observations will be sufficient ?
    • avoid too many, because typically observations imply a cost
      • money / time → limited resources
      • risk / harm / patient burden → ethical constraints
    • have enough
  • To offer strong enough statistical inference !
    • linked to standard error
      • testing → power [probability to detect effect]
      • estimation → accuracy [size of confidence interval]

Sample size calculation: a difficult design issue

 

  • Part of the design of a study
    • before data collection
    • requires understanding:
      • parameters: effect size of interest
      • data: future data properties
      • model: relation outcome and its conditions under which observed
    • decision based on (highly) incomplete information, thus based on (strong) assumptions

 

  • Not always possible nor meaningful !
    • confirmatory studies easier than exploratory
    • experiments (control) easier than observational
    • not obvious for complex models
      → simulation
    • not obvious for predictive models
      → no standard error

Sample size calculation: if not possible

 

  • If not possible in a meaningful way
    use alternative justification

    • common practice
    • feasibility

    •  
  • Avoid retrospective power analyses
    → OK for future study only

    Hoenig, J., & Heisey, D. (2001). The Abuse of Power:
    The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician, 55, 19–24.

 

  • Because less strong,
    put more weight on non-statistical justification
    • low cost
    • importance

Simple example confirmatory experiment

 

  • Does my radiotherapy work ?
    • aim: show my radiotherapy reduces tumor size
    • method: compare treatment and control group
      • tumor induced in N mice
      • random assignment mice
    • data: tumor sizes after 20 days
    • analysis: unpaired t-test

 

  • Sample size question
    • how many mice are required
    • to show treatment reduces tumor size more
    • assuming effect size:
      • my radiotherapy works if 25% more reduction treatment group
      • 20% ~ 4mm (control) versus 5mm (treatment)
    • with 80% probability (+ type I error probability .05)

Reference example

 

  • Apriori specifications
    • intend to perform a statistical test
    • comparing 2 equally sized groups
    • to detect difference of at least 2
    • assuming an uncertainty of 4 SD on each average
    • which results in an effect size of .5
    • evaluated on a Student t-distribution
    • allowing for a type I error prob. of .05 \((\alpha)\)
    • allowing for a type II error prob. of .2 \((\beta)\)
  • Sample size
    conditional on specifications being true

Note

  • This reference example used throughout the workshop !!

Formula you could use

 

  • Specifications for this particular case:
    • sample size (n → ?)
    • difference ( \(\Delta\) =signal → 2)
    • uncertainty ( \(\sigma\) =noise → 4)
    • type I errors ( \(\alpha\).05, so \(Z_{ \alpha /2}\) → -1.96)
    • type II errors ( \(\beta\).2, so \(Z_ \beta\) → -0.84)
  • Sample size = 2 groups x 63 observations = 126

 

\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{ \Delta^2} = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2} = 62.79\)

  • Formula’s are test and statistic specific but logic remains same
  • This and other formula’s implemented in various tools, our focus: GPower

GPower: the building blocks in action

 

  • 4 components and 2 distributions
    • distributions: Ho & Ha ~ test dependent shape
    • SIZES: effect size & sample size ~ shift Ha
    • ERRORS :
      • Type I error ( \(\alpha\) ) defined on distribution Ho
      • Type II error ( \(\beta\) ) evaluated on distribution Ha
  • Calculate sample size based on effect size, type I / II error



GPower: a useful tool

 

  • Use it
    • implements wide variety of tests
    • free @ http://www.gpower.hhu.de/
    • popular and well established
    • implements various visualizations
    • documented -fairly- well
  • Maybe not use it
    • not all tests are included, the simpler
    • not without flaws
    • other tools exist (some paying)

 

GPower statistical tests

 

  • Test family - statistical tests [in window]
    • Exact Tests (8)
    • \(t\)-tests (11) → reference
    • \(z\)-tests (2)
    • \(\chi^2\)-tests (7)
    • \(F\)-tests (16)
  • Focus on the density functions

 

  • Tests [in menu]
    • correlation & regression (15)
    • means (19) → reference
    • proportions (8)
    • variances (2)
  • Focus on the type of parameters

GPower input

 

  • ~ reference example input
    • t-test : difference two independent averages
    • apriori: calculate sample size
    • effect size = standardized difference (Cohen’s \(d\))
      • Determine =>
        • \(d\) = |difference| / SD_pooled
        • \(d\) = |0-2| / 4 = .5
    • \(\alpha\) = .05; two-tailed ( \(\alpha\) /2 → .025 & .975 )
    • \(power = 1-\beta\) = .8
    • allocation ratio N2/N1 = 1 (equally sized groups)

GPower output

 

  • ~ reference example output
    • sample size \((n)\) = 64 x 2 = (128)
    • degrees of freedom \((df)\) = 126 (128-2)
    • power ≥ .80 (1- \(\beta\)) = 0.8015
    • distributions: central + non-central
    • critical t = 1.979
      • decision boundary given \(\alpha\) and \(df\)
        qt(.975,126)
    • non centrality parameter ( \(\delta\) ) = 2.8284
      • shift Ha (true) away from Ho (null)
        2/(4*sqrt(2))*sqrt(64)

GPower protocol

 

  • Summary for future reference or communication
    • central and non-central distributions (figure)
    • protocol of power analysis (text)

 

  • File/Edit save or print file (copy-paste)

 

Non-centrality parameter ( \(\delta\) ), shift Ha from Ho

 

  • non-centrality parameter \(\delta\) combines
    • assumed effect size ((standardized) signal)
    • conditional on sample size (information)
  • \(\delta\) determines overlap Ho and Ha: bigger ncp less overlap
    • \(\delta\) as violation of Ho → shift (location/shape)
    • power = probability beyond \(\color{green}{cut off}\) at Ho evaluated on Ha
    • push with sample size
  • Ha acts as \(\color{blue}{truth}\) assumed difference of e.g. .5 SD
    • Ha ~ t(ncp=2.828,df)
  • Ho acts as \(\color{red}{benchmark}\): typically no difference, no relation
    • set \(\color{green}{cutoff}\) on Ho ~ t(ncp=0,df) using \(\alpha\)

 

Alternative: divide by N

 

  • Sample sizes determine shape, not location
    • divide by n: sample size ~ standard error
      • peakedness of both distributions
      • often preferred didactically
    • non-centrality parameter: sample size ~ location
      • standardized distributions
      • often preferred in software / algorithms
  • Formula’s same (DIY: two equations for critical value)

 

 

\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}\)

Type I/II error probability

 

  • Inference (test) based on cut-off’s (density → AUC=1)

  • Type I error: incorrectly reject Ho (false positive):

    • cut-off at Ho, error prob. \(\alpha\) controlled
    • one/two tailed → one/both sides informative ?
  • Type II error: incorrectly fail to reject Ho (false negative):

    • cut-off at Ho, error prob. \(\beta\) obtained from Ha
    • Ha assumed known in a power analyses
  • power = 1 - \(\beta\) = probability correct rejection (true positive)

 

  • Inference versus truth
    • infer: effect exists vs. unsure
    • truth: effect exist vs. does not
infer=Ha infer=Ho sum
truth=Ho \(\alpha\) 1- \(\alpha\) 1
truth=Ha 1- \(\beta\) \(\beta\) 1

Create plot

 

  • Create a plot
    • X-Y plot for range of values
    • assumes calculated analysis
      • ~ reference example
    • specify Y-axis / X-axis / curves and constant
      • beware of order !
  • Plot sample size (y-axis)
    • by type I error \(\alpha\) (x-axis) → from .01 to .2 in steps of .01
    • for 4 values of power (curves) → with values .8 in steps of .05
    • assume effect size (constant) → .5 from reference example

 

  • Notice Table option

Errors: exercises

 

  • Where on the red curve (right) is
    the type II error equal to 4 * type I error ?
  • When smaller effect size (e.g., .25), what changes ?

 

Errors: exercises continued

 

  • Plot power instead of sample size

    • with 4 power curves
      with sample sizes 32 in step of 32
  • What is relation type I and II error ?

  • What would be difference between curves for \(\alpha\) = 0 ?

 

Decide Type I/II error probability

 

  • Reasoning on error probabilities
    • \(\alpha\) & \(\beta\) inversely related
    • which error you want to avoid most ?
      • cheap aids test ? → avoid type II
      • heavy cancer treatment ? → avoid type I

 

  • Popular choices for error probabilities
    • \(\alpha\) often in range .01 - .05 → 1/100 - 1/20
    • \(\beta\) often in range .2 to .1 → power = 80% to 90%

 

 

  • Popular rules of thumb
    • 4 * \(\alpha\) ~ \(\beta\)
      • type I error is ~4 times worse !!
  • Probability for both errors always exists

Control Type I error

 

  • Defined on the Ho, known
    • assumes only sampling variability

 

  • Multiple testing
    • control \(\alpha\) over set of tests
      • each test \(\alpha\) error
    • inflates type I error \(\alpha\)
      • k tests with probability for an error: \(1-(1- \alpha)^k\)
    • correct → eg., Bonferroni ( \(\alpha/k\))

 

  • Interim analysis
    • analyze and ‘conditionally’ proceed
    • type of multiple testing
    • plan in advance
    • adjustments of either \(p\) or \(\alpha\)
    • alpha spending, eg., O’Brien-Flemming bounds
    • NOT GPower
      our own simulation tool (Susanne Blotwijk)
    • determine boundaries with PASS, R (ldbounds), …

For fun: P(effect exists | test says so)

 

  • Using \(\alpha\), \(\beta\) and power or \(1-\beta\)
    • \(P(infer=Ha|truth=Ha) = power\)\(P\)(test says there is effect | effect exists)
    • \(P(infer=Ha|truth=Ho) = \alpha\)
    • \(P(infer=Ho|truth=Ha) = \beta\)
    • \(P(\underline{truth}=Ha|\underline{infer}=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha)}\) → Bayes Theorem
    • __ = \(\frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha|truth=Ha) * P(truth=Ha) + P(infer=Ha|truth=Ho) * P(truth=Ho)}\)
    • __ = \(\frac{power * P(truth=Ha)}{power * P(truth=Ha) + \alpha * P(truth=Ho)}\) → depends on prior probabilities
  • IF very low probability model is true (eg., \(P(truth=Ha) = .01\)) THEN probability effect exists if test says so is low (e.g., \(P(truth=Ha|infer=Ha) = \frac{.8 * .01}{.8 * .01 + .05 * .99} = .14\))

Effect sizes, in principle

 

  • Estimate / guestimate of minimal magnitude of interest
     

  • Typically standardized: signal to noise ratio

    • eg., effect size \(d\) = .5 means .5 standard deviations (pooled)
  • Part of non-centrality (as is sample size) → pushing away Ha

  • ~ Practical relevance

    • NOT statistical significance
      • p-value ~ effect size AND sample size

 

  • 2 main families of effect sizes
    • d-family (differences) and r-family (associations)
    • transform one into other,
      eg., d = .5 → r = .243
      \(\hspace{20 mm}d = \frac{2r}{\sqrt{1-r^2}}\)
      \(\hspace{20 mm}r = \frac{d}{\sqrt{d^2+4}}\)
      \(\hspace{20 mm}d = ln(OR) * \frac{\sqrt{3}}{\pi}\)

Effect sizes, in literature

 

  • Cohen, J. (1992).
    A power primer. Psychological Bulletin, 112, 155–159.

  • Cohen, J. (1988).
    Statistical power analysis for the behavioral sciences (2nd ed).

  • Famous Cohen conventions

    • but beware, just rules of thumb

 

Effect sizes, in literature continued

 

  • Ellis, P. D. (2010).
    The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results.

  • more than 70 different effect sizes… most of them related

 

Effect sizes, in GPower (Determine)

 

  • Effect sizes are test specific
    • t-test → group means and sd’s
    • one-way anova → variance explained & error
    • regression → sd’s and correlations
    • . . . .

 

  • GPower helps with Determine
    • sliding window
    • one or more effect size specifications

 

Effect sizes: exercises

 

  • For the reference example:
    • change mean values from 0 and 2 to 4 and 6, what changes ?
    • change sd values to 2 for each, what changes ?
      • effect size ?
      • total sample size ?
      • critical t ?
      • non-centrality ?
    • change sd values to 8 for each, what changes ?
    • change sd to 2 and 5.3, or 1 and 5.5,
      how does it compare to 4 and 4 ?

 

Effect sizes: exercises continued

 

  • For the reference example:
    • plot powercurve: power by effect size
    • compare 6 sample sizes: 34 in steps of 34
    • for a range of effect sizes in between .2 and 1.2
    • use \(\alpha\) equal to .05
    • how does power change when doubling the effect size ?

 

  • powercurve → X-Y plot for range of values

Effect sizes, how to determine them in theory

 

  • Choice of effect size matters → justify choice !!

  • Choice of effect size depends on aim of the study

    • realistic (eg., previously observed effect) → replicate
    • important (eg., minimally relevant effect)
    • NOT significant → meaningless, dependent on sample size
  • Choice of effect size dependent on statistical test of interest

    • for independent t-test → means and standard deviations
    • possible alternative: variance explained, eg., 1 versus 16+1

 

  • Examples
    • with one-way ANOVA
      \(f\) = .25 instead of d = .5
    • with linear regression
      \(f^2\) = .0625 instead of d = .5
    • psychometric freeware

Effect sizes, how to determine them in practice

 

  • Experts / patients
    • minimally clinically relevant effect
    • importance
    • use if possible
  • Literature
    • earlier study / systematic review
    • realistic
    • beware of publication bias
  • (Internal) Pilot
    • guestimate dispersion estimate
    • not to obtain effect size → small sample

 

  • Guestimate uncertainty…
    • sd from assumed range
      • assume normal and divide by 6
    • sd for proportions at conservative .5
    • sd from control, assume treatment the same
    • ...
  • Turn to Cohen
    • use if everything else fails
    • rules of thumb
      • eg., .2 - .5 - .8 for Cohen’s d

Effect sizes, a note about the SD

 

  • For independent t-test → means and standard deviations (sd)
    • sd ~ ‘unexplained’ variance
    • account for important predictors
  • Example: 50% variance unexplained by treatment, explained by predictor
    • a standard deviation of 4 (variance of 16)
    • split into
      • standard deviation of 2.8284 (variance of 8)
      • standard deviation of 2.8284 explained by important predictor

Effect sizes, a note about the SD continued

 

  • not accounting for the important predictor

 

  • accounting for it, sd (around average) reduced

Effect sizes, a note about non-inferiority

 

  • Most often aim to show effect is likely to exist
    • show difference (or relation) different from (typically) zero
    • assuming a particular difference (or relation)
  • Non-inferiority to show effect is no too much worse
    • show difference (or relation) not beyond margin of tolerance
    • assuming a particular difference (or relation),
    • most often but not necessarily 0
    • ! this is inherently one-sided

Effect sizes, notes: exercises

 

  • Use reference example
    • assume half the unexplained variance is accounted for by the predictor, what are the sample sizes ?
    • assume a non-inferiority margin of -2, and no difference, how big is the sample size ?
    • assume treatment to be 2 higher, compare the sample size for superiority (bigger than 0) and non-inferiority with margin of -2

 

Sample size

 

  • Sample size
    • allocated over groups
    • not necessarily equally
    • specify ratio
  • Most important: all groups sufficiently large
    • unequal group sizes not a problem
    • only when expected group size difference

Sample size: exercises

 

  • For the reference example:
    • compare for allocation ratios 1, .5, 2, 10, 50
    • repeat for effect size 1, and compare
  • ? no idea why n1 \(\neq\) n2
  • Plot with changed allocation ratio

 

Relation sample & effect size, type I & II errors

 

  • Building blocks:
    • sizes: sample ( \(n\) ) and effect ( \(\Delta\) )
    • errors: type I ( \(\alpha\) ), type II ~ power ( \(1-\beta\) )

 

  • GPower → type of power analysis
    • Apriori: \(n\) ~ \(\alpha\), power, \(\Delta\)
    • Post Hoc: power ~ \(\alpha\), \(n\), \(\Delta\)
    • Compromise: power, \(\alpha\) ~ \(\beta\:/\:\alpha\), \(\Delta\), \(n\)
    • Criterion: \(\alpha\) ~ power, \(\Delta\), \(n\)
    • Sensitivity: \(\Delta\) ~ \(\alpha\), power, \(n\)

 

  • each parameter conditional on others
  • one outgoing arrow, three going in  

Type of power analysis: exercises

 

  • For the reference example:
    • how big is the power for 128 observations (n=2x64, \(\alpha\)=.05 and \(\Delta\)=.5)
    • then, assume a power of .8 but with only half the sample size, how does the effect size \(\Delta\) change ?
    • then, set the ratio \(\beta\)/ \(\alpha\) to 4, what are the resulting \(\alpha\) and \(\beta\) ? and what is the critical value ?
    • then, set the effect size to .7 (same ratio), what are the resulting \(\alpha\) and \(\beta\) ? and what is the critical value ?

Getting your hands dirty

 

 

  • in G*Power
m1=0
m2=2
s1=4
s2=4  
alpha=.025
N=128  
var=.5*s1^2+.5*s2^2  
d=abs(m1-m2)/sqrt(2*var)*sqrt(N/2)  
tc=tinv(1-alpha,N-1)  
power=1-nctcdf(tc,N-1,d)  

 

  • in R
.n <- 64
.df <- 2*.n-2
.ncp <- 2 / (4 * sqrt(2)) * sqrt(.n)
.alpha <- .05
.power <- 1 -
    pt( 
        qt(1-.alpha/2,df=.df,ncp=0), 
        df=.df, ncp=.ncp
    )
round(.power,4)
[1] 0.8015
  • 2 steps
    • qt → quantile on H0 (p=\(Z_{1-\alpha/2}\))
    • pt → probability on Ha < quantile

GPower, beyond the independent t-test

 

  • So far, comparing two independent means  

  • From now on, selected topics beyond independent t-test
    with small exercises

    • dependent instead of independent
    • non-parametric instead of assuming normality
    • relations instead of groups (regression)
    • correlations
    • proportions, dependent and independent
    • more than 2 groups (compare jointly, pairwise, focused)
    • more than 1 predictor
    • repeated measures

 

  • Look into - GPower manual
    27 tests → effect size, non-centrality parameter and example !!

Dependence between groups: exercises

 

  • When comparing 2 dependent groups (eg., before/after treatment) → account for correlations

  • Correlation are typically obtained from pilot data, earlier research

  • GPower: matched pairs (t-test / means, difference 2 dependent means)

  • Use reference example,

    • assume a correlation of .5 and compare with reference example for effect size, ncp, n
    • how many observations are required if no correlation exists (think then try) ? effect size ?
    • what changes with correlation .875 (think: more or less n, higher or lower effect size) ?
    • what power would be obtained for the reference with sample size 2x64, but correlation .5 ?

Dependence: a note about the correlations

 

  • a noticeable correlation

 

  • clear correlation

Non-parametric distribution: exercises

 

  • When non-normally distributed residuals are expected, not possible to circumvent (eg., transformations)

  • Only considers ranks or uses permutations → price is efficiency and flexibility

  • Requires a parent distribution (alternative hypothesis), ‘min ARE’ should be default

  • GPower: two groups → Wilcoxon-Mann-Whitney (t-test / means, diff. 2 indep. means)

  • Use reference example

    • use a normal parent distribution, how much efficiency is lost ?
    • use ‘min ARE’ as parent distribution, how much efficiency is lost ?

A relations perspective, regression analysis: exercises

 

  • Differences between groups ~ relation with grouping (categorization)

  • Example: d = .5 ~ r = .243 (note: slope \(\beta = {r*\sigma_y} / {\sigma_x}\))

    • total variance \(\sigma_y\) = residual variance + model variance (2 or 0) → var((2-1),(0-1),(2-1),(0-1),…)
    • design variance \(\sigma_x\) = variance -.5 and .5 for all observations → var((1-.5),(0-.5),(1-.5),(0-.5),…)
  • GPower: regression coefficient (t-test / regression, one group size of slope)

    • what is the slope \(\beta\) and \(\sigma_y\) for reference values, d=.5 (hint:d~r), SD = 4 and \(\sigma_x\) = .5 (1/0)
    • what is the resulting sample size
    • what happens with the slope and sample size if predictor values are taken as 1/-1 instead of .5/-.5?
    • what is \(\sigma_y\) for a slope of 6, \(\sigma_x\) = .5, and SD = 4, would it increase the sample size ?

Explained variance perspective, regression: exercises

 

  • A relation as ratio between and within group variance ~ explained variance R2

  • Different but related effect sizes \(f^2\) = \({R^2/{(1-R^2)}}\)

    • partial \(R^2\) = variance explained by predictor / total variance
    • \(f^2\) = variance explained by predictor / residual variance
    • Note: \(f\) = 2\(d\) for 2 groups
  • GPower: regression coefficient (t-test / regression, fixed model single regression coef)

  • Use reference example

    • remember the variances, add them to calculate the effect size
    • calculate sample size ?
    • what if also other predictors in the model ?
    • what if 3 predictors extra reduce residual variance to 50% ?
      • hint: total variance remains constant at 17

A variance ratio perspective, ANOVA: exercises

 

  • Multiple groups, at least two differ → not one effect size d

  • F-test statistic & effect size f, ratio of variances \(\sigma_{between}^2 / \sigma_{within}^2\)

  • GPower: one-way Anova (F-test / Means, ANOVA - fixed effects, omnibus, one way)

  • use reference example

    • what is the sample size, for a difference of 2, each 64 observations ?
    • why are the ncp and critical value different ?
    • how does size matter ? (play with it)
  • include a third group (group 1 = 0, group 2 = 2)

    • for group 3 with mean 0, 2 or 4 (figure), what are the sample sizes ?
    • vary group 2, middle group, with mean 1 and 3, does that have an effect ?
    • repeat for between variance 2.67 (balanced: 0, 2, 4) and within variance 16 ?

 

Multiple groups, pairwise: exercises

 

  • Detecting ‘a’ difference often not of interest (omnibus), typically particular pairwise comparisons are

  • Pairwise comparisons

    • looked at as if t-test
    • requires multiple testing correction, e.g., Bonferonni correction: divide \(\alpha\) by number of tests
  • use reference example (C = 0, T1 = 2), and extend with group 3 with T2 = 4

    • what samples sizes are necessary for all three pairwise tests combined ?
    • what if biggest difference (C-T2) is ignored, because considered easiest to detect ?
    • with original 64 sized groups, what is the power to detect C-T1 difference group ?
      • with either 3 or 2 tests jointly ?

Multiple groups: contrasts: exercises

 

  • Contrasts are linear combinations → planned comparison
    • eg., T1-C: \(1 * T1 -1 * C \neq 0\)
    • eg., (T1+T2)/2-C: \(.5 * (1 * T1 + 1 * T2) -1 * C \neq 0\)
  • Effect sizes for planned comparisons must be calculated
    • contrast specific between variance \(\sigma_{contrast}^2\)
    • \(f\) ~ variance ratio between / within
    • Note: \(f\) = 2\(d\)
  • Obtain effect sizes for contrasts (assume equally sized for convenience)
    • T1-C, T2-C, (T1+T2)/2-C
    • each contrast requires 1 degree of freedom

 

  • Parameters
    • group means \(\mu_i\)
    • pre-specified coefficients \(c_i\)
    • sample sizes \(n_i\)
    • total sample size \(N\)
    • k levels


\(\sigma_{contrast} = \frac{|\sum{\mu_i * c_i}|}{\sqrt{N \sum_i^k c_i^2 / n_i}}\)  

\(f = \sqrt{\frac{\sigma_{contrast}^2}{\sigma_{error}^2}}\)

Multiple groups: contrasts: exercises continued

 

  • GPower: one-way ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)

  • For the reference example extended, with contrasts \(f_{T1-C}\)=.25, \(f_{T2-C}\)=.50 and \(f_{(T2+T1)/2-C}\)=.3535

    • what are the samples sizes for either contrast 1 or contrast 2 ?
    • what are the samples sizes for both contrast 1 and contrast 2 combined ?
    • if taking that sample size, what will be the power for T1-T2 ?
    • what are the samples size for contrast 3 ?

Repeated measures within: exercises

 

  • When the factor is time: repeated measures

    • relates to dependent t-test for multiple measurements (>2)
  • Beware: effect sizes obtained from literature may/may not include correlation

    • Options: as in GPower 3, or SPSS, …
  • GPower: repeated measures (F-test / Means, repeated measures within factors)

  • For reference example with effect size f = .25 (1/16 explained versus unexplained)

    • mimic independent t-test
    • mimic dependent t-test, correlation .5 !
    • what does an increase in correlation imply, why ?
    • for 4, or 8 repeated measurements (cor=.5), what changes.
    • for 2, 4 or 8 groups (cor=.5), what changes ?

Multiple factors

 

  • Multiple main effects and possibly interaction effects
    • main effect:
      • difference B1-B2 over all conditions of treatment (C,T1,T2)
      • difference C-T1-T2 over all conditions of type (B1,B2)
    • interaction effect:
      • effect of treatment (C-T1-T2) different per level of type (B1 or B2) or vise verse
    • note: numerator degrees of freedom
      • main effect (nr-1) or 2 for treatment and 1 for type
      • interaction (nr1-1)*(nr2-1) or 2 (=2x1)

 

 

  • effect sizes
    • \(\eta^2\) = \(f^2 / (1+f^2)\)
    • note: \(f = d/2\)
      for two groups

Multiple factors effect sizes: exercises

 

  • Get effect size: in-house shiny app

  • Use reference example for treatment (C-T1-T2) and add type (B1-B2)

    • determine the effect size \(\eta^2\)
      for averages 0, 2, and if necessary use 4 or 6
      • treatment effect C-T1
      • treatment effect C-T1-T2 - no type effect B1-B2
      • no treatment effect C-T1-T2 - type effect B1-B2
      • treatment effect C-T1-T2 within B1, not B2 - with interaction
      • treatment effect C-T1-T2 - type effect B1-B2 without interaction

Multiple factors: exercises

 

  • GPower: multiway ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)

  • Use reference example for treatment (C-T1-T2) and add type (B1-B2)
    what are the sample sizes

    • C-T1 with \(\eta^2\)=0.0588
    • C-T1-T2 (no B1-B2) with \(\eta^2\)=0.0385
    • B1-B2 (no C-T1-T2) with \(\eta^2\)=0.0588
    • interaction C-T1-T2 x no B1-B2 with \(\eta^2\)=0.04
    • additive C-T1-T2 + B1-B2
      • with \(\eta^2\)=0.1429 for C-T1-T2
      • with \(\eta^2\)=0.0588 for B1-B2

Repeated measures between: exercises

 

  • When repeated measures are obtained for different groups

  • GPower: repeated measures (F-test / Means, repeated measures within factors)

  • For reference example

    • relate to independent t-test, 2 uncorrelated measurements
    • mimic independent t-test, 2 almost perfectly correlated measurements
    • with a correlation .5, what changes ?
    • what does increase in correlation imply, why ?
    • for 3 groups extended reference, cor=.5, what changes ?
    • for 4, or 8 repeated measurements (cor=.5), what changes.

Interaction within x between effect sizes: exercises

 

  • When differences between groups depend on time

  • Get effect size: in-house shiny app

  • Use reference example for control-treatment (C-T1), and 2 or 4 time points

    • determine the effect size interaction \(\eta^2\) with r=0 and r=.5
      • treatment effect C-T1, both C and T1 increase with 2
      • treatment effect C-T1, only T1 increase with 2
      • if previous situation repeated twice, T1=2,4,2,4
      • if previous situation repeated twice, but reversed, T1=2,4,4,2

Interaction within x between: exercises

 

  • Options: different effect sizes are possible

    • is correlation already part of effect size ?
    • often it is when extracted from literature
  • GPower: repeated measures (F-test / Means, repeated measures within-between factors)

  • Use effect sizes previous exercise, part 2 and 3

    • determine the sample size, assuming a correlation of .5
      • for effect size assuming no correlation is included, include .5 correlation
      • for effect size assuming correlation is included, include correlation of 0

Correlations: exercises

 

  • Test difference of two independent correlations → [-1..1]

  • Use Fisher Z transformations to normalize

    • z = .5 * log( \(\frac{1+r}{1-r}\) )
    • q = z1-z2
  • Correlations easier to differentiate as they are more different from 0

  • GPower: z-tests / correlation & regressions: 2 indep. Pearson r’s

  • Testing whether two correlations are the same

    • with correlation coefficients .7844 and .5, what are the effect & sample sizes ?
    • with the same difference, but stronger correlations, eg., .9844 and .7, what changes ?
    • with the same difference, but weaker correlations, eg., .1 and .3844, what changes ?

Proportions: exercises

 

  • Test difference two independent proportions → [0..1]

  • Proportions easier to differentiate as they are more different from .5

  • GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)

  • Testing whether two proportions are the same

    • for odds ratio 3 and p2 = .50, what is p1 ? and for odds ratio 1/3 ?
    • what is the sample size to detect a difference for both situations ?
    • for odds ratio 3 and p2 = .75, determine p1 and sample size, how does it compare with before ?
    • for odds ratio 1/3 and p2 = .25, determine p1 and sample size, how does it compare with before ?
    • compare sample size for a .15 difference, at p1=.5 ?

Proportions: exercises

 

  • GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)

  • Plot 5 power curves

    • odds ratio = 2, with p2 reference probability .6
    • proportions .5 to 1
    • 1 curve per sample sizes 328, 428, 528… (intervals of 100)
    • type I error .05
  • Explain curve minimum, relation sample size ?

  • Repeat for one-tailed, difference ?

Dependent proportions: exercises

 

  • Test difference two dependent proportions → [0..1] categorical shift

    • for two categories, McNemar test: compare \(p_{12}\) with \(p_{21}\)
    • information from changes only → discordant pairs
    • effect size as odds ratio → ratio of discordance
  • GPower: McNemar test (exact / proportions, difference 2 dependent proportions)

  • Testing whether proportions of discordance are the same

    • assume odds ratio 2, .25 discordant, what is sample size
    • what for discordant, .5, and 1 ?
    • odds ratio .99 versus .5, (prop discordant = .25), what are \(p_12\) and \(p_21\) and sample sizes ?
    • repeat for third alpha option, and consider total sample size, what happens ?

Not included

 

  • GPower not for always sufficient

  • Tests too difficult to specify in GPower

    • statistics / parametervalues difficult to guestimate
    • manual not always sufficient
  • Tests not included in GPower

    • eg., survival analysis
    • many tools online, most dedicated to a particular model
  • Tests without formula

    • simulation may be the only tool

 

  • simulation in theory is always possible
    • iterate many times:
      • generate: simulated outcome
        introduce randomness ~ standard deviation
      • analyze: estimate parameters + test
    • count proportion of rejections (~power)
    • determine confidence bounds (~accuracy)
    • select sample size
      • with appropriate proportion (test)
      • with appropriate interval (estimation)

Simulation example: t-test

 

  • simulation in practice
    • reference example: 0-2 (4), 64x2
    • replicate 10000 times
      • generate:
        dta$y <- dta$y+rnorm(length(dta$y),0,4)
      • analyze:
        res <- t.test(data=dta,y~X)
    • count proportion rejection:
      mean(sims['p.val',] < .05)

 

 

gr <- rep(c('T','C'),64)
y <- ifelse(gr=='C',0,2)
dta <- data.frame(y=y,X=gr)
cutoff <- qt(.025,nrow(dta))
 
my_sim_function <- function(){
    dta$y <- dta$y+rnorm(length(dta$y),0,4)     # generate (with sd=4)
    res <- t.test(data=dta,y~X)                 # analyze
    c(res$estimate %*% c(-1,1),res$statistic,res$p.value)
}
sims <- replicate(10000,my_sim_function())      # many iterations
dimnames(sims)[[1]] <- c('diff','t.stat','p.val')

mean(sims['p.val',] < .05)  # p-values  0.8029
mean(sims['t.stat',] < cutoff)  # t-statistics 0.8029
mean(sims['diff',] > sd(sims['diff',])*cutoff*(-1)) # differences 0.8024

Focus / simplify

 

  • Complex statistical models
    • simulate BUT it requires programming and a thorough understanding of the model
    • alternative: focus on essential elements → simplify the aim
  • Sample size calculations (design) for simpler research aim
    • not necessarily equivalent to final statistical testing / estimation
    • requires justification to convince yourself and/or reviewers
      • successful already if simple aim is satisfied
      • ignored part is not too costly

 

  • Example:
    • statistics:
      group difference evolution 4 repeated measurements → mixed model
    • focus:
      difference treatment and control last time point is essential → t-test
    • argument: first 3 measurements low cost, interesting to see change

Conclusion

 

  • Sample size calculation is a design issue, not a statistical one

  • It typically focuses on ensuring sufficient data to result in sufficiently strong statistical inference

  • Sample size depends on effect size, type I & II errors, and the statistical test of interest

  • Effect sizes express the amount of signal compared to the background noise

  • GPower deals with not too complex models

    • more complex complex models imply more complex specification
    • simplify using a focus, if justifiable → then GPower can get you a long way

Questions ?

 

Thank you for your attention.

 

 

Methodological and statistical support to help make a difference

  • At SQUARE we meet to provide complementary support in statistics and methodology (qualitative and quantitative) to our research community, for individual researchers and research groups, in order to get the best out of their research.
  • SQUARE aims to further enhance the quality of both the research and how it is communicated.

Contact

  • find the SQUARE team and information on our service at square.research.vub.be
  • for feedback on this workshop: wilfried.cools@vub.be