Workshop

  • to introduce the key ideas
  • to help you see the bigger picture
  • to offer first practical experience: GPower and R

 

  • Target audience
    • primarily the research community at VUB / UZ Brussel
  • History
    • started in 2019, with Sven Van Laere
    • almost yearly, gradually refined
    • more focus on R, and reasoning

Feedback

  • Help us improve this document

    wilfried.cools@vub.be

at SQUARE

  • Ask us for help

    we offer consultancy
    on methodology, statistics and its communication

    square.research.vub.be

Sample size calculation: demarcation

 

  • How many observations will be sufficient
    • avoid too many, because typically observations imply a cost
      • money / time → limited resources
      • risk / harm / patient burden → ethical constraints
    • have enough, to ensure success of the study
  • to offer strong enough statistical inference ?
    • linked to standard error
      • testing → power [probability to detect effect]
      • estimation → accuracy [size of confidence interval]

Sample size calculation: a difficult design issue

 

  • Part of the design of a study
    • before data collection
    • requires understanding:
      • parameters: effect size of interest
      • data: future data properties
      • model: relation outcome and its conditions under which observed
    • decision based on (highly) incomplete information, thus based on (strong) assumptions

 

  • Not always possible nor meaningful !

    • confirmatory studies easier than exploratory
    • experiments (control) easier than observational
    • not obvious for complex models
      → simulation
    • not obvious for predictive models → no standard error
  • Avoid retrospective power analyses
    → OK for future study only

    Hoenig, J., & Heisey, D. (2001). The Abuse of Power:
    The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician, 55, 19–24.

Sample size calculation: if not possible

 

  • If not possible in a meaningful way
    use alternative justification
    • common practice
    • feasibility

    •  
  • Implies exploratory aim
    • no guarantee on significance
    • no guarantee on accuracy

 

  • But, because less strong as argument,
    put more weight on non-statistical justification
    • low cost
    • importance (even if no effect)
    • novelty (future studies)
    • …  

Sample size calculation: who are you talking to ?

 

  • To persuade that your inference will be
    • effective → enough
    • efficient → not too many
  • Persuade
    • yourself
    • funding committee
      • per project
      • per observation
    • committee ethics
      • study worth the patient burden

Example: confirmatory experiment

 

  • Does my radiotherapy work ?
    • aim: show my radiotherapy reduces tumor size
    • method: compare groups treatment T and control C
      • induce tumor in N mice
      • randomly assign mice to T/C
    • observation: tumor sizes after 20 days
    • analysis: unpaired t-test

 

  • Sample size question
    • how many mice are required
    • to show treatment reduces tumor size more
    • requiring 25% more reduction for treatment group
      • compared to assumed 4mm (control)
      • require at least 5mm (treatment)
    • with 80% probability (+ type I error probability .05)

Example: reference

 

  • Apriori specifications
    • intend to perform a statistical test
    • comparing 2 equally sized groups
    • to detect difference of at least 2
    • assuming an uncertainty of 4 SD on each average
    • which results in an effect size of .5
    • evaluated on a Student t-distribution
    • allowing for a type I error prob. of .05 \((\alpha)\)
    • allowing for a type II error prob. of .2 \((\beta)\)
  • Sample size
    conditional on specifications being true


Difference detected approximately 80% of the times.

Note

  • This reference example used throughout the workshop !!

Formula you could use

 

  • Specifications for this particular case:
    • sample size (n → ?)
    • difference ( \(\Delta\) =signal → 2)
    • uncertainty ( \(\sigma\) =noise → 4)
    • type I errors ( \(\alpha\).05, so \(Z_{ \alpha /2}\) → -1.96)
    • type II errors ( \(\beta\).2, so \(Z_ \beta\) → -0.842)
  • Sample size = 2 groups x 63 observations = 126

 

\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{ \Delta^2} = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2} = 62.79\)

  • Formula’s are test and statistic specific but logic remains same
  • This and other formula’s implemented in various tools, we will use GPower and pwr package in R

R: a does-it-all tool

 

  • Use it
    • implements wide variety of tests
    • offers multiple (dedicated) packages
    • free @ http://www.cran.com/
    • huge community to help
  • Alternatives exist
    • online tools
    • special purpose programs

 

  • At a bare minimum, it is a calculator
    • hint: qnorm(.025) is \(Z_{.05/2}\)
    • hint: 4^2 is 4 squared
    • calculate the sample size in R for earlier formula
    • \(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}\)

R: a native stats package function

 

power.t.test(delta = 2, sd = 4, sig.level = 0.05, power = .80)
     Two-sample t test power calculation 

              n = 63.76576
          delta = 2
             sd = 4
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

# get help
?power.t.test

R: exercise using power.t.test

 

To confirm that the treatment results in at least 500 calories less, compared to the control,
knowing that the standard deviation of measured calories in each group should be about 1000,
how many observations are required to show that difference,
allowing for a .01 type I error and .9 power ?

GPower: a useful tool

 

  • Use it
    • implements wide variety of tests
    • free @ http://www.gpower.hhu.de/
    • popular and well established
    • implements various visualizations
    • documented -fairly- well
  • Maybe not use it
    • not all tests are included, the simpler
    • not without flaws
    • other tools exist (some paying)

 

GPower menu of statistical tests

 

  • Test family - statistical tests [in window]
    • Exact Tests (8)
    • \(t\)-tests (11) → reference
    • \(z\)-tests (2)
    • \(\chi^2\)-tests (7)
    • \(F\)-tests (16)
  • Focus on the density functions

 

  • Tests [in menu]
    • correlation & regression (15)
    • means (19) → reference
    • proportions (8)
    • variances (2)
  • Focus on the type of parameters

GPower input

 

  • ~ reference example input
    • t-test : difference two independent averages
    • apriori: calculate sample size
    • effect size = standardized difference (Cohen’s \(d\))
      • Determine =>
        • \(d\) = |difference| / SD_pooled
        • \(d\) = |0-2| / 4 = .5
    • \(\alpha\) = .05; two-tailed ( \(\alpha\) /2 → .025 & .975 )
    • \(power = 1-\beta\) = .8
    • allocation ratio N2/N1 = 1 (equally sized groups)

GPower output

 

  • ~ reference example output
    • sample size \((n)\) = 64 x 2 = (128)
    • degrees of freedom \((df)\) = 126 (128-2)
    • power ≥ .80 (1- \(\beta\)) = 0.8015
    • distributions: central + non-central
    • critical t = 1.979
      • decision boundary given \(\alpha\) and \(df\)
        qt(.975,126)
    • non centrality parameter ( \(\delta\) ) = 2.8284
      • shift Ha (true) away from Ho (null)
        2/(4*sqrt(2))*sqrt(64)

Exercise with Determine

 

  • For the reference example:
    • change mean values from 0 and 2 to 4 and 6, what changes ?
    • change sd values to 2 for each, what changes ?
      • effect size ?
      • total sample size ?
      • critical t ?
      • non-centrality ?
    • change sd values to 8 for each, what changes ?
    • change sd to 2 and 5.3, or 1 and 5.5,
      how does it compare to 4 and 4 ?

 

Exercise with X-Y Plot

 

  • For the reference example:
    • plot powercurve: power by effect size
    • compare 6 sample sizes: 34 in steps of 34
    • for a range of effect sizes in between .2 and 1.2
    • use \(\alpha\) equal to .05
    • how does power change when doubling the effect size ?

 

  • powercurve → X-Y plot for range of values

GPower protocol

 

  • Summary for future reference or communication
    • central and non-central distributions (figure)
    • protocol of power analysis (text)

 

  • File/Edit save or print file (copy-paste)

 

Non-centrality parameter ( \(\delta\) ), shift Ha from Ho

 

  • non-centrality parameter \(\delta\) combines SIZES
    • assumed effect size ((standardized) signal)
    • conditional on sample size (information)
  • \(\delta\) determines overlap Ho and Ha: bigger ncp less overlap
    • \(\delta\) as violation of Ho → shift (location/shape)
    • power = probability beyond \(\color{green}{cut off}\) at Ho evaluated on Ha
    • push with sample size
  • Ha acts as \(\color{blue}{truth}\) assumed difference of e.g. .5 SD
    • Ha ~ t(ncp=2.828,df)
  • Ho acts as \(\color{red}{benchmark}\): typically no difference, no relation
    • set \(\color{green}{cutoff}\) on Ho ~ t(ncp=0,df) using \(\alpha\)

 

Alternative: divide by N

 

  • Sample sizes determine shape, not location
    • divide by n: sample size ~ standard error
      • peakedness of both distributions
      • often preferred didactically
    • non-centrality parameter: sample size ~ location
      • standardized distributions
      • often preferred in software / algorithms
  • Formula’s same (DIY: two equations for critical value)

 

 

\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}\)

Type I/II error probability

 

  • Inference (test) based on cut-off’s (density → AUC=1)

  • Type I error: incorrectly reject Ho (false positive):

    • cut-off at Ho, error prob. \(\alpha\) controlled
      • conditional on absense group differences / relation
    • one/two tailed → one/both sides informative ?
  • Type II error: incorrectly fail to reject Ho (false negative):

    • cut-off at Ho, error prob. \(\beta\) obtained from Ha
    • Ha assumed known in a power analyses
  • power = 1 - \(\beta\) = probability correct rejection (true positive)

 

  • Inference versus truth
    • infer: effect exists vs. unsure
    • truth: effect exist vs. does not
infer=Ha infer=Ho sum
truth=Ho \(\alpha\) 1- \(\alpha\) 1
truth=Ha 1- \(\beta\) \(\beta\) 1

Create plot

 

  • Create a plot
    • X-Y plot for range of values
    • assumes calculated analysis
      • ~ reference example
    • specify Y-axis / X-axis / curves and constant
      • beware of order !
  • Plot sample size (y-axis)
    • by type I error \(\alpha\) (x-axis) → from .01 to .2 in steps of .01
    • for 4 values of power (curves) → with values .8 in steps of .05
    • assume effect size (constant) → .5 from reference example

 

  • Notice Table option

Errors: exercises

 

  • Where on the red curve (right) is
    the type II error equal to 4 * type I error ?
  • When smaller effect size (e.g., .25), what changes ?
  • With power on the Y-axis, what is relation type I and II error ?

 

Decide Type I/II error probability

 

  • Reasoning on error probabilities
    • \(\alpha\) & \(\beta\) inversely related
    • which error you want to avoid most ?
      • cheap aids test ? → avoid type II
      • heavy cancer treatment ? → avoid type I

 

  • Popular choices for error probabilities
    • \(\alpha\) often in range .01 - .05 → 1/100 - 1/20
    • \(\beta\) often in range .2 to .1 → power = 80% to 90%

 

 

  • Popular rules of thumb
    • 4 * \(\alpha\) ~ \(\beta\)
      • type I error is ~4 times worse !!
  • Probability for both errors always exists

For fun: P(effect exists | test says so)

 

  • Using \(\alpha\), \(\beta\) and power or \(1-\beta\)
    • \(P(infer=Ha|truth=Ha) = power\)\(P\)(test says there is effect | effect exists)
    • \(P(infer=Ha|truth=Ho) = \alpha\)
    • \(P(infer=Ho|truth=Ha) = \beta\)
    • \(P(\underline{truth}=Ha|\underline{infer}=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha)}\) → Bayes Theorem
    • __ = \(\frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha|truth=Ha) * P(truth=Ha) + P(infer=Ha|truth=Ho) * P(truth=Ho)}\)
    • __ = \(\frac{power * P(truth=Ha)}{power * P(truth=Ha) + \alpha * P(truth=Ho)}\) → depends on prior probabilities
  • IF very low probability model is true (eg., \(P(truth=Ha) = .01\)) THEN probability effect exists if test says so is low (e.g., \(P(truth=Ha|infer=Ha) = \frac{.8 * .01}{.8 * .01 + .05 * .99} = .14\))

Control Type I error for Multiple Testing

 

  • Type I error defined on the family of Ho's,
    • manage probability to incorrectly reject
      • any of the null hypotheses !!
      • all of the null hypotheses
  • Typical use: contrasts on factors with +2 levels

 

  • Multiple testing
    • inflates type I error \(\alpha\)
      • k tests with probability for an error: \(1-(1- \alpha)^k\)
    • control \(\alpha\) over set of tests
      • change \(\alpha\) → eg., Bonferroni ( \(\alpha/k\))
      • ~ adjust p-values

A family of comparisons: exercises

 

  • Comparing the control group and two treatments

  • Pairwise comparisons, typically not an omnibus

    • looked at as a set of t-tests, not as ANOVA
    • requires multiple testing correction, e.g., Bonferonni correction: divide \(\alpha\) by number of tests
  • use reference example (C = 0, T1 = 2), and extend with group 3 with T2 = 4 (same sd)

    • what samples sizes are necessary for all three pairwise tests combined ?
    • what if biggest difference (C-T2) is ignored, because considered easiest to detect ?

Control Type I error for Interim Analysis

 

  • Interim analysis
    • analyze and ‘conditionally’ proceed
      • possibility to stop early
    • error spending: stop if
      • significant: control type I error
      • futile: control type II error

 

  • Adjustments of either \(p\) or \(\alpha\)
    • plan in advance
      • O’Brien-Flemming bounds: initially conservative
      • Pockock bounds: constant
      • … or design yourself
    • dependent on information fraction, extract critical values

Control Type I error for Interim Analysis

 

Pocock (P), O’Brien-Fleming (OF), Haybittle-Peto (HP), and Wang-Tsiatis (WT) correction with \(\delta\) = 0.25

Interim Analysis: exercises

 

  • Detect difference reference, with 2 peaks at the data
  • Ensure sufficient power, what are the decision rules
  • Use App. gsDesigner

Early stopping with high cost: Simon II stage

 

  • Interim analysis
    • event (reaction to treatment) or not (binomial)
    • stop if not enough evidence, with high cost

 

  • To determine a difference in proportion of reactions
    • while p0 expected, detect p1 wanted
    • search for number n1 at first stage
    • given (different values of) the number (n) in total

Exercise Simon II stage

 

  • To determine a difference in proportion of reactions
    • assume about .1 percent reacts if the treatment does nothing
    • assume about .4 percent reacts if the treatment works sufficiently
    • for type I and II error .05 and .2
    • use the ph2simon function of the clinfun package
    • hint: ?ph2simon to get the help file

Sample size

 

  • Sample size
    • allocated over groups
    • not necessarily equally
    • specify ratio
  • Most important: all groups sufficiently large
    • unequal group sizes not a problem
    • only when expected group size difference

Sample size: exercises

 

  • For the reference example:
    • compare for allocation ratios 1, .5, 2, 10, 50
    • repeat for effect size 1, and compare
  • ? no idea why n1 \(\neq\) n2
  • Plot with changed allocation ratio

 

Effect sizes, in principle

 

  • Estimate / guestimate of minimal magnitude of interest
     

  • Typically standardized: signal to noise ratio

    • eg., effect size \(d\) = .5 means .5 standard deviations (pooled)
    • eg., effect size \(R^2\) = .3 means .3 explained variance
  • Part of non-centrality (as is sample size) → pushing away Ha

  • ~ Practical relevance

    • NOT statistical significance
      • p-value ~ effect size AND sample size

 

  • 2 main families of effect sizes
    • d-family (differences) and r-family (associations)
    • transform one into other,
      eg., d = .5 → r = .243
      \(\hspace{20 mm}d = \frac{2r}{\sqrt{1-r^2}}\)
      \(\hspace{20 mm}r = \frac{d}{\sqrt{d^2+4}}\)
      \(\hspace{20 mm}d = ln(OR) * \frac{\sqrt{3}}{\pi}\)

Effect sizes, in literature

 

  • Cohen, J. (1992).
    A power primer. Psychological Bulletin, 112, 155–159.

  • Cohen, J. (1988).
    Statistical power analysis for the behavioral sciences (2nd ed).

  • Famous Cohen conventions

    • but beware, just rules of thumb

 

Effect sizes, in literature continued

 

  • Ellis, P. D. (2010).
    The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results.

  • more than 70 different effect sizes… most of them related

 

Effect sizes, in GPower (Determine)

 

  • Effect sizes are test specific
    • t-test → group means and sd’s
    • one-way anova → variance explained & error
    • regression → sd’s and correlations
    • . . . .

 

  • GPower helps with Determine
    • sliding window
    • one or more effect size specifications

 

Effect sizes, how to determine them in theory

 

  • Choice of effect size matters → justify choice !!

  • Choice of effect size depends on aim of the study

    • realistic (eg., previously observed effect) → replicate
    • important (eg., minimally relevant effect)
    • NOT significant → meaningless, dependent on sample size
  • Choice of effect size dependent on statistical test of interest

    • for independent t-test → means and standard deviations
    • possible alternative: variance explained, eg., 1 versus 16+1

 

  • Examples
    • with one-way ANOVA
      \(f\) = .25 instead of d = .5
    • with linear regression
      \(f^2\) = .0625 instead of d = .5
    • psychometric freeware

Effect sizes, how to determine them in practice

 

  • Experts / patients
    • minimally clinically relevant effect
    • importance
    • use if possible
  • Literature
    • earlier study / systematic review
    • realistic
    • beware of publication bias
  • (Internal) Pilot
    • guestimate dispersion estimate
    • not to obtain effect size → small sample

 

  • Guestimate uncertainty…
    • sd from assumed range
      • assume normal and divide by 6
    • sd for proportions at conservative .5
    • sd from control, assume treatment the same
    • ...
  • Turn to Cohen
    • use if everything else fails
    • rules of thumb
      • eg., .2 - .5 - .8 for Cohen’s d

Effect sizes, a note about the SD

 

  • For independent t-test → means and standard deviations (sd)
    • sd ~ ‘unexplained’ variance
    • account for important predictors
  • Example: 50% variance unexplained by treatment, explained by predictor
    • a standard deviation of 4 (variance of 16)
    • split into
      • standard deviation of 2.8284 (variance of 8)
      • standard deviation of 2.8284 explained by important predictor

Effect sizes, a note about the SD continued

 

  • not accounting for the important predictor

 

  • accounting for it, sd (around average) reduced

Effect sizes, a note about non-inferiority

 

  • Most often aim to show effect is likely to exist
    • show difference (or relation) different from (typically) zero
    • assuming a particular difference (or relation)
  • Non-inferiority to show effect is no too much worse
    • show difference (or relation) not beyond margin of tolerance
    • assuming a particular difference (or relation),
    • most often but not necessarily 0
    • ! this is inherently one-sided

Effect sizes, notes: exercises

 

  • Use reference example
    • assume half the unexplained variance is accounted for by the predictor, what are the sample sizes ?
    • assume a non-inferiority margin of -2, and no difference, how big is the sample size ?
    • assume treatment to be 2 higher, compare the sample size for superiority (bigger than 0) and non-inferiority with margin of -2

 

Effect sizes, test specific but not really

 

  • Regression analysis results in R

Call:
lm(formula = y ~ factor(group), data = .dta)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8910 -2.5825 -0.6262  2.5170  9.2916 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)    6.280e-16  5.000e-01   0.000  1.00000   
factor(group)2 2.000e+00  7.071e-01   2.828  0.00544 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4 on 126 degrees of freedom
Multiple R-squared:  0.0597,    Adjusted R-squared:  0.05224 
F-statistic:     8 on 1 and 126 DF,  p-value: 0.005444

 

  • ANOVA results in R
             Df Sum Sq Mean Sq F value  Pr(>F)   
group         1    128     128       8 0.00544 **
Residuals   126   2016      16                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • t-test results in R
[1] 0.00544392
        t 
-2.828427 

Effect sizes, test specific but not really

 

  • Total variance: 16.882
  • Between group variance
    with mean y equal to 0 and 2: 1.008
  • Within group variance
    (error or residual variance): 15.874
  • F-test statistic, square root of ratio
    between on within group variance: 0.252
  • \(\eta^2\) statistic, ratio between group on total variance 0.06
  • Design variance
    with X = 0 and 1: 0.252

 

A relations perspective, regression analysis: exercises

 

  • Differences between groups ~ relation with grouping (categorization)

  • Example: d = .5 ~ r = .243 (note: slope \(\beta = {r*\sigma_y} / {\sigma_x}\))

    • total variance \(\sigma_y\) = residual variance + model variance (2 or 0) → var((2-1),(0-1),(2-1),(0-1),…)
    • design variance \(\sigma_x\) = variance -.5 and .5 for all observations → var((1-.5),(0-.5),(1-.5),(0-.5),…)
  • GPower: regression coefficient (t-test / regression, one group size of slope)

    • what is the slope \(\beta\) and \(\sigma_y\) for reference values, d=.5 (hint:d~r), SD = 4 and \(\sigma_x\) = .5 (1/0)
    • what is the resulting sample size
    • what happens with the slope and sample size if predictor values are taken as 1/-1 instead of 0/1 ?
    • what is \(\sigma_y\) for a slope of 6, \(\sigma_x\) = .5, and SD = 4, would it increase the sample size ?

Explained variance perspective, regression: exercises

 

  • A relation as ratio between and within group variance ~ explained variance R2

  • Different but related effect sizes \(f^2\) = \({R^2/{(1-R^2)}}\)

    • partial \(R^2\) = variance explained by predictor / total variance
    • \(f^2\) = variance explained by predictor / residual variance (Note: 2\(f\) = \(d\) for 2 groups)
  • GPower: regression coefficient (t-test / regression, fixed model single regression coef)

  • Use reference example

    • remember the variances, add them to calculate the effect size
    • calculate sample size ?
    • what if also other predictors in the model ?
    • what if 3 predictors extra reduce residual variance to 50% ?
      • hint: total variance remains constant at 17

Effect sizes, transformations

 

  • \(f^2\), a go-to measure
    • possible for more than 2 groups
    • transformations well described  
  • Typically standardized: signal to noise ratio
    • eg., effect size \(d\) = .5 means .5 standard deviations (pooled)
  • Part of non-centrality (as is sample size) → pushing away Ha
  • ~ Practical relevance
    • NOT statistical significance
      • p-value ~ effect size AND sample size

 

  • 2 main families of effect sizes
    • d-family (differences) and r-family (associations)
    • transform one into other,
      eg., d = .5 → r = .243
      \(\hspace{20 mm}d = \frac{2r}{\sqrt{1-r^2}}\)
      \(\hspace{20 mm}r = \frac{d}{\sqrt{d^2+4}}\)
      \(\hspace{20 mm}d = ln(OR) * \frac{\sqrt{3}}{\pi}\)
      \(\hspace{20 mm}f^2 = \frac{R^2}{1-R^2}\)

Effect sizes, \(f^2\) approximations

 

Regression

\(f^2 = \frac{R^2}{1-R^2}\)  

Binary Logistic Regression

\(f = \frac{\Phi^{-1}(AUC)}{\sqrt{2}}\)  

Ordinal Logistic Regression

\(f = \frac{\sqrt{3}}{2*\pi} * log\_odds\_ratio\)

  Poisson Regression

\(f = \sqrt{\frac{(\lambda_1-\lambda_2)^2}{2*(\lambda_1+\lambda_2)}}\)  

Exponential for waiting times

\(f = 1-{\frac{\lambda_1 * \lambda_2}{(\lambda_1 + \lambda_2)^2}}^{\frac{\mu^2}{\sigma}}\)  

Gamma regression

\(f^2 = {\frac{(\mu_1 + \mu_2)^2}{4*\mu_1*\mu_2}}^{\frac{\mu^2}{\sigma^2}} - 1\)

Effect sizes, use of \(f^2\) with the pwr package

 

  • equivalence
power.t.test(delta = 2, sd = 4, sig.level = 0.05, power = .80)$n * 2
[1] 127.5315
# install.packages('pwr')
library(pwr)
pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.8, type = "two.sample")$n * 2
[1] 127.5312
pwr.f2.test(u = 1, f2 = 0.25^2, sig.level = 0.05, power = 0.8)$v + 2
[1] 127.5312
  • degrees of freedom
    • u, testing the hypothesis of interest
    • +.., for the full model

 

Effect sizes, use cases of \(f^2\), degrees of freedom

  • To extract sample size
    • determine the required degrees of freedom
    • requires the number used for estimation effect of interest
  • Sample size is the required degrees of freedom, but
    • to add one observation for each parameter to estimate
  • E.g., a 234 factorial design, with primary interest the 2*3 interaction
    • (2-1)*(3-1) degrees of freedom for effect of interest
    • 1+((2-1)+(3-1)+(4-1))+ ((2-1)(3-1)+(2-1)(4-1)+(3-1)(4-1))+ ((2-1)(3-1)*(4-1)) = 24 degrees of freedom total

Effect sizes, use cases of \(f^2\) with the pwr package

Following literature we expect 8.8 events per cycle for our control group and aim to show that our treatment group would have less, with at most 7.3, which implies a rate ratio of at most 1.2. The groups should be similar in size, and without further information it is assumed that dispersion is 1. With the typical .05 type I error and 80% power, and for simplicity not including any other predictors, it should be possible to verify the required 118 sample size.

Poisson Regression \(f = \sqrt{\frac{(\lambda_1-\lambda_2)^2}{2*(\lambda_1+\lambda_2)}}\)

Effect sizes, use cases of \(f^2\) with the pwr package

To show that the average rating for the treatment group is 10% better, a sample size is calculated based on a test comparing waiting times with an expected number for the iDA group of at most 1.44 compared to 1.6 for the control group. Note that this implies an average of 1.52, and we also found in literature evidence for a standard deviation of .82 of observed events. Verify that about 1002 patients are required when choosing a type I error of .05 and aiming at .8 power.

Exponential for waiting times \(f = 1-{\frac{\lambda_1 * \lambda_2}{(\lambda_1 + \lambda_2)^2}}^{\frac{\mu^2}{\sigma}}\)

Relation sample & effect size, type I & II errors

 

  • Building blocks:
    • sizes: sample ( \(n\) ) and effect ( \(\Delta\) )
    • errors: type I ( \(\alpha\) ), type II ~ power ( \(1-\beta\) )

 

  • GPower → type of power analysis
    • Apriori: \(n\) ~ \(\alpha\), power, \(\Delta\)
    • Post Hoc: power ~ \(\alpha\), \(n\), \(\Delta\)
    • Compromise: power, \(\alpha\) ~ \(\beta\:/\:\alpha\), \(\Delta\), \(n\)
    • Criterion: \(\alpha\) ~ power, \(\Delta\), \(n\)
    • Sensitivity: \(\Delta\) ~ \(\alpha\), power, \(n\)

 

  • each parameter conditional on others
  • one outgoing arrow, three going in  

Type of power analysis: exercises

 

  • For the reference example:
    • how big is the power for 128 observations (n=2x64, \(\alpha\)=.05 and \(\Delta\)=.5)
    • then, assume a power of .8 but with only half the sample size, how does the effect size \(\Delta\) change ?
    • then, set the ratio \(\beta\)/ \(\alpha\) to 4, what are the resulting \(\alpha\) and \(\beta\) ? and what is the critical value ?
    • then, set the effect size to .7 (same ratio), what are the resulting \(\alpha\) and \(\beta\) ? and what is the critical value ?

Getting your hands dirty

 

 

  • in G*Power
m1=0
m2=2
s1=4
s2=4  
alpha=.025
N=128  
var=.5*s1^2+.5*s2^2  
d=abs(m1-m2)/sqrt(2*var)*sqrt(N/2)  
tc=tinv(1-alpha,N-1)  
power=1-nctcdf(tc,N-1,d)  

 

  • in R
.n <- 64
.df <- 2*.n-2
.ncp <- 2 / (4 * sqrt(2)) * sqrt(.n)
.alpha <- .05
.power <- 1 -
    pt( 
        qt(1-.alpha/2,df=.df,ncp=0), 
        df=.df, ncp=.ncp
    )
round(.power,4)
[1] 0.8015
  • 2 steps
    • qt → quantile on H0 (p=\(Z_{1-\alpha/2}\))
    • pt → probability on Ha < quantile

GPower, a few more situations as exercise

 

  • dependent instead of independent
  • proportions, dependent and independent
  • non-parametric instead of assuming normality
  • correlations
  • more than 2 groups (compare jointly, pairwise, focused)
  • more than 1 predictor
  • repeated measures

 

  • Look into - GPower manual
    27 tests → effect size, non-centrality parameter and example !!

Dependence between groups: exercises

 

  • When comparing 2 dependent groups (eg., before/after treatment) → account for correlations

  • Correlation are typically obtained from pilot data, earlier research

  • GPower: matched pairs (t-test / means, difference 2 dependent means)

  • Use reference example,

    • assume a correlation of .5 and compare with reference example for effect size and n
    • how many observations are required if no correlation exists (think then try) ? effect size ?
    • what changes with correlation .875 (think: more or less n, higher or lower effect size) ?
    • what power would be obtained for the reference with sample size 2x64, but correlation .5 ?
    • get the sd of the difference and use it (\(\sqrt{\sigma_a^2 + \sigma_b^2 - 2 * \rho * \sigma_a * \sigma_b}\)) ?

Dependence: a note about the correlations

 

  • a noticeable correlation

 

  • clear correlation

Proportions: exercises

 

  • Test difference two independent proportions → [0..1]

  • Simplest version of a logistic regression on two groups

  • GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)

  • Testing whether two proportions are the same

    • for odds ratio 3 and p2 = .50, what is p1 ? and for odds ratio 1/3 and p2 = .75 ?
    • what is the sample size to detect a difference for both situations ?
    • for odds ratio 3 and p2 = .75, determine p1 and sample size, how does it compare with before ?
    • compare sample size for a .15 difference, at p1=.5 ?

Proportions: exercises

 

  • GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)

  • Plot 5 power curves

    • odds ratio = 2, with p2 reference probability .6
    • proportions .5 to 1
    • 1 curve per sample sizes 328, 428, 528… (intervals of 100)
    • type I error .05
  • Explain curve minimum, relation sample size ?

  • Repeat for one-tailed, difference ?

Dependent proportions: exercises

 

  • Test difference two dependent proportions → [0..1] categorical shift

    • for two categories, McNemar test: compare \(p_{12}\) with \(p_{21}\)
    • information from changes only → discordant pairs
    • effect size as odds ratio → ratio of discordance
  • GPower: McNemar test (exact / proportions, difference 2 dependent proportions)

  • Testing whether proportions of discordance are the same

    • assume odds ratio 2, .25 discordant, what is sample size
    • for discordant .5 and 1 what are \(p_12\) and \(p_21\) and sample sizes ?
    • for odds ratio .5 and 1 (prop discordant = .25), what are sample sizes ?
    • repeat for third alpha option first scenario, what happens ?

Non-parametric distribution: exercises

 

  • When non-normally distributed residuals are expected, not possible to circumvent (eg., transformations)

  • Only considers ranks or uses permutations → price is efficiency and flexibility

  • Requires a parent distribution (alternative hypothesis), ‘min ARE’ should be default

  • GPower: two groups → Wilcoxon-Mann-Whitney (t-test / means, diff. 2 indep. means)

  • Use reference example

    • use a normal parent distribution, how much efficiency is lost ?
    • use ‘min ARE’ as parent distribution, how much efficiency is lost ?

A variance ratio perspective, ANOVA: exercises

 

  • Multiple groups, at least two differ → not one effect size d

  • F-test statistic & effect size f, ratio of variances \(\sigma_{between}^2 / \sigma_{within}^2\)

  • GPower: one-way Anova (F-test / Means, ANOVA - fixed effects, omnibus, one way)

  • use reference example

    • what is the sample size, for a difference of 2, each 64 observations ?
    • why are the ncp and critical value different ?
    • how does size matter ? (play with it)
  • include a third group (group 1 = 0, group 2 = 2)

    • for third group with mean 0, 2 or 4 (figure), what are the sample sizes ?
    • repeat for between variance 2.67 (balanced: 0, 2, 4) and within variance 16 ?

 

Multiple groups: contrasts: exercises

 

  • Contrasts are linear combinations → planned comparison
    • eg., T1-C: \(1 * T1 -1 * C \neq 0\)
    • eg., (T1+T2)/2-C: \(.5 * (1 * T1 + 1 * T2) -1 * C \neq 0\)
  • Effect sizes for planned comparisons must be calculated
    • contrast specific between variance \(\sigma_{contrast}^2\)
    • \(f\) ~ variance ratio between / within
    • Note: \(f\) = 2\(d\)
  • Obtain effect sizes for contrasts (assume equally sized for convenience)
    • T1-C, T2-C, (T1+T2)/2-C
    • each contrast requires 1 degree of freedom

 

  • Parameters
    • group means \(\mu_i\)
    • pre-specified coefficients \(c_i\)
    • sample sizes \(n_i\)
    • total sample size \(N\)
    • k levels


\(\sigma_{contrast} = \frac{|\sum{\mu_i * c_i}|}{\sqrt{N \sum_i^k c_i^2 / n_i}}\)  

\(f = \sqrt{\frac{\sigma_{contrast}^2}{\sigma_{error}^2}}\)

Multiple groups: contrasts: exercises continued

 

  • GPower: one-way ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)

  • For the reference example extended, with contrasts \(f_{T1-C}\)=.25, \(f_{T2-C}\)=.50 and \(f_{(T2+T1)/2-C}\)=.3535

    • what are the samples sizes for either contrast 1 or contrast 2 ?
    • what are the samples sizes for both contrast 1 and contrast 2 combined ?
    • if taking that sample size, what will be the power for T1-T2 ?
    • what are the samples size for contrast 3 ?

Repeated measures within: exercises

 

  • When the factor is time: repeated measures

    • relates to dependent t-test for multiple measurements (>2)
  • Beware: effect sizes obtained from literature may/may not include correlation

    • Options: as in GPower 3, or SPSS, …
  • GPower: repeated measures (F-test / Means, repeated measures within factors)

  • For reference example with effect size f = .25 (1/16 explained versus unexplained)

    • mimic independent t-test
    • mimic dependent t-test, correlation .5 !
    • what does an increase in correlation imply, why ?
    • for 4, or 8 repeated measurements (cor=.5), what changes.
    • for 4 groups (4 measures, cor=.5), what changes ?

Multiple factors

 

  • Multiple main effects and possibly interaction effects
    • main effect:
      • difference B1-B2 over all conditions of treatment (C,T1,T2)
      • difference C-T1-T2 over all conditions of type (B1,B2)
    • interaction effect:
      • effect of treatment (C-T1-T2) different per level of type (B1 or B2) or vise verse
    • note: numerator degrees of freedom
      • main effect (nr-1) or 2 for treatment and 1 for type
      • interaction (nr1-1)*(nr2-1) or 2 (=2x1)

 

 

  • effect sizes
    • \(\eta^2\) = \(f^2 / (1+f^2)\)
    • note: \(f = d/2\)
      for two groups

Multiple factors effect sizes: exercises

 

  • Get effect size: in-house shiny app

  • Use reference example for treatment (C-T1-T2) and add type (B1-B2)

    • determine the effect size \(\eta^2\)
      for averages 0, 2, and if necessary use 4 or 6
      • treatment effect C-T1
      • treatment effect C-T1-T2 - no type effect B1-B2
      • no treatment effect C-T1-T2 - type effect B1-B2
      • treatment effect C-T1-T2 within B1, not B2 - with interaction
      • treatment effect C-T1-T2 - type effect B1-B2 without interaction

Multiple factors: exercises

 

  • GPower: multiway ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)

  • Use reference example for treatment (C-T1-T2) and add type (B1-B2)
    what are the sample sizes

    • C-T1 with \(\eta^2\)=0.0588
    • C-T1-T2 (no B1-B2) with \(\eta^2\)=0.1429
    • B1-B2 (no C-T1-T2) with \(\eta^2\)=0.0588
    • interaction C-T1-T2 x no B1-B2 with \(\eta^2\)=0.04
    • additive C-T1-T2 + B1-B2
      • with \(\eta^2\)=0.1429 for C-T1-T2
      • with \(\eta^2\)=0.0588 for B1-B2

Repeated measures between: exercises

 

  • When repeated measures are obtained for different groupss

  • GPower: repeated measures (F-test / Means, repeated measures between factors)

  • For reference example

    • relate to independent t-test, 2 uncorrelated measurements
    • mimic independent t-test, 2 almost perfectly correlated measurements
    • with a correlation .5, what changes ?
    • what does increase in correlation imply, why ?
    • for 3 groups extended reference, cor=.5, what changes ?
    • for 4, or 8 repeated measurements (cor=.5), what changes.

Interaction within x between effect sizes: exercises

 

  • When differences between groups depend on time

  • Get effect size: in-house shiny app

  • Use reference example for control-treatment (C-T1), and 2 or 4 time points

    • determine the effect size interaction \(\eta^2\) with r=0 and r=.5
      • treatment effect C-T1, both C and T1 increase with 2
      • treatment effect C-T1, only T1 increase with 2
      • if previous situation repeated twice, T1=2,4,2,4
      • if previous situation repeated twice, but reversed, T1=2,4,4,2

Interaction within x between: exercises

 

  • Options: different effect sizes are possible

    • is correlation already part of effect size ?
    • often it is when extracted from literature
  • GPower: repeated measures (F-test / Means, repeated measures within-between factors)

  • Use effect sizes previous exercise, part 2 and 3

    • determine the sample size, assuming a correlation of .5
      • for effect size assuming no correlation is included, include .5 correlation
      • for effect size assuming correlation is included, include correlation of 0

Correlations: exercises

 

  • Test difference of two independent correlations → [-1..1]

  • Use Fisher Z transformations to normalize

    • z = .5 * log( \(\frac{1+r}{1-r}\) )
    • q = z1-z2
  • Correlations easier to differentiate as they are more different from 0

  • GPower: z-tests / correlation & regressions: 2 indep. Pearson r’s

  • Testing whether two correlations are the same

    • with correlation coefficients .7844 and .5, what are the effect & sample sizes ?
    • with the same difference, but stronger correlations, eg., .9844 and .7, what changes ?
    • with the same difference, but weaker correlations, eg., .1 and .3844, what changes ?

Not included

 

  • GPower not for always sufficient, not even R is

  • Tests too difficult to specify in GPower or R

    • statistics / parametervalues difficult to guestimate
    • manuals not always sufficient
  • Tests not included in GPower or R

    • eg., survival analysis in GPower
    • many tools online, most dedicated to a particular model
  • Tests without formula

    • simulation may be the only tool

 

  • simulation in theory is always possible
    • iterate many times:
      • generate: simulated outcome
        introduce randomness ~ standard deviation
      • analyze: estimate parameters + test
    • count proportion of rejections (~power)
    • determine confidence bounds (~accuracy)
    • select sample size
      • with appropriate proportion (test)
      • with appropriate interval (estimation)

Simulation example: t-test

 

  • simulation in practice
    • reference example: 0-2 (4), 64x2
    • replicate 10000 times
      • generate:
        dta$y <- dta$y+rnorm(length(dta$y),0,4)
      • analyze:
        res <- t.test(data=dta,y~X)
    • count proportion rejection:
      mean(sims['p.val',] < .05)

 

 

gr <- rep(c('T','C'),64)
y <- ifelse(gr=='C',0,2)
dta <- data.frame(y=y,X=gr)
cutoff <- qt(.025,nrow(dta))
 
my_sim_function <- function(){
    dta$y <- dta$y+rnorm(length(dta$y),0,4)     # generate (with sd=4)
    res <- t.test(data=dta,y~X)                 # analyze
    c(res$estimate %*% c(-1,1),res$statistic,res$p.value)
}
sims <- replicate(10000,my_sim_function())      # many iterations
dimnames(sims)[[1]] <- c('diff','t.stat','p.val')

mean(sims['p.val',] < .05)  # p-values  0.8029
mean(sims['t.stat',] < cutoff)  # t-statistics 0.8029
mean(sims['diff',] > sd(sims['diff',])*cutoff*(-1)) # differences 0.8024

Focus / simplify

 

  • Complex statistical models
    • simulate BUT it requires programming and a thorough understanding of the model
    • alternative: focus on essential elements → simplify the aim
  • Sample size calculations (design) for simpler research aim
    • not necessarily equivalent to final statistical testing / estimation
    • requires justification to convince yourself and/or reviewers
      • successful already if simple aim is satisfied
      • ignored part is not too costly

 

  • Example:
    • statistics:
      group difference evolution 4 repeated measurements → mixed model
    • focus:
      difference treatment and control last time point is essential → t-test
    • argument: first 3 measurements low cost, interesting to see change

Conclusion

 

  • Sample size calculation is a design issue, not a statistical one

  • It typically focuses on ensuring sufficient data to result in sufficiently strong statistical inference

  • Sample size depends on effect size, type I & II errors, and the statistical test of interest

  • GPower deals with not too complex models

    • more complex complex models imply more complex specification
    • simplify using a focus, if justifiable → then GPower can get you a long way
  • R is more flexible, but may require a bit more digging and work

  • The most important, turn it into a good story…

Questions ?

 

Thank you for your attention.

 

 

Methodological and statistical support to help make a difference

  • At SQUARE we meet to provide complementary support in statistics and methodology (qualitative and quantitative) to our research community, for individual researchers and research groups, in order to get the best out of their research.
  • SQUARE aims to further enhance the quality of both the research and how it is communicated.

Contact

  • find the SQUARE team and information on our service at square.research.vub.be
  • for feedback on this workshop: wilfried.cools@vub.be