sample size calculation

Workshop

to introduce the key ideas
to help you see the bigger picture
to offer first practical experience: GPower and R

Target audience
- primarily the research community at VUB / UZ Brussel
History
- started in 2019, with Sven Van Laere
- almost yearly, gradually refined
- more focus on R, and reasoning

Feedback

Help us improve this document

wilfried.cools@vub.be

at SQUARE

Ask us for help

we offer consultancy
on methodology, statistics and its communication

square.research.vub.be

Sample size calculation: demarcation

How many observations will be sufficient …
- avoid too many, because typically observations imply a cost
  - money / time → limited resources
  - risk / harm / patient burden → ethical constraints
- have enough, to ensure success of the study

to offer strong enough statistical inference ?
- linked to standard error
  - testing → power [probability to detect effect]
  - estimation → accuracy [size of confidence interval]

Sample size calculation: a difficult design issue

Part of the design of a study
- before data collection
- requires understanding:
  - parameters: effect size of interest
  - data: future data properties
  - model: relation outcome and its conditions under which observed
- decision based on (highly) incomplete information, thus based on (strong) assumptions

Not always possible nor meaningful !
- confirmatory studies easier than exploratory
- experiments (control) easier than observational
- not obvious for complex models
  → simulation
- not obvious for predictive models → no standard error
Avoid retrospective power analyses
→ OK for future study only

Hoenig, J., & Heisey, D. (2001). The Abuse of Power:
The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician, 55, 19–24.

Sample size calculation: if not possible

If not possible in a meaningful way
use alternative justification
- common practice
- feasibility
- …
Implies exploratory aim
- no guarantee on significance
- no guarantee on accuracy

But, because less strong as argument,
put more weight on non-statistical justification
- low cost
- importance (even if no effect)
- novelty (future studies)
- …

Sample size calculation: who are you talking to ?

To persuade that your inference will be
- effective → enough
- efficient → not too many
Persuade
- yourself
- funding committee
  - per project
  - per observation
- committee ethics
  - study worth the patient burden

Example: confirmatory experiment

Does my radiotherapy work ?
- aim: show my radiotherapy reduces tumor size
- method: compare groups treatment T and control C
  - induce tumor in N mice
  - randomly assign mice to T/C
- observation: tumor sizes after 20 days
- analysis: unpaired t-test

Sample size question
- how many mice are required
- to show treatment reduces tumor size more
- requiring 25% more reduction for treatment group
  - compared to assumed 4mm (control)
  - require at least 5mm (treatment)
- with 80% probability (+ type I error probability .05)

Example: reference

Apriori specifications
- intend to perform a statistical test
- comparing 2 equally sized groups
- to detect difference of at least 2
- assuming an uncertainty of 4 SD on each average
- which results in an effect size of .5
- evaluated on a Student t-distribution
- allowing for a type I error prob. of .05 $(\alpha)$
- allowing for a type II error prob. of .2 $(\beta)$
Sample size
conditional on specifications being true

Difference detected approximately 80% of the times.

Note

This reference example used throughout the workshop !!

Formula you could use

Specifications for this particular case:
- sample size (n → ?)
- difference ( $\Delta$ =signal → 2)
- uncertainty ( $\sigma$ =noise → 4)
- type I errors ( $\alpha$ → .05, so $Z_{ \alpha /2}$ → -1.96)
- type II errors ( $\beta$ → .2, so $Z_ \beta$ → -0.842)
Sample size = 2 groups x 63 observations = 126

$n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{ \Delta^2} = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2} = 62.79$

Formula’s are test and statistic specific but logic remains same
This and other formula’s implemented in various tools, we will use GPower and pwr package in R

R: a does-it-all tool

Use it
- implements wide variety of tests
- offers multiple (dedicated) packages
- free @ http://www.cran.com/
- huge community to help
Alternatives exist
- online tools
- special purpose programs
- …

At a bare minimum, it is a calculator
- hint: qnorm(.025) is $Z_{.05/2}$
- hint: 4^2 is 4 squared
- calculate the sample size in R for earlier formula
- $n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}$

R: a native stats package function

power.t.test(delta = 2, sd = 4, sig.level = 0.05, power = .80)
     Two-sample t test power calculation 

              n = 63.76576
          delta = 2
             sd = 4
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

# get help
?power.t.test

R: exercise using power.t.test

To confirm that the treatment results in at least 500 calories less, compared to the control,
knowing that the standard deviation of measured calories in each group should be about 1000,
how many observations are required to show that difference,
allowing for a .01 type I error and .9 power ?

GPower: a useful tool

Use it
- implements wide variety of tests
- free @ http://www.gpower.hhu.de/
- popular and well established
- implements various visualizations
- documented -fairly- well
Maybe not use it
- not all tests are included, the simpler
- not without flaws
- other tools exist (some paying)

GPower menu of statistical tests

Test family - statistical tests [in window]
- Exact Tests (8)
- $t$-tests (11) → reference
- $z$-tests (2)
- $\chi^2$-tests (7)
- $F$-tests (16)
Focus on the density functions

Tests [in menu]
- correlation & regression (15)
- means (19) → reference
- proportions (8)
- variances (2)
Focus on the type of parameters

GPower input

~ reference example input
- t-test : difference two independent averages
- apriori: calculate sample size
- effect size = standardized difference (Cohen’s $d$)
  - Determine =>
    - $d$ = |difference| / SD_pooled
    - $d$ = |0-2| / 4 = .5
- $\alpha$ = .05; two-tailed ( $\alpha$ /2 → .025 & .975 )
- $power = 1-\beta$ = .8
- allocation ratio N2/N1 = 1 (equally sized groups)

GPower output

~ reference example output
- sample size $(n)$ = 64 x 2 = (128)
- degrees of freedom $(df)$ = 126 (128-2)
- power ≥ .80 (1- $\beta$) = 0.8015
- distributions: central + non-central
- critical t = 1.979
  - decision boundary given $\alpha$ and $df$
    qt(.975,126)
- non centrality parameter ( $\delta$ ) = 2.8284
  - shift Ha (true) away from Ho (null)
    2/(4*sqrt(2))*sqrt(64)

Exercise with Determine

For the reference example:
- change mean values from 0 and 2 to 4 and 6, what changes ?
- change sd values to 2 for each, what changes ?
  - effect size ?
  - total sample size ?
  - critical t ?
  - non-centrality ?
- change sd values to 8 for each, what changes ?
- change sd to 2 and 5.3, or 1 and 5.5,
  how does it compare to 4 and 4 ?

Exercise with X-Y Plot

For the reference example:
- plot powercurve: power by effect size
- compare 6 sample sizes: 34 in steps of 34
- for a range of effect sizes in between .2 and 1.2
- use $\alpha$ equal to .05
- how does power change when doubling the effect size ?

powercurve → X-Y plot for range of values

GPower protocol

Summary for future reference or communication
- central and non-central distributions (figure)
- protocol of power analysis (text)

File/Edit save or print file (copy-paste)

Non-centrality parameter ( $\delta$ ), shift `Ha` from `Ho`

non-centrality parameter $\delta$ combines SIZES
- assumed effect size ((standardized) signal)
- conditional on sample size (information)
$\delta$ determines overlap Ho and Ha: bigger ncp less overlap
- $\delta$ as violation of Ho → shift (location/shape)
- power = probability beyond $\color{green}{cut off}$ at Ho evaluated on Ha
- push with sample size
Ha acts as $\color{blue}{truth}$ assumed difference of e.g. .5 SD
- Ha ~ t(ncp=2.828,df)
Ho acts as $\color{red}{benchmark}$: typically no difference, no relation
- set $\color{green}{cutoff}$ on Ho ~ t(ncp=0,df) using $\alpha$

Alternative: divide by N

Sample sizes determine shape, not location
- divide by n: sample size ~ standard error
  - peakedness of both distributions
  - often preferred didactically
- non-centrality parameter: sample size ~ location
  - standardized distributions
  - often preferred in software / algorithms
Formula’s same (DIY: two equations for critical value)

$n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}$

Type I/II error probability

Inference (test) based on cut-off’s (density → AUC=1)
Type I error: incorrectly reject Ho (false positive):
- cut-off at Ho, error prob. $\alpha$ controlled
  - conditional on absense group differences / relation
- one/two tailed → one/both sides informative ?
Type II error: incorrectly fail to reject Ho (false negative):
- cut-off at Ho, error prob. $\beta$ obtained from Ha
- Ha assumed known in a power analyses
power = 1 - $\beta$ = probability correct rejection (true positive)

Inference versus truth
- infer: effect exists vs. unsure
- truth: effect exist vs. does not

	infer=Ha	infer=Ho	sum
truth=Ho	$\alpha$	1- $\alpha$	1
truth=Ha	1- $\beta$	$\beta$	1

Create plot

Create a plot
- X-Y plot for range of values
- assumes calculated analysis
  - ~ reference example
- specify Y-axis / X-axis / curves and constant
  - beware of order !
Plot sample size (y-axis)
- by type I error $\alpha$ (x-axis) → from .01 to .2 in steps of .01
- for 4 values of power (curves) → with values .8 in steps of .05
- assume effect size (constant) → .5 from reference example

Notice Table option

Errors: exercises

Where on the red curve (right) is
the type II error equal to 4 * type I error ?
When smaller effect size (e.g., .25), what changes ?
With power on the Y-axis, what is relation type I and II error ?

Decide Type I/II error probability

Reasoning on error probabilities
- $\alpha$ & $\beta$ inversely related
- which error you want to avoid most ?
  - cheap aids test ? → avoid type II
  - heavy cancer treatment ? → avoid type I

Popular choices for error probabilities
- $\alpha$ often in range .01 - .05 → 1/100 - 1/20
- $\beta$ often in range .2 to .1 → power = 80% to 90%

Popular rules of thumb
- 4 * $\alpha$ ~ $\beta$
  - type I error is ~4 times worse !!
Probability for both errors always exists

For fun: P(effect exists | test says so)

Using $\alpha$, $\beta$ and power or $1-\beta$
- $P(infer=Ha|truth=Ha) = power$ → $P$(test says there is effect | effect exists)
- $P(infer=Ha|truth=Ho) = \alpha$
- $P(infer=Ho|truth=Ha) = \beta$
- $P(\underline{truth}=Ha|\underline{infer}=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha)}$ → Bayes Theorem
- __ = $\frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha|truth=Ha) * P(truth=Ha) + P(infer=Ha|truth=Ho) * P(truth=Ho)}$
- __ = $\frac{power * P(truth=Ha)}{power * P(truth=Ha) + \alpha * P(truth=Ho)}$ → depends on prior probabilities
IF very low probability model is true (eg., $P(truth=Ha) = .01$) THEN probability effect exists if test says so is low (e.g., $P(truth=Ha|infer=Ha) = \frac{.8 * .01}{.8 * .01 + .05 * .99} = .14$)

Control Type I error for Multiple Testing

Type I error defined on the family of Ho's,
- manage probability to incorrectly reject
  - any of the null hypotheses !!
  - all of the null hypotheses
Typical use: contrasts on factors with +2 levels

Multiple testing
- inflates type I error $\alpha$
  - k tests with probability for an error: $1-(1- \alpha)^k$
- control $\alpha$ over set of tests
  - change $\alpha$ → eg., Bonferroni ( $\alpha/k$)
  - ~ adjust p-values

A family of comparisons: exercises

Comparing the control group and two treatments
Pairwise comparisons, typically not an omnibus
- looked at as a set of t-tests, not as ANOVA
- requires multiple testing correction, e.g., Bonferonni correction: divide $\alpha$ by number of tests
use reference example (C = 0, T1 = 2), and extend with group 3 with T2 = 4 (same sd)
- what samples sizes are necessary for all three pairwise tests combined ?
- what if biggest difference (C-T2) is ignored, because considered easiest to detect ?

Control Type I error for Interim Analysis

Interim analysis
- analyze and ‘conditionally’ proceed
  - possibility to stop early
- error spending: stop if
  - significant: control type I error
  - futile: control type II error

Adjustments of either $p$ or $\alpha$
- plan in advance
  - O’Brien-Flemming bounds: initially conservative
  - Pockock bounds: constant
  - … or design yourself
- dependent on information fraction, extract critical values

Control Type I error for Interim Analysis

NOT GPower
our own simulation tool (Susanne Blotwijk)
determine boundaries with PASS, R (ldbounds), …

Pocock (P), O’Brien-Fleming (OF), Haybittle-Peto (HP), and Wang-Tsiatis (WT) correction with $\delta$ = 0.25

Interim Analysis: exercises

Detect difference reference, with 2 peaks at the data
Ensure sufficient power, what are the decision rules
Use App. gsDesigner

Early stopping with high cost: Simon II stage

Interim analysis
- event (reaction to treatment) or not (binomial)
- stop if not enough evidence, with high cost

To determine a difference in proportion of reactions
- while p0 expected, detect p1 wanted
- search for number n1 at first stage
- given (different values of) the number (n) in total

Exercise Simon II stage

To determine a difference in proportion of reactions
- assume about .1 percent reacts if the treatment does nothing
- assume about .4 percent reacts if the treatment works sufficiently
- for type I and II error .05 and .2
- use the ph2simon function of the clinfun package
- hint: ?ph2simon to get the help file

Sample size

Sample size
- allocated over groups
- not necessarily equally
- specify ratio
Most important: all groups sufficiently large
- unequal group sizes not a problem
- only when expected group size difference

Sample size: exercises

For the reference example:
- compare for allocation ratios 1, .5, 2, 10, 50
- repeat for effect size 1, and compare
? no idea why n1 $\neq$ n2
Plot with changed allocation ratio

Effect sizes, in principle

Estimate / guestimate of minimal magnitude of interest
Typically standardized: signal to noise ratio
- eg., effect size $d$ = .5 means .5 standard deviations (pooled)
- eg., effect size $R^2$ = .3 means .3 explained variance
Part of non-centrality (as is sample size) → pushing away Ha
~ Practical relevance
- NOT statistical significance
  - p-value ~ effect size AND sample size

2 main families of effect sizes
- d-family (differences) and r-family (associations)
- transform one into other,
  eg., d = .5 → r = .243
  $\hspace{20 mm}d = \frac{2r}{\sqrt{1-r^2}}$
  $\hspace{20 mm}r = \frac{d}{\sqrt{d^2+4}}$
  $\hspace{20 mm}d = ln(OR) * \frac{\sqrt{3}}{\pi}$

Effect sizes, in literature

Cohen, J. (1992).
A power primer. Psychological Bulletin, 112, 155–159.
Cohen, J. (1988).
Statistical power analysis for the behavioral sciences (2nd ed).
Famous Cohen conventions
- but beware, just rules of thumb

Effect sizes, in literature continued

Ellis, P. D. (2010).
The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results.
more than 70 different effect sizes… most of them related

Effect sizes, in GPower (Determine)

Effect sizes are test specific
- t-test → group means and sd’s
- one-way anova → variance explained & error
- regression → sd’s and correlations
- . . . .

GPower helps with Determine
- sliding window
- one or more effect size specifications

Effect sizes, how to determine them in theory

Choice of effect size matters → justify choice !!
Choice of effect size depends on aim of the study
- realistic (eg., previously observed effect) → replicate
- important (eg., minimally relevant effect)
- NOT significant → meaningless, dependent on sample size
Choice of effect size dependent on statistical test of interest
- for independent t-test → means and standard deviations
- possible alternative: variance explained, eg., 1 versus 16+1

Examples
- with one-way ANOVA
  $f$ = .25 instead of d = .5
- with linear regression
  $f^2$ = .0625 instead of d = .5
- psychometric freeware

Effect sizes, how to determine them in practice

Experts / patients
- minimally clinically relevant effect
- importance
- use if possible
Literature
- earlier study / systematic review
- realistic
- beware of publication bias
(Internal) Pilot
- guestimate dispersion estimate
- not to obtain effect size → small sample

Guestimate uncertainty…
- sd from assumed range
  - assume normal and divide by 6
- sd for proportions at conservative .5
- sd from control, assume treatment the same
- ...
Turn to Cohen
- use if everything else fails
- rules of thumb
  - eg., .2 - .5 - .8 for Cohen’s d

Effect sizes, a note about the SD

For independent t-test → means and standard deviations (sd)
- sd ~ ‘unexplained’ variance
- account for important predictors
Example: 50% variance unexplained by treatment, explained by predictor
- a standard deviation of 4 (variance of 16)
- split into
  - standard deviation of 2.8284 (variance of 8)
  - standard deviation of 2.8284 explained by important predictor

Effect sizes, a note about the SD continued

not accounting for the important predictor

accounting for it, sd (around average) reduced

Effect sizes, a note about non-inferiority

Most often aim to show effect is likely to exist
- show difference (or relation) different from (typically) zero
- assuming a particular difference (or relation)
Non-inferiority to show effect is no too much worse
- show difference (or relation) not beyond margin of tolerance
- assuming a particular difference (or relation),
- most often but not necessarily 0
- ! this is inherently one-sided

Effect sizes, notes: exercises

Use reference example
- assume half the unexplained variance is accounted for by the predictor, what are the sample sizes ?
- assume a non-inferiority margin of -2, and no difference, how big is the sample size ?
- assume treatment to be 2 higher, compare the sample size for superiority (bigger than 0) and non-inferiority with margin of -2

Effect sizes, test specific but not really

Regression analysis results in R


Call:
lm(formula = y ~ factor(group), data = .dta)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8910 -2.5825 -0.6262  2.5170  9.2916 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)    6.280e-16  5.000e-01   0.000  1.00000   
factor(group)2 2.000e+00  7.071e-01   2.828  0.00544 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4 on 126 degrees of freedom
Multiple R-squared:  0.0597,    Adjusted R-squared:  0.05224 
F-statistic:     8 on 1 and 126 DF,  p-value: 0.005444

ANOVA results in R

             Df Sum Sq Mean Sq F value  Pr(>F)   
group         1    128     128       8 0.00544 **
Residuals   126   2016      16                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

t-test results in R

[1] 0.00544392

        t 
-2.828427

Effect sizes, test specific but not really

Total variance: 16.882
Between group variance
with mean y equal to 0 and 2: 1.008
Within group variance
(error or residual variance): 15.874
F-test statistic, square root of ratio
between on within group variance: 0.252
$\eta^2$ statistic, ratio between group on total variance 0.06
Design variance
with X = 0 and 1: 0.252

A relations perspective, regression analysis: exercises

Differences between groups ~ relation with grouping (categorization)
Example: d = .5 ~ r = .243 (note: slope $\beta = {r*\sigma_y} / {\sigma_x}$)
- total variance $\sigma_y$ = residual variance + model variance (2 or 0) → var((2-1),(0-1),(2-1),(0-1),…)
- design variance $\sigma_x$ = variance -.5 and .5 for all observations → var((1-.5),(0-.5),(1-.5),(0-.5),…)
GPower: regression coefficient (t-test / regression, one group size of slope)
- what is the slope $\beta$ and $\sigma_y$ for reference values, d=.5 (hint:d~r), SD = 4 and $\sigma_x$ = .5 (1/0)
- what is the resulting sample size
- what happens with the slope and sample size if predictor values are taken as 1/-1 instead of 0/1 ?
- what is $\sigma_y$ for a slope of 6, $\sigma_x$ = .5, and SD = 4, would it increase the sample size ?

Explained variance perspective, regression: exercises

A relation as ratio between and within group variance ~ explained variance R²
Different but related effect sizes $f^2$ = ${R^2/{(1-R^2)}}$
- partial $R^2$ = variance explained by predictor / total variance
- $f^2$ = variance explained by predictor / residual variance (Note: 2$f$ = $d$ for 2 groups)
GPower: regression coefficient (t-test / regression, fixed model single regression coef)
Use reference example
- remember the variances, add them to calculate the effect size
- calculate sample size ?
- what if also other predictors in the model ?
- what if 3 predictors extra reduce residual variance to 50% ?
  - hint: total variance remains constant at 17

Effect sizes, transformations

$f^2$, a go-to measure
- possible for more than 2 groups
- transformations well described
Typically standardized: signal to noise ratio
- eg., effect size $d$ = .5 means .5 standard deviations (pooled)
Part of non-centrality (as is sample size) → pushing away Ha
~ Practical relevance
- NOT statistical significance
  - p-value ~ effect size AND sample size

2 main families of effect sizes
- d-family (differences) and r-family (associations)
- transform one into other,
  eg., d = .5 → r = .243
  $\hspace{20 mm}d = \frac{2r}{\sqrt{1-r^2}}$
  $\hspace{20 mm}r = \frac{d}{\sqrt{d^2+4}}$
  $\hspace{20 mm}d = ln(OR) * \frac{\sqrt{3}}{\pi}$
  $\hspace{20 mm}f^2 = \frac{R^2}{1-R^2}$

Effect sizes, $f^2$ approximations

Regression

$f^2 = \frac{R^2}{1-R^2}$

Binary Logistic Regression

$f = \frac{\Phi^{-1}(AUC)}{\sqrt{2}}$

Ordinal Logistic Regression

$f = \frac{\sqrt{3}}{2*\pi} * log\_odds\_ratio$

Poisson Regression

$f = \sqrt{\frac{(\lambda_1-\lambda_2)^2}{2*(\lambda_1+\lambda_2)}}$

Exponential for waiting times

$f = 1-{\frac{\lambda_1 * \lambda_2}{(\lambda_1 + \lambda_2)^2}}^{\frac{\mu^2}{\sigma}}$

Gamma regression

$f^2 = {\frac{(\mu_1 + \mu_2)^2}{4*\mu_1*\mu_2}}^{\frac{\mu^2}{\sigma^2}} - 1$

Effect sizes, use of $f^2$ with the pwr package

equivalence

power.t.test(delta = 2, sd = 4, sig.level = 0.05, power = .80)$n * 2

[1] 127.5315

# install.packages('pwr')
library(pwr)
pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.8, type = "two.sample")$n * 2

[1] 127.5312

pwr.f2.test(u = 1, f2 = 0.25^2, sig.level = 0.05, power = 0.8)$v + 2

[1] 127.5312

degrees of freedom
- u, testing the hypothesis of interest
- +.., for the full model

our own effect size retrieval tool (Susanne Blotwijk)
for our reference
- $n^2$ = $\frac{(1^2)}{(1^2+4^2)}$
- $f$ = $\sqrt{\frac{1^2}{4^2}}$

Effect sizes, use cases of $f^2$, degrees of freedom

To extract sample size
- determine the required degrees of freedom
- requires the number used for estimation effect of interest
Sample size is the required degrees of freedom, but
- to add one observation for each parameter to estimate
E.g., a 234 factorial design, with primary interest the 2*3 interaction
- (2-1)*(3-1) degrees of freedom for effect of interest
- 1+((2-1)+(3-1)+(4-1))+ ((2-1)(3-1)+(2-1)(4-1)+(3-1)(4-1))+ ((2-1)(3-1)*(4-1)) = 24 degrees of freedom total

Effect sizes, use cases of $f^2$ with the pwr package

Following literature we expect 8.8 events per cycle for our control group and aim to show that our treatment group would have less, with at most 7.3, which implies a rate ratio of at most 1.2. The groups should be similar in size, and without further information it is assumed that dispersion is 1. With the typical .05 type I error and 80% power, and for simplicity not including any other predictors, it should be possible to verify the required 118 sample size.

Poisson Regression $f = \sqrt{\frac{(\lambda_1-\lambda_2)^2}{2*(\lambda_1+\lambda_2)}}$

Effect sizes, use cases of $f^2$ with the pwr package

To show that the average rating for the treatment group is 10% better, a sample size is calculated based on a test comparing waiting times with an expected number for the iDA group of at most 1.44 compared to 1.6 for the control group. Note that this implies an average of 1.52, and we also found in literature evidence for a standard deviation of .82 of observed events. Verify that about 1002 patients are required when choosing a type I error of .05 and aiming at .8 power.

Exponential for waiting times $f = 1-{\frac{\lambda_1 * \lambda_2}{(\lambda_1 + \lambda_2)^2}}^{\frac{\mu^2}{\sigma}}$

Relation sample & effect size, type I & II errors

Building blocks:
- sizes: sample ( $n$ ) and effect ( $\Delta$ )
- errors: type I ( $\alpha$ ), type II ~ power ( $1-\beta$ )

GPower → type of power analysis
- Apriori: $n$ ~ $\alpha$, power, $\Delta$
- Post Hoc: power ~ $\alpha$, $n$, $\Delta$
- Compromise: power, $\alpha$ ~ $\beta\:/\:\alpha$, $\Delta$, $n$
- Criterion: $\alpha$ ~ power, $\Delta$, $n$
- Sensitivity: $\Delta$ ~ $\alpha$, power, $n$

each parameter conditional on others
one outgoing arrow, three going in

Type of power analysis: exercises

For the reference example:
- how big is the power for 128 observations (n=2x64, $\alpha$=.05 and $\Delta$=.5)
- then, assume a power of .8 but with only half the sample size, how does the effect size $\Delta$ change ?
- then, set the ratio $\beta$/ $\alpha$ to 4, what are the resulting $\alpha$ and $\beta$ ? and what is the critical value ?
- then, set the effect size to .7 (same ratio), what are the resulting $\alpha$ and $\beta$ ? and what is the critical value ?

Getting your hands dirty

in G*Power

m1=0
m2=2
s1=4
s2=4  
alpha=.025
N=128  
var=.5*s1^2+.5*s2^2  
d=abs(m1-m2)/sqrt(2*var)*sqrt(N/2)  
tc=tinv(1-alpha,N-1)  
power=1-nctcdf(tc,N-1,d)

in R

.n <- 64
.df <- 2*.n-2
.ncp <- 2 / (4 * sqrt(2)) * sqrt(.n)
.alpha <- .05
.power <- 1 -
    pt( 
        qt(1-.alpha/2,df=.df,ncp=0), 
        df=.df, ncp=.ncp
    )
round(.power,4)

[1] 0.8015

2 steps
- qt → quantile on H0 (p=$Z_{1-\alpha/2}$)
- pt → probability on Ha < quantile

GPower, a few more situations as exercise

dependent instead of independent
proportions, dependent and independent
non-parametric instead of assuming normality
correlations
more than 2 groups (compare jointly, pairwise, focused)
more than 1 predictor
repeated measures

Look into - GPower manual
27 tests → effect size, non-centrality parameter and example !!

Dependence between groups: exercises

When comparing 2 dependent groups (eg., before/after treatment) → account for correlations
Correlation are typically obtained from pilot data, earlier research
GPower: matched pairs (t-test / means, difference 2 dependent means)
Use reference example,
- assume a correlation of .5 and compare with reference example for effect size and n
- how many observations are required if no correlation exists (think then try) ? effect size ?
- what changes with correlation .875 (think: more or less n, higher or lower effect size) ?
- what power would be obtained for the reference with sample size 2x64, but correlation .5 ?
- get the sd of the difference and use it ($\sqrt{\sigma_a^2 + \sigma_b^2 - 2 * \rho * \sigma_a * \sigma_b}$) ?

Dependence: a note about the correlations

a noticeable correlation

clear correlation

Proportions: exercises

Test difference two independent proportions → [0..1]
Simplest version of a logistic regression on two groups
GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)
Testing whether two proportions are the same
- for odds ratio 3 and p2 = .50, what is p1 ? and for odds ratio 1/3 and p2 = .75 ?
- what is the sample size to detect a difference for both situations ?
- for odds ratio 3 and p2 = .75, determine p1 and sample size, how does it compare with before ?
- compare sample size for a .15 difference, at p1=.5 ?

Proportions: exercises

GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)
Plot 5 power curves
- odds ratio = 2, with p2 reference probability .6
- proportions .5 to 1
- 1 curve per sample sizes 328, 428, 528… (intervals of 100)
- type I error .05
Explain curve minimum, relation sample size ?
Repeat for one-tailed, difference ?

Dependent proportions: exercises

Test difference two dependent proportions → [0..1] categorical shift
- for two categories, McNemar test: compare $p_{12}$ with $p_{21}$
- information from changes only → discordant pairs
- effect size as odds ratio → ratio of discordance
GPower: McNemar test (exact / proportions, difference 2 dependent proportions)
Testing whether proportions of discordance are the same
- assume odds ratio 2, .25 discordant, what is sample size
- for discordant .5 and 1 what are $p_12$ and $p_21$ and sample sizes ?
- for odds ratio .5 and 1 (prop discordant = .25), what are sample sizes ?
- repeat for third alpha option first scenario, what happens ?

Non-parametric distribution: exercises

When non-normally distributed residuals are expected, not possible to circumvent (eg., transformations)
Only considers ranks or uses permutations → price is efficiency and flexibility
Requires a parent distribution (alternative hypothesis), ‘min ARE’ should be default
GPower: two groups → Wilcoxon-Mann-Whitney (t-test / means, diff. 2 indep. means)
Use reference example
- use a normal parent distribution, how much efficiency is lost ?
- use ‘min ARE’ as parent distribution, how much efficiency is lost ?

A variance ratio perspective, ANOVA: exercises

Multiple groups, at least two differ → not one effect size d
F-test statistic & effect size f, ratio of variances $\sigma_{between}^2 / \sigma_{within}^2$
GPower: one-way Anova (F-test / Means, ANOVA - fixed effects, omnibus, one way)
use reference example
- what is the sample size, for a difference of 2, each 64 observations ?
- why are the ncp and critical value different ?
- how does size matter ? (play with it)
include a third group (group 1 = 0, group 2 = 2)
- for third group with mean 0, 2 or 4 (figure), what are the sample sizes ?
- repeat for between variance 2.67 (balanced: 0, 2, 4) and within variance 16 ?

Multiple groups: contrasts: exercises

Contrasts are linear combinations → planned comparison
- eg., T1-C: $1 * T1 -1 * C \neq 0$
- eg., (T1+T2)/2-C: $.5 * (1 * T1 + 1 * T2) -1 * C \neq 0$
Effect sizes for planned comparisons must be calculated
- contrast specific between variance $\sigma_{contrast}^2$
- $f$ ~ variance ratio between / within
- Note: $f$ = 2$d$
Obtain effect sizes for contrasts (assume equally sized for convenience)
- T1-C, T2-C, (T1+T2)/2-C
- each contrast requires 1 degree of freedom

Parameters
- group means $\mu_i$
- pre-specified coefficients $c_i$
- sample sizes $n_i$
- total sample size $N$
- k levels

$\sigma_{contrast} = \frac{|\sum{\mu_i * c_i}|}{\sqrt{N \sum_i^k c_i^2 / n_i}}$

$f = \sqrt{\frac{\sigma_{contrast}^2}{\sigma_{error}^2}}$

Multiple groups: contrasts: exercises continued

GPower: one-way ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
For the reference example extended, with contrasts $f_{T1-C}$=.25, $f_{T2-C}$=.50 and $f_{(T2+T1)/2-C}$=.3535
- what are the samples sizes for either contrast 1 or contrast 2 ?
- what are the samples sizes for both contrast 1 and contrast 2 combined ?
- if taking that sample size, what will be the power for T1-T2 ?
- what are the samples size for contrast 3 ?

Repeated measures within: exercises

When the factor is time: repeated measures
- relates to dependent t-test for multiple measurements (>2)
Beware: effect sizes obtained from literature may/may not include correlation
- Options: as in GPower 3, or SPSS, …
GPower: repeated measures (F-test / Means, repeated measures within factors)
For reference example with effect size f = .25 (1/16 explained versus unexplained)
- mimic independent t-test
- mimic dependent t-test, correlation .5 !
- what does an increase in correlation imply, why ?
- for 4, or 8 repeated measurements (cor=.5), what changes.
- for 4 groups (4 measures, cor=.5), what changes ?

Multiple factors

Multiple main effects and possibly interaction effects
- main effect:
  - difference B1-B2 over all conditions of treatment (C,T1,T2)
  - difference C-T1-T2 over all conditions of type (B1,B2)
- interaction effect:
  - effect of treatment (C-T1-T2) different per level of type (B1 or B2) or vise verse
- note: numerator degrees of freedom
  - main effect (nr-1) or 2 for treatment and 1 for type
  - interaction (nr1-1)*(nr2-1) or 2 (=2x1)

note:
get effect sizes for two way anova:
in-house shiny app

effect sizes
- $\eta^2$ = $f^2 / (1+f^2)$
- note: $f = d/2$
  for two groups

Multiple factors effect sizes: exercises

Get effect size: in-house shiny app
Use reference example for treatment (C-T1-T2) and add type (B1-B2)
- determine the effect size $\eta^2$
  for averages 0, 2, and if necessary use 4 or 6
  - treatment effect C-T1
  - treatment effect C-T1-T2 - no type effect B1-B2
  - no treatment effect C-T1-T2 - type effect B1-B2
  - treatment effect C-T1-T2 within B1, not B2 - with interaction
  - treatment effect C-T1-T2 - type effect B1-B2 without interaction

Multiple factors: exercises

GPower: multiway ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
Use reference example for treatment (C-T1-T2) and add type (B1-B2)
what are the sample sizes
- C-T1 with $\eta^2$=0.0588
- C-T1-T2 (no B1-B2) with $\eta^2$=0.1429
- B1-B2 (no C-T1-T2) with $\eta^2$=0.0588
- interaction C-T1-T2 x no B1-B2 with $\eta^2$=0.04
- additive C-T1-T2 + B1-B2
  - with $\eta^2$=0.1429 for C-T1-T2
  - with $\eta^2$=0.0588 for B1-B2

Repeated measures between: exercises

When repeated measures are obtained for different groupss
GPower: repeated measures (F-test / Means, repeated measures between factors)
For reference example
- relate to independent t-test, 2 uncorrelated measurements
- mimic independent t-test, 2 almost perfectly correlated measurements
- with a correlation .5, what changes ?
- what does increase in correlation imply, why ?
- for 3 groups extended reference, cor=.5, what changes ?
- for 4, or 8 repeated measurements (cor=.5), what changes.

Interaction within x between effect sizes: exercises

When differences between groups depend on time
Get effect size: in-house shiny app
Use reference example for control-treatment (C-T1), and 2 or 4 time points
- determine the effect size interaction $\eta^2$ with r=0 and r=.5
  - treatment effect C-T1, both C and T1 increase with 2
  - treatment effect C-T1, only T1 increase with 2
  - if previous situation repeated twice, T1=2,4,2,4
  - if previous situation repeated twice, but reversed, T1=2,4,4,2

Interaction within x between: exercises

Options: different effect sizes are possible
- is correlation already part of effect size ?
- often it is when extracted from literature
GPower: repeated measures (F-test / Means, repeated measures within-between factors)
Use effect sizes previous exercise, part 2 and 3
- determine the sample size, assuming a correlation of .5
  - for effect size assuming no correlation is included, include .5 correlation
  - for effect size assuming correlation is included, include correlation of 0

Correlations: exercises

Test difference of two independent correlations → [-1..1]
Use Fisher Z transformations to normalize
- z = .5 * log( $\frac{1+r}{1-r}$ )
- q = z1-z2
Correlations easier to differentiate as they are more different from 0
GPower: z-tests / correlation & regressions: 2 indep. Pearson r’s
Testing whether two correlations are the same
- with correlation coefficients .7844 and .5, what are the effect & sample sizes ?
- with the same difference, but stronger correlations, eg., .9844 and .7, what changes ?
- with the same difference, but weaker correlations, eg., .1 and .3844, what changes ?

Not included

GPower not for always sufficient, not even R is
Tests too difficult to specify in GPower or R
- statistics / parametervalues difficult to guestimate
- manuals not always sufficient
Tests not included in GPower or R
- eg., survival analysis in GPower
- many tools online, most dedicated to a particular model
Tests without formula
- simulation may be the only tool

simulation in theory is always possible
- iterate many times:
  - generate: simulated outcome
    introduce randomness ~ standard deviation
  - analyze: estimate parameters + test
- count proportion of rejections (~power)
- determine confidence bounds (~accuracy)
- select sample size
  - with appropriate proportion (test)
  - with appropriate interval (estimation)

Simulation example: t-test

simulation in practice
- reference example: 0-2 (4), 64x2
- replicate 10000 times
  - generate:
    dta$y <- dta$y+rnorm(length(dta$y),0,4)
  - analyze:
    res <- t.test(data=dta,y~X)
- count proportion rejection:
  mean(sims['p.val',] < .05)

gr <- rep(c('T','C'),64)
y <- ifelse(gr=='C',0,2)
dta <- data.frame(y=y,X=gr)
cutoff <- qt(.025,nrow(dta))
 
my_sim_function <- function(){
    dta$y <- dta$y+rnorm(length(dta$y),0,4)     # generate (with sd=4)
    res <- t.test(data=dta,y~X)                 # analyze
    c(res$estimate %*% c(-1,1),res$statistic,res$p.value)
}
sims <- replicate(10000,my_sim_function())      # many iterations
dimnames(sims)[[1]] <- c('diff','t.stat','p.val')

mean(sims['p.val',] < .05)  # p-values  0.8029
mean(sims['t.stat',] < cutoff)  # t-statistics 0.8029
mean(sims['diff',] > sd(sims['diff',])*cutoff*(-1)) # differences 0.8024

Focus / simplify

Complex statistical models
- simulate BUT it requires programming and a thorough understanding of the model
- alternative: focus on essential elements → simplify the aim
Sample size calculations (design) for simpler research aim
- not necessarily equivalent to final statistical testing / estimation
- requires justification to convince yourself and/or reviewers
  - successful already if simple aim is satisfied
  - ignored part is not too costly

Example:
- statistics:
  group difference evolution 4 repeated measurements → mixed model
- focus:
  difference treatment and control last time point is essential → t-test
- argument: first 3 measurements low cost, interesting to see change

Conclusion

Sample size calculation is a design issue, not a statistical one
It typically focuses on ensuring sufficient data to result in sufficiently strong statistical inference
Sample size depends on effect size, type I & II errors, and the statistical test of interest
GPower deals with not too complex models
- more complex complex models imply more complex specification
- simplify using a focus, if justifiable → then GPower can get you a long way
R is more flexible, but may require a bit more digging and work
The most important, turn it into a good story…

Questions ?

Thank you for your attention.

Methodological and statistical support to help make a difference

At SQUARE we meet to provide complementary support in statistics and methodology (qualitative and quantitative) to our research community, for individual researchers and research groups, in order to get the best out of their research.
SQUARE aims to further enhance the quality of both the research and how it is communicated.

Contact

find the SQUARE team and information on our service at square.research.vub.be
for feedback on this workshop: wilfried.cools@vub.be

sample size calculation

Workshop

Sample size calculation: demarcation

Sample size calculation: a difficult design issue

Sample size calculation: if not possible

Sample size calculation: who are you talking to ?

Example: confirmatory experiment

Example: reference

Formula you could use

R: a does-it-all tool

R: a native stats package function

R: exercise using power.t.test

GPower: a useful tool

GPower menu of statistical tests

GPower input

GPower output

Exercise with Determine

Exercise with X-Y Plot

GPower protocol

Non-centrality parameter ( \(\delta\) ), shift Ha from Ho

Alternative: divide by N

Type I/II error probability

Create plot

Errors: exercises

Decide Type I/II error probability

For fun: P(effect exists | test says so)

Control Type I error for Multiple Testing

A family of comparisons: exercises

Control Type I error for Interim Analysis

Control Type I error for Interim Analysis

Interim Analysis: exercises

Early stopping with high cost: Simon II stage

Exercise Simon II stage

Sample size

Sample size: exercises

Effect sizes, in principle

Effect sizes, in literature

Effect sizes, in literature continued

Effect sizes, in GPower (Determine)

Effect sizes, how to determine them in theory

Effect sizes, how to determine them in practice

Effect sizes, a note about the SD

Effect sizes, a note about the SD continued

Effect sizes, a note about non-inferiority

Effect sizes, notes: exercises

Effect sizes, test specific but not really

Effect sizes, test specific but not really

A relations perspective, regression analysis: exercises

Explained variance perspective, regression: exercises

Effect sizes, transformations

Effect sizes, \(f^2\) approximations

Effect sizes, use of \(f^2\) with the pwr package

Effect sizes, use cases of \(f^2\), degrees of freedom

Effect sizes, use cases of \(f^2\) with the pwr package

Effect sizes, use cases of \(f^2\) with the pwr package

Relation sample & effect size, type I & II errors

Type of power analysis: exercises

Getting your hands dirty

GPower, a few more situations as exercise

Dependence between groups: exercises

Dependence: a note about the correlations

Proportions: exercises

Proportions: exercises

Dependent proportions: exercises

Non-parametric distribution: exercises

A variance ratio perspective, ANOVA: exercises

Multiple groups: contrasts: exercises

Multiple groups: contrasts: exercises continued

Repeated measures within: exercises

Multiple factors

Multiple factors effect sizes: exercises

Multiple factors: exercises

Repeated measures between: exercises

Interaction within x between effect sizes: exercises

Interaction within x between: exercises

Correlations: exercises

Not included

Simulation example: t-test

Focus / simplify

Conclusion

Non-centrality parameter ( \(\delta\) ), shift `Ha` from `Ho`