sample size calculation

Workshop

to introduce the key ideas
to help you see the bigger picture
to offer first practical experience: GPower

Target audience
- primarily the research community at VUB / UZ Brussel
History
- started in 2019, with Sven Van Laere
- almost yearly, gradually refined

Feedback

Help us improve this document

wilfried.cools@vub.be

at SQUARE

Ask us for help

we offer consultancy
on methodology, statistics and its communication

square.research.vub.be

Program

Part I:
understand the reasoning
- introduce building blocks
- highlight how the relate
- focus on t-test only
- a few exercises in GPower

Part II:
explore more complex situations
- go beyond the t-test
- simple but common
- many exercises in GPower
- not one formula for all

Sample size calculation: demarcation

How many observations will be sufficient ?
- avoid too many, because typically observations imply a cost
  - money / time → limited resources
  - risk / harm / patient burden → ethical constraints
- have enough

To offer strong enough statistical inference !
- linked to standard error
  - testing → power [probability to detect effect]
  - estimation → accuracy [size of confidence interval]

Sample size calculation: a difficult design issue

Part of the design of a study
- before data collection
- requires understanding:
  - parameters: effect size of interest
  - data: future data properties
  - model: relation outcome and its conditions under which observed
- decision based on (highly) incomplete information, thus based on (strong) assumptions

Not always possible nor meaningful !
- confirmatory studies easier than exploratory
- experiments (control) easier than observational
- not obvious for complex models
  → simulation
- not obvious for predictive models
  → no standard error

Sample size calculation: if not possible

If not possible in a meaningful way
use alternative justification
- common practice
- feasibility
- …
Avoid retrospective power analyses
→ OK for future study only

Hoenig, J., & Heisey, D. (2001). The Abuse of Power:
The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician, 55, 19–24.

Because less strong,
put more weight on non-statistical justification
- low cost
- importance
- …

Simple example confirmatory experiment

Does my radiotherapy work ?
- aim: show my radiotherapy reduces tumor size
- method: compare treatment and control group
  - tumor induced in N mice
  - random assignment mice
- data: tumor sizes after 20 days
- analysis: unpaired t-test

Sample size question
- how many mice are required
- to show treatment reduces tumor size more
- assuming effect size:
  - my radiotherapy works if 25% more reduction treatment group
  - 20% ~ 4mm (control) versus 5mm (treatment)
- with 80% probability (+ type I error probability .05)

Reference example

Apriori specifications
- intend to perform a statistical test
- comparing 2 equally sized groups
- to detect difference of at least 2
- assuming an uncertainty of 4 SD on each average
- which results in an effect size of .5
- evaluated on a Student t-distribution
- allowing for a type I error prob. of .05 $(\alpha)$
- allowing for a type II error prob. of .2 $(\beta)$
Sample size
conditional on specifications being true

Note

This reference example used throughout the workshop !!

Formula you could use

Specifications for this particular case:
- sample size (n → ?)
- difference ( $\Delta$ =signal → 2)
- uncertainty ( $\sigma$ =noise → 4)
- type I errors ( $\alpha$ → .05, so $Z_{ \alpha /2}$ → -1.96)
- type II errors ( $\beta$ → .2, so $Z_ \beta$ → -0.84)
Sample size = 2 groups x 63 observations = 126

$n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{ \Delta^2} = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2} = 62.79$

Formula’s are test and statistic specific but logic remains same
This and other formula’s implemented in various tools, our focus: GPower

GPower: the building blocks in action

4 components and 2 distributions
- distributions: Ho & Ha ~ test dependent shape
- SIZES: effect size & sample size ~ shift Ha
- ERRORS :
  - Type I error ( $\alpha$ ) defined on distribution Ho
  - Type II error ( $\beta$ ) evaluated on distribution Ha
Calculate sample size based on effect size, type I / II error

GPower: a useful tool

Use it
- implements wide variety of tests
- free @ http://www.gpower.hhu.de/
- popular and well established
- implements various visualizations
- documented -fairly- well
Maybe not use it
- not all tests are included, the simpler
- not without flaws
- other tools exist (some paying)

GPower statistical tests

Test family - statistical tests [in window]
- Exact Tests (8)
- $t$-tests (11) → reference
- $z$-tests (2)
- $\chi^2$-tests (7)
- $F$-tests (16)
Focus on the density functions

Tests [in menu]
- correlation & regression (15)
- means (19) → reference
- proportions (8)
- variances (2)
Focus on the type of parameters

GPower input

~ reference example input
- t-test : difference two independent averages
- apriori: calculate sample size
- effect size = standardized difference (Cohen’s $d$)
  - Determine =>
    - $d$ = |difference| / SD_pooled
    - $d$ = |0-2| / 4 = .5
- $\alpha$ = .05; two-tailed ( $\alpha$ /2 → .025 & .975 )
- $power = 1-\beta$ = .8
- allocation ratio N2/N1 = 1 (equally sized groups)

GPower output

~ reference example output
- sample size $(n)$ = 64 x 2 = (128)
- degrees of freedom $(df)$ = 126 (128-2)
- power ≥ .80 (1- $\beta$) = 0.8015
- distributions: central + non-central
- critical t = 1.979
  - decision boundary given $\alpha$ and $df$
    qt(.975,126)
- non centrality parameter ( $\delta$ ) = 2.8284
  - shift Ha (true) away from Ho (null)
    2/(4*sqrt(2))*sqrt(64)

GPower protocol

Summary for future reference or communication
- central and non-central distributions (figure)
- protocol of power analysis (text)

File/Edit save or print file (copy-paste)

Non-centrality parameter ( $\delta$ ), shift `Ha` from `Ho`

non-centrality parameter $\delta$ combines
- assumed effect size ((standardized) signal)
- conditional on sample size (information)
$\delta$ determines overlap Ho and Ha: bigger ncp less overlap
- $\delta$ as violation of Ho → shift (location/shape)
- power = probability beyond $\color{green}{cut off}$ at Ho evaluated on Ha
- push with sample size
Ha acts as $\color{blue}{truth}$ assumed difference of e.g. .5 SD
- Ha ~ t(ncp=2.828,df)
Ho acts as $\color{red}{benchmark}$: typically no difference, no relation
- set $\color{green}{cutoff}$ on Ho ~ t(ncp=0,df) using $\alpha$

Alternative: divide by N

Sample sizes determine shape, not location
- divide by n: sample size ~ standard error
  - peakedness of both distributions
  - often preferred didactically
- non-centrality parameter: sample size ~ location
  - standardized distributions
  - often preferred in software / algorithms
Formula’s same (DIY: two equations for critical value)

$n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}$

Type I/II error probability

Inference (test) based on cut-off’s (density → AUC=1)
Type I error: incorrectly reject Ho (false positive):
- cut-off at Ho, error prob. $\alpha$ controlled
- one/two tailed → one/both sides informative ?
Type II error: incorrectly fail to reject Ho (false negative):
- cut-off at Ho, error prob. $\beta$ obtained from Ha
- Ha assumed known in a power analyses
power = 1 - $\beta$ = probability correct rejection (true positive)

Inference versus truth
- infer: effect exists vs. unsure
- truth: effect exist vs. does not

	infer=Ha	infer=Ho	sum
truth=Ho	$\alpha$	1- $\alpha$	1
truth=Ha	1- $\beta$	$\beta$	1

Create plot

Create a plot
- X-Y plot for range of values
- assumes calculated analysis
  - ~ reference example
- specify Y-axis / X-axis / curves and constant
  - beware of order !
Plot sample size (y-axis)
- by type I error $\alpha$ (x-axis) → from .01 to .2 in steps of .01
- for 4 values of power (curves) → with values .8 in steps of .05
- assume effect size (constant) → .5 from reference example

Notice Table option

Errors: exercises

Where on the red curve (right) is
the type II error equal to 4 * type I error ?
When smaller effect size (e.g., .25), what changes ?

Errors: exercises continued

Plot power instead of sample size
- with 4 power curves
  with sample sizes 32 in step of 32
What is relation type I and II error ?
What would be difference between curves for $\alpha$ = 0 ?

Decide Type I/II error probability

Reasoning on error probabilities
- $\alpha$ & $\beta$ inversely related
- which error you want to avoid most ?
  - cheap aids test ? → avoid type II
  - heavy cancer treatment ? → avoid type I

Popular choices for error probabilities
- $\alpha$ often in range .01 - .05 → 1/100 - 1/20
- $\beta$ often in range .2 to .1 → power = 80% to 90%

Popular rules of thumb
- 4 * $\alpha$ ~ $\beta$
  - type I error is ~4 times worse !!
Probability for both errors always exists

Control Type I error

Defined on the Ho, known
- assumes only sampling variability

Multiple testing
- control $\alpha$ over set of tests
  - each test $\alpha$ error
- inflates type I error $\alpha$
  - k tests with probability for an error: $1-(1- \alpha)^k$
- correct → eg., Bonferroni ( $\alpha/k$)

Interim analysis
- analyze and ‘conditionally’ proceed
- type of multiple testing
- plan in advance
- adjustments of either $p$ or $\alpha$
- alpha spending, eg., O’Brien-Flemming bounds
- NOT GPower
  our own simulation tool (Susanne Blotwijk)
- determine boundaries with PASS, R (ldbounds), …

For fun: P(effect exists | test says so)

Using $\alpha$, $\beta$ and power or $1-\beta$
- $P(infer=Ha|truth=Ha) = power$ → $P$(test says there is effect | effect exists)
- $P(infer=Ha|truth=Ho) = \alpha$
- $P(infer=Ho|truth=Ha) = \beta$
- $P(\underline{truth}=Ha|\underline{infer}=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha)}$ → Bayes Theorem
- __ = $\frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha|truth=Ha) * P(truth=Ha) + P(infer=Ha|truth=Ho) * P(truth=Ho)}$
- __ = $\frac{power * P(truth=Ha)}{power * P(truth=Ha) + \alpha * P(truth=Ho)}$ → depends on prior probabilities
IF very low probability model is true (eg., $P(truth=Ha) = .01$) THEN probability effect exists if test says so is low (e.g., $P(truth=Ha|infer=Ha) = \frac{.8 * .01}{.8 * .01 + .05 * .99} = .14$)

Effect sizes, in principle

Estimate / guestimate of minimal magnitude of interest
Typically standardized: signal to noise ratio
- eg., effect size $d$ = .5 means .5 standard deviations (pooled)
Part of non-centrality (as is sample size) → pushing away Ha
~ Practical relevance
- NOT statistical significance
  - p-value ~ effect size AND sample size

2 main families of effect sizes
- d-family (differences) and r-family (associations)
- transform one into other,
  eg., d = .5 → r = .243
  $\hspace{20 mm}d = \frac{2r}{\sqrt{1-r^2}}$
  $\hspace{20 mm}r = \frac{d}{\sqrt{d^2+4}}$
  $\hspace{20 mm}d = ln(OR) * \frac{\sqrt{3}}{\pi}$

Effect sizes, in literature

Cohen, J. (1992).
A power primer. Psychological Bulletin, 112, 155–159.
Cohen, J. (1988).
Statistical power analysis for the behavioral sciences (2nd ed).
Famous Cohen conventions
- but beware, just rules of thumb

Effect sizes, in literature continued

Ellis, P. D. (2010).
The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results.
more than 70 different effect sizes… most of them related

Effect sizes, in GPower (Determine)

Effect sizes are test specific
- t-test → group means and sd’s
- one-way anova → variance explained & error
- regression → sd’s and correlations
- . . . .

GPower helps with Determine
- sliding window
- one or more effect size specifications

Effect sizes: exercises

For the reference example:
- change mean values from 0 and 2 to 4 and 6, what changes ?
- change sd values to 2 for each, what changes ?
  - effect size ?
  - total sample size ?
  - critical t ?
  - non-centrality ?
- change sd values to 8 for each, what changes ?
- change sd to 2 and 5.3, or 1 and 5.5,
  how does it compare to 4 and 4 ?

Effect sizes: exercises continued

For the reference example:
- plot powercurve: power by effect size
- compare 6 sample sizes: 34 in steps of 34
- for a range of effect sizes in between .2 and 1.2
- use $\alpha$ equal to .05
- how does power change when doubling the effect size ?

powercurve → X-Y plot for range of values

Effect sizes, how to determine them in theory

Choice of effect size matters → justify choice !!
Choice of effect size depends on aim of the study
- realistic (eg., previously observed effect) → replicate
- important (eg., minimally relevant effect)
- NOT significant → meaningless, dependent on sample size
Choice of effect size dependent on statistical test of interest
- for independent t-test → means and standard deviations
- possible alternative: variance explained, eg., 1 versus 16+1

Examples
- with one-way ANOVA
  $f$ = .25 instead of d = .5
- with linear regression
  $f^2$ = .0625 instead of d = .5
- psychometric freeware

Effect sizes, how to determine them in practice

Experts / patients
- minimally clinically relevant effect
- importance
- use if possible
Literature
- earlier study / systematic review
- realistic
- beware of publication bias
(Internal) Pilot
- guestimate dispersion estimate
- not to obtain effect size → small sample

Guestimate uncertainty…
- sd from assumed range
  - assume normal and divide by 6
- sd for proportions at conservative .5
- sd from control, assume treatment the same
- ...
Turn to Cohen
- use if everything else fails
- rules of thumb
  - eg., .2 - .5 - .8 for Cohen’s d

Effect sizes, a note about the SD

For independent t-test → means and standard deviations (sd)
- sd ~ ‘unexplained’ variance
- account for important predictors
Example: 50% variance unexplained by treatment, explained by predictor
- a standard deviation of 4 (variance of 16)
- split into
  - standard deviation of 2.8284 (variance of 8)
  - standard deviation of 2.8284 explained by important predictor

Effect sizes, a note about the SD continued

not accounting for the important predictor

accounting for it, sd (around average) reduced

Effect sizes, a note about non-inferiority

Most often aim to show effect is likely to exist
- show difference (or relation) different from (typically) zero
- assuming a particular difference (or relation)
Non-inferiority to show effect is no too much worse
- show difference (or relation) not beyond margin of tolerance
- assuming a particular difference (or relation),
- most often but not necessarily 0
- ! this is inherently one-sided

Effect sizes, notes: exercises

Use reference example
- assume half the unexplained variance is accounted for by the predictor, what are the sample sizes ?
- assume a non-inferiority margin of -2, and no difference, how big is the sample size ?
- assume treatment to be 2 higher, compare the sample size for superiority (bigger than 0) and non-inferiority with margin of -2

Sample size

Sample size
- allocated over groups
- not necessarily equally
- specify ratio
Most important: all groups sufficiently large
- unequal group sizes not a problem
- only when expected group size difference

Sample size: exercises

For the reference example:
- compare for allocation ratios 1, .5, 2, 10, 50
- repeat for effect size 1, and compare
? no idea why n1 $\neq$ n2
Plot with changed allocation ratio

Relation sample & effect size, type I & II errors

Building blocks:
- sizes: sample ( $n$ ) and effect ( $\Delta$ )
- errors: type I ( $\alpha$ ), type II ~ power ( $1-\beta$ )

GPower → type of power analysis
- Apriori: $n$ ~ $\alpha$, power, $\Delta$
- Post Hoc: power ~ $\alpha$, $n$, $\Delta$
- Compromise: power, $\alpha$ ~ $\beta\:/\:\alpha$, $\Delta$, $n$
- Criterion: $\alpha$ ~ power, $\Delta$, $n$
- Sensitivity: $\Delta$ ~ $\alpha$, power, $n$

each parameter conditional on others
one outgoing arrow, three going in

Type of power analysis: exercises

For the reference example:
- how big is the power for 128 observations (n=2x64, $\alpha$=.05 and $\Delta$=.5)
- then, assume a power of .8 but with only half the sample size, how does the effect size $\Delta$ change ?
- then, set the ratio $\beta$/ $\alpha$ to 4, what are the resulting $\alpha$ and $\beta$ ? and what is the critical value ?
- then, set the effect size to .7 (same ratio), what are the resulting $\alpha$ and $\beta$ ? and what is the critical value ?

Getting your hands dirty

in G*Power

m1=0
m2=2
s1=4
s2=4  
alpha=.025
N=128  
var=.5*s1^2+.5*s2^2  
d=abs(m1-m2)/sqrt(2*var)*sqrt(N/2)  
tc=tinv(1-alpha,N-1)  
power=1-nctcdf(tc,N-1,d)

in R

.n <- 64
.df <- 2*.n-2
.ncp <- 2 / (4 * sqrt(2)) * sqrt(.n)
.alpha <- .05
.power <- 1 -
    pt( 
        qt(1-.alpha/2,df=.df,ncp=0), 
        df=.df, ncp=.ncp
    )
round(.power,4)

[1] 0.8015

2 steps
- qt → quantile on H0 (p=$Z_{1-\alpha/2}$)
- pt → probability on Ha < quantile

GPower, beyond the independent t-test

So far, comparing two independent means
From now on, selected topics beyond independent t-test
with small exercises
- dependent instead of independent
- non-parametric instead of assuming normality
- relations instead of groups (regression)
- correlations
- proportions, dependent and independent
- more than 2 groups (compare jointly, pairwise, focused)
- more than 1 predictor
- repeated measures

Look into - GPower manual
27 tests → effect size, non-centrality parameter and example !!

Dependence between groups: exercises

When comparing 2 dependent groups (eg., before/after treatment) → account for correlations
Correlation are typically obtained from pilot data, earlier research
GPower: matched pairs (t-test / means, difference 2 dependent means)
Use reference example,
- assume a correlation of .5 and compare with reference example for effect size, ncp, n
- how many observations are required if no correlation exists (think then try) ? effect size ?
- what changes with correlation .875 (think: more or less n, higher or lower effect size) ?
- what power would be obtained for the reference with sample size 2x64, but correlation .5 ?

Dependence: a note about the correlations

a noticeable correlation

clear correlation

Non-parametric distribution: exercises

When non-normally distributed residuals are expected, not possible to circumvent (eg., transformations)
Only considers ranks or uses permutations → price is efficiency and flexibility
Requires a parent distribution (alternative hypothesis), ‘min ARE’ should be default
GPower: two groups → Wilcoxon-Mann-Whitney (t-test / means, diff. 2 indep. means)
Use reference example
- use a normal parent distribution, how much efficiency is lost ?
- use ‘min ARE’ as parent distribution, how much efficiency is lost ?

A relations perspective, regression analysis: exercises

Differences between groups ~ relation with grouping (categorization)
Example: d = .5 ~ r = .243 (note: slope $\beta = {r*\sigma_y} / {\sigma_x}$)
- total variance $\sigma_y$ = residual variance + model variance (2 or 0) → var((2-1),(0-1),(2-1),(0-1),…)
- design variance $\sigma_x$ = variance -.5 and .5 for all observations → var((1-.5),(0-.5),(1-.5),(0-.5),…)
GPower: regression coefficient (t-test / regression, one group size of slope)
- what is the slope $\beta$ and $\sigma_y$ for reference values, d=.5 (hint:d~r), SD = 4 and $\sigma_x$ = .5 (1/0)
- what is the resulting sample size
- what happens with the slope and sample size if predictor values are taken as 1/-1 instead of .5/-.5?
- what is $\sigma_y$ for a slope of 6, $\sigma_x$ = .5, and SD = 4, would it increase the sample size ?

Explained variance perspective, regression: exercises

A relation as ratio between and within group variance ~ explained variance R²
Different but related effect sizes $f^2$ = ${R^2/{(1-R^2)}}$
- partial $R^2$ = variance explained by predictor / total variance
- $f^2$ = variance explained by predictor / residual variance
- Note: $f$ = 2$d$ for 2 groups
GPower: regression coefficient (t-test / regression, fixed model single regression coef)
Use reference example
- remember the variances, add them to calculate the effect size
- calculate sample size ?
- what if also other predictors in the model ?
- what if 3 predictors extra reduce residual variance to 50% ?
  - hint: total variance remains constant at 17

A variance ratio perspective, ANOVA: exercises

Multiple groups, at least two differ → not one effect size d
F-test statistic & effect size f, ratio of variances $\sigma_{between}^2 / \sigma_{within}^2$
GPower: one-way Anova (F-test / Means, ANOVA - fixed effects, omnibus, one way)
use reference example
- what is the sample size, for a difference of 2, each 64 observations ?
- why are the ncp and critical value different ?
- how does size matter ? (play with it)
include a third group (group 1 = 0, group 2 = 2)
- for group 3 with mean 0, 2 or 4 (figure), what are the sample sizes ?
- vary group 2, middle group, with mean 1 and 3, does that have an effect ?
- repeat for between variance 2.67 (balanced: 0, 2, 4) and within variance 16 ?

Multiple groups, pairwise: exercises

Detecting ‘a’ difference often not of interest (omnibus), typically particular pairwise comparisons are
Pairwise comparisons
- looked at as if t-test
- requires multiple testing correction, e.g., Bonferonni correction: divide $\alpha$ by number of tests
use reference example (C = 0, T1 = 2), and extend with group 3 with T2 = 4
- what samples sizes are necessary for all three pairwise tests combined ?
- what if biggest difference (C-T2) is ignored, because considered easiest to detect ?
- with original 64 sized groups, what is the power to detect C-T1 difference group ?
  - with either 3 or 2 tests jointly ?

Multiple groups: contrasts: exercises

Contrasts are linear combinations → planned comparison
- eg., T1-C: $1 * T1 -1 * C \neq 0$
- eg., (T1+T2)/2-C: $.5 * (1 * T1 + 1 * T2) -1 * C \neq 0$
Effect sizes for planned comparisons must be calculated
- contrast specific between variance $\sigma_{contrast}^2$
- $f$ ~ variance ratio between / within
- Note: $f$ = 2$d$
Obtain effect sizes for contrasts (assume equally sized for convenience)
- T1-C, T2-C, (T1+T2)/2-C
- each contrast requires 1 degree of freedom

Parameters
- group means $\mu_i$
- pre-specified coefficients $c_i$
- sample sizes $n_i$
- total sample size $N$
- k levels

$\sigma_{contrast} = \frac{|\sum{\mu_i * c_i}|}{\sqrt{N \sum_i^k c_i^2 / n_i}}$

$f = \sqrt{\frac{\sigma_{contrast}^2}{\sigma_{error}^2}}$

Multiple groups: contrasts: exercises continued

GPower: one-way ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
For the reference example extended, with contrasts $f_{T1-C}$=.25, $f_{T2-C}$=.50 and $f_{(T2+T1)/2-C}$=.3535
- what are the samples sizes for either contrast 1 or contrast 2 ?
- what are the samples sizes for both contrast 1 and contrast 2 combined ?
- if taking that sample size, what will be the power for T1-T2 ?
- what are the samples size for contrast 3 ?

Repeated measures within: exercises

When the factor is time: repeated measures
- relates to dependent t-test for multiple measurements (>2)
Beware: effect sizes obtained from literature may/may not include correlation
- Options: as in GPower 3, or SPSS, …
GPower: repeated measures (F-test / Means, repeated measures within factors)
For reference example with effect size f = .25 (1/16 explained versus unexplained)
- mimic independent t-test
- mimic dependent t-test, correlation .5 !
- what does an increase in correlation imply, why ?
- for 4, or 8 repeated measurements (cor=.5), what changes.
- for 2, 4 or 8 groups (cor=.5), what changes ?

Multiple factors

Multiple main effects and possibly interaction effects
- main effect:
  - difference B1-B2 over all conditions of treatment (C,T1,T2)
  - difference C-T1-T2 over all conditions of type (B1,B2)
- interaction effect:
  - effect of treatment (C-T1-T2) different per level of type (B1 or B2) or vise verse
- note: numerator degrees of freedom
  - main effect (nr-1) or 2 for treatment and 1 for type
  - interaction (nr1-1)*(nr2-1) or 2 (=2x1)

note:
get effect sizes for two way anova:
in-house shiny app

effect sizes
- $\eta^2$ = $f^2 / (1+f^2)$
- note: $f = d/2$
  for two groups

Multiple factors effect sizes: exercises

Get effect size: in-house shiny app
Use reference example for treatment (C-T1-T2) and add type (B1-B2)
- determine the effect size $\eta^2$
  for averages 0, 2, and if necessary use 4 or 6
  - treatment effect C-T1
  - treatment effect C-T1-T2 - no type effect B1-B2
  - no treatment effect C-T1-T2 - type effect B1-B2
  - treatment effect C-T1-T2 within B1, not B2 - with interaction
  - treatment effect C-T1-T2 - type effect B1-B2 without interaction

Multiple factors: exercises

GPower: multiway ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
Use reference example for treatment (C-T1-T2) and add type (B1-B2)
what are the sample sizes
- C-T1 with $\eta^2$=0.0588
- C-T1-T2 (no B1-B2) with $\eta^2$=0.0385
- B1-B2 (no C-T1-T2) with $\eta^2$=0.0588
- interaction C-T1-T2 x no B1-B2 with $\eta^2$=0.04
- additive C-T1-T2 + B1-B2
  - with $\eta^2$=0.1429 for C-T1-T2
  - with $\eta^2$=0.0588 for B1-B2

Repeated measures between: exercises

When repeated measures are obtained for different groups
GPower: repeated measures (F-test / Means, repeated measures within factors)
For reference example
- relate to independent t-test, 2 uncorrelated measurements
- mimic independent t-test, 2 almost perfectly correlated measurements
- with a correlation .5, what changes ?
- what does increase in correlation imply, why ?
- for 3 groups extended reference, cor=.5, what changes ?
- for 4, or 8 repeated measurements (cor=.5), what changes.

Interaction within x between effect sizes: exercises

When differences between groups depend on time
Get effect size: in-house shiny app
Use reference example for control-treatment (C-T1), and 2 or 4 time points
- determine the effect size interaction $\eta^2$ with r=0 and r=.5
  - treatment effect C-T1, both C and T1 increase with 2
  - treatment effect C-T1, only T1 increase with 2
  - if previous situation repeated twice, T1=2,4,2,4
  - if previous situation repeated twice, but reversed, T1=2,4,4,2

Interaction within x between: exercises

Options: different effect sizes are possible
- is correlation already part of effect size ?
- often it is when extracted from literature
GPower: repeated measures (F-test / Means, repeated measures within-between factors)
Use effect sizes previous exercise, part 2 and 3
- determine the sample size, assuming a correlation of .5
  - for effect size assuming no correlation is included, include .5 correlation
  - for effect size assuming correlation is included, include correlation of 0

Correlations: exercises

Test difference of two independent correlations → [-1..1]
Use Fisher Z transformations to normalize
- z = .5 * log( $\frac{1+r}{1-r}$ )
- q = z1-z2
Correlations easier to differentiate as they are more different from 0
GPower: z-tests / correlation & regressions: 2 indep. Pearson r’s
Testing whether two correlations are the same
- with correlation coefficients .7844 and .5, what are the effect & sample sizes ?
- with the same difference, but stronger correlations, eg., .9844 and .7, what changes ?
- with the same difference, but weaker correlations, eg., .1 and .3844, what changes ?

Proportions: exercises

Test difference two independent proportions → [0..1]
Proportions easier to differentiate as they are more different from .5
GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)
Testing whether two proportions are the same
- for odds ratio 3 and p2 = .50, what is p1 ? and for odds ratio 1/3 ?
- what is the sample size to detect a difference for both situations ?
- for odds ratio 3 and p2 = .75, determine p1 and sample size, how does it compare with before ?
- for odds ratio 1/3 and p2 = .25, determine p1 and sample size, how does it compare with before ?
- compare sample size for a .15 difference, at p1=.5 ?

Proportions: exercises

GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)
Plot 5 power curves
- odds ratio = 2, with p2 reference probability .6
- proportions .5 to 1
- 1 curve per sample sizes 328, 428, 528… (intervals of 100)
- type I error .05
Explain curve minimum, relation sample size ?
Repeat for one-tailed, difference ?

Dependent proportions: exercises

Test difference two dependent proportions → [0..1] categorical shift
- for two categories, McNemar test: compare $p_{12}$ with $p_{21}$
- information from changes only → discordant pairs
- effect size as odds ratio → ratio of discordance
GPower: McNemar test (exact / proportions, difference 2 dependent proportions)
Testing whether proportions of discordance are the same
- assume odds ratio 2, .25 discordant, what is sample size
- what for discordant, .5, and 1 ?
- odds ratio .99 versus .5, (prop discordant = .25), what are $p_12$ and $p_21$ and sample sizes ?
- repeat for third alpha option, and consider total sample size, what happens ?

Not included

GPower not for always sufficient
Tests too difficult to specify in GPower
- statistics / parametervalues difficult to guestimate
- manual not always sufficient
Tests not included in GPower
- eg., survival analysis
- many tools online, most dedicated to a particular model
Tests without formula
- simulation may be the only tool

simulation in theory is always possible
- iterate many times:
  - generate: simulated outcome
    introduce randomness ~ standard deviation
  - analyze: estimate parameters + test
- count proportion of rejections (~power)
- determine confidence bounds (~accuracy)
- select sample size
  - with appropriate proportion (test)
  - with appropriate interval (estimation)

Simulation example: t-test

simulation in practice
- reference example: 0-2 (4), 64x2
- replicate 10000 times
  - generate:
    dta$y <- dta$y+rnorm(length(dta$y),0,4)
  - analyze:
    res <- t.test(data=dta,y~X)
- count proportion rejection:
  mean(sims['p.val',] < .05)

gr <- rep(c('T','C'),64)
y <- ifelse(gr=='C',0,2)
dta <- data.frame(y=y,X=gr)
cutoff <- qt(.025,nrow(dta))
 
my_sim_function <- function(){
    dta$y <- dta$y+rnorm(length(dta$y),0,4)     # generate (with sd=4)
    res <- t.test(data=dta,y~X)                 # analyze
    c(res$estimate %*% c(-1,1),res$statistic,res$p.value)
}
sims <- replicate(10000,my_sim_function())      # many iterations
dimnames(sims)[[1]] <- c('diff','t.stat','p.val')

mean(sims['p.val',] < .05)  # p-values  0.8029
mean(sims['t.stat',] < cutoff)  # t-statistics 0.8029
mean(sims['diff',] > sd(sims['diff',])*cutoff*(-1)) # differences 0.8024

Focus / simplify

Complex statistical models
- simulate BUT it requires programming and a thorough understanding of the model
- alternative: focus on essential elements → simplify the aim
Sample size calculations (design) for simpler research aim
- not necessarily equivalent to final statistical testing / estimation
- requires justification to convince yourself and/or reviewers
  - successful already if simple aim is satisfied
  - ignored part is not too costly

Example:
- statistics:
  group difference evolution 4 repeated measurements → mixed model
- focus:
  difference treatment and control last time point is essential → t-test
- argument: first 3 measurements low cost, interesting to see change

Conclusion

Sample size calculation is a design issue, not a statistical one
It typically focuses on ensuring sufficient data to result in sufficiently strong statistical inference
Sample size depends on effect size, type I & II errors, and the statistical test of interest
Effect sizes express the amount of signal compared to the background noise
GPower deals with not too complex models
- more complex complex models imply more complex specification
- simplify using a focus, if justifiable → then GPower can get you a long way

Questions ?

Thank you for your attention.

Methodological and statistical support to help make a difference

At SQUARE we meet to provide complementary support in statistics and methodology (qualitative and quantitative) to our research community, for individual researchers and research groups, in order to get the best out of their research.
SQUARE aims to further enhance the quality of both the research and how it is communicated.

Contact

find the SQUARE team and information on our service at square.research.vub.be
for feedback on this workshop: wilfried.cools@vub.be

sample size calculation

Workshop

Program

Sample size calculation: demarcation

Sample size calculation: a difficult design issue

Sample size calculation: if not possible

Simple example confirmatory experiment

Reference example

Formula you could use

GPower: the building blocks in action

GPower: a useful tool

GPower statistical tests

GPower input

GPower output

GPower protocol

Non-centrality parameter ( \(\delta\) ), shift Ha from Ho

Alternative: divide by N

Type I/II error probability

Create plot

Errors: exercises

Errors: exercises continued

Decide Type I/II error probability

Control Type I error

For fun: P(effect exists | test says so)

Effect sizes, in principle

Effect sizes, in literature

Effect sizes, in literature continued

Effect sizes, in GPower (Determine)

Effect sizes: exercises

Effect sizes: exercises continued

Effect sizes, how to determine them in theory

Effect sizes, how to determine them in practice

Effect sizes, a note about the SD

Effect sizes, a note about the SD continued

Effect sizes, a note about non-inferiority

Effect sizes, notes: exercises

Sample size

Sample size: exercises

Relation sample & effect size, type I & II errors

Type of power analysis: exercises

Getting your hands dirty

GPower, beyond the independent t-test

Dependence between groups: exercises

Dependence: a note about the correlations

Non-parametric distribution: exercises

A relations perspective, regression analysis: exercises

Explained variance perspective, regression: exercises

A variance ratio perspective, ANOVA: exercises

Multiple groups, pairwise: exercises

Multiple groups: contrasts: exercises

Multiple groups: contrasts: exercises continued

Repeated measures within: exercises

Multiple factors

Multiple factors effect sizes: exercises

Multiple factors: exercises

Repeated measures between: exercises

Interaction within x between effect sizes: exercises

Interaction within x between: exercises

Correlations: exercises

Proportions: exercises

Proportions: exercises

Dependent proportions: exercises

Not included

Simulation example: t-test

Focus / simplify

Conclusion

Questions ?

Non-centrality parameter ( \(\delta\) ), shift `Ha` from `Ho`