with exercises in GPower and R
April 14, 2025
Feedback
Help us improve this document
wilfried.cools@vub.be
at SQUARE
Ask us for help
we offer consultancy
on methodology, statistics and its communication
square.research.vub.be
Not always possible nor meaningful !
Avoid retrospective power analyses
→ OK for future study only
Hoenig, J., & Heisey, D. (2001). The Abuse of Power:
The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician, 55, 19–24.
Difference detected approximately 80% of the times.
Note
?
)2
)4
).05
, so \(Z_{ \alpha /2}\) → -1.96).2
, so \(Z_ \beta\) → -0.842)
\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{ \Delta^2} = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2} = 62.79\)
GPower
and pwr
package in R
To confirm that the treatment results in at least 500 calories less, compared to the control,
knowing that the standard deviation of measured calories in each group should be about 1000,
how many observations are required to show that difference,
allowing for a .01 type I error and .9 power ?
reference
reference
~ reference example input
Determine =>
0-2
| / 4
= .5
.05
; two-tailed .8
1
(equally sized groups)
~ reference example output
128
)qt(.975,126)
Ha
(true) away from Ho
(null) 2/(4*sqrt(2))*sqrt(64)
reference example
:
reference example
:
Ha
from Ho
effect size
((standardized) signal)sample size
(information)Ho
and Ha
: bigger ncp less overlap
Ho
→ shift (location/shape)Ho
evaluated on Ha
push
with sample sizeHa
acts as \(\color{blue}{truth}\) assumed difference of e.g. .5 SD
Ha ~ t(ncp=2.828,df)
Ho
acts as \(\color{red}{benchmark}\): typically no difference, no relation
Ho ~ t(ncp=0,df)
using \(\alpha\)
\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}\)
Inference (test) based on cut-off’s (density → AUC=1)
Type I error: incorrectly reject Ho
(false positive):
Ho
, error prob. \(\alpha\) controlled
Type II error: incorrectly fail to reject Ho
(false negative):
Ho
, error prob. \(\beta\) obtained from Ha
Ha
assumed known in a power analysespower = 1 - \(\beta\) = probability correct rejection (true positive)
infer=Ha | infer=Ho | sum | |
truth=Ho | \(\alpha\) | 1- \(\alpha\) | 1 |
truth=Ha | 1- \(\beta\) | \(\beta\) | 1 |
X-Y plot for range of values
~ reference example
Ho's
,
Comparing the control group and two treatments
Pairwise comparisons, typically not an omnibus
use reference example
(C = 0, T1 = 2), and extend with group 3 with T2 = 4 (same sd)
Pocock (P), O’Brien-Fleming (OF), Haybittle-Peto (HP), and Wang-Tsiatis (WT) correction with \(\delta\) = 0.25
reference
, with 2 peaks at the data
reference example
:
Estimate / guestimate of minimal magnitude of interest
Typically standardized: signal to noise ratio
Part of non-centrality (as is sample size) → pushing away Ha
~ Practical relevance
d-family
(differences) and r-family
(associations)
Cohen, J. (1992).
A power primer. Psychological Bulletin, 112, 155–159.
Cohen, J. (1988).
Statistical power analysis for the behavioral sciences (2nd ed).
Famous Cohen conventions
Ellis, P. D. (2010).
The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results.
more than 70 different effect sizes… most of them related
Determine
Choice of effect size matters → justify choice !!
Choice of effect size depends on aim of the study
Choice of effect size dependent on statistical test of interest
...
reference example
Call:
lm(formula = y ~ factor(group), data = .dta)
Residuals:
Min 1Q Median 3Q Max
-9.8910 -2.5825 -0.6262 2.5170 9.2916
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.280e-16 5.000e-01 0.000 1.00000
factor(group)2 2.000e+00 7.071e-01 2.828 0.00544 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4 on 126 degrees of freedom
Multiple R-squared: 0.0597, Adjusted R-squared: 0.05224
F-statistic: 8 on 1 and 126 DF, p-value: 0.005444
Df Sum Sq Mean Sq F value Pr(>F)
group 1 128 128 8 0.00544 **
Residuals 126 2016 16
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] 0.00544392
t
-2.828427
Differences between groups ~ relation with grouping (categorization)
Example: d = .5 ~ r = .243 (note: slope \(\beta = {r*\sigma_y} / {\sigma_x}\))
GPower: regression coefficient (t-test / regression, one group size of slope)
A relation as ratio between and within group variance ~ explained variance R2
Different but related effect sizes \(f^2\) = \({R^2/{(1-R^2)}}\)
GPower: regression coefficient (t-test / regression, fixed model single regression coef)
Use reference example
Ha
d-family
(differences) and r-family
(associations)
Regression
\(f^2 = \frac{R^2}{1-R^2}\)
Binary Logistic Regression
\(f = \frac{\Phi^{-1}(AUC)}{\sqrt{2}}\)
Ordinal Logistic Regression
\(f = \frac{\sqrt{3}}{2*\pi} * log\_odds\_ratio\)
Poisson Regression
\(f = \sqrt{\frac{(\lambda_1-\lambda_2)^2}{2*(\lambda_1+\lambda_2)}}\)
Exponential for waiting times
\(f = 1-{\frac{\lambda_1 * \lambda_2}{(\lambda_1 + \lambda_2)^2}}^{\frac{\mu^2}{\sigma}}\)
Gamma regression
\(f^2 = {\frac{(\mu_1 + \mu_2)^2}{4*\mu_1*\mu_2}}^{\frac{\mu^2}{\sigma^2}} - 1\)
[1] 127.5315
# install.packages('pwr')
library(pwr)
pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.8, type = "two.sample")$n * 2
[1] 127.5312
[1] 127.5312
for our reference
Following literature we expect 8.8 events per cycle for our control group and aim to show that our treatment group would have less, with at most 7.3, which implies a rate ratio of at most 1.2. The groups should be similar in size, and without further information it is assumed that dispersion is 1. With the typical .05 type I error and 80% power, and for simplicity not including any other predictors, it should be possible to verify the required 118 sample size.
Poisson Regression \(f = \sqrt{\frac{(\lambda_1-\lambda_2)^2}{2*(\lambda_1+\lambda_2)}}\)
To show that the average rating for the treatment group is 10% better, a sample size is calculated based on a test comparing waiting times with an expected number for the iDA group of at most 1.44 compared to 1.6 for the control group. Note that this implies an average of 1.52, and we also found in literature evidence for a standard deviation of .82 of observed events. Verify that about 1002 patients are required when choosing a type I error of .05 and aiming at .8 power.
Exponential for waiting times \(f = 1-{\frac{\lambda_1 * \lambda_2}{(\lambda_1 + \lambda_2)^2}}^{\frac{\mu^2}{\sigma}}\)
~
\(\alpha\), power
, \(\Delta\)power
~
\(\alpha\), \(n\), \(\Delta\)power
, \(\alpha\) ~
\(\beta\:/\:\alpha\), \(\Delta\), \(n\)~
power
, \(\Delta\), \(n\)~
\(\alpha\), power
, \(n\)
reference example
:
G*Power
When comparing 2 dependent groups (eg., before/after treatment) → account for correlations
Correlation are typically obtained from pilot data, earlier research
GPower: matched pairs (t-test / means, difference 2 dependent means)
Use reference example
,
reference
with sample size 2x64, but correlation .5 ?
Test difference two independent proportions → [0..1]
Simplest version of a logistic regression on two groups
GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)
Testing whether two proportions are the same
GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)
Plot 5 power curves
Explain curve minimum, relation sample size ?
Repeat for one-tailed, difference ?
Test difference two dependent proportions → [0..1] categorical shift
GPower: McNemar test (exact / proportions, difference 2 dependent proportions)
Testing whether proportions of discordance are the same
When non-normally distributed residuals are expected, not possible to circumvent (eg., transformations)
Only considers ranks or uses permutations → price is efficiency and flexibility
Requires a parent distribution (alternative hypothesis), ‘min ARE’ should be default
GPower: two groups → Wilcoxon-Mann-Whitney (t-test / means, diff. 2 indep. means)
Use reference example
Multiple groups, at least two differ → not one effect size d
F-test statistic & effect size f
, ratio of variances \(\sigma_{between}^2 / \sigma_{within}^2\)
GPower: one-way Anova (F-test / Means, ANOVA - fixed effects, omnibus, one way)
use reference example
include a third group (group 1 = 0, group 2 = 2)
\(\sigma_{contrast} = \frac{|\sum{\mu_i * c_i}|}{\sqrt{N \sum_i^k c_i^2 / n_i}}\)
\(f = \sqrt{\frac{\sigma_{contrast}^2}{\sigma_{error}^2}}\)
GPower: one-way ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
For the reference example
extended, with contrasts \(f_{T1-C}\)=.25, \(f_{T2-C}\)=.50 and \(f_{(T2+T1)/2-C}\)=.3535
When the factor is time: repeated measures
Beware: effect sizes obtained from literature may/may not include correlation
GPower: repeated measures (F-test / Means, repeated measures within factors)
For reference example
with effect size f = .25 (1/16 explained versus unexplained)
Get effect size: in-house shiny app
Use reference example
for treatment (C-T1-T2) and add type (B1-B2)
GPower: multiway ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
Use reference example
for treatment (C-T1-T2) and add type (B1-B2)
what are the sample sizes
When repeated measures are obtained for different groupss
GPower: repeated measures (F-test / Means, repeated measures between factors)
For reference example
When differences between groups depend on time
Get effect size: in-house shiny app
Use reference example
for control-treatment (C-T1), and 2 or 4 time points
Options: different effect sizes are possible
GPower: repeated measures (F-test / Means, repeated measures within-between factors)
Use effect sizes previous exercise, part 2 and 3
Test difference of two independent correlations → [-1..1]
Use Fisher Z transformations to normalize
Correlations easier to differentiate as they are more different from 0
GPower: z-tests / correlation & regressions: 2 indep. Pearson r’s
Testing whether two correlations are the same
GPower not for always sufficient, not even R is
Tests too difficult to specify in GPower or R
Tests not included in GPower or R
Tests without formula
dta$y <- dta$y+rnorm(length(dta$y),0,4)
res <- t.test(data=dta,y~X)
mean(sims['p.val',] < .05)
gr <- rep(c('T','C'),64)
y <- ifelse(gr=='C',0,2)
dta <- data.frame(y=y,X=gr)
cutoff <- qt(.025,nrow(dta))
my_sim_function <- function(){
dta$y <- dta$y+rnorm(length(dta$y),0,4) # generate (with sd=4)
res <- t.test(data=dta,y~X) # analyze
c(res$estimate %*% c(-1,1),res$statistic,res$p.value)
}
sims <- replicate(10000,my_sim_function()) # many iterations
dimnames(sims)[[1]] <- c('diff','t.stat','p.val')
mean(sims['p.val',] < .05) # p-values 0.8029
mean(sims['t.stat',] < cutoff) # t-statistics 0.8029
mean(sims['diff',] > sd(sims['diff',])*cutoff*(-1)) # differences 0.8024
Sample size calculation is a design issue, not a statistical one
It typically focuses on ensuring sufficient data to result in sufficiently strong statistical inference
Sample size depends on effect size, type I & II errors, and the statistical test of interest
GPower deals with not too complex models
R is more flexible, but may require a bit more digging and work
The most important, turn it into a good story…
Thank you for your attention.
Methodological and statistical support to help make a difference
Contact