In this vignette, we use statistical power analysis to determine a minimal number of samples required to confidently trust the results of a statistical test from MS-based omics data.
The calculation of a statistical power depends on several factors. It is statistical test-dependent, i.e. the power analysis will be specific to a statistical test. Thus, this kind of analysis requires to predefine the statistical test that will be used.
In case of a standard t-test comparing the average intensities between two conditions for a single peptide/protein, the statistical power can be mathematically expressed in function of:
See the wikipedia page for basic concepts about statistical power: https://en.wikipedia.org/wiki/Power_of_a_test
However, because many peptides/proteins are usually quantified in MS-based experiments, the reasoning underlying the power analysis is a little bit different. We will think in term of:
Additionally, another parameter will be important:
The complexity of this kind of analysis increases when one is interested in other types of statistical tests, in particular when the experiment is composed of more than 2 conditions. Notably, the concept of “fold-change” is replaced by a vector of parameters.
In this vignette, we will use the ssize.fdr R package to perform this kind of analysis.
Please see https://academic.oup.com/bioinformatics/article/23/6/739/416097?login=true
and
references inside for details, notably about mathematical formula.
In a facility, a first MS-based experiment with a minimum number of replicates per biological condition (at least 3 to be able to estimate the variance of intensities for each peptide/protein in each condition) is generally intended to check that the experiment has worked correctly. It can also be used to assess how many samples are sufficient to obtain robust statistical results in a subsequent analysis.
This analysis can also be performed after the final data have been acquired to evaluate what is the confidence in the results. For instance, it can be used to define a minimal fold-change allowing to reach a sufficient average statistical power.
The R package ssize.fdr contains several functions that can be used to calculate minimum sample sizes. The purpose of each of these functions is to estimate an average statistical power over all the tests carried out on all the peptides/proteins in function of the sample size.
Additionnally, it can be used in case of more than two conditions to compare: the functions ssize.F and ssize.Fvary can be used by defining a design matrix, a FDR level, a user-specified proportion of non differentially abundant peptides/proteins, and a standard deviation or the parameters of inverse gamma distribution followed by variances of peptides/proteins for ssize.Fvary.
We detailed hereafter how to conduct such an analysis using a dataset that has been used in a publication of Cell Reports (PMID: 31851926 - Dataset: PXD012825). The associated publication is available here: https://www.sciencedirect.com/science/article/pii/S2211124719315311?via%3Dihub
For the next, we use a proteinGroups.txt file of MaxQuant from a dataset of the proteomics facility of the Institut Pasteur that leads to a publication (PMID: 31851926). Any file from another software can also be used as long as it provides the intensities of peptides/proteins in different samples.
R can be used to directly download the data, unzip the file and extract the proteinGroups.txt file from PRIDE:
<- tempfile()
temp download.file("https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/01/PXD012825/txt.zip",temp)
<- read.csv(unz(temp, "txt/proteinGroups.txt"),header=TRUE,sep="\t",quote="")
data unlink(temp)
You can also do it “by hands”. Indeed, all the MaxQuant output files of our study can be downloadable from PRIDE by clicking here (txt.zip file) and unzipping the file afterwards: https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/01/PXD012825/txt.zip
Details on the dataset are available here: https://www.ebi.ac.uk/pride/archive/projects/PXD012825
If you have produce your own txt file from MaxQuant, you just have to use:
<- read.csv("C:/path_to_your_file/txt/proteinGroups.txt",header=TRUE,sep="\t",quote="") data
After importing the data in R, you can check the dataset structure:
#structure of the dataset
str(data)
Also, the number of proteins identified can be obtained with:
#number of rows (proteins) in the dataset
nrow(data)
## [1] 5904
For the next, we are exclusively interested in the intensities measured for each protein of the dataset. They can be extracted using:
=data[,grep("Intensity.",colnames(data))] datai
In MaxQuant, missing values are 0. Replace 0 by NA to allow R to identify missing values:
==0]=NA datai[datai
Additionally, it is useful to transform the data with a log2 transformation to avoid too high variances for high intensities. This can be a problem because low-intensity proteins are more likely to be selected in the differential analysis in absence of log-transformation:
=log2(datai) datai
In the next, we will use only the proteins without missing values to evaluate how many samples are required:
=na.omit(datai)
datai
#number of rows (proteins) in the dataset
nrow(datai)
## [1] 2469
To ensure everything is correct, the first rows of the dataset can be checked with:
head(datai)
## Intensity.534H Intensity.534M Intensity.534N Intensity.545H Intensity.545M
## 5 26.74945 28.60438 27.10580 26.26960 28.76885
## 12 33.44628 33.25575 33.67219 33.46271 33.26319
## 14 26.88972 28.92241 26.93179 27.63768 29.39495
## 15 25.64469 27.03915 27.68953 28.28401 27.43038
## 24 32.59096 31.27353 31.96144 32.65994 31.16519
## 28 35.80005 37.77362 35.52965 35.76619 35.79190
## Intensity.545N Intensity.546H Intensity.546M Intensity.546N Intensity.547H
## 5 26.89203 28.79647 29.81274 28.82950 27.28771
## 12 32.87738 33.44727 33.34467 33.86692 33.73832
## 14 27.94436 29.18903 31.05724 28.53200 27.48524
## 15 29.26289 27.18622 27.01521 27.36994 28.05662
## 24 31.42762 34.09380 33.54823 34.61374 31.59139
## 28 35.22974 35.65040 36.62105 35.61300 35.57249
## Intensity.547M Intensity.547N Intensity.548H Intensity.548M Intensity.548N
## 5 29.98848 27.77178 27.53490 29.03063 28.56885
## 12 33.28223 33.53971 33.71508 33.26444 33.50640
## 14 30.85013 28.79359 28.77659 30.61877 29.02459
## 15 28.29491 28.08253 33.17896 29.22506 28.87449
## 24 31.67405 31.43713 34.12209 33.06466 34.30735
## 28 35.97513 35.50826 35.80075 36.68565 36.18344
## Intensity.549H Intensity.549M Intensity.549N Intensity.550H Intensity.550M
## 5 27.22555 29.75406 27.56943 27.57881 28.30478
## 12 33.66786 33.40121 33.17044 33.45133 32.97294
## 14 28.66804 30.37513 27.79433 29.47698 30.30693
## 15 27.63194 26.93686 26.90793 27.78119 29.83661
## 24 32.88992 32.74023 32.55073 32.30840 31.03394
## 28 36.26609 36.56482 35.73847 36.40154 36.35332
## Intensity.550N
## 5 28.32354
## 12 33.49500
## 14 29.35685
## 15 25.87025
## 24 32.18931
## 28 36.21706
The experimental design is specific to each experiment and is extremely important, especially for the subsequent statistical analysis.
The vector containing the experimental design must correspond to the columns containing the intensity values. The colnames function can be used to checked the column names of the intensity matrix:
colnames(datai)
## [1] "Intensity.534H" "Intensity.534M" "Intensity.534N" "Intensity.545H"
## [5] "Intensity.545M" "Intensity.545N" "Intensity.546H" "Intensity.546M"
## [9] "Intensity.546N" "Intensity.547H" "Intensity.547M" "Intensity.547N"
## [13] "Intensity.548H" "Intensity.548M" "Intensity.548N" "Intensity.549H"
## [17] "Intensity.549M" "Intensity.549N" "Intensity.550H" "Intensity.550M"
## [21] "Intensity.550N"
Here, “H”, “M” and “N” are different biological conditions. So, the vector indicating the experimental design contains the conditions of each sample:
=rep(c("H","M","N"),7)
conditions
conditions
## [1] "H" "M" "N" "H" "M" "N" "H" "M" "N" "H" "M" "N" "H" "M" "N" "H" "M" "N" "H"
## [20] "M" "N"
Next, a design matrix that represents a linear model needs to be created. Indeed, the parameters of this model can next be used to build several statistical tests that will be specify by a contrast matrix.
A linear regression model can be expressed as: \[Y=XB+\epsilon\] where \(Y\) is a vector containing the intensities of a protein/peptide in different samples, \(B\) is a vector of parameters, \(\epsilon\) is a (Gaussian white) noise, \(X\) is a design matrix.
The design matrix defines the model that will be used to explain the intensities of proteins/peptides.
For instance, in a Case/Control framework with 3 replicates in each condition, the linear regression model will be: \[\begin{bmatrix} y_{1}\\ y_{2}\\ y_{3}\\ y_{4}\\ y_{5}\\ y_{6} \end{bmatrix}=\begin{bmatrix}1 & 0\\ 1 & 0\\ 1 & 0\\ 1 & 1\\ 1 & 1\\ 1 & 1 \end{bmatrix}\begin{bmatrix}\alpha \\ \beta\end{bmatrix}+\epsilon\]
When using a statistical test, we are interested to test if some model parameters are null (or linear combinations of parameters). For instance, in the Case/Control framework above, we are interested in testing the nullity of the \(\beta\) parameter. This can be formulated as a multiplication between a contrast matrix and the vector of parameters, i.e. we want to test if: \[\begin{bmatrix}0 & 1 \end{bmatrix}\begin{bmatrix}\alpha \\ \beta\end{bmatrix}=0 \Leftrightarrow \beta=0\]
More generally, the statistical test can be expressed as: \[H_0: C^tB=0\qquad vs\qquad H_1: C^tB\neq0\] where \(C\) is the contrast matrix.
The most current framework is the Case-Control one. Typically, you have two biological conditions that you want to compare (ex: Wild Type vs Mutant, Healthy vs Not healthy, etc.). As we have three conditions in our case study, this framework corresponds to compare only two conditions, for instance “M” vs “N”. Thus, only the samples corresponding to the M and N conditions have to be kept:
=conditions[conditions%in%c("M","N")]
cond_case_ctrl
=datai[,conditions%in%c("M","N")] datai_case_ctrl
The design matrix is obtained with the model.matrix function:
=model.matrix(~cond_case_ctrl)
design_case_ctrl
design_case_ctrl
## (Intercept) cond_case_ctrlN
## 1 1 0
## 2 1 1
## 3 1 0
## 4 1 1
## 5 1 0
## 6 1 1
## 7 1 0
## 8 1 1
## 9 1 0
## 10 1 1
## 11 1 0
## 12 1 1
## 13 1 0
## 14 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$cond_case_ctrl
## [1] "contr.treatment"
Each row of design_case_ctrl corresponds to a sample. Each column of design_case_ctrl corresponds to a parameter of the linear model used. The first column of design_case_ctrl is composed of 1 and corresponds to the intercept of the linear model. The intercept is dedicated to model the average intensity in the “M” condition (reference condition). The second column is dedicated to model the intensity gap between the “M” and the “N” condition, it is equal to 1 only if the sample belongs to the “N” condition, 0 otherwise.
In case of more than 2 conditions, the design matrix can also be defined with the model.matrix function:
=model.matrix(~conditions)
design_all
design_all
## (Intercept) conditionsM conditionsN
## 1 1 0 0
## 2 1 1 0
## 3 1 0 1
## 4 1 0 0
## 5 1 1 0
## 6 1 0 1
## 7 1 0 0
## 8 1 1 0
## 9 1 0 1
## 10 1 0 0
## 11 1 1 0
## 12 1 0 1
## 13 1 0 0
## 14 1 1 0
## 15 1 0 1
## 16 1 0 0
## 17 1 1 0
## 18 1 0 1
## 19 1 0 0
## 20 1 1 0
## 21 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$conditions
## [1] "contr.treatment"
In that case, the intercept is dedicated to model the average intensity in the “H” condition (reference condition). The second column is dedicated to model the intensity gap between the “H” and the “M” conditions: it so equal to 1 only if the sample belongs to the “M” condition, 0 otherwise. The third column is dedicated to model the intensity gap between the “H” and the “N” condition, it so equal to 1 only if the sample belongs to the “N” condition, 0 otherwise.
Of note, the “H” condition is here considered as the reference condition. Indeed, R converts the character vector conditions into a factor vector where the levels are ranked by alphabetic order. However, another condition can also be wished to be the reference. For instance, if we wish “N” as the reference, you have to create a factor object where “N” is the first level:
=factor(conditions, levels=c("N","M","H"))
conditions_mod
=model.matrix(~conditions_mod)
design_all
design_all
Note that this matrix is different from the first.
Until now, the samples have been considered as unpaired. However, if you look carefully the vector colnames(datai), you will see similar identification numbers (534, 545, etc.) in each of the biological conditions. Indeed, each number indicates a biological sample, and each biological condition (“H”,“M”,“N”) corresponds to a specific treatment of this sample: the samples are thus paired.
A vector indicating the biological sample identifiers can be created:
=factor(c(rep(534,3),rep(545,3),rep(546,3),rep(547,3),rep(548,3),rep(549,3),rep(550,3)))
sample_nb
#For the Case-Control case (only M and N samples are kept)
=sample_nb[conditions%in%c("M","N")] sample_nb_case_ctrl
To integrate this pairing in our design matrix, we use the model.matrix function:
=model.matrix(~cond_case_ctrl+sample_nb_case_ctrl)
design_case_ctrl_paired
design_case_ctrl_paired
## (Intercept) cond_case_ctrlN sample_nb_case_ctrl545 sample_nb_case_ctrl546
## 1 1 0 0 0
## 2 1 1 0 0
## 3 1 0 1 0
## 4 1 1 1 0
## 5 1 0 0 1
## 6 1 1 0 1
## 7 1 0 0 0
## 8 1 1 0 0
## 9 1 0 0 0
## 10 1 1 0 0
## 11 1 0 0 0
## 12 1 1 0 0
## 13 1 0 0 0
## 14 1 1 0 0
## sample_nb_case_ctrl547 sample_nb_case_ctrl548 sample_nb_case_ctrl549
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## 7 1 0 0
## 8 1 0 0
## 9 0 1 0
## 10 0 1 0
## 11 0 0 1
## 12 0 0 1
## 13 0 0 0
## 14 0 0 0
## sample_nb_case_ctrl550
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
## 11 0
## 12 0
## 13 1
## 14 1
## attr(,"assign")
## [1] 0 1 2 2 2 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$cond_case_ctrl
## [1] "contr.treatment"
##
## attr(,"contrasts")$sample_nb_case_ctrl
## [1] "contr.treatment"
=model.matrix(~conditions+sample_nb)
design_all_paired
design_all_paired
## (Intercept) conditionsM conditionsN sample_nb545 sample_nb546 sample_nb547
## 1 1 0 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 0 1 0 0 0
## 4 1 0 0 1 0 0
## 5 1 1 0 1 0 0
## 6 1 0 1 1 0 0
## 7 1 0 0 0 1 0
## 8 1 1 0 0 1 0
## 9 1 0 1 0 1 0
## 10 1 0 0 0 0 1
## 11 1 1 0 0 0 1
## 12 1 0 1 0 0 1
## 13 1 0 0 0 0 0
## 14 1 1 0 0 0 0
## 15 1 0 1 0 0 0
## 16 1 0 0 0 0 0
## 17 1 1 0 0 0 0
## 18 1 0 1 0 0 0
## 19 1 0 0 0 0 0
## 20 1 1 0 0 0 0
## 21 1 0 1 0 0 0
## sample_nb548 sample_nb549 sample_nb550
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## 7 0 0 0
## 8 0 0 0
## 9 0 0 0
## 10 0 0 0
## 11 0 0 0
## 12 0 0 0
## 13 1 0 0
## 14 1 0 0
## 15 1 0 0
## 16 0 1 0
## 17 0 1 0
## 18 0 1 0
## 19 0 0 1
## 20 0 0 1
## 21 0 0 1
## attr(,"assign")
## [1] 0 1 1 2 2 2 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$conditions
## [1] "contr.treatment"
##
## attr(,"contrasts")$sample_nb
## [1] "contr.treatment"
The contrast matrix will define the statistical test that will be used, i.e. which nullity of a parameter (or a combination of parameters) of the linear model will be tested. The statistical power analysis will next depend on this matrix. It has a same number of rows that the number of columns in the design matrix (i.e. number of parameters in the linear model).
In the Case-Control framework, we have only two parameters in the linear model, the first represents the average intensity in the reference condition (“M” here), the second represents the gap between this reference condition and the other condition (“N”). What is generally of interest is to test whether this gap is zero. In this case, there will be no significant difference in intensity between both conditions. Thus, the contrast matrix is:
=c(0,1) ct_cc
Here, the contrast matrix is actually a vector since we test only if the second parameter is null (1 on the second coordinate).
In the multiple condition framework of our case study, we have only three parameters in the linear model: the first represents the average intensity in the reference condition, the second represents the gap between this reference condition and a second condition, the third represents the gap between this reference condition and a third condition. What is generally of interest is to test whether all these gaps are zero. In this case, there will be no significant difference in intensity between the three conditions. Thus, the contrast matrix is:
=cbind(c(0,1,0),c(0,1,-1))
ct_all
ct_all
## [,1] [,2]
## [1,] 0 0
## [2,] 1 1
## [3,] 0 -1
The first column tests if the second parameter is null, and the second column tests if the second parameter is equal to the third one. The combination of these two contrasts will lead to test if all the gaps are null.
In the Case-Control case with paired samples, we are interested to test if only the parameter related to the gap between the conditions is significantly not null. This parameter is the 2nd in our model composed of 8 parameters (you can check this using colnames(design_case_ctrl_paired)):
=rep(0,ncol(design_case_ctrl_paired))
ct_cc_paired
2]=1
ct_cc_paired[
ct_cc_paired
## [1] 0 1 0 0 0 0 0 0
Similarly, in the multiple condition framework with paired samples, we are interested to test if the parameters related to the gaps between the conditions are significantly not null. They are in 2nd and 3rd position in our model composed of 9 parameters:
=matrix(0,ncol(design_all_paired),2)
ct_all_paired
2,1]=1
ct_all_paired[2,2]=1
ct_all_paired[3,2]=-1
ct_all_paired[
ct_all_paired
## [,1] [,2]
## [1,] 0 0
## [2,] 1 1
## [3,] 0 -1
## [4,] 0 0
## [5,] 0 0
## [6,] 0 0
## [7,] 0 0
## [8,] 0 0
## [9,] 0 0
Here we use the ssize.TwoSampVary function of the ssize.fdr R package to estimate the average statistical power in function of the sample size. The function needs to precise: the average and standard deviations to define the Normal distribution of fold-changes, the two parameters of the Inverse Gamma distributions of variances, the proportions of non-differentially abundant proteins.
To install the required packages, you have to use:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("limma","qvalue","multtest")
BiocManager
install.packages(c("ssize.fdr","cp4p"))
If already installed, you can load it in your R session with:
library("limma")
library("ssize.fdr")
library("cp4p")
The distribution of variances is assumed to follow an Inverse Gamma distribution characterized by two parameters: a shape and a scale parameter. We can use the functions of the limma R package to fit the model with the adequate contrast, and next deduce these parameters:
library(limma)
=eBayes(contrasts.fit(lmFit(datai_case_ctrl,design_case_ctrl),ct_cc),robust=TRUE)
fit
=fit$s2.prior;
s02=mean(fit$df.prior)
d0
#shape parameter
=d0/2
alpha
#scale parameter
=d0*s02/2 beta
The estimation of the proportion of non-differentially abundant proteins is related to the estimation of the proportion of true null hypotheses among multiple tests, which is a well-known statistical problematic. Generally, estimation methods are based on the distribution of pvalues.
Here, we extract the p-values of interest from the output of the functions of the limma R package and display their distribution:
=fit$F.p.value
pval_ct
hist(pval_ct,50,main="Histogram of pvalues",col=4)
From this distribution, the cp4p R package allows applying diverse statistical methods to estimate the searched proportion:
library(cp4p)
#Calibration plot of pvalues to choose a method
calibration.plot(pval_ct,"ALL")
## $pi0
## pi0.Storey.Spline pi0.Storey.Boot pi0.Jiang pi0.Histo pi0.Langaas pi0.Pounds
## 1 0.1308977 0.1654805 0.1653592 0.18226 0.1604865 0.2433594
## pi0.ABH pi0.SLIM
## 1 0.3110571 0.3483723
##
## $h1.concentration
## NULL
##
## $unif.under
## numeric(0)
=estim.pi0(pval_ct,pi0.method="ALL")
prop
$pi0.est prop
## pi0.Storey.Spline pi0.Storey.Boot pi0.Jiang pi0.Histo pi0.Langaas pi0.Pounds
## 1 0.1308977 0.1654805 0.1653592 0.18226 0.1604865 0.2433594
## pi0.ABH pi0.SLIM
## 1 0.3110571 0.3483723
Here, we conduct the power analysis for a fold-change fixed to 1. We use a FDR threshold of 5%. The proportion of non-differentially abundant proteins is varying following our previous estimations.
library(ssize.fdr)
=ssize.twoSampVary(deltaMean=1, deltaSE=0, a=alpha, b=beta, fdr = 0.05, power = 0.8, pi0 = seq(min(prop$pi0.est),max(prop$pi0.est),by=0.05), maxN = 35, side = "two-sided", cex.title=1.15, cex.legend=1) ss2v
For 7 samples by biological conditions, we evaluate a statistical power between 0.5 and 0.8 depending of the proportion of non-differentially abundant proteins. We have a good power (around 0.8) if the proportion of non-differentially abundant proteins is low.
You can change the fold change level to observe the behaviour of the curve: more the fold-change will be high, and less samples will be required.
Here we use the ssize.Fvary function of the ssize.fdr R package to estimate the average statistical power in function of the sample size in our real case study where we have 3 biological conditions with a pairing of samples.
As explained previously, we have to estimate the distributions of variances in our dataset and the proportion of non-differentially abundant proteins. Additionally, we have to fix a FDR threshold and define a vector of parameters.
Here, the parameters of the distributions of variances and the proportion of non-differentially abundant proteins can be estimated similarly to the Case-Control case.
=eBayes(contrasts.fit(lmFit(datai,design_all_paired),ct_all_paired),robust=TRUE)
fit
=fit$s2.prior;
s02=mean(fit$df.prior)
d0
#shape parameter
=d0/2
alpha
#scale parameter
=d0*s02/2
beta
#p-values
=fit$F.p.value
pval_ct
#proportion of true null hypotheses
=estim.pi0(pval_ct,pi0.method="ALL")
prop
$pi0.est prop
## pi0.Storey.Spline pi0.Storey.Boot pi0.Jiang pi0.Histo pi0.Langaas pi0.Pounds
## 1 0.0605356 0.07560416 0.0906393 0.1060422 0.08464052 0.14356
## pi0.ABH pi0.SLIM
## 1 0.1952207 0.2106361
Additionally, a function giving the number of degrees of freedom has to be provided to conduct the power analysis.
The number of degrees of freedom is the number of values in the final calculation of a test statistic that are free to vary. In our case, it is defined as the number of “observations” minus the number of parameters to estimate. The number of observations can be evaluated by the number of tested conditions multiplied by the sample size, and the number of parameters is the one that are used in the test statistic.
Putting these together, the function to provide in our case is:
=function(n){3*n-2}; df
Instead of a fold-change level, we have to think in term of a vector of parameters.
This vector has a length equal to 9. We can have an idea of how this vector is in average in our dataset by estimating it with the limma R packge using:
=eBayes(lmFit(datai,design_all_paired),robust=TRUE)
fit
=apply(fit$coefficients,2,mean)
average_coeff
average_coeff
## (Intercept) conditionsM conditionsN sample_nb545 sample_nb546 sample_nb547
## 28.53723268 -0.20315934 -0.03630089 0.14017476 0.85840494 0.86976803
## sample_nb548 sample_nb549 sample_nb550
## 0.97393537 0.92493149 1.01402676
Note that the coefficients associated to the sample identifiers are the global effects of the pairing in our dataset. The 2nd and 3rd coordinates are the ones we are interested in (effects of the “M” and “N” conditions relatively to the “H” condition). Thus, we can conduct several power analysis by fixing different values for them.
Here is a vector of parameters where we want to know the statistical power when we have an effect of 0.5 between the “H” condition and the “M” and “N” conditions under the global effects of the pairing in our dataset:
=average_coeff
param
2]=0.5
param[
3]=0.5 param[
Next, we have everything to perform our analysis:
=ssize.Fvary(X=design_all_paired,beta=param,L=ct_all_paired, dn=df,a=alpha,b=beta, fdr=0.05, power=0.8, pi0= seq(min(prop$pi0.est),max(prop$pi0.est),by=0.05),maxN=35) ftv
Note that the sample size is estimated for each condition, i.e. we need at least 4 replicates in “H”, “M” and “N” to reach a good power (80%) when searching for differences of 0.5 between conditions.
If you have managed to get this far, congratulations !!! you finished this wonderful tutorial, and merit so much more !