1. Introduction

In this vignette, we use statistical power analysis to determine a minimal number of samples required to confidently trust the results of a statistical test from MS-based omics data.

1.1. Statistical power analysis in the context of MS-based omics data

The calculation of a statistical power depends on several factors. It is statistical test-dependent, i.e. the power analysis will be specific to a statistical test. Thus, this kind of analysis requires to predefine the statistical test that will be used.

In case of a standard t-test comparing the average intensities between two conditions for a single peptide/protein, the statistical power can be mathematically expressed in function of:

the minimal average difference intensity between conditions that is expected to be detectable (fold-change or effect size),
the variance of the intensities along the samples,
the type 1 error (threshold on the p-values),
the number of samples in each condition (sample size).

See the wikipedia page for basic concepts about statistical power: https://en.wikipedia.org/wiki/Power_of_a_test

However, because many peptides/proteins are usually quantified in MS-based experiments, the reasoning underlying the power analysis is a little bit different. We will think in term of:

average statistical power on all the tests,
distribution of fold-changes along the peptides/proteins (instead of a single value),
distribution of variances along the peptides/proteins (instead of a single value),
a False Discovery Rate (FDR) level (instead of a type 1 error) since a multiple test correction has to be applied to avoid false positives.

Additionally, another parameter will be important:

\(\pi_0\): the proportion of non-differentially abundant proteins.

The complexity of this kind of analysis increases when one is interested in other types of statistical tests, in particular when the experiment is composed of more than 2 conditions. Notably, the concept of “fold-change” is replaced by a vector of parameters.

In this vignette, we will use the ssize.fdr R package to perform this kind of analysis.

Please see https://academic.oup.com/bioinformatics/article/23/6/739/416097?login=true and
references inside for details, notably about mathematical formula.

1.2. When to use it in MS-based omics facilities ?

In a facility, a first MS-based experiment with a minimum number of replicates per biological condition (at least 3 to be able to estimate the variance of intensities for each peptide/protein in each condition) is generally intended to check that the experiment has worked correctly. It can also be used to assess how many samples are sufficient to obtain robust statistical results in a subsequent analysis.

This analysis can also be performed after the final data have been acquired to evaluate what is the confidence in the results. For instance, it can be used to define a minimal fold-change allowing to reach a sufficient average statistical power.

1.3. How to perform it with R ?

The R package ssize.fdr contains several functions that can be used to calculate minimum sample sizes. The purpose of each of these functions is to estimate an average statistical power over all the tests carried out on all the peptides/proteins in function of the sample size.

Additionnally, it can be used in case of more than two conditions to compare: the functions ssize.F and ssize.Fvary can be used by defining a design matrix, a FDR level, a user-specified proportion of non differentially abundant peptides/proteins, and a standard deviation or the parameters of inverse gamma distribution followed by variances of peptides/proteins for ssize.Fvary.

We detailed hereafter how to conduct such an analysis using a dataset that has been used in a publication of Cell Reports (PMID: 31851926 - Dataset: PXD012825). The associated publication is available here: https://www.sciencedirect.com/science/article/pii/S2211124719315311?via%3Dihub

2. Importing, checking and transforming the data with R

2.1 Importing data in R

For the next, we use a proteinGroups.txt file of MaxQuant from a dataset of the proteomics facility of the Institut Pasteur that leads to a publication (PMID: 31851926). Any file from another software can also be used as long as it provides the intensities of peptides/proteins in different samples.

R can be used to directly download the data, unzip the file and extract the proteinGroups.txt file from PRIDE:

temp <- tempfile()
download.file("https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/01/PXD012825/txt.zip",temp)
data <- read.csv(unz(temp, "txt/proteinGroups.txt"),header=TRUE,sep="\t",quote="")
unlink(temp)

You can also do it “by hands”. Indeed, all the MaxQuant output files of our study can be downloadable from PRIDE by clicking here (txt.zip file) and unzipping the file afterwards: https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/01/PXD012825/txt.zip

Details on the dataset are available here: https://www.ebi.ac.uk/pride/archive/projects/PXD012825

If you have produce your own txt file from MaxQuant, you just have to use:

data <- read.csv("C:/path_to_your_file/txt/proteinGroups.txt",header=TRUE,sep="\t",quote="")

2.2 Checking and transforming the data

After importing the data in R, you can check the dataset structure:

#structure of the dataset
str(data)

Also, the number of proteins identified can be obtained with:

#number of rows (proteins) in the dataset
nrow(data)

## [1] 5904

For the next, we are exclusively interested in the intensities measured for each protein of the dataset. They can be extracted using:

datai=data[,grep("Intensity.",colnames(data))]

In MaxQuant, missing values are 0. Replace 0 by NA to allow R to identify missing values:

datai[datai==0]=NA

Additionally, it is useful to transform the data with a log2 transformation to avoid too high variances for high intensities. This can be a problem because low-intensity proteins are more likely to be selected in the differential analysis in absence of log-transformation:

datai=log2(datai)

In the next, we will use only the proteins without missing values to evaluate how many samples are required:

datai=na.omit(datai)

#number of rows (proteins) in the dataset

nrow(datai)

## [1] 2469

To ensure everything is correct, the first rows of the dataset can be checked with:

head(datai)

##    Intensity.534H Intensity.534M Intensity.534N Intensity.545H Intensity.545M
## 5        26.74945       28.60438       27.10580       26.26960       28.76885
## 12       33.44628       33.25575       33.67219       33.46271       33.26319
## 14       26.88972       28.92241       26.93179       27.63768       29.39495
## 15       25.64469       27.03915       27.68953       28.28401       27.43038
## 24       32.59096       31.27353       31.96144       32.65994       31.16519
## 28       35.80005       37.77362       35.52965       35.76619       35.79190
##    Intensity.545N Intensity.546H Intensity.546M Intensity.546N Intensity.547H
## 5        26.89203       28.79647       29.81274       28.82950       27.28771
## 12       32.87738       33.44727       33.34467       33.86692       33.73832
## 14       27.94436       29.18903       31.05724       28.53200       27.48524
## 15       29.26289       27.18622       27.01521       27.36994       28.05662
## 24       31.42762       34.09380       33.54823       34.61374       31.59139
## 28       35.22974       35.65040       36.62105       35.61300       35.57249
##    Intensity.547M Intensity.547N Intensity.548H Intensity.548M Intensity.548N
## 5        29.98848       27.77178       27.53490       29.03063       28.56885
## 12       33.28223       33.53971       33.71508       33.26444       33.50640
## 14       30.85013       28.79359       28.77659       30.61877       29.02459
## 15       28.29491       28.08253       33.17896       29.22506       28.87449
## 24       31.67405       31.43713       34.12209       33.06466       34.30735
## 28       35.97513       35.50826       35.80075       36.68565       36.18344
##    Intensity.549H Intensity.549M Intensity.549N Intensity.550H Intensity.550M
## 5        27.22555       29.75406       27.56943       27.57881       28.30478
## 12       33.66786       33.40121       33.17044       33.45133       32.97294
## 14       28.66804       30.37513       27.79433       29.47698       30.30693
## 15       27.63194       26.93686       26.90793       27.78119       29.83661
## 24       32.88992       32.74023       32.55073       32.30840       31.03394
## 28       36.26609       36.56482       35.73847       36.40154       36.35332
##    Intensity.550N
## 5        28.32354
## 12       33.49500
## 14       29.35685
## 15       25.87025
## 24       32.18931
## 28       36.21706

3. Design of experiments and used statistical test

The experimental design is specific to each experiment and is extremely important, especially for the subsequent statistical analysis.

3.1. Specifying the conditions of samples

The vector containing the experimental design must correspond to the columns containing the intensity values. The colnames function can be used to checked the column names of the intensity matrix:

colnames(datai)

##  [1] "Intensity.534H" "Intensity.534M" "Intensity.534N" "Intensity.545H"
##  [5] "Intensity.545M" "Intensity.545N" "Intensity.546H" "Intensity.546M"
##  [9] "Intensity.546N" "Intensity.547H" "Intensity.547M" "Intensity.547N"
## [13] "Intensity.548H" "Intensity.548M" "Intensity.548N" "Intensity.549H"
## [17] "Intensity.549M" "Intensity.549N" "Intensity.550H" "Intensity.550M"
## [21] "Intensity.550N"

Here, “H”, “M” and “N” are different biological conditions. So, the vector indicating the experimental design contains the conditions of each sample:

conditions=rep(c("H","M","N"),7)

conditions

##  [1] "H" "M" "N" "H" "M" "N" "H" "M" "N" "H" "M" "N" "H" "M" "N" "H" "M" "N" "H"
## [20] "M" "N"

Next, a design matrix that represents a linear model needs to be created. Indeed, the parameters of this model can next be used to build several statistical tests that will be specify by a contrast matrix.

3.2 Definition of Design and Contrast matrices

A linear regression model can be expressed as: \[Y=XB+\epsilon\] where \(Y\) is a vector containing the intensities of a protein/peptide in different samples, \(B\) is a vector of parameters, \(\epsilon\) is a (Gaussian white) noise, \(X\) is a design matrix.

The design matrix defines the model that will be used to explain the intensities of proteins/peptides.

For instance, in a Case/Control framework with 3 replicates in each condition, the linear regression model will be: \[\begin{bmatrix} y_{1}\\ y_{2}\\ y_{3}\\ y_{4}\\ y_{5}\\ y_{6} \end{bmatrix}=\begin{bmatrix}1 & 0\\ 1 & 0\\ 1 & 0\\ 1 & 1\\ 1 & 1\\ 1 & 1 \end{bmatrix}\begin{bmatrix}\alpha \\ \beta\end{bmatrix}+\epsilon\]

When using a statistical test, we are interested to test if some model parameters are null (or linear combinations of parameters). For instance, in the Case/Control framework above, we are interested in testing the nullity of the \(\beta\) parameter. This can be formulated as a multiplication between a contrast matrix and the vector of parameters, i.e. we want to test if: \[\begin{bmatrix}0 & 1 \end{bmatrix}\begin{bmatrix}\alpha \\ \beta\end{bmatrix}=0 \Leftrightarrow \beta=0\]

More generally, the statistical test can be expressed as: \[H_0: C^tB=0\qquad vs\qquad H_1: C^tB\neq0\] where \(C\) is the contrast matrix.

3.3 Design matrix in R

3.3.1 Case-Control framework

The most current framework is the Case-Control one. Typically, you have two biological conditions that you want to compare (ex: Wild Type vs Mutant, Healthy vs Not healthy, etc.). As we have three conditions in our case study, this framework corresponds to compare only two conditions, for instance “M” vs “N”. Thus, only the samples corresponding to the M and N conditions have to be kept:

cond_case_ctrl=conditions[conditions%in%c("M","N")]

datai_case_ctrl=datai[,conditions%in%c("M","N")]

The design matrix is obtained with the model.matrix function:

design_case_ctrl=model.matrix(~cond_case_ctrl)

design_case_ctrl

##    (Intercept) cond_case_ctrlN
## 1            1               0
## 2            1               1
## 3            1               0
## 4            1               1
## 5            1               0
## 6            1               1
## 7            1               0
## 8            1               1
## 9            1               0
## 10           1               1
## 11           1               0
## 12           1               1
## 13           1               0
## 14           1               1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$cond_case_ctrl
## [1] "contr.treatment"

Each row of design_case_ctrl corresponds to a sample. Each column of design_case_ctrl corresponds to a parameter of the linear model used. The first column of design_case_ctrl is composed of 1 and corresponds to the intercept of the linear model. The intercept is dedicated to model the average intensity in the “M” condition (reference condition). The second column is dedicated to model the intensity gap between the “M” and the “N” condition, it is equal to 1 only if the sample belongs to the “N” condition, 0 otherwise.

3.3.2 Multiple conditions framework

In case of more than 2 conditions, the design matrix can also be defined with the model.matrix function:

design_all=model.matrix(~conditions)

design_all

##    (Intercept) conditionsM conditionsN
## 1            1           0           0
## 2            1           1           0
## 3            1           0           1
## 4            1           0           0
## 5            1           1           0
## 6            1           0           1
## 7            1           0           0
## 8            1           1           0
## 9            1           0           1
## 10           1           0           0
## 11           1           1           0
## 12           1           0           1
## 13           1           0           0
## 14           1           1           0
## 15           1           0           1
## 16           1           0           0
## 17           1           1           0
## 18           1           0           1
## 19           1           0           0
## 20           1           1           0
## 21           1           0           1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$conditions
## [1] "contr.treatment"

In that case, the intercept is dedicated to model the average intensity in the “H” condition (reference condition). The second column is dedicated to model the intensity gap between the “H” and the “M” conditions: it so equal to 1 only if the sample belongs to the “M” condition, 0 otherwise. The third column is dedicated to model the intensity gap between the “H” and the “N” condition, it so equal to 1 only if the sample belongs to the “N” condition, 0 otherwise.

Of note, the “H” condition is here considered as the reference condition. Indeed, R converts the character vector conditions into a factor vector where the levels are ranked by alphabetic order. However, another condition can also be wished to be the reference. For instance, if we wish “N” as the reference, you have to create a factor object where “N” is the first level:

conditions_mod=factor(conditions, levels=c("N","M","H"))

design_all=model.matrix(~conditions_mod)

design_all

Note that this matrix is different from the first.

3.3.3 Case of paired samples

Until now, the samples have been considered as unpaired. However, if you look carefully the vector colnames(datai), you will see similar identification numbers (534, 545, etc.) in each of the biological conditions. Indeed, each number indicates a biological sample, and each biological condition (“H”,“M”,“N”) corresponds to a specific treatment of this sample: the samples are thus paired.

A vector indicating the biological sample identifiers can be created:

sample_nb=factor(c(rep(534,3),rep(545,3),rep(546,3),rep(547,3),rep(548,3),rep(549,3),rep(550,3)))

#For the Case-Control case (only M and N samples are kept)

sample_nb_case_ctrl=sample_nb[conditions%in%c("M","N")]

To integrate this pairing in our design matrix, we use the model.matrix function:

For the Case-Control case:

design_case_ctrl_paired=model.matrix(~cond_case_ctrl+sample_nb_case_ctrl)

design_case_ctrl_paired

##    (Intercept) cond_case_ctrlN sample_nb_case_ctrl545 sample_nb_case_ctrl546
## 1            1               0                      0                      0
## 2            1               1                      0                      0
## 3            1               0                      1                      0
## 4            1               1                      1                      0
## 5            1               0                      0                      1
## 6            1               1                      0                      1
## 7            1               0                      0                      0
## 8            1               1                      0                      0
## 9            1               0                      0                      0
## 10           1               1                      0                      0
## 11           1               0                      0                      0
## 12           1               1                      0                      0
## 13           1               0                      0                      0
## 14           1               1                      0                      0
##    sample_nb_case_ctrl547 sample_nb_case_ctrl548 sample_nb_case_ctrl549
## 1                       0                      0                      0
## 2                       0                      0                      0
## 3                       0                      0                      0
## 4                       0                      0                      0
## 5                       0                      0                      0
## 6                       0                      0                      0
## 7                       1                      0                      0
## 8                       1                      0                      0
## 9                       0                      1                      0
## 10                      0                      1                      0
## 11                      0                      0                      1
## 12                      0                      0                      1
## 13                      0                      0                      0
## 14                      0                      0                      0
##    sample_nb_case_ctrl550
## 1                       0
## 2                       0
## 3                       0
## 4                       0
## 5                       0
## 6                       0
## 7                       0
## 8                       0
## 9                       0
## 10                      0
## 11                      0
## 12                      0
## 13                      1
## 14                      1
## attr(,"assign")
## [1] 0 1 2 2 2 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$cond_case_ctrl
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sample_nb_case_ctrl
## [1] "contr.treatment"

For the multiple conditions case:

design_all_paired=model.matrix(~conditions+sample_nb)

design_all_paired

##    (Intercept) conditionsM conditionsN sample_nb545 sample_nb546 sample_nb547
## 1            1           0           0            0            0            0
## 2            1           1           0            0            0            0
## 3            1           0           1            0            0            0
## 4            1           0           0            1            0            0
## 5            1           1           0            1            0            0
## 6            1           0           1            1            0            0
## 7            1           0           0            0            1            0
## 8            1           1           0            0            1            0
## 9            1           0           1            0            1            0
## 10           1           0           0            0            0            1
## 11           1           1           0            0            0            1
## 12           1           0           1            0            0            1
## 13           1           0           0            0            0            0
## 14           1           1           0            0            0            0
## 15           1           0           1            0            0            0
## 16           1           0           0            0            0            0
## 17           1           1           0            0            0            0
## 18           1           0           1            0            0            0
## 19           1           0           0            0            0            0
## 20           1           1           0            0            0            0
## 21           1           0           1            0            0            0
##    sample_nb548 sample_nb549 sample_nb550
## 1             0            0            0
## 2             0            0            0
## 3             0            0            0
## 4             0            0            0
## 5             0            0            0
## 6             0            0            0
## 7             0            0            0
## 8             0            0            0
## 9             0            0            0
## 10            0            0            0
## 11            0            0            0
## 12            0            0            0
## 13            1            0            0
## 14            1            0            0
## 15            1            0            0
## 16            0            1            0
## 17            0            1            0
## 18            0            1            0
## 19            0            0            1
## 20            0            0            1
## 21            0            0            1
## attr(,"assign")
## [1] 0 1 1 2 2 2 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$conditions
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sample_nb
## [1] "contr.treatment"

3.4 Contrast matrix in R

The contrast matrix will define the statistical test that will be used, i.e. which nullity of a parameter (or a combination of parameters) of the linear model will be tested. The statistical power analysis will next depend on this matrix. It has a same number of rows that the number of columns in the design matrix (i.e. number of parameters in the linear model).

3.4.1 Case-Control framework

In the Case-Control framework, we have only two parameters in the linear model, the first represents the average intensity in the reference condition (“M” here), the second represents the gap between this reference condition and the other condition (“N”). What is generally of interest is to test whether this gap is zero. In this case, there will be no significant difference in intensity between both conditions. Thus, the contrast matrix is:

ct_cc=c(0,1)

Here, the contrast matrix is actually a vector since we test only if the second parameter is null (1 on the second coordinate).

3.4.2 Multiple conditions framework

In the multiple condition framework of our case study, we have only three parameters in the linear model: the first represents the average intensity in the reference condition, the second represents the gap between this reference condition and a second condition, the third represents the gap between this reference condition and a third condition. What is generally of interest is to test whether all these gaps are zero. In this case, there will be no significant difference in intensity between the three conditions. Thus, the contrast matrix is:

ct_all=cbind(c(0,1,0),c(0,1,-1))

ct_all

##      [,1] [,2]
## [1,]    0    0
## [2,]    1    1
## [3,]    0   -1

The first column tests if the second parameter is null, and the second column tests if the second parameter is equal to the third one. The combination of these two contrasts will lead to test if all the gaps are null.

3.4.3 Case of paired samples

In the Case-Control case with paired samples, we are interested to test if only the parameter related to the gap between the conditions is significantly not null. This parameter is the 2nd in our model composed of 8 parameters (you can check this using colnames(design_case_ctrl_paired)):

ct_cc_paired=rep(0,ncol(design_case_ctrl_paired))

ct_cc_paired[2]=1

ct_cc_paired

## [1] 0 1 0 0 0 0 0 0

Similarly, in the multiple condition framework with paired samples, we are interested to test if the parameters related to the gaps between the conditions are significantly not null. They are in 2nd and 3rd position in our model composed of 9 parameters:

ct_all_paired=matrix(0,ncol(design_all_paired),2)

ct_all_paired[2,1]=1
ct_all_paired[2,2]=1
ct_all_paired[3,2]=-1

ct_all_paired

##       [,1] [,2]
##  [1,]    0    0
##  [2,]    1    1
##  [3,]    0   -1
##  [4,]    0    0
##  [5,]    0    0
##  [6,]    0    0
##  [7,]    0    0
##  [8,]    0    0
##  [9,]    0    0

4. Statistical power analysis in function of the number of samples

Here we use the ssize.TwoSampVary function of the ssize.fdr R package to estimate the average statistical power in function of the sample size. The function needs to precise: the average and standard deviations to define the Normal distribution of fold-changes, the two parameters of the Inverse Gamma distributions of variances, the proportions of non-differentially abundant proteins.

4.1 Installing the ssize.fdr, limma and cp4p R packages

To install the required packages, you have to use:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("limma","qvalue","multtest")

install.packages(c("ssize.fdr","cp4p"))

If already installed, you can load it in your R session with:

library("limma")
library("ssize.fdr")
library("cp4p")

4.2. Case-Control framework

4.2.1 Estimating the distribution of variances

The distribution of variances is assumed to follow an Inverse Gamma distribution characterized by two parameters: a shape and a scale parameter. We can use the functions of the limma R package to fit the model with the adequate contrast, and next deduce these parameters:

library(limma)

fit=eBayes(contrasts.fit(lmFit(datai_case_ctrl,design_case_ctrl),ct_cc),robust=TRUE)

s02=fit$s2.prior;
d0=mean(fit$df.prior)

#shape parameter
alpha=d0/2

#scale parameter
beta=d0*s02/2

4.2.2 Estimating the proportion of non-differentially abundant proteins

The estimation of the proportion of non-differentially abundant proteins is related to the estimation of the proportion of true null hypotheses among multiple tests, which is a well-known statistical problematic. Generally, estimation methods are based on the distribution of pvalues.

Here, we extract the p-values of interest from the output of the functions of the limma R package and display their distribution:

pval_ct=fit$F.p.value

hist(pval_ct,50,main="Histogram of pvalues",col=4)

From this distribution, the cp4p R package allows applying diverse statistical methods to estimate the searched proportion:

library(cp4p)

#Calibration plot of pvalues to choose a method
calibration.plot(pval_ct,"ALL")

## $pi0
##   pi0.Storey.Spline pi0.Storey.Boot pi0.Jiang pi0.Histo pi0.Langaas pi0.Pounds
## 1         0.1308977       0.1654805 0.1653592   0.18226   0.1604865  0.2433594
##     pi0.ABH  pi0.SLIM
## 1 0.3110571 0.3483723
## 
## $h1.concentration
## NULL
## 
## $unif.under
## numeric(0)

prop=estim.pi0(pval_ct,pi0.method="ALL")

prop$pi0.est

##   pi0.Storey.Spline pi0.Storey.Boot pi0.Jiang pi0.Histo pi0.Langaas pi0.Pounds
## 1         0.1308977       0.1654805 0.1653592   0.18226   0.1604865  0.2433594
##     pi0.ABH  pi0.SLIM
## 1 0.3110571 0.3483723

4.2.3 Final step: fold-changes, power and sample size

Here, we conduct the power analysis for a fold-change fixed to 1. We use a FDR threshold of 5%. The proportion of non-differentially abundant proteins is varying following our previous estimations.

library(ssize.fdr)

ss2v=ssize.twoSampVary(deltaMean=1, deltaSE=0, a=alpha, b=beta, fdr = 0.05, power = 0.8, pi0 = seq(min(prop$pi0.est),max(prop$pi0.est),by=0.05), maxN = 35, side = "two-sided", cex.title=1.15, cex.legend=1)

For 7 samples by biological conditions, we evaluate a statistical power between 0.5 and 0.8 depending of the proportion of non-differentially abundant proteins. We have a good power (around 0.8) if the proportion of non-differentially abundant proteins is low.

You can change the fold change level to observe the behaviour of the curve: more the fold-change will be high, and less samples will be required.

4.3 Multiple conditions with a pairing

Here we use the ssize.Fvary function of the ssize.fdr R package to estimate the average statistical power in function of the sample size in our real case study where we have 3 biological conditions with a pairing of samples.

As explained previously, we have to estimate the distributions of variances in our dataset and the proportion of non-differentially abundant proteins. Additionally, we have to fix a FDR threshold and define a vector of parameters.

4.3.1 Estimating the distributions of variances and the proportion of non-differentially abundant proteins

Here, the parameters of the distributions of variances and the proportion of non-differentially abundant proteins can be estimated similarly to the Case-Control case.

fit=eBayes(contrasts.fit(lmFit(datai,design_all_paired),ct_all_paired),robust=TRUE)

s02=fit$s2.prior;
d0=mean(fit$df.prior)

#shape parameter
alpha=d0/2

#scale parameter
beta=d0*s02/2

#p-values
pval_ct=fit$F.p.value

#proportion of true null hypotheses
prop=estim.pi0(pval_ct,pi0.method="ALL")

prop$pi0.est

##   pi0.Storey.Spline pi0.Storey.Boot pi0.Jiang pi0.Histo pi0.Langaas pi0.Pounds
## 1         0.0605356      0.07560416 0.0906393 0.1060422  0.08464052    0.14356
##     pi0.ABH  pi0.SLIM
## 1 0.1952207 0.2106361

4.3.2 Determining the degrees of freedom function

Additionally, a function giving the number of degrees of freedom has to be provided to conduct the power analysis.

The number of degrees of freedom is the number of values in the final calculation of a test statistic that are free to vary. In our case, it is defined as the number of “observations” minus the number of parameters to estimate. The number of observations can be evaluated by the number of tested conditions multiplied by the sample size, and the number of parameters is the one that are used in the test statistic.

Putting these together, the function to provide in our case is:

df=function(n){3*n-2};

4.3.3 Final step: vector of parameters, power and sample size

Instead of a fold-change level, we have to think in term of a vector of parameters.

This vector has a length equal to 9. We can have an idea of how this vector is in average in our dataset by estimating it with the limma R packge using:

fit=eBayes(lmFit(datai,design_all_paired),robust=TRUE)

average_coeff=apply(fit$coefficients,2,mean)

average_coeff

##  (Intercept)  conditionsM  conditionsN sample_nb545 sample_nb546 sample_nb547 
##  28.53723268  -0.20315934  -0.03630089   0.14017476   0.85840494   0.86976803 
## sample_nb548 sample_nb549 sample_nb550 
##   0.97393537   0.92493149   1.01402676

Note that the coefficients associated to the sample identifiers are the global effects of the pairing in our dataset. The 2nd and 3rd coordinates are the ones we are interested in (effects of the “M” and “N” conditions relatively to the “H” condition). Thus, we can conduct several power analysis by fixing different values for them.

Here is a vector of parameters where we want to know the statistical power when we have an effect of 0.5 between the “H” condition and the “M” and “N” conditions under the global effects of the pairing in our dataset:

param=average_coeff

param[2]=0.5

param[3]=0.5

Next, we have everything to perform our analysis:

ftv=ssize.Fvary(X=design_all_paired,beta=param,L=ct_all_paired, dn=df,a=alpha,b=beta, fdr=0.05, power=0.8, pi0= seq(min(prop$pi0.est),max(prop$pi0.est),by=0.05),maxN=35)

Note that the sample size is estimated for each condition, i.e. we need at least 4 replicates in “H”, “M” and “N” to reach a good power (80%) when searching for differences of 0.5 between conditions.

If you have managed to get this far, congratulations !!! you finished this wonderful tutorial, and merit so much more !

Using Statistical Power for Sample Size estimation in MS-based omics data

Quentin Giai Gianetto