1. Introduction

In this vignette, we show how to create quality control reports using R.

For this, we first use a dedicated R package named “PTX-QC” to obtain standard quality control metrics for a MS-based (prote)omics experiment. Of note, many vignettes are available from the github page of PTX-QC to help users: https://github.com/cbielow/PTXQC

For details, the publication associated to PTX-QC is available here: https://pubs.acs.org/doi/10.1021/acs.jproteome.5b00780

Additionally, we explain how to create some other nice plots to check the reproducibility of identified peptides/proteins and of their quantified values, and to incorporate them in an additional report where everything can be changed in the images from Microsoft PowerPoint.

2. Creating a standard PTX-QC reporting

PTX-QC creates a QC report containing a comprehensive and powerful set of QC metrics, augmented with automated scoring functions. The automated scores are collated to create an overview heatmap at the beginning of the report, giving valuable guidance also to nonspecialists.

2.1 Installing PTX-QC

The PTX-QC library can be installed from R using:

install.packages("PTXQC")

Once it is installed, you have to load the library in your R session using:

library("PTXQC")

## Loading package PTXQC (version 1.0.13)

2.2 Importing data in R

For the next, we will use a MaxQuant output from a dataset of the proteomics facility of the Institut Pasteur that leads to a publication in Cell Reports (PMID: 31851926 - Dataset: PXD012825).

The associated publication is available here: https://www.sciencedirect.com/science/article/pii/S2211124719315311?via%3Dihub

Details on the dataset are available here: https://www.ebi.ac.uk/pride/archive/projects/PXD012825

PTX-QC can also be used from any output of another software as long as it allows providing a mzTab file.

R can be used to directly download the data from PRIDE and unzip the file in a temporary file:

temp <- tempfile()

download.file("https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/01/PXD012825/txt.zip",temp)

unzip(temp, exdir = tempdir()) 

txt_folder = file.path(tempdir(), "txt")

2.3 Creating the report

Following the previous section where the dataset has been installed in a temporary file, you just have to use:

report = createReport(txt_folder)

txt_folder

A report is created in the “txt_folder” file. It takes around 10 min to create the report on my laptop.

Note that the “mqpar.xml” is normally required to get all the QC metrics (missing in our example).

If you have produce your own txt file from MaxQuant, you just have to use:

report = createReport("C:/path_to_your_file/txt")

Similarly, a report is generated in the “txt” file of MaxQuant.

Next, you can look inside the created report if most of the quality metrics are good (or not).

2.4 Plotting the overview heatmap

Once the report has been created, you can plot the overview heatmap in R.

Load the heatmap data in the R session (replace “path_to_your_file” and “report_…_heatmap” with the correct name):

data_to_plot <- read.table("C:/path_to_your_file/txt/report_..._heatmap.txt",sep="\t",header=TRUE)

Install and load the “ggplot2” and “reshape2” libraries:

install.packages("ggplot2")

install.packages("reshape2")

Plot the overview heatmap:

#load the needed libraries
library("ggplot2")
library("reshape2")

#transform wide-format data into long-format data.
data_to_plot = melt(data_to_plot)

## Using fc.raw.file as id variables

#heatmap to plot
p = ggplot(data_to_plot, aes_string(y="fc.raw.file", x="variable"))
  if (any(is.na(data_to_plot))){
    p = p + geom_tile(aes_string(fill = "value", colour = "dummy_col")) +
      scale_colour_manual(name="Missing", values=c("NA" = "grey50")) +
      guides(colour = guide_legend(override.aes = list(fill = 'grey50')))
  } else {
    p = p + geom_tile(aes_string(fill = "value"))
  }  
  p = p + scale_fill_gradientn(colours = c("red", "black", "green"),
                         na.value = "grey50",
                         limits=c(0, 1), 
                         guide = guide_legend(title = "score"),
                         breaks=c(1,0.5,0),
                         labels=c("best","under\nperforming","fail")) +
    theme(axis.text.x = suppressWarnings(element_text(angle = 90, 
                                     vjust = 0.5, 
                                     hjust = 1))) +
    ggtitle("Performance overview") +
    xlab("") +
    ylab("Raw file")
  
#display the heatmap
p

The average overall quality of the dataset is summarized in green in the last column of the heatmap (at the right).

2.5 Shiny app

Of note, a Shiny app of PTX-QC is available at: http://ptxqc.bsc.fu-berlin.de/

It is useful to create a report just by clicking buttons from the web for punctual use.

3. Checking the reproducibility of identifications and quantified values

The PTX-QC report provides many quality control metrics to assess that the MS experiment works well.

However, a MS experiment can work in a good fashion but the biological conditions are not clustering together because of the preparation of samples for instance. We show here how to provide essential plots to check this.

We start from the proteinGroups.txt file of MaxQuant, but the method can be applied from any matrix of quantified values.

Also, we provide a method to export any R plot into a PowerPoint file where everything in the image can be changed from PowerPoint. This can be used to provide additional reports to the PTX-QC report, and is of definite interest to provide nice plots for scientific papers.

3.1 Importing a matrix of quantified values in R

For the next, we will use a proteinGroups.txt file of MaxQuant from a dataset of the proteomics facility of the Institut Pasteur that leads to a publication (PMID: 31851926).

In this section, any file from another software can also be used as long as it provides the intensities of peptides/proteins in different samples.

R can be used to directly download the data from PRIDE, unzip the file and extract the proteinGroups.txt file:

temp <- tempfile()
download.file("https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/01/PXD012825/txt.zip",temp)
data <- read.csv(unz(temp, "txt/proteinGroups.txt"),header=TRUE,sep="\t",quote="")
unlink(temp)

You can also do it “by hands”. Indeed, all the MaxQuant output files of our study can be downloadable from PRIDE by clicking here (txt.zip file) and unzipping the file afterwards: https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/01/PXD012825/txt.zip

Details on the dataset are available here: https://www.ebi.ac.uk/pride/archive/projects/PXD012825

If you have produce your own txt file from MaxQuant, you just have to use:

data <- read.csv("C:/path_to_your_file/txt/proteinGroups.txt",header=TRUE,sep="\t",quote="")

3.2 Checking and transforming the data

After importing the data in R, you can check the dataset structure:

#structure of the dataset
str(data)

Also, the number of proteins identified can be obtained with:

#number of rows (proteins) in the dataset
nrow(data)

## [1] 5904

For the next, we are interested in the number of unique peptides identified for each protein of the dataset. They can be extracted using:

datap=data[,grep("Unique.peptides.",colnames(data))]

Also, we are interested in the intensities measured for each protein of the dataset. They can be extracted using:

datai=data[,grep("Intensity.",colnames(data))]

Alternatively, LFQ intensities can be extracted using:

datai=data[,grep("LFQ.intensity.",colnames(data))]

In MaxQuant, missing values are 0. Replace 0 by NA to allow R to identify missing values:

datai[datai==0]=NA

Additionally, it is useful to transform the data with a log2 transformation:

datai=log2(datai)

Columns of the dataset can be renamed by deleting their common pattern:

short.name=unlist(strsplit(colnames(datap),split="Unique.peptides."))
short.name=short.name[short.name!=""]
colnames(datap)=short.name

short.name=unlist(strsplit(colnames(datai),split="Intensity."))
#or if you use LFQ intensities:
#short.name=unlist(strsplit(colnames(datai),split="LFQ.intensity."))
short.name=short.name[short.name!=""]
colnames(datai)=short.name

To ensure everything is correct, the first rows of the dataset can be checked with the head() function, and the str() function can be used to check the data set structure:

head(datap)
str(datap)

head(datai)
str(datai)

3.3 Checking identifications

A first step consists to check how many proteins have been identified with unique peptides in the dataset (peptides specific to only one protein in the database used to identify). Also, how many are identified in two samples, specifics to one sample, etc.

When the samples are highly reproducible, one can expect that the same proteins will be identified in each of the samples. The intersections between the sets of proteins identified in the samples can be visualized using UpSet graphs to highlight any aberrant sample.

3.3.1 Upset plots of identified proteins

Creating a matrix containing the row number of an identified peptide if it is quantified:

ind_datap=datap
ind_datap[ind_datap>0]=1;
vecto=as.matrix(1:nrow(ind_datap));
row.nb.id=apply(ind_datap,2,function(x) x*vecto);

Installing the UpSetR R package:

install.packages("UpSetR")

Plotting the UpSet graph with R:

library(UpSetR)

data.upset=as.data.frame(ind_datap)

upset.plot = upset(data.upset, 
                   order.by = "freq",
                   number.angles=315, 
                   main.bar.color=4, 
                   nsets=ncol(data.upset),
                   sets.x.label="Nb of proteins", 
                   text.scale = c(1.5,1,1.5,1,1,1), 
                   mb.ratio=c(0.55,0.45));

#To visualize
upset.plot

Details on the interpretation of such plots can be found here: https://upset.app/

3.3.2 Hierarchical clustering of identified proteins

The reproducibility of the identifications between two samples can be evaluated using Jaccard index. This index can be computed using the jaccard() function of the R package jaccard.

Installing the jaccard R package:

install.packages("jaccard")

library(jaccard)

mat.jaccard=matrix(NA,ncol(data.upset),ncol(data.upset))

for (i in 1:ncol(data.upset)){
for (j in 1:ncol(data.upset)){
mat.jaccard[i,j]=jaccard(data.upset[,i],data.upset[,j]);
}}

colnames(mat.jaccard)=colnames(data.upset)
rownames(mat.jaccard)=colnames(data.upset)

Additionally, a distance matrix based on this index can be computed with the as.dist() function, and a hierarchical clustering of samples can be performed with the hclust() function:

hv = hclust(as.dist(1-mat.jaccard),method="ward.D2")

The clustering can be visualized with the ggdendrogram function of the R package ggdendro. It requires also the ggplot2 R package.

Installing the ggdendro and ggplot2 R packages:

install.packages("ggdendro")
install.packages("ggplot2")

library(ggdendro)
library(ggplot2)

ggd = ggdendrogram(hv, rotate = TRUE, theme_dendro = FALSE)
ggd = ggd + theme(axis.text.y = element_text(hjust = 1))
ggd = ggd + xlab("Sample")
ggd = ggd + ylab("Height")
ggd = ggd + ggtitle("Ward’s method with a Jaccard index-based distance")

#To visualize
ggd

We see that the samples are clustering by conditions and that the “M” conditions is far from the others: the identified proteins are globally the same in the samples of a same conditions, but different between conditions.

Additionally, the same approach can be used to evaluate the replication of the “nonidentifications” (missing values) between samples. The dataset to use in the previous function is the next:

ind_datap=datap
ind_datap[datap==0]=1;
ind_datap[datap>0]=0;
data.upset=as.data.frame(ind_datap)

3.4 Checking quantified values

3.4.1 Plotting the distributions of observed values

The differences between the distributions of observed values can be used to see whether samples have globally lower or higher quantified values than in the others.

To check if the distribution of quantified values are the same in different samples, it is important to focus exclusively on proteins which are quantified in all samples you want to compare.

Installing the ggplot2 and ggridges R packages to plot the distributions:

install.packages("ggplot2")
install.packages("ggridges")

Deleting proteins not quantified in all samples:

datai_allq=na.omit(datai)

Formatting into a dataframe to use ggplot:

df=stack(data.frame(datai_allq));
colnames(df)=c("log2_intensities","Samples");

Plotting using stat_density_ridges function of the ggridges R package:

library(ggplot2)
library(ggridges)
dis.obs = ggplot(df, aes(x=log2_intensities, y=Samples, fill=
stat(x)))
dis.obs=dis.obs+stat_density_ridges(geom="density_ridges_gradient",
scale=3,size=0.3,rel_min_height = 0.01,quantile_lines=TRUE,
quantiles = 0.5, alpha = 0.7)
dis.obs = dis.obs + scale_fill_viridis_c(name="log2(int.)",
option="C", direction=-1)
dis.obs = dis.obs + labs(title = "Distribution of intensities
for proteins without missing values")

dis.obs

## Picking joint bandwidth of 0.491

3.4.2 Correlation matrix of quantified values

A Pearson correlation matrix using only complete pairs of observations between two samples to compute each Pearson correlation coefficient can be obtained by using:

mat.cor=cor(datai, use = "pairwise.complete.obs");

Next, they can be visualized with the corrplot function of the corrplot R package.

Installing the corrplot R package:

install.packages("corrplot")

library(corrplot)

## corrplot 0.92 loaded

matc=corrplot(mat.cor, order = "hclust", tl.cex=0.5, number.cex=0.5, addrect = 2);

matc=corrplot.mixed(mat.cor, order = "hclust", tl.cex=0.5, number.cex=0.5);

The correlation matrix allows justifying that there is a good repeatability of quantified values along the samples. It can be useful for determining which biological conditions are close, and which are less so.

3.4.3 Hierarchical clustering of quantified values

Clustering methods can be applied from the obtained Pearson correlation-based distance matrix to illustrate if the intensities measured in different samples are close or not:

hvc = hclust(as.dist(1-mat.cor),method="ward.D2")

require(ggdendro)
ggdc = ggdendrogram(hvc, rotate = TRUE, theme_dendro = FALSE)
ggdc = ggdc + theme(axis.text.y = element_text(hjust = 1))
ggdc = ggdc + xlab("Sample")
ggdc = ggdc + ylab("Height")
ggdc = ggdc + ggtitle("Ward’s method with a Pearson correlation-based distance")

ggdc

We see that the samples of the “M” conditions are clustering together, but that the “N” and “H” condition are mixed together: the quantified values of proteins in the “H” and “N” conditions are globally quite close.

3.5 Exporting all the plots in a PowerPoint where everything can be changed

Here, we explain how to generate a PPTX report file with the R plots previously generated and where everything can be changed in the images.

For this, we use the export R package: https://github.com/tomwenseleers/export

3.5.1 Installing the export R package from Github

The export R packages depends on several R packages that needs to be installed in a first step:

install.packages("officer")
install.packages("rvg")
install.packages("openxlsx")
install.packages("ggplot2")
install.packages("flextable")
install.packages("xtable")
install.packages("rgl")
install.packages("stargazer")
install.packages("tikzDevice")
install.packages("xml2")
install.packages("broom")
install.packages("devtools")

#installation of the export R package
devtools::install_github("tomwenseleers/export")

3.5.2 Creating your own report

Once installed, the graph2ppt() function of the export R package can be used to create the PowerPoint file.

The vector.graphic option is used to create modifiable graphics, the append option is used to add a slide in an already existing powerpoint file.

#load the export R package
library(export)

#Creation of the pptx file and add inside the overview heatmap of PTX-QC

p

graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE)

#Upset plot of proteins identified with unique peptides

upset.plot

graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE)

#Hierarchical clustering of identified proteins

ggd

graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE, append=TRUE)

#Distribution of quantified values

dis.obs

graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE, append=TRUE)

#Correlation matrix with 2 rectangles

corrplot(mat.cor, order = "hclust", tl.cex=0.5, number.cex=0.5, addrect = 2);

graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE, append=TRUE)

#Correlation matrix with Pearson correlation coefficients

corrplot.mixed(mat.cor, order = "hclust", tl.cex=0.5, number.cex=0.5);

graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE, append=TRUE)

#Hierarchical clustering of quantified values

ggdc

graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE, append=TRUE)

If you have managed to get this far, you have, in addition to the PTX-QC report, a second report in pptx format in the “C:/path_to_your_file” folder with images that you can modify from PowerPoint to include them in scientific papers for example.

Quality control reports of MS-based omics data

Quentin Giai Gianetto