In this vignette, we show how to create quality control reports using R.
For this, we first use a dedicated R package named “PTX-QC” to obtain standard quality control metrics for a MS-based (prote)omics experiment. Of note, many vignettes are available from the github page of PTX-QC to help users: https://github.com/cbielow/PTXQC
For details, the publication associated to PTX-QC is available here: https://pubs.acs.org/doi/10.1021/acs.jproteome.5b00780
Additionally, we explain how to create some other nice plots to check the reproducibility of identified peptides/proteins and of their quantified values, and to incorporate them in an additional report where everything can be changed in the images from Microsoft PowerPoint.
PTX-QC creates a QC report containing a comprehensive and powerful set of QC metrics, augmented with automated scoring functions. The automated scores are collated to create an overview heatmap at the beginning of the report, giving valuable guidance also to nonspecialists.
The PTX-QC library can be installed from R using:
install.packages("PTXQC")
Once it is installed, you have to load the library in your R session using:
library("PTXQC")
## Loading package PTXQC (version 1.0.13)
For the next, we will use a MaxQuant output from a dataset of the proteomics facility of the Institut Pasteur that leads to a publication in Cell Reports (PMID: 31851926 - Dataset: PXD012825).
The associated publication is available here: https://www.sciencedirect.com/science/article/pii/S2211124719315311?via%3Dihub
Details on the dataset are available here: https://www.ebi.ac.uk/pride/archive/projects/PXD012825
PTX-QC can also be used from any output of another software as long as it allows providing a mzTab file.
R can be used to directly download the data from PRIDE and unzip the file in a temporary file:
<- tempfile()
temp
download.file("https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/01/PXD012825/txt.zip",temp)
unzip(temp, exdir = tempdir())
= file.path(tempdir(), "txt") txt_folder
Following the previous section where the dataset has been installed in a temporary file, you just have to use:
= createReport(txt_folder)
report
txt_folder
A report is created in the “txt_folder” file. It takes around 10 min to create the report on my laptop.
Note that the “mqpar.xml” is normally required to get all the QC metrics (missing in our example).
If you have produce your own txt file from MaxQuant, you just have to use:
= createReport("C:/path_to_your_file/txt") report
Similarly, a report is generated in the “txt” file of MaxQuant.
Next, you can look inside the created report if most of the quality metrics are good (or not).
Once the report has been created, you can plot the overview heatmap in R.
Load the heatmap data in the R session (replace “path_to_your_file” and “report_…_heatmap” with the correct name):
<- read.table("C:/path_to_your_file/txt/report_..._heatmap.txt",sep="\t",header=TRUE) data_to_plot
Install and load the “ggplot2” and “reshape2” libraries:
install.packages("ggplot2")
install.packages("reshape2")
Plot the overview heatmap:
#load the needed libraries
library("ggplot2")
library("reshape2")
#transform wide-format data into long-format data.
= melt(data_to_plot) data_to_plot
## Using fc.raw.file as id variables
#heatmap to plot
= ggplot(data_to_plot, aes_string(y="fc.raw.file", x="variable"))
p if (any(is.na(data_to_plot))){
= p + geom_tile(aes_string(fill = "value", colour = "dummy_col")) +
p scale_colour_manual(name="Missing", values=c("NA" = "grey50")) +
guides(colour = guide_legend(override.aes = list(fill = 'grey50')))
else {
} = p + geom_tile(aes_string(fill = "value"))
p
} = p + scale_fill_gradientn(colours = c("red", "black", "green"),
p na.value = "grey50",
limits=c(0, 1),
guide = guide_legend(title = "score"),
breaks=c(1,0.5,0),
labels=c("best","under\nperforming","fail")) +
theme(axis.text.x = suppressWarnings(element_text(angle = 90,
vjust = 0.5,
hjust = 1))) +
ggtitle("Performance overview") +
xlab("") +
ylab("Raw file")
#display the heatmap
p
The average overall quality of the dataset is summarized in green in the last column of the heatmap (at the right).
Of note, a Shiny app of PTX-QC is available at: http://ptxqc.bsc.fu-berlin.de/
It is useful to create a report just by clicking buttons from the web for punctual use.
The PTX-QC report provides many quality control metrics to assess that the MS experiment works well.
However, a MS experiment can work in a good fashion but the biological conditions are not clustering together because of the preparation of samples for instance. We show here how to provide essential plots to check this.
We start from the proteinGroups.txt file of MaxQuant, but the method can be applied from any matrix of quantified values.
Also, we provide a method to export any R plot into a PowerPoint file where everything in the image can be changed from PowerPoint. This can be used to provide additional reports to the PTX-QC report, and is of definite interest to provide nice plots for scientific papers.
For the next, we will use a proteinGroups.txt file of MaxQuant from a dataset of the proteomics facility of the Institut Pasteur that leads to a publication (PMID: 31851926).
In this section, any file from another software can also be used as long as it provides the intensities of peptides/proteins in different samples.
R can be used to directly download the data from PRIDE, unzip the file and extract the proteinGroups.txt file:
<- tempfile()
temp download.file("https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/01/PXD012825/txt.zip",temp)
<- read.csv(unz(temp, "txt/proteinGroups.txt"),header=TRUE,sep="\t",quote="")
data unlink(temp)
You can also do it “by hands”. Indeed, all the MaxQuant output files of our study can be downloadable from PRIDE by clicking here (txt.zip file) and unzipping the file afterwards: https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/01/PXD012825/txt.zip
Details on the dataset are available here: https://www.ebi.ac.uk/pride/archive/projects/PXD012825
If you have produce your own txt file from MaxQuant, you just have to use:
<- read.csv("C:/path_to_your_file/txt/proteinGroups.txt",header=TRUE,sep="\t",quote="") data
After importing the data in R, you can check the dataset structure:
#structure of the dataset
str(data)
Also, the number of proteins identified can be obtained with:
#number of rows (proteins) in the dataset
nrow(data)
## [1] 5904
For the next, we are interested in the number of unique peptides identified for each protein of the dataset. They can be extracted using:
=data[,grep("Unique.peptides.",colnames(data))] datap
Also, we are interested in the intensities measured for each protein of the dataset. They can be extracted using:
=data[,grep("Intensity.",colnames(data))] datai
Alternatively, LFQ intensities can be extracted using:
=data[,grep("LFQ.intensity.",colnames(data))] datai
In MaxQuant, missing values are 0. Replace 0 by NA to allow R to identify missing values:
==0]=NA datai[datai
Additionally, it is useful to transform the data with a log2 transformation:
=log2(datai) datai
Columns of the dataset can be renamed by deleting their common pattern:
=unlist(strsplit(colnames(datap),split="Unique.peptides."))
short.name=short.name[short.name!=""]
short.namecolnames(datap)=short.name
=unlist(strsplit(colnames(datai),split="Intensity."))
short.name#or if you use LFQ intensities:
#short.name=unlist(strsplit(colnames(datai),split="LFQ.intensity."))
=short.name[short.name!=""]
short.namecolnames(datai)=short.name
To ensure everything is correct, the first rows of the dataset can be checked with the head() function, and the str() function can be used to check the data set structure:
head(datap)
str(datap)
head(datai)
str(datai)
A first step consists to check how many proteins have been identified with unique peptides in the dataset (peptides specific to only one protein in the database used to identify). Also, how many are identified in two samples, specifics to one sample, etc.
When the samples are highly reproducible, one can expect that the same proteins will be identified in each of the samples. The intersections between the sets of proteins identified in the samples can be visualized using UpSet graphs to highlight any aberrant sample.
Creating a matrix containing the row number of an identified peptide if it is quantified:
=datap
ind_datap>0]=1;
ind_datap[ind_datap=as.matrix(1:nrow(ind_datap));
vecto=apply(ind_datap,2,function(x) x*vecto); row.nb.id
Installing the UpSetR R package:
install.packages("UpSetR")
Plotting the UpSet graph with R:
library(UpSetR)
=as.data.frame(ind_datap)
data.upset
= upset(data.upset,
upset.plot order.by = "freq",
number.angles=315,
main.bar.color=4,
nsets=ncol(data.upset),
sets.x.label="Nb of proteins",
text.scale = c(1.5,1,1.5,1,1,1),
mb.ratio=c(0.55,0.45));
#To visualize
upset.plot
Details on the interpretation of such plots can be found here: https://upset.app/
The reproducibility of the identifications between two samples can be evaluated using Jaccard index. This index can be computed using the jaccard() function of the R package jaccard.
Installing the jaccard R package:
install.packages("jaccard")
library(jaccard)
=matrix(NA,ncol(data.upset),ncol(data.upset))
mat.jaccard
for (i in 1:ncol(data.upset)){
for (j in 1:ncol(data.upset)){
=jaccard(data.upset[,i],data.upset[,j]);
mat.jaccard[i,j]
}}
colnames(mat.jaccard)=colnames(data.upset)
rownames(mat.jaccard)=colnames(data.upset)
Additionally, a distance matrix based on this index can be computed with the as.dist() function, and a hierarchical clustering of samples can be performed with the hclust() function:
= hclust(as.dist(1-mat.jaccard),method="ward.D2") hv
The clustering can be visualized with the ggdendrogram function of the R package ggdendro. It requires also the ggplot2 R package.
Installing the ggdendro and ggplot2 R packages:
install.packages("ggdendro")
install.packages("ggplot2")
library(ggdendro)
library(ggplot2)
= ggdendrogram(hv, rotate = TRUE, theme_dendro = FALSE)
ggd = ggd + theme(axis.text.y = element_text(hjust = 1))
ggd = ggd + xlab("Sample")
ggd = ggd + ylab("Height")
ggd = ggd + ggtitle("Ward’s method with a Jaccard index-based distance")
ggd
#To visualize
ggd
We see that the samples are clustering by conditions and that the “M” conditions is far from the others: the identified proteins are globally the same in the samples of a same conditions, but different between conditions.
Additionally, the same approach can be used to evaluate the replication of the “nonidentifications” (missing values) between samples. The dataset to use in the previous function is the next:
=datap
ind_datap==0]=1;
ind_datap[datap>0]=0;
ind_datap[datap=as.data.frame(ind_datap) data.upset
The differences between the distributions of observed values can be used to see whether samples have globally lower or higher quantified values than in the others.
To check if the distribution of quantified values are the same in different samples, it is important to focus exclusively on proteins which are quantified in all samples you want to compare.
Installing the ggplot2 and ggridges R packages to plot the distributions:
install.packages("ggplot2")
install.packages("ggridges")
Deleting proteins not quantified in all samples:
=na.omit(datai) datai_allq
Formatting into a dataframe to use ggplot:
=stack(data.frame(datai_allq));
dfcolnames(df)=c("log2_intensities","Samples");
Plotting using stat_density_ridges function of the ggridges R package:
library(ggplot2)
library(ggridges)
= ggplot(df, aes(x=log2_intensities, y=Samples, fill=
dis.obs stat(x)))
=dis.obs+stat_density_ridges(geom="density_ridges_gradient",
dis.obsscale=3,size=0.3,rel_min_height = 0.01,quantile_lines=TRUE,
quantiles = 0.5, alpha = 0.7)
= dis.obs + scale_fill_viridis_c(name="log2(int.)",
dis.obs option="C", direction=-1)
= dis.obs + labs(title = "Distribution of intensities
dis.obs for proteins without missing values")
dis.obs
## Picking joint bandwidth of 0.491
A Pearson correlation matrix using only complete pairs of observations between two samples to compute each Pearson correlation coefficient can be obtained by using:
=cor(datai, use = "pairwise.complete.obs"); mat.cor
Next, they can be visualized with the corrplot function of the corrplot R package.
Installing the corrplot R package:
install.packages("corrplot")
library(corrplot)
## corrplot 0.92 loaded
=corrplot(mat.cor, order = "hclust", tl.cex=0.5, number.cex=0.5, addrect = 2); matc
=corrplot.mixed(mat.cor, order = "hclust", tl.cex=0.5, number.cex=0.5); matc
The correlation matrix allows justifying that there is a good repeatability of quantified values along the samples. It can be useful for determining which biological conditions are close, and which are less so.
Clustering methods can be applied from the obtained Pearson correlation-based distance matrix to illustrate if the intensities measured in different samples are close or not:
= hclust(as.dist(1-mat.cor),method="ward.D2")
hvc
require(ggdendro)
= ggdendrogram(hvc, rotate = TRUE, theme_dendro = FALSE)
ggdc = ggdc + theme(axis.text.y = element_text(hjust = 1))
ggdc = ggdc + xlab("Sample")
ggdc = ggdc + ylab("Height")
ggdc = ggdc + ggtitle("Ward’s method with a Pearson correlation-based distance")
ggdc
ggdc
We see that the samples of the “M” conditions are clustering together, but that the “N” and “H” condition are mixed together: the quantified values of proteins in the “H” and “N” conditions are globally quite close.
Here, we explain how to generate a PPTX report file with the R plots previously generated and where everything can be changed in the images.
For this, we use the export R package: https://github.com/tomwenseleers/export
The export R packages depends on several R packages that needs to be installed in a first step:
install.packages("officer")
install.packages("rvg")
install.packages("openxlsx")
install.packages("ggplot2")
install.packages("flextable")
install.packages("xtable")
install.packages("rgl")
install.packages("stargazer")
install.packages("tikzDevice")
install.packages("xml2")
install.packages("broom")
install.packages("devtools")
#installation of the export R package
::install_github("tomwenseleers/export") devtools
Once installed, the graph2ppt() function of the export R package can be used to create the PowerPoint file.
The vector.graphic option is used to create modifiable graphics, the append option is used to add a slide in an already existing powerpoint file.
#load the export R package
library(export)
#Creation of the pptx file and add inside the overview heatmap of PTX-QC
p
graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE)
#Upset plot of proteins identified with unique peptides
upset.plot
graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE)
#Hierarchical clustering of identified proteins
ggd
graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE, append=TRUE)
#Distribution of quantified values
dis.obs
graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE, append=TRUE)
#Correlation matrix with 2 rectangles
corrplot(mat.cor, order = "hclust", tl.cex=0.5, number.cex=0.5, addrect = 2);
graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE, append=TRUE)
#Correlation matrix with Pearson correlation coefficients
corrplot.mixed(mat.cor, order = "hclust", tl.cex=0.5, number.cex=0.5);
graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE, append=TRUE)
#Hierarchical clustering of quantified values
ggdc
graph2ppt(file="C:/path_to_your_file/reportQC.pptx", width=7,height=5,vector.graphic = TRUE, append=TRUE)
If you have managed to get this far, you have, in addition to the PTX-QC report, a second report in pptx format in the “C:/path_to_your_file” folder with images that you can modify from PowerPoint to include them in scientific papers for example.