metaQuantome Command-line Tutorial

Caleb W. Easterly, Ray Sajulga, Subina Mehta, James Johnson, Praveen Kumar, Shane Hubler, Bart Mesuere, Joel Rudney, Timothy J. Griffin, Pratik D. Jagtap April 19, 2019

go back to the main page

Introduction

This tutorial accompanies the article Easterly et al., Mol Cell Proteomics 2019, and is designed to show how to use the metaQuantome command line interface. An introduction to the metaQuantome Galaxy tool interface is available here.

Installing metaQuantome

The easiest way to install metaQuantome with all the dependencies is by using Bioconda (provided you are on Mac or Linux, which are the only systems supported by Bioconda).

First, install the conda package manager, by downloading either Anaconda or Miniconda (see https://docs.anaconda.com/anaconda/install/). Then, the following commands will set up the necessary channels for Bioconda and metaQuantome. If needed, additional information on Bioconda is available at https://bioconda.github.io/.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Then, run the following command to set up an environment named mqome, which will have metaQuantome (version 1.0.0) and all dependencies in it:

conda create -n mqome metaquantome=2.0.1

If the following prompt is seen at the command line, type y:

Proceeed ([y]/n)?

The create command only needs to be run once.

Finally, we can activate the environment using the following command, which will make the metaquantome command available.

source activate mqome

The rest of the tutorial will assume that you have the Conda environment activated, and are on a Mac or Linux system. Furthermore, you can follow along with the tutorial by cloning the Github repository at https://github.com/galaxyproteomics/metaquantome_mcp_analysis and changing your directory to <git_repo_root>/cli_tutorial. If you have any problems, please submit an issue at https://github.com/galaxyproteomics/metaquantome/issues.

The data

metaQuantome takes 2-3 input files depending on the mode of analysis. In the case of functional analysis, metaQuantome takes in a functional annotation file and a file with peptide intensities. For taxonomic analysis, the peptide intensity file and a file with lowest common ancestor (LCA) peptide annotations. For function-taxonomy interaction analysis, metaQuantome takes in all three files above.

In this tutorial, the data is from a previously published paper [1], and is downsampled in the interest of speed and simplicity - so, keep in mind that these results are by not necessarily representative. The purpose of this tutorial is to demonstrate the capabilities of metaQuantome, not to analyze the full dataset. We can see the beginning of each of the files below.

Function

peptide go
LPGQQHGTPSAK GO:0009405,GO:0019867
ELPGLAALTDK GO:0006096,GO:0005737,GO:0004618,GO:0005524
ELADASVSTIEIER GO:0006412,GO:0015935,GO:0003735,GO:0019843,GO:0003729
QISAGELR GO:0006412,GO:0005840,GO:0022625,GO:0003735,GO:0003729
VPAILADATK GO:0055085

Note that there can be multiple GO terms (or no GO terms) annotated to a peptide, as long as they are separated by a comma.

Taxonomy

peptide lca
LPGQQHGTPSAK Veillonella
ELPGLAALTDK Bacteria
ELADASVSTIEIER Bacteria
QISAGELR Bacteria
VPAILADATK Granulicatella adiacens

Here, the peptides are annotated with their lowest common ancestor as a NCBI taxonomy names. Unlike names, NCBI taxIDs are unambiguous and not affected by differences in capitalization or spelling errors. So, taxIDs are highly recommended.

Peptide intensities

Here, the mass-spectrometer spectral intensities are provided for each peptide.

peptide X737WS X737NS X852WS X852NS X867WS X867NS
LPGQQHGTPSAK 160259.9 334347.7 NA NA 564647.6 402888.9
ELPGLAALTDK 9640246.9 NA 6379355.0 1842767 5491673.7 NA
ELADASVSTIEIER 3957113.0 NA 238529.0 NA 1277833.8 165964.2
QISAGELR 17929965.5 774077.2 18109620.3 2481987 6576851.4 NA
VPAILADATK NA 202338.2 447102.1 1892304 1049044.7 1696422.4

Note that missing data is here represented as NA. In general, metaQuantome treats 0, NA, and NaN as missing data.

Also, metaQuantome assumes that input data is untransformed and normalized - that is, the data should be on the original measurement scale (not log transformed or any other transformation), and it should be normalized by an appropriate method.

Databases

metaQuantome has a separate module for downloading the necessary databases. Let’s first get the helptext for the module and make sure everything is installed correctly:

metaquantome db -h
usage: metaquantome db [-h] [--dir DIR] [--update]
                       {ncbi,go,ec} [{ncbi,go,ec} ...]

positional arguments:
  {ncbi,go,ec}       database to download. note that COG mode does not require
                     a download due to its simplicity.

optional arguments:
  -h, --help         show this help message and exit
  --dir DIR, -d DIR  data directory for files.
  --update, -u       overwrite existing databases if present.

If you get any errors that suggest that there has been trouble installing the package, please post on https://github.com/galaxyproteomics/metaquantome/issues.

Now, to actually download the databases, choose a folder - here, data - and run the following command:

mkdir -p ./data
metaquantome db ncbi go --dir ./data

As detailed in the manuscript, the are several questions that metaQuantome can be used to explore. Let’s go through a few of these.

Most Abundant Taxa

metaQuantome can be used to identify the most abundant taxa in an experiment, and stratify this by condition. The workflow here is

  1. expand the set of annotations to include indirect annotations
  2. filter the expanded results to well-supported taxa
  3. viz the filtered taxa, ranked by abundance.

Expand

metaQuantome requires information about each of the samples, and which experimental conditions they belong to. The names of the samples also should be the column names in the intensity file. For example, if you have two experimental conditions “A” and “B”, each with two samples that you’re calling “A1”, “A2”, and “B1”, “B2”, respectively, the columns of your intensity file should be peptide (or another peptide column name), A1,A2,B1, and B2. Then, the --samps argument gives metaQuantome the information about which samples below to which groups. There are two ways to provide this information to metaQuantome:

An example of the required tabular file is in input_files/samples.tab:

cat input_files/samples.tab
group   colnames
NS  X737NS,X852NS,X867NS
WS  X737WS,X852WS,X867WS

The file must have the ‘group colnames` header, and cannot end with a newline. Moreover, it must be separated with ‘hard’ tabs rather than ‘soft tabs’, as must all files used by metaQuantome - the easiest way to ensure this is to create the file in Excel or other spreadsheet software, and then export to a tab-separated file. On the other hand, the JSON option is simpler for a smaller number of samples. The JSON equivalent in this case is

'{"NS": ["X733NS","X852NS","X866NS"], "WS": ["X733WS", "X852WS", "X866WS"]}'

Note that the items in the JSON string must be surrounded by double quotes (e.g., "thing").

Now, let’s run the expand module on the files in input_files.

metaquantome expand \
    --mode t \
    --int_file input_files/int.tab \
    --pep_colname_int peptide \
    --tax_file input_files/tax.tab \
    --pep_colname_tax peptide \
    --tax_colname lca \
    --samps input_files/samples.tab \
    --outfile mqome_outputs/tax_expanded.tab \
    --data_dir data

Note that many of the arguments give column names in the input files, so that metaQuantome can keep track of what’s what. Specifically, it needs to know the name of the peptide column in each file (--pep_colname_int, --pep_colname_tax), and the name of the taxonomy least common ancestor column in the taxonomy file (tax_colname).

Let’s look at the output in mqome_outputs:

id taxon\_name rank NS\_mean WS\_mean X737NS X737WS X852NS X852WS X867NS X867WS X737NS\_n\_peptide X737WS\_n\_peptide X852NS\_n\_peptide X852WS\_n\_peptide X867NS\_n\_peptide X867WS\_n\_peptide X737NS\_n\_samp\_children X737WS\_n\_samp\_children X852NS\_n\_samp\_children X852WS\_n\_samp\_children X867NS\_n\_samp\_children X867WS\_n\_samp\_children
481 Neisseriaceae family 21.18733 16.54774 22.09506 17.25113 18.17465 15.11985 NA NA 1 1 1 1 NA NA 1 1 1 1 NA NA
538 Eikenella genus 21.18733 16.54774 22.09506 17.25113 18.17465 15.11985 NA NA 1 1 1 1 NA NA 1 1 1 1 NA NA
539 Eikenella corrodens species 21.18733 16.54774 22.09506 17.25113 18.17465 15.11985 NA NA 1 1 1 1 NA NA NA NA NA NA NA NA
712 Pasteurellaceae family 26.45509 24.24076 24.96501 22.21824 NA 17.34137 27.1723 25.69783 12 12 NA 1 12 12 1 1 NA NA 1 1
724 Haemophilus genus 26.04355 24.30867 23.49238 21.57275 NA NA 26.9149 25.19611 5 5 NA NA 5 5 1 1 NA NA 1 1

(The software used to write this tutorial puts in the backslashes (\) in the column names - those aren’t actually there in the file)

Quantitative outputs are included (columns like “NS_mean”, which shows the log base 2 of the average intensity for that term), as well as information about the ‘quality’ of the annotations, like “X733NS_n_peptide”, which shows the number of peptides that give evidence for each term. For example, 4 peptides were observed in sample X733NS that were annotated with the Firmicutes phylum or any of its children. These extra columns are included because they are used in metaquantome filter, and for the sake of transparency.

Filter

Now, let’s filter the expanded outputs to terms about which we have a certain level of confidence.

metaquantome filter \
    --expand_file mqome_outputs/tax_expanded.tab \
    --mode t \
    --samps input_files/samples.tab \
    --min_peptides 1 \
    --min_pep_nsamp 1 \
    --min_children_non_leaf 2 \
    --min_child_nsamp 1 \
    --qthreshold 2 \
    --outfile mqome_outputs/tax_filt.tab

Here, we’re requiring that each term satisfy the following conditions:

Let’s look at the output:

id taxon\_name rank NS\_mean WS\_mean X737NS X737WS X852NS X852WS X867NS X867WS X737NS\_n\_peptide X737WS\_n\_peptide X852NS\_n\_peptide X852WS\_n\_peptide X867NS\_n\_peptide X867WS\_n\_peptide X737NS\_n\_samp\_children X737WS\_n\_samp\_children X852NS\_n\_samp\_children X852WS\_n\_samp\_children X867NS\_n\_samp\_children X867WS\_n\_samp\_children
712 Pasteurellaceae family 26.45509 24.24076 24.96501 22.21824 0.00000 17.34137 27.17230 25.69783 12 12 0 1 12 12 1 1 0 0 1 1
724 Haemophilus genus 26.04355 24.30867 23.49238 21.57275 0.00000 0.00000 26.91490 25.19611 5 5 0 0 5 5 1 1 0 0 1 1
729 Haemophilus parainfluenzae species 23.51605 21.83745 22.11308 20.13283 0.00000 0.00000 24.21369 22.59720 4 4 0 0 4 4 0 0 0 0 0 0
1224 Proteobacteria phylum 26.50641 26.40567 25.64756 25.68405 25.57253 26.02080 27.45138 27.10854 18 18 3 4 17 17 2 2 2 2 2 2
1239 Firmicutes phylum 28.51239 30.09143 28.16821 29.85861 28.68930 29.90299 28.62579 30.43794 247 362 325 366 317 354 3 2 3 3 3 2

Viz

Finally, using the filtered output, we can visualize the most abundant taxa with the viz module. To do this, we provide viz with the name of the mean column (or any column) and a few other arguments. The most important argument for visualizing taxonomy is --target_rank, which indicates the rank we want to see. Let’s go with species for now (since it’s a reduced dataset, there aren’t that many genera or families or so on). In addition, we can choose the number of terms we want to display (--nterms).

metaquantome viz \
    --plottype bar \
    --infile mqome_outputs/tax_filt.tab \
    --mode t \
    --samps input_files/samples.tab \
    --meancol NS_mean \
    --target_rank species \
    --nterms 3 \
    --img mqome_outputs/imgs/tax_ns.png \
    --width 7 \
    --height 4 \
    --barcol 1
metaquantome viz \
    --plottype bar \
    --infile mqome_outputs/tax_filt.tab \
    --mode t \
    --samps input_files/samples.tab \
    --meancol WS_mean \
    --target_rank species \
    --nterms 3 \
    --img mqome_outputs/imgs/tax_ws.png \
    --width 7 \
    --height 4 \
    --barcol 2

NS

WS

Cluster Analysis of Taxonomy

metaQuantome offers two methods for cluster analysis: principal components analysis and hierarchically clustered heatmaps. In both cases, we can use the mqome_outputs/tax_filt.tab file already produced above in the expand and filter steps.

Principal components analysis

metaquantome viz \
    --plottype pca \
    --infile mqome_outputs/tax_filt.tab \
    --mode t \
    --samps input_files/samples.tab \
    --img mqome_outputs/imgs/tax_pca.png \
    --calculate_sep

The --calculate_sep argument tells metaQuantome to calculate the ratio of the between-cluster variance to the within-cluster variance, where higher indicates a better separation of clusters (see the main paper for details). The separation is included in the plot title.

Heatmap

metaquantome viz \
    --plottype heatmap \
    --infile mqome_outputs/tax_filt.tab \
    --mode t \
    --samps input_files/samples.tab \
    --img mqome_outputs/imgs/tax_heatmap.png

Differentially Abundant Functions

Now, let’s go through a differential abundance analysis with metaQuantome. We’ll do analysis on GO terms, and look for terms that are significantly different between the no sucrose (NS) and with sucrose (WS) samples. The workflow is similar, but we add another step (stat), so we do expand -> filter -> stat -> viz.

Expand

metaquantome expand \
    --mode f \
    --data_dir data \
    --int_file input_files/int.tab \
    --pep_colname_int peptide \
    --func_file input_files/func.tab \
    --pep_colname_func peptide \
    --func_colname go \
    --ontology go \
    --samps input_files/samples.tab \
    --outfile mqome_outputs/func_expanded.tab

Filter

Here, we filter to GO terms with good evidence, and those that are appropriate for statistical analysis. Specifically, we need at least 3 samples per group for a robust t-test.

metaquantome filter \
    --expand_file mqome_outputs/func_expanded.tab \
    --mode f \
    --ontology go \
    --samps input_files/samples.tab \
    --min_peptides 2 \
    --min_pep_nsamp 2 \
    --min_children_non_leaf 2 \
    --min_child_nsamp 2 \
    --qthreshold 3 \
    --outfile mqome_outputs/func_filt.tab

Stat

In stat, we use the tabular result from filter to append fold changes and the p-values obtained from t-tests. For additional options, use metaquantome stat -h

metaquantome stat \
    --file mqome_outputs/func_filt.tab \
    --mode f \
    --ontology go \
    --samps input_files/samples.tab \
    --parametric True \
    --paired \
    --outfile mqome_outputs/func_stat.tab

Viz

The visualization provided by metaQuantome for differential abundance is a volcano plot, where the -log10 p value is plotted against the log2 fold change.

Here, we provide the name of the fold change column in the result from stat, and whether or not to reverse the fold change. Note that the fold change is automatically calculated by alphabetizing the two group names, so we may want to reverse the order (here, we want to know WS over NS rather than NS over WS).

metaquantome viz \
    --plottype volcano \
    --infile mqome_outputs/func_stat.tab \
    --mode f \
    --samps input_files/samples.tab \
    --meancol NS_mean \
    --fc_name log2fc_NS_over_WS \
    --flip_fc \
    --img mqome_outputs/imgs/func_dea.png

We may also want to see the names specific terms that are differentially abundant. To do this, we provide the name of the column that has the text we want to plot - usually, the id column (for function) or taxon_name (for taxonomy) to --textannot. The plotting program automatically hides some of the terms to avoid overplotting - see mqome_outputs/func_stat.tab for the full list of results.

metaquantome viz \
    --plottype volcano \
    --infile mqome_outputs/func_stat.tab \
    --mode f \
    --samps input_files/samples.tab \
    --meancol NS_mean \
    --fc_name log2fc_NS_over_WS \
    --flip_fc \
    --img mqome_outputs/imgs/func_dea_text.png \
    --textannot id

Finally, we may also want to divide the GO terms by ontology - biological process, molecular function, and cellular component. We can do that by providing the flag --gosplit. You may also have to tweak the --width and --height (in inches) to get everything looking right.

metaquantome viz \
    --plottype volcano \
    --infile mqome_outputs/func_stat.tab \
    --mode f \
    --samps input_files/samples.tab \
    --meancol NS_mean \
    --fc_name log2fc_NS_over_WS \
    --flip_fc \
    --img mqome_outputs/imgs/func_dea_text_split.png \
    --textannot id \
    --gosplit \
    --width 5 \
    --height 8

Function-Taxonomy Interaction

In this analysis, we examine the contribution of each taxon to carbohydrate metabolism, which is denoted by the GO term GO:0005975, “carbohydrate metabolic process”. Once again, the workflow is expand -> filter, but now we go straight to viz (testing is not recommended in function-taxonomy interaction mode).

Expand

First, we run the expand module. The mode is now ft, and we have to provide information about each of the intensity, taxonomy, and function files.

metaquantome expand \
    --mode ft \
    --data_dir data \
    --int_file input_files/int.tab \
    --pep_colname_int peptide \
    --tax_file input_files/tax.tab \
    --pep_colname_tax peptide \
    --tax_colname lca \
    --func_file input_files/func.tab \
    --pep_colname_func peptide \
    --func_colname go \
    --ontology go \
    --samps input_files/samples.tab \
    --outfile mqome_outputs/ft_expanded.tab

Filter

Now, we run the filter module, which does not change much by mode.

metaquantome filter \
    --expand_file mqome_outputs/ft_expanded.tab \
    --mode ft \
    --samps input_files/samples.tab \
    --min_peptides 1 \
    --min_pep_nsamp 1 \
    --min_children_non_leaf 2 \
    --min_child_nsamp 1 \
    --qthreshold 2 \
    --outfile mqome_outputs/ft_filt.tab

Viz

Finally, we visualize the results. Many vizualizations can be made from the same filter result file, including looking at taxonomic distribution of a functional term (--whichway t_dist) and looking at the functional distribution for a taxon (--whichway f_dist). In addition, for t_dist we need to provide the --target_rank; that is, the rank for which we’re interested in the distribution. Here, the target rank is genus.

metaquantome viz \
    --mode ft \
    --samps samples.tab \
    --infile mqome_outputs/ft_filt.tab \
    --img mqome_outputs/imgs/ft_carbo_ns.png \
    --plottype ft_dist \
    --meancol NS_mean \
    --whichway t_dist \
    --id GO:0005975 \
    --target_rank genus \
    --nterms all \
    --barcol 1
metaquantome viz \
    --mode ft \
    --samps samples.tab \
    --infile mqome_outputs/ft_filt.tab \
    --img mqome_outputs/imgs/ft_carbo_ws.png \
    --plottype ft_dist \
    --meancol WS_mean \
    --whichway t_dist \
    --id GO:0005975 \
    --target_rank genus \
    --nterms all \
    --barcol 2

NS

WS

The End

Thank you for going through the tutorial! As we mentioned above, use the help flag for full documentation of each command (e.g. metaquantome expand -h), and, if you have any problems or questions, submit an issue at https://github.com/galaxyproteomics/metaquantome/issues