Differential Analysis with DESeq2
In this section of the tutorial, we will guide you through the practical steps necessary to set up the RStudio environment, load the required libraries and data, and execute the DESeq2 analysis. By the end of this section, you will have a fully functional DESeq2 analysis pipeline set up in RStudio, ready to uncover the differentially expressed genes in your dataset.
Launching the RStudio environment
Once the nf-core/rnaseq pipeline is terminated, the resulting data are stored in the folder results_star_salmon
. Now, we can analyse the results by running DESeq2 on RStudio. First of all, we need to launch it:
A pop-up will appear and by clicking on Open, we will be redirected to the RStudio login page. By inserting the username and the password reported below, we will be able to connect to RStudio:
Using sleep
will keep the Gitpod session active, preventing disconnection and providing enough time to complete our analysis without interruptions
Differential Expression Analysis
As in all analysis, firstly we need to create a new project:
-
Go to the File menu and select New Project;
-
Select New Directory, New Project, name the project as shown below and click on Create Project;
- The new project will be automatically opened in RStudio.
We can check whether we are in the correct working directory with getwd()
. The path /workspace/gitpod/training/DE_analysis/
should appear on your console.
To store our results in an organized way, we will create a folder named de_results using the New Folder button in the bottom right panel. We will save all our resulting tables and plots in this folder. Next, go to the File menu, select New File and then R Script to create a script editor in which we will save all commands required for the analysis. In the editor type:
and save the file as de_script.R. From now on, each command described in the tutorial can be added to your script. The resulting working directory should look like this:
The analysis requires several R packages. To utilise them, we need to load the following libraries:
and the pre-computed DESeq2 object (dds
) generated by the nfcore/rnaseq pipeline. In this tutorial we will analyse the dds
object generated by running the alignment with STAR and the quantification with Salmon:
Alternatively, a user could choose to analyse the the dds
object generated by running only Salmon for both lightweight alignment and quantification.
In DESEq2, the dds
object is a central data structure that contains the following components:
-
countData
: a matrix of raw count data, where each row represents a gene and each column represents a sample; -
colData
: a data frame containing information about the samples, such as the experimental design, treatment and other relevant metadata; -
design
: a formula specifying the experimental design utilised to estimate the dispersion and the log2 fold change.
All these components can be checked with specific commands:
The colData
and the design
are the ones created within the nfcore/rnaseq pipeline and must be reorganised prior to the analysis. With the following commands we will create our metadata starting from the info stored in the dds
. We will rename the column of the colData
, we will ensure that the rownames of the metadata are present in the same order as the column names and finally we will update the colData
of the dds
object with our newly created metadata.
With this operation we also eliminate the sizeFactors
already estimated by the nfcore/rnaseq pipeline.
To avoid errors in DESeq2 is essential to check that sample names match between the colData
and the countData
, and that the sample are in the correct order:
Now that everything is setted up, we can proceed to generate a new DESeq2 object with the corrected metadata and the right design:
Comparing the structure of the newly created dds (dds_new
) with the one automatically produced by the pipeline (dds
), we can observe the differences:
Before running the different steps of the analysis, a good practice consists in pre-filtering the genes to remove those with very low counts. This is useful to improve computional efficiency and enhance interpretability. In general, it is reasonable to keep only genes with a sum counts of at least 10 for a minimal number of 3 samples:
Now, it is time to run the differential expression analysis with the DESeq()
function:
The DESeq()
function is a high-level wrapper that simplifies the process of differential expression analysis by combining multiple steps into a single function call. This makes the workflow more user-friendly and ensures that all necessary preprocessing and statistical steps are executed in the correct order. The key functions that DESeq2 calls include:
-
estimateSizeFactors
: to normalise the count data; -
estimateDispersion
: to estimate the dispersion; -
nbinomWaldTest
: to perform differential expression test.
The individual functions can be carried out also singularly as shown below:
The next step in the DESeq2 workflow is to perform quality control (QC) analysis on our data. This analysis is crucial for identifying potential issues ensuring that the data are suitable for downstream analysis. For QC analysis, it is useful to work with transformed versions of the count data, variance-stabilised (vst)
or regularised log-transformed (rlog)
counts. While, the rlog is more robust to outliers and extreme values, vst is computationally faster and so preferred for larger dataset.
These transformations are used for visualisation purposes, while DESeq2 requires raw counts for differential expression analysis.
The rlog
and the vst
transformations have an argument, blind that can be set to:
-
TRUE (default): useful for QC analysis because it re-estimates the dispersion, allowing for comparison of samples in an unbiased manner with respect to experimental conditions;
-
FALSE: the function utilizes the already estimated dispersion, generally applied when differences in counts are expected to be due to the experimental design.
Next, we perform Principal Component Analysis (PCA) to explore the data. DESeq2 provides a built-in function, plotPCA()
, which uses ggplot2 for visualisation, taking the rld
(or the vst
) object as input.
Since the treatment is the principal condition of interest in our metadata, we will use the condition
information from our metadata to plot the PCA:
The second essential step in QC analysis is hierarchical clustering. Although DESeq2 does not have a built-in function for this analysis, we can use the pheatmap()
function from the pheatmap package.
We will extract the matrix of rlog-transformed counts from the rld
object (pheatmap input), compute pairwise correlations and plot the heatmap:
The normalised counts stored in the dds
can be inspected with the counts()
function and saved in our results folder:
The results()
function in DESeq2 is used to extract the results of the DE analysis. This function takes the dds
object as input and returns a DataFrame containing the results of the analysis:
-
baseMean: the average expression level of the gene across all samples;
-
log2FoldChange: the log2 fold change of the gene between the condition of interest and the reference level;
-
lfcSE: the standard error of the log2 fold change;
-
stat: the Wald statistic, which is used to calculate the p-value;
-
pvalue: the p-value from the Wald test indicates the probability of observing the measured difference in gene expression (log2 fold change) by chance, assuming no true difference exists (null hypothesis). A low p-value suggests that the observed expression change between samples is unlikely due to random chance, so we can reject the null hypothesis —> the gene is differentially expressed;
-
padj: the adjusted p-value, which takes into account multiple testing corrections, (Benjamini-Hochberg method default) to control the false discovery rate;
The results()
function returns the results for all genes in the analysis with an adjusted p-value below a specific FDR cutoff, set by default to 0.1. This threshold can be modified with the parameter alpha
. The results()
function can also be customised to filter the results based on certain criteria (log2 fold change or padj) or to set a specific contrast (specific comparison between two or more levels).
The order of the contrast names determines the direction of the fold change that is reported in the results. Specifically, the first level of the contrast is the condition of interest and the second level is the reference level.
Notice that in this tutorial the contrast is already correctly specified.
In the Experimental Design section, we emphasised the importance of estimating the log2 fold change threshold using a statistical power calculation, rather than selecting it arbitrarily. This approach ensures that the chosen threshold is statistically appropriate and tailored to the specifics of the experiment. However, since we are working with simulated data for demonstration purposes, we will use a padj threshold of 0.05 and consider genes with a log2 fold change of at least 1 or -1 as differentially expressed.
Plot the results
Now that we have obtained the results of the differential expression analysis, it’s time to visualise the data to gain a deeper understanding of the biological processes that are affected by the experimental conditions. Visualisation is a crucial step in RNA-seq analysis, as it allows us to identify patterns and trends in the data that may not be immediately apparent from the numerical results.
In the following sections, we will explore different types of plots that are commonly used to visualise the results of RNA-seq analysis, including:
- MA plot: scatter plot commonly utilised to visualise the results of the DE analysis for all the samples. The plot displays the mean of the normalised counts on the x-axis and the log2 fold change on the y-axis. This allows the visualisation of the relationship between the magnitude of the fold change and the mean expression level of the genes. Genes that are differentially expressed will appear farthest from the horizontal line, while genes with low expression levels will appear closer to the line.
- counts plot: plot of the normalised counts for a single gene across the different conditions in your experiment. It’s particularly useful for visualising the expression levels of specific genes of interest and comparing them across sample groups.
heatmap: plot of the normalised counts for all the significant genes obtained with the pheatmap()
function. The heatmap provides insights into genes and sample relationships that may not be apparent from individual gene plots alone.
- volcano plot: scatter plot that displays the log2 fold change on the x-axis and the log transformed padj on the y-axis. This allows for the visualisation of both the magnitude and significance of the changes in gene expression between two conditions. Genes that are differentially expressed (i.e., have a large log2 fold change) and are statistically significant (i.e., have a low padj) will appear in the left (downregulated genes) or in the right (upregulated genes) corners of the plot making easier their identification.
Functional analysis
The output of the differential expression analysis is a list of significant DE genes. To uncover the underlying biological mechanisms, various downstream analyses can be performed, such as functional enrichment analysis (identify overrepresented biological processes, molecular functions, cellular components or pathways), and network analysis (group genes based on similar expression patterns to identify novel interactions). To facilitate the interpretation of the resulting list of DE genes, a range of freely available web- and R-based tools can be employed.
In this tutorial, we will explore an enrichment analysis technique known as Over-Representation Analysis (ORA), a powerful tool for identifying biological pathways or processes significantly enriched within the list of DE genes. The underlying statistic behind ORA is the hypergeometric test, which considers three key components:
-
Universe: the background list of genes (for example the genes annotated in a genome);
-
GeneSet: a collection of genes annotated by a reference database (such as Gene Ontology), and known to be involved in a particular biological pathway or process;
-
Gene List: the differentially expressed genes.
The hypergeometric test calculates the probability of observing a certain number of genes from the gene set (pathway or process) within the gene list (DE genes) by chance. An important aspect of this analysis is the concept of membership. It defines the relationship between DE genes and genes from the analysed gene set. By knowing which genes belong to which pathway/process, we can determine whether the observed overlap between DE genes and the particular pathway/process is greater than what would be expected by random chance.