What is Differential Expression Analysis?
How to find the genes that actually differ between your samples — without a stats degree.
- 1 Differential expression analysis finds genes whose expression levels differ between conditions — for example, disease vs. healthy or treated vs. control.
-
2
Three dominant methods exist:
DESeq2,edgeR, andlimma-voom. All three are statistically sound; they just have different defaults. - 3 Outputs are interpreted using two numbers: adjusted p-value (statistical confidence) and log2 fold change (effect size). Both matter.
The question DE analysis answers
Strip away the jargon and differential expression analysis asks one simple thing: are the gene expression patterns different between two (or more) groups of samples, and if so, which genes are driving the difference?
A concrete example. Take GSE151427, the cardiac differentiation dataset used elsewhere in TransXplorer. It contains RNA-seq counts from two types of stem-cell-derived endothelial cells: cardiac mesoderm-derived (CMEC) and paraxial mesoderm-derived (PMEC). Both look identical under a microscope. Both express the standard endothelial markers. But they are destined for very different anatomical fates — one populates the heart, the other the trunk and limbs.
The biological question is: what is molecularly different about these cells? The DE analysis question is the operational version of that: which of the ~20,000 expressed genes have significantly different average expression between the CMEC and PMEC groups?
You might hope this is trivial: just subtract one group's mean from the other and rank the differences. It isn't trivial, because gene expression is noisy. Every gene's count varies from sample to sample within the same condition for reasons that have nothing to do with biology — library preparation differences, sequencing depth, RNA quality, the random chance of which transcripts a sequencer happened to capture. A gene can look "different" between two groups purely by accident.
Differential expression analysis is the principled way to separate real, reproducible differences from random noise. The output isn't just a list of genes; it's a list of genes plus a statement, for each one, of how confident we are that the difference is real and how large that difference is.
Why we need statistics — the noise problem
Here's the intuition that does most of the work. Imagine you measure the expression of a single gene in three CMEC samples and three PMEC samples. You get six numbers. The CMEC numbers don't agree exactly with each other, and neither do the PMEC numbers — that's within-condition variation. The averages of the two groups also differ — that's the between-condition difference.
The question a statistical test asks is: is the difference between the two group averages bigger than what you'd expect from within-group jitter alone? If yes, the gene is called significant. If no, the apparent difference could easily have come from random noise and is dismissed.
This is the core idea behind every flavor of differential expression test — from a humble t-test to DESeq2's Wald test. They differ in how they estimate the within-group variance, but they all answer the same question.
Now, why can't we just run a t-test on RNA-seq counts? Two reasons. First, with only 3 or 4 replicates per group, the variance estimate for any single gene is itself noisy — you might massively over- or under-estimate the noise floor and call genes significant when they aren't (or miss real ones). Second, RNA-seq data are counts, not the symmetric, continuous measurements a t-test assumes. Both of those problems have specific solutions, which is what makes DESeq2, edgeR, and limma-voom distinct from each other and from generic stats. We'll look at the count problem next.
Why counts need special models — the negative binomial
RNA-seq produces counts. Each gene gets an integer: the number of sequencing reads that mapped to it. Counts have three properties that ordinary statistical models don't handle well.
They are integer-valued and non-negative — you can have 0, 1, 2, or 1,247 reads, but not -3 or 17.4. They have a strong mean–variance relationship: highly expressed genes show much larger absolute variance than lowly expressed ones. And critically, the variance grows faster than the mean, a phenomenon called overdispersion — biological replicates introduce noise on top of the purely statistical Poisson noise you'd see if every replicate were a perfect technical copy.
The negative binomial distribution sits in a sweet spot. It's defined on non-negative integers, it's right-skewed (most genes have most replicates clustered around their mean with occasional larger values), and it has a separate dispersion parameter that lets variance scale with the mean however the data demand. Both DESeq2 and edgeR use the negative binomial directly. They differ mainly in how they estimate that dispersion parameter, especially with the small sample sizes that are typical in RNA-seq (3–6 replicates per group).
limma-voom takes a different route. Instead of modeling counts with a count-aware distribution, it transforms the counts using a procedure called voom, which converts them into approximately-normal values plus a precision weight for each observation. After this transformation, the proven linear-model machinery from microarray analysis works directly on RNA-seq data — which is why limma-voom is so fast on large datasets.
Don't worry if the mathematical details feel slippery. The takeaway is that all three methods are count-aware in some way, and that's why they outperform a naive t-test on RNA-seq data. The next section compares them head-to-head.
DESeq2 vs. edgeR vs. limma-voom — what's actually different
Almost every RNA-seq paper you read uses one of these three packages. They have overlapping reputations and the documentation can feel competitive. Here's a flat comparison.
| Aspect | DESeq2 | edgeR | limma-voom |
|---|---|---|---|
| Underlying model | Negative binomial | Negative binomial | Linear model + voom transformation |
| Dispersion estimation | Shrinks toward a fitted mean–variance trend | Empirical Bayes shrinkage to a common dispersion | Trend-based variance modeling on log-CPM |
| Best for sample sizes | Small to large (≥3 / group) | Small to large (≥3 / group) | Larger studies (≥10 / group); also the fastest |
| Normalization | Median-of-ratios | TMM (trimmed mean of M-values) | TMM + voom precision weights |
| Handles complex designs | Excellent | Excellent | Excellent |
| Output interpretation | log2FC + padj (BH) |
logFC + FDR |
logFC + adj.P.Val |
| Original citation | Love et al. 2014 | Robinson et al. 2010 | Law et al. 2014 |
What that looks like in practice
On most real datasets, the three methods agree on roughly 80% of significant genes. The remaining 20% sits in the borderline region where calls depend on how dispersion is shrunk, which normalization is applied, and how aggressively the method moderates variance.
A working description of the personalities:
DESeq2is often the most conservative of the three at small sample sizes. Its independent filtering and outlier-aware variance shrinkage tend to suppress low-count, high-variance noise. It is the default in most modern RNA-seq pipelines.edgeRcalls slightly more genes significant thanDESeq2on the same data — the difference is usually marginal but consistent. Its quasi-likelihood F-test (glmQLFit) is the recommended modern workflow and is excellent at small sample sizes.limma-voomis the fastest by a wide margin and is the standard choice for large studies (n ≥ 10/group). It can lose a little sensitivity at very small sample sizes (n = 3) because its precision weights need enough data to estimate the mean–variance trend reliably.
A sane workflow: pick one based on study size and the conventions in your sub-field, run a confirmatory pass with a second method if reviewers expect it, and never silently swap methods until a previously non-significant gene becomes significant.
Interpreting outputs — padj, log2FC, baseMean
A DE results table looks intimidating until you know which columns to read. There are usually only three that matter for triage.
padj (adjusted p-value)
A raw p-value is the probability, under the null hypothesis that no real difference exists, of seeing a difference at least as large as the one you observed. If you ran one test at threshold p < 0.05, you'd expect a 5% false positive rate — tolerable. But RNA-seq tests roughly 20,000 genes simultaneously. By chance alone, about 1,000 genes will look "significant" at p < 0.05 even when nothing is actually changing.
The Benjamini–Hochberg (BH) correction transforms raw p-values into adjusted p-values (padj) that control the False Discovery Rate (FDR) — the expected proportion of false positives among the genes you call significant. The conventional threshold is padj < 0.05, which means: on average, no more than 5% of your significant calls are false.
log2 fold change
The log2 fold change is the effect size. It is log2(condition_A / condition_B), with a tweak: DESeq2 applies a shrinkage estimator that pulls low-count, high-variance fold changes toward zero, giving more reliable rankings.
The intuitive translation: log2FC = 1 means the gene is 2× higher in condition A; log2FC = 2 means 4×; log2FC = -1 means half. Sign matters: positive is up in A, negative is down in A — and which condition is "A" depends on how you set your reference level. A common threshold is |log2FC| > 1, which corresponds to at least a 2-fold change in either direction.
baseMean
baseMean is the average normalized count across all samples. It's a sanity check: a "significant" gene with baseMean = 3 is a single read away from being noise. Most published DE filters either drop low-baseMean genes entirely or treat them with extra skepticism. A practical cutoff is baseMean > 10, though independent filtering inside DESeq2 often handles this for you.
A gene can pass either filter on its own and still not be biologically interesting. A gene with padj = 0.0001 but log2FC = 0.1 is statistically real but reflects an effect that is too small to matter physiologically. A gene with log2FC = 5 but padj = 0.4 looks dramatic but could easily be a single outlier sample. Always use both.
Multiple testing correction, made concrete
The multiple testing problem is the single most common source of "false discoveries" in genomics. The arithmetic is straightforward: at significance threshold α = 0.05, a single test has a 5% chance of a false positive under the null. Test 20,000 genes and the expected number of chance hits is 20,000 × 0.05 = 1,000. Without correction, you'd publish a list of "significant" genes that is mostly noise.
There are three common responses:
- No correction. ~1,000 false positives expected. Don't do this.
- Bonferroni. Divide α by the number of tests. Expected false positives drop to near zero — but the threshold becomes so strict that real signal is also lost.
- Benjamini–Hochberg (FDR). Control the proportion, not the count, of false positives among your significant calls. At FDR < 0.05, you accept that ~5% of called genes will be false — in exchange for retaining most of the real signal. This is the standard.
All three DE methods (DESeq2, edgeR, limma-voom) apply Benjamini–Hochberg by default. The column you read is named padj in DESeq2, FDR in edgeR, and adj.P.Val in limma — the same quantity, three names.
Common pitfalls
The six mistakes that show up most often in our review of submitted analyses:
DE with n = 2 per group is unreliable — the variance estimate is essentially undefined. Minimum 3 replicates per group; ≥ 4–5 is much better.
Wildly different sample sizes per group inflate variance and reduce power. Aim for balance when possible; block on covariates when not.
Technical variation can fully mask biological signal. Visualize PCA before DE; if batches separate samples, model batch in the design or correct it. See batch effects.
Small effects can be statistically real but biologically trivial. Use both padj and |log2FC| thresholds. The volcano plot exists precisely to remind you.
If your contrast is reversed, all signs of log2FC flip and biology reads upside down. Always check that the reference level matches the comparison you intend.
Running, re-running, and quietly relaxing padj or fold-change cutoffs until your favorite gene appears is the most common form of inadvertent fraud. Pre-register your thresholds.
Which method should I use?
If you're paralyzed by the choice between DESeq2, edgeR, and limma-voom, this flow chart will get you to a reasonable answer in under a minute.
All three methods handle multi-factor designs (e.g., treatment + sex + batch). DESeq2's design = ~ batch + condition syntax tends to be the most readable. If you have repeated measures or random effects, look at limma's duplicateCorrelation() or the dream extension — that's where it pulls ahead.
How TransXplorer handles all this
TransXplorer runs all three methods in parallel and shows you how much they agree on your specific dataset — so you can pick one with eyes open instead of by reputation. Each method's full results table is browsable; the consensus DEG list is one click away. Multiple-testing correction (BH by default; Bonferroni on request) is applied automatically.
-
All three methods (
DESeq2,edgeR,limma-voom) run in parallel with one button. - Method-agreement Venn diagram on every analysis — see the consensus list before you commit.
-
Interactive volcano plot with draggable
padjandlog2FCthresholds — explore live, don't re-render. -
Auto-annotated top DEGs with gene-symbol labels using
JetBrains Monoand biotype color coding. - One-click GO / KEGG / Reactome enrichment from the DEG list, with ORA and GSEA side by side.
- Publication-ready PDF / SVG export. CSV download of every results table.
Ready to try differential expression?
Load the example dataset and run all three methods in under two minutes. No installation, no login.
Try DE analysis in TransXplorer →Common questions
Which method should I report in my paper?
DESeq2, edgeR, or limma-voom is defensible. Cite the original paper and the version number you used (e.g., "DESeq2 v1.42.0, Love et al. 2014"). If you ran multiple methods, report the primary in the main text and the others in supplementary, with the overlap percentage.
Can I average replicates before running DE?
Is padj < 0.05 a hard rule?
padj < 0.01 or even 0.001) for hypothesis-generation work where you'll commit lab time to follow-ups. Be more lenient (padj < 0.1 or 0.2) for exploratory screens that feed into pathway analysis — GSEA in particular benefits from a longer ranked list.
What if my top gene has padj = 0?
1e-300). It isn't actually zero; the math is just running out of precision. Report it as "padj < 1e-300" or use the original p-value in scientific notation. It's a strong signal, not a bug.
Should I filter low-count genes before DE?
DESeq2's independent filtering, edgeR's filterByExpr()). Pre-filtering genes with fewer than 10 reads summed across all samples speeds analysis without harming results. Don't pre-filter too aggressively — doing so distorts the multiple-testing correction.
References & further reading
-
Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15:550.
doi:10.1186/s13059-014-0550-8 -
Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140.
doi:10.1093/bioinformatics/btp616 -
Law CW, Chen Y, Shi W, Smyth GK (2014). voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology 15:R29.
doi:10.1186/gb-2014-15-2-r29 -
Benjamini Y, Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B 57(1):289–300.
doi:10.1111/j.2517-6161.1995.tb02031.x
What's next?
Three directions to keep going — another concept, a deeper method comparison, or back to the hands-on walkthrough.