Concept guide · 15 min read Beginner-friendly

What is Differential Expression Analysis?

How to find the genes that actually differ between your samples — without a stats degree.

No coding required Method-agnostic Plain-language
TL;DR
  • 1 Differential expression analysis finds genes whose expression levels differ between conditions — for example, disease vs. healthy or treated vs. control.
  • 2 Three dominant methods exist: DESeq2, edgeR, and limma-voom. All three are statistically sound; they just have different defaults.
  • 3 Outputs are interpreted using two numbers: adjusted p-value (statistical confidence) and log2 fold change (effect size). Both matter.
01 · The question

The question DE analysis answers

Strip away the jargon and differential expression analysis asks one simple thing: are the gene expression patterns different between two (or more) groups of samples, and if so, which genes are driving the difference?

A concrete example. Take GSE151427, the cardiac differentiation dataset used elsewhere in TransXplorer. It contains RNA-seq counts from two types of stem-cell-derived endothelial cells: cardiac mesoderm-derived (CMEC) and paraxial mesoderm-derived (PMEC). Both look identical under a microscope. Both express the standard endothelial markers. But they are destined for very different anatomical fates — one populates the heart, the other the trunk and limbs.

The biological question is: what is molecularly different about these cells? The DE analysis question is the operational version of that: which of the ~20,000 expressed genes have significantly different average expression between the CMEC and PMEC groups?

You might hope this is trivial: just subtract one group's mean from the other and rank the differences. It isn't trivial, because gene expression is noisy. Every gene's count varies from sample to sample within the same condition for reasons that have nothing to do with biology — library preparation differences, sequencing depth, RNA quality, the random chance of which transcripts a sequencer happened to capture. A gene can look "different" between two groups purely by accident.

Differential expression analysis is the principled way to separate real, reproducible differences from random noise. The output isn't just a list of genes; it's a list of genes plus a statement, for each one, of how confident we are that the difference is real and how large that difference is.

02 · Noise vs. signal

Why we need statistics — the noise problem

Here's the intuition that does most of the work. Imagine you measure the expression of a single gene in three CMEC samples and three PMEC samples. You get six numbers. The CMEC numbers don't agree exactly with each other, and neither do the PMEC numbers — that's within-condition variation. The averages of the two groups also differ — that's the between-condition difference.

The question a statistical test asks is: is the difference between the two group averages bigger than what you'd expect from within-group jitter alone? If yes, the gene is called significant. If no, the apparent difference could easily have come from random noise and is dismissed.

fig. 1 normalized counts 0 50 100 150 200 CMEC (n=4) PMEC (n=4) within (noise) between (signal) VERDICT SIGNIFICANT signal > noise
Fig. 1. The same gene's normalized counts in two conditions. The within-group spread is the noise floor; the between-group shift is the signal. A test calls the gene significant only when the shift clearly exceeds the noise.

This is the core idea behind every flavor of differential expression test — from a humble t-test to DESeq2's Wald test. They differ in how they estimate the within-group variance, but they all answer the same question.

Now, why can't we just run a t-test on RNA-seq counts? Two reasons. First, with only 3 or 4 replicates per group, the variance estimate for any single gene is itself noisy — you might massively over- or under-estimate the noise floor and call genes significant when they aren't (or miss real ones). Second, RNA-seq data are counts, not the symmetric, continuous measurements a t-test assumes. Both of those problems have specific solutions, which is what makes DESeq2, edgeR, and limma-voom distinct from each other and from generic stats. We'll look at the count problem next.

03 · The right model

Why counts need special models — the negative binomial

RNA-seq produces counts. Each gene gets an integer: the number of sequencing reads that mapped to it. Counts have three properties that ordinary statistical models don't handle well.

They are integer-valued and non-negative — you can have 0, 1, 2, or 1,247 reads, but not -3 or 17.4. They have a strong mean–variance relationship: highly expressed genes show much larger absolute variance than lowly expressed ones. And critically, the variance grows faster than the mean, a phenomenon called overdispersion — biological replicates introduce noise on top of the purely statistical Poisson noise you'd see if every replicate were a perfect technical copy.

fig. 2 Poisson too narrow counts → μ = σ² Normal allows negatives counts < 0? continuous, symmetric Negative Binomial good fit ✓ counts ≥ 0, σ² > μ
Fig. 2. Three candidate distributions for RNA-seq counts. Poisson forces variance to equal the mean, underestimating real noise. Normal allows negative counts and is symmetric. Negative binomial accepts integers, stays non-negative, and lets variance exceed the mean.

The negative binomial distribution sits in a sweet spot. It's defined on non-negative integers, it's right-skewed (most genes have most replicates clustered around their mean with occasional larger values), and it has a separate dispersion parameter that lets variance scale with the mean however the data demand. Both DESeq2 and edgeR use the negative binomial directly. They differ mainly in how they estimate that dispersion parameter, especially with the small sample sizes that are typical in RNA-seq (3–6 replicates per group).

limma-voom takes a different route. Instead of modeling counts with a count-aware distribution, it transforms the counts using a procedure called voom, which converts them into approximately-normal values plus a precision weight for each observation. After this transformation, the proven linear-model machinery from microarray analysis works directly on RNA-seq data — which is why limma-voom is so fast on large datasets.

Don't worry if the mathematical details feel slippery. The takeaway is that all three methods are count-aware in some way, and that's why they outperform a naive t-test on RNA-seq data. The next section compares them head-to-head.

04 · Method comparison

DESeq2 vs. edgeR vs. limma-voom — what's actually different

Almost every RNA-seq paper you read uses one of these three packages. They have overlapping reputations and the documentation can feel competitive. Here's a flat comparison.

Aspect DESeq2 edgeR limma-voom
Underlying model Negative binomial Negative binomial Linear model + voom transformation
Dispersion estimation Shrinks toward a fitted mean–variance trend Empirical Bayes shrinkage to a common dispersion Trend-based variance modeling on log-CPM
Best for sample sizes Small to large (≥3 / group) Small to large (≥3 / group) Larger studies (≥10 / group); also the fastest
Normalization Median-of-ratios TMM (trimmed mean of M-values) TMM + voom precision weights
Handles complex designs Excellent Excellent Excellent
Output interpretation log2FC + padj (BH) logFC + FDR logFC + adj.P.Val
Original citation Love et al. 2014 Robinson et al. 2010 Law et al. 2014

What that looks like in practice

On most real datasets, the three methods agree on roughly 80% of significant genes. The remaining 20% sits in the borderline region where calls depend on how dispersion is shrunk, which normalization is applied, and how aggressively the method moderates variance.

fig. 3 ~80% ALL 3 AGREE DESeq2 edgeR limma-voom ~6% ~7% ~5% RULE OF THUMB Pick one. Be consistent. Don't switch mid-study to "fish" for results.
Fig. 3. Approximate three-way overlap of DEG calls across DESeq2, edgeR, and limma-voom on typical bulk RNA-seq. The center is the consensus; the slivers are method-dependent calls that usually sit at the significance boundary.

A working description of the personalities:

  • DESeq2 is often the most conservative of the three at small sample sizes. Its independent filtering and outlier-aware variance shrinkage tend to suppress low-count, high-variance noise. It is the default in most modern RNA-seq pipelines.
  • edgeR calls slightly more genes significant than DESeq2 on the same data — the difference is usually marginal but consistent. Its quasi-likelihood F-test (glmQLFit) is the recommended modern workflow and is excellent at small sample sizes.
  • limma-voom is the fastest by a wide margin and is the standard choice for large studies (n ≥ 10/group). It can lose a little sensitivity at very small sample sizes (n = 3) because its precision weights need enough data to estimate the mean–variance trend reliably.

A sane workflow: pick one based on study size and the conventions in your sub-field, run a confirmatory pass with a second method if reviewers expect it, and never silently swap methods until a previously non-significant gene becomes significant.

05 · Reading the output

Interpreting outputs — padj, log2FC, baseMean

A DE results table looks intimidating until you know which columns to read. There are usually only three that matter for triage.

padj (adjusted p-value)

A raw p-value is the probability, under the null hypothesis that no real difference exists, of seeing a difference at least as large as the one you observed. If you ran one test at threshold p < 0.05, you'd expect a 5% false positive rate — tolerable. But RNA-seq tests roughly 20,000 genes simultaneously. By chance alone, about 1,000 genes will look "significant" at p < 0.05 even when nothing is actually changing.

The Benjamini–Hochberg (BH) correction transforms raw p-values into adjusted p-values (padj) that control the False Discovery Rate (FDR) — the expected proportion of false positives among the genes you call significant. The conventional threshold is padj < 0.05, which means: on average, no more than 5% of your significant calls are false.

log2 fold change

The log2 fold change is the effect size. It is log2(condition_A / condition_B), with a tweak: DESeq2 applies a shrinkage estimator that pulls low-count, high-variance fold changes toward zero, giving more reliable rankings.

The intuitive translation: log2FC = 1 means the gene is 2× higher in condition A; log2FC = 2 means 4×; log2FC = -1 means half. Sign matters: positive is up in A, negative is down in A — and which condition is "A" depends on how you set your reference level. A common threshold is |log2FC| > 1, which corresponds to at least a 2-fold change in either direction.

baseMean

baseMean is the average normalized count across all samples. It's a sanity check: a "significant" gene with baseMean = 3 is a single read away from being noise. Most published DE filters either drop low-baseMean genes entirely or treat them with extra skepticism. A practical cutoff is baseMean > 10, though independent filtering inside DESeq2 often handles this for you.

fig. 4 gene baseMean log2FoldChange pvalue padj GATA4 1,247.3 +3.21 2.4e-18 8.1e-15 baseMean avg expression across samples log2FoldChange = +3.21 ≈ 9.3× higher in condition A padj = 8.1e-15 FDR-controlled: extremely significant
Fig. 4. One row of a typical DEG output table. Read padj for confidence, log2FoldChange for effect size, baseMean for sanity. Both p-value and padj are reported; report padj.
Why this matters

A gene can pass either filter on its own and still not be biologically interesting. A gene with padj = 0.0001 but log2FC = 0.1 is statistically real but reflects an effect that is too small to matter physiologically. A gene with log2FC = 5 but padj = 0.4 looks dramatic but could easily be a single outlier sample. Always use both.

06 · Multiple testing

Multiple testing correction, made concrete

The multiple testing problem is the single most common source of "false discoveries" in genomics. The arithmetic is straightforward: at significance threshold α = 0.05, a single test has a 5% chance of a false positive under the null. Test 20,000 genes and the expected number of chance hits is 20,000 × 0.05 = 1,000. Without correction, you'd publish a list of "significant" genes that is mostly noise.

There are three common responses:

  • No correction. ~1,000 false positives expected. Don't do this.
  • Bonferroni. Divide α by the number of tests. Expected false positives drop to near zero — but the threshold becomes so strict that real signal is also lost.
  • Benjamini–Hochberg (FDR). Control the proportion, not the count, of false positives among your significant calls. At FDR < 0.05, you accept that ~5% of called genes will be false — in exchange for retaining most of the real signal. This is the standard.
fig. 5 expected false positives 0 250 500 750 1000 ~1000 No correction DANGEROUS <1 Bonferroni over-conservative ~5% Benjamini–Hochberg JUST RIGHT ✓
Fig. 5. Testing 20,000 genes at α = 0.05. No correction gives ~1,000 expected false positives; Bonferroni controls the family-wise error rate at the cost of nearly all sensitivity; Benjamini–Hochberg controls the false discovery rate at a tunable level.

All three DE methods (DESeq2, edgeR, limma-voom) apply Benjamini–Hochberg by default. The column you read is named padj in DESeq2, FDR in edgeR, and adj.P.Val in limma — the same quantity, three names.

07 · Pitfalls

Common pitfalls

The six mistakes that show up most often in our review of submitted analyses:

1
Low sample size

DE with n = 2 per group is unreliable — the variance estimate is essentially undefined. Minimum 3 replicates per group; ≥ 4–5 is much better.

2
Unbalanced designs

Wildly different sample sizes per group inflate variance and reduce power. Aim for balance when possible; block on covariates when not.

3
Ignoring batch effects

Technical variation can fully mask biological signal. Visualize PCA before DE; if batches separate samples, model batch in the design or correct it. See batch effects.

4
Filtering on p-value alone

Small effects can be statistically real but biologically trivial. Use both padj and |log2FC| thresholds. The volcano plot exists precisely to remind you.

5
Reference level set incorrectly

If your contrast is reversed, all signs of log2FC flip and biology reads upside down. Always check that the reference level matches the comparison you intend.

6
P-hacking the threshold

Running, re-running, and quietly relaxing padj or fold-change cutoffs until your favorite gene appears is the most common form of inadvertent fraud. Pre-register your thresholds.

08 · Decision tree

Which method should I use?

If you're paralyzed by the choice between DESeq2, edgeR, and limma-voom, this flow chart will get you to a reasonable answer in under a minute.

Bulk RNA-seq counts? YES Single-cell data? (scRNA-seq) YES MAST / scran use specialized tools NO ≥ 10 samples per group? YES limma-voom fastest · sensitive at large n NO DESeq2 or edgeR both excellent at small n
Complex multi-factor designs

All three methods handle multi-factor designs (e.g., treatment + sex + batch). DESeq2's design = ~ batch + condition syntax tends to be the most readable. If you have repeated measures or random effects, look at limma's duplicateCorrelation() or the dream extension — that's where it pulls ahead.

09 · In the platform

How TransXplorer handles all this

TransXplorer runs all three methods in parallel and shows you how much they agree on your specific dataset — so you can pick one with eyes open instead of by reputation. Each method's full results table is browsable; the consensus DEG list is one click away. Multiple-testing correction (BH by default; Bonferroni on request) is applied automatically.

  • All three methods (DESeq2, edgeR, limma-voom) run in parallel with one button.
  • Method-agreement Venn diagram on every analysis — see the consensus list before you commit.
  • Interactive volcano plot with draggable padj and log2FC thresholds — explore live, don't re-render.
  • Auto-annotated top DEGs with gene-symbol labels using JetBrains Mono and biotype color coding.
  • One-click GO / KEGG / Reactome enrichment from the DEG list, with ORA and GSEA side by side.
  • Publication-ready PDF / SVG export. CSV download of every results table.

Ready to try differential expression?

Load the example dataset and run all three methods in under two minutes. No installation, no login.

Try DE analysis in TransXplorer →
10 · FAQ

Common questions

Which method should I report in my paper?
Any of DESeq2, edgeR, or limma-voom is defensible. Cite the original paper and the version number you used (e.g., "DESeq2 v1.42.0, Love et al. 2014"). If you ran multiple methods, report the primary in the main text and the others in supplementary, with the overlap percentage.
Can I average replicates before running DE?
No. Averaging destroys the within-group variance information the statistical test depends on. The whole point of replicates is to estimate that variance — collapse them into one number and you have nothing to compare the between-group signal against. If you must reduce data volume, use the pseudoreplicate approach inside each method, not arithmetic averaging.
Is padj < 0.05 a hard rule?
It is a convention, not a law of nature. Be stricter (padj < 0.01 or even 0.001) for hypothesis-generation work where you'll commit lab time to follow-ups. Be more lenient (padj < 0.1 or 0.2) for exploratory screens that feed into pathway analysis — GSEA in particular benefits from a longer ranked list.
What if my top gene has padj = 0?
That means the adjusted p-value is below the smallest representable floating-point number (roughly 1e-300). It isn't actually zero; the math is just running out of precision. Report it as "padj < 1e-300" or use the original p-value in scientific notation. It's a strong signal, not a bug.
Should I filter low-count genes before DE?
Most modern methods do internal filtering (DESeq2's independent filtering, edgeR's filterByExpr()). Pre-filtering genes with fewer than 10 reads summed across all samples speeds analysis without harming results. Don't pre-filter too aggressively — doing so distorts the multiple-testing correction.
11 · Further reading

References & further reading

  1. Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15:550. doi:10.1186/s13059-014-0550-8
  2. Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140. doi:10.1093/bioinformatics/btp616
  3. Law CW, Chen Y, Shi W, Smyth GK (2014). voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology 15:R29. doi:10.1186/gb-2014-15-2-r29
  4. Benjamini Y, Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B 57(1):289–300. doi:10.1111/j.2517-6161.1995.tb02031.x

What's next?

Three directions to keep going — another concept, a deeper method comparison, or back to the hands-on walkthrough.

Understanding Batch Effects soon

What batch effects are, why they masquerade as biology, and how to detect and correct them before DE.

Read the concept

GSEA vs ORA: Pathway Enrichment soon

Two methods, two philosophies. When to use over-representation analysis vs. ranked-list enrichment.

Read the concept

RNA-Seq in 10 Minutes

Back to the hands-on walkthrough — load real data, run DESeq2, do enrichment, export figures. End-to-end.

Open the tutorial