GSE151427 ~30 min

Complete Analysis Tutorial

Walk through a complete RNA-seq analysis pipeline using GSE151427 (human iPSC-derived endothelial cells). Learn how to use every feature of TransXplorer.

10
Samples
6,163
DEGs
319
Drug Targets
8
TF Regulators
Tutorial Scope This tutorial highlights key outputs from each analysis module. For complete results and executive summaries, run the full analysis by clicking "Load Example" in TransXplorer.

Introduction

What is RNA-seq?

RNA sequencing (RNA-seq) is a powerful technology that captures a snapshot of all the genes actively being expressed in your cells at a given moment. Unlike older microarray methods, RNA-seq can detect novel transcripts, alternative splicing events, and provide highly accurate quantification across a wide dynamic range.

In a typical RNA-seq experiment, messenger RNA is extracted from samples, converted to cDNA, and sequenced using high-throughput platforms. The resulting data—millions of short sequence reads—must then be processed through a series of computational steps to extract biological meaning.

Bulk RNA-seq vs Single-Cell Bulk RNA-seq measures the average gene expression across all cells in a sample. It's cost-effective, well-established, and ideal for comparing conditions (e.g., treated vs untreated). Single-cell RNA-seq profiles individual cells, revealing cellular heterogeneity but at higher cost and complexity. TransXplorer is designed for bulk RNA-seq analysis, with cell type deconvolution to estimate cellular composition.

What is TransXplorer?

TransXplorer is an automated end-to-end web server for translational RNA-seq analysis and therapeutic discovery. It bridges the gap between raw sequencing data and clinical application by semi-automating complex bioinformatic tasks—enabling researchers without extensive computational expertise to perform publication-quality analysis from start to finish.

The platform covers the complete analytical journey: from optional FASTQ file processing through differential expression analysis, batch effect correction, pathway enrichment, network analysis, drug-target discovery, and clinical validation with TCGA data integration.

Key Innovations
  • Autonomous batch effect detection using PVCA, kBET, and Silhouette metrics
  • Hybrid Gene Regulatory Network combining DoRothEA and TFLink databases across 11 organisms
  • Instant therapeutic translation through real-time API integration with DGIdb, ChEMBL, and OpenTargets
  • Stable architecture using asynchronous job queues, validated on datasets with 45,000 genes × 48 samples

Tutorial Dataset: GSE151427

In this tutorial, we'll analyze GSE151427—a study of human iPSC-derived endothelial cells comparing two developmental lineages:

PropertyValue
GEO AccessionGSE151427
OrganismHomo sapiens
Cell TypesCMEC (cardiac mesoderm endothelial cells) vs PMEC (paraxial mesoderm endothelial cells)
SourceHuman induced pluripotent stem cell (iPSC) derived
Samples10 samples (5 CMEC, 5 PMEC)
ComparisonPMEC vs CMEC (identifying genes differentially expressed between the two EC subtypes)
Follow Along Click "Load Example" in TransXplorer to automatically load GSE151427 with pre-configured settings and follow this tutorial step-by-step.

What You'll Learn

Quality Control

Assess sample quality, detect outliers, and identify batch effects using PCA and UMAP

Differential Expression

Identify significantly changed genes using DESeq2, edgeR, or limma-voom

Pathway Analysis

Discover enriched biological processes and pathways (GO, KEGG, Reactome)

Drug Discovery

Find FDA-approved drugs and experimental compounds targeting your DEGs

Network Analysis

Build WGCNA, GRN, and PPI networks to understand gene relationships

Cell Deconvolution

Estimate cellular composition using xCell, MCP-counter, or EPIC

Data Input

Supported Input Formats

TransXplorer accepts gene expression count matrices in multiple formats: CSV, TXT, TSV, and XLSX. The matrix should have genes as rows and samples as columns. Importantly, you should upload raw counts (integers)—not normalized values like FPKM or TPM.

Why Raw Counts? DESeq2, edgeR, and limma-voom all require raw count data because they model the discrete nature of sequencing reads using statistical distributions (negative binomial). Pre-normalized data violates the assumptions of these methods and can lead to inflated false discovery rates.
Quick Start For this tutorial, click "Load Example Data" to automatically load GSE151427 with the count matrix and sample metadata pre-configured.

Sample Metadata

The metadata file defines your experimental design—which samples belong to which groups, and any batch or covariate information. This is critical for proper statistical modeling of your comparison.

SampleGroupBatchTime
GSM4577968CMEC1day6
GSM4577969PMEC1day6
GSM4577970CMEC2day8
GSM4577971CMEC2day8
GSM4577972PMEC2day8

Normalization Methods

Raw counts must be normalized to account for differences in sequencing depth between samples. TransXplorer offers several methods:

  • TMM (Trimmed Mean of M-values): Recommended default. Calculates scaling factors by removing genes with extreme expression, providing robust normalization.
  • RLE (Relative Log Expression): Used by DESeq2. Calculates a pseudo-reference sample and derives scaling factors from median ratios.
  • VST (Variance Stabilizing Transformation): DESeq2-specific transformation that stabilizes variance across the mean—ideal for visualization and clustering.
  • CPM/TPM: Counts/Transcripts per million. Simple normalization useful for comparing expression levels across genes.

Exploratory Analysis

Before statistical testing, exploratory analysis reveals patterns in your data and identifies potential confounding factors. This step is crucial for ensuring biological signals aren't obscured by technical artifacts.

Principal Component Analysis (PCA)

PCA reduces the high-dimensional gene expression data (thousands of genes) into a few principal components that capture the most variance. Each point represents a sample, and samples with similar expression profiles will cluster together.

How to Read PCA Plots In a well-designed experiment, samples should cluster primarily by biological group (CMEC vs PMEC), not by technical factors (batch, sequencing run). If samples cluster by batch instead of condition, you likely have a batch effect that needs correction. The percentage on each axis shows how much variation that component explains.

UMAP Visualization

Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality reduction technique that often reveals cluster structures that linear methods like PCA miss. UMAP preserves both local and global data structure, making it excellent for identifying subpopulations.

Interactive PCA

Loading...

Interactive UMAP

Loading...

Batch Effect Detection

Batch Effect Detected & Corrected (Source: time)
PVCA
34.1%
Silhouette
0.375
kBET
0.473
Combined
0.559

Before vs After Correction

PCA Before

Loading...

PCA After

Loading...

Differential Expression

Parameters

method: "DESeq2" | comparison: "PMEC vs CMEC" | padj: 0.05 | |log2FC|: 1.0

Volcano Plot

Interactive Volcano

Loading...

DEG Summary

--
Total DEGs
--
Up in PMEC
--
Up in CMEC
--
Max |log2FC|

Top DEGs

Top Differentially Expressed Genes

Loading...

Pathway Enrichment

Results Summary

107
GO BP Terms
16
KEGG Pathways
123
Total

Enrichment Plot

Pathway Bubble Plot
Enrichment

Gene-Pathway Network

Interactive Network

Drug Target Discovery

TransXplorer integrates with DGIdb and OpenTargets to identify potential therapeutic targets.

Analysis Modes

Quick Analysis

Fast DGIdb lookup for rapid screening

Detailed Analysis

OpenTargets clinical evidence

Comprehensive

Combined with prioritization scoring

Drug Prioritization

319
Total Drugs
15
Very High Priority
Approved
Top Phase
Top Prioritized Drugs
DrugPriorityScoreTargetsTarget GenesEvidencePhase
TRETINOINVery High6.584ALDH1A2, APOA1, MYCN, RARB⭐⭐⭐Approved
OCRIPLASMINVery High6.164COL2A1, COL6A3, LAMA4, LAMC3⭐⭐⭐Approved
MAVACAMTENVery High5.743MYH6, MYL4, MYL7⭐⭐⭐Approved
VANDETANIBVery High5.152LTK, EPHB3⭐⭐⭐Approved
ACALABRUTINIBVery High5.142BTK, ENO2⭐⭐⭐Approved

Cell Type Deconvolution

Estimate cellular composition using reference-based deconvolution algorithms.

Cell Type Proportions

Cell Composition

Loading...

Group Comparison

Cell Type Proportions by Group

Loading...

Key Findings

  • Endothelial cells: Both groups show high endothelial scores (expected for iPSC-derived ECs)
  • CMEC enrichment: Higher neutrophil and fibroblast signatures
  • Validation: Results consistent with cardiac vs paraxial mesoderm origins

WGCNA Co-expression

Identify modules of highly correlated genes and relate them to experimental conditions.

Gene Dendrogram

Gene Clustering
Dendrogram

Module-Trait Correlation

Module-Trait Relationships
Module-Trait

Gene Regulatory Networks

Identify transcription factors controlling your DEGs and their regulatory relationships.

Executive Summary

8 TFs controlling 607 interactions. EGR1 is the top master regulator.

EGR1
Top Regulator

Master Regulators

Top Master Regulators
RankTFTargetsAct %RoleDruggable
1EGR112492%ActivatorYes
2NR2F16891%ActivatorYes
3MAFB53100%ActivatorYes
4MYCN1392%ActivatorNo
5GATA41090%ActivatorYes

TF Activity Heatmap

TF Activities
TF Activity

PPI Network Analysis

Map DEGs onto protein interaction databases to identify hub proteins.

Analysis Modes

Physical Network

Validated physical interactions only

Full Network

All interaction types

Network Statistics

60
Proteins
59
Interactions
28.4%
Coverage
16
Hub Proteins

Interactive Network

PPI Network

Hub Proteins

Top Hub Proteins
ProteinDegreeHub TypeScoreRegulation
APOE6Major Hub10.13Down
APOA15Major Hub7.47Down
CYP2S15Major Hub4.43Down
MCM105Major Hub2.11Up
GATA44Major Hub2.63Down

Network Metrics

PPI Network Metrics Summary
PPI Metrics

Frequently Asked Questions

What input formats are supported?
TransXplorer accepts raw count matrices (CSV, TSV, TXT), GEO accession IDs for automatic download, and pre-normalized expression matrices. The counts should be integers (not normalized values like FPKM/TPM).
How does batch correction work?
TransXplorer uses limma::removeBatchEffect when the combined batch score exceeds 0.25. This removes technical variation while preserving biological signal between your experimental groups.
Which DEG method should I use?
DESeq2: Best for small samples (n<10), handles low counts well.
edgeR: Good for larger datasets, slightly faster.
limma-voom: Best for complex designs with multiple factors.
What's the difference between drug analysis modes?
Quick: DGIdb only, fastest.
Detailed: Adds OpenTargets clinical evidence.
Comprehensive: Both databases + prioritization scoring. Recommended for publications.
How are hub genes identified in WGCNA?
Hub genes are identified by their module membership (kME) score, which measures how well a gene's expression correlates with the module eigengene. Genes with kME > 0.8 are typically considered hubs.
What is a master regulator in GRN analysis?
Master regulators are transcription factors that control many of your DEGs. They're ranked by: (1) number of targets, (2) evidence confidence, and (3) overlap with your gene set. Higher scores indicate stronger regulatory influence.
Physical vs Full PPI network - which to choose?
Physical: Only experimentally validated protein-protein interactions. Higher confidence, fewer edges. Best for mechanistic studies.
Full: Includes co-expression, text-mining, genetic interactions. More comprehensive but may include indirect associations.
How accurate is cell type deconvolution?
Accuracy depends on the method and reference signatures. xCell works well for immune cells; MCP-counter is robust for stromal populations. Results should be validated with orthogonal methods (flow cytometry, scRNA-seq) when possible.
Can I use my own gene list for enrichment?
Yes! You can upload a custom gene list or use the automatically generated DEG lists. TransXplorer supports both over-representation analysis (ORA) and gene set enrichment analysis (GSEA).
What does "druggable" mean for a TF?
A TF is marked "druggable" if it has known small molecule modulators in the ChEMBL database. This doesn't mean drugs are available, but that the protein has been targeted in drug discovery efforts.
How do I interpret module-trait correlations?
Positive correlation (red): Module genes upregulated in that condition.
Negative correlation (blue): Module genes downregulated.
Focus on modules with p < 0.05 and |correlation| > 0.5 for biological interpretation.
Can I export results for publication?
Yes! All results can be downloaded as CSV files, plots as PNG/SVG, and interactive plots as HTML. Executive summaries are available as PDF reports for each analysis module.
What organisms does TransXplorer support?
TransXplorer supports 11 model organisms: Human (hg38), Mouse (mm10), Rat (rn6), Zebrafish (danRer11), Drosophila (dm6), C. elegans (WBcel235), Yeast (R64), Arabidopsis, Chicken (galGal6), Pig (susScr11), and custom genome uploads. GRN databases (DoRothEA/TFLink) and pathway databases vary by organism.
Can I process FASTQ files directly?
Yes! TransXplorer offers a complete FASTQ preprocessing pipeline including FastQC quality control, Trimmomatic adapter trimming, HISAT2 alignment, and featureCounts quantification. For files >3GB, TransXplorer generates a pre-configured Docker script for local processing.