How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery

“`html

How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery

In this tutorial, we perform an advanced single-cell RNA-seq analysis workflow using Scanpy on the PBMC-3k benchmark dataset. We start by loading the dataset, inspecting its structure, and applying quality control checks to evaluate gene counts, total counts, mitochondrial content, and ribosomal gene signals. We then filter low-quality cells and genes, detect potential doublets with Scrublet, normalize the data, apply log transformation, and identify highly variable genes for downstream analysis. Also, we score cell-cycle phases, regress out unwanted technical variation, scale the data, and reduce dimensionality using PCA, UMAP, and t-SNE. We also cluster cells with the Leiden algorithm, identify marker genes, annotate cell populations using canonical PBMC markers, explore trajectory structure with PAGA and diffusion pseudotime, calculate a custom interferon-response score, and finally save the fully analyzed AnnData object for future use.

Copy Code

      
        !pip install -q scanpy leidenalg python-igraph scrublet

import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor="white", figsize=(5, 5))
sc.logging.print_header()

adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()
print(adata)

adata.var["mt"]   = adata.var_names.str.startswith("MT-")
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
sc.pp.calculate_qc_metrics(
  adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True
)

sc.pl.violin(adata,
             ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
             jitter=0.4, multi_panel=True,
)
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")

We install the required single-cell analysis libraries and import Scanpy, NumPy, Pandas, Matplotlib, and warning controls. We load the PBMC-3k benchmark dataset, make gene names unique, and inspect the AnnData object structure. We then calculate quality control metrics for mitochondrial and ribosomal genes and visualize count-level quality patterns using violin and scatter plots.

Copy Code

      
        sc.pp.filter_cells(adata, min_genes=200)
        sc.pp.filter_genes(adata, min_cells=3)
        adata = adata[adata.obs.n_genes_by_counts < 2500, :].copy()
        adata = adata[adata.obs.pct_counts_mt < 5, :].copy()

sc.pp.scrublet(adata)
print("Predicted doublets:", int(adata.obs["predicted_doublet"].sum()))
adata = adata[~adata.obs["predicted_doublet"], :].copy()

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
adata.raw = adata
adata = adata[:, adata.var.highly_variable].copy()

We filter out low-quality cells and rarely detected genes to improve the reliability of the dataset. We use Scrublet through Scanpy to identify predicted doublets and remove them before deeper analysis. We then preserve raw counts, normalize expression values, apply log transformation, select highly variable genes, and keep only the most informative features.

Copy Code

      
        s_genes = ["MCM5","PCNA","TYMS","FEN1","MCM2","MCM4","RRM1","UNG","GINS2",
                   "MCM6","CDCA7","DTL","PRIM1","UHRF1","HELLS","RFC2","NASP",
                   "RAD51AP1","GMNN","WDR76","SLBP","CCNE2","UBR7","POLD3","MSH2",
                   "ATAD2","RAD51","RRM2","CDC45","CDC6","EXO1","TIPIN","DSCC1",
                   "BLM","CASP8AP2","USP1","CLSPN","POLA1","CHAF1B","E2F8"]
        g2m_genes = ["HMGB2","CDK1","NUSAP1","UBE2C","BIRC5","TPX2","TOP2A","NDC80",
                      "CKS2","NUF2","CKS1B","MKI67","TMPO","CENPF","TACC3","SMC4",
                      "CCNB2","CKAP2L","CKAP2","AURKB","BUB1","KIF11","ANP32E",
                      "TUBB4B","GTSE1","KIF20B","HJURP","CDCA3","CDC20","TTK",
                      "CDC25C","KIF2C","RANGAP1","NCAPD2","DLGAP5","CDCA2","CDCA8",
                      "ECT2","KIF23","HMMR","AURKA","PSRC1","ANLN","LBR","CKAP5",
                      "CENPE","NEK2","G2E3","CBX5","CENPA"]
        s_genes   = [g for g in s_genes   if g in adata.var_names]
        g2m_genes = [g for g in g2m_genes if g in adata.var_names]

sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)

sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)

We define S-phase and G2/M-phase marker genes and retain only those present in the dataset. We score each cell for cell-cycle phase, regress out unwanted variation from total counts and mitochondrial percentage, and scale the data for downstream modeling. We then run PCA, inspect explained variance, construct the neighborhood graph, and generate UMAP and t-SNE embeddings.

Copy Code



sc.tl.leiden(adata, resolution=0.5, flavor="igraph", n_iterations=2, directed=False)

sc.pl.umap(adata, color="leiden", legend_loc="on data", title="Leiden clusters")

sc.pl.tsne(adata, color="leiden", legend_loc="on data", title="t-SNE clusters")
sc.tl.rank_genes_groups(adata, "leiden", method="wilcoxon")

sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
result   = adata.uns["rank_genes_groups"]

groups   = result["names"].dtype.names

top_df   = pd.DataFrame({g: result["names"][g][:10] for g in groups})

print("\nTop 10 markers per cluster:\n", top_df)
marker_genes = {

"B-cell":           ["CD79A", "MS4A1"],

"CD8 T
Source Read original →
Related reading
RAG-Anything Tutorial: Build a Multimodal Retrieval Pipeline for Text, Tables, Equations, and Images in Colab
Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory Parsing, Patch Analysis, Token Budgets, and Tool-Use Metrics
Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export
The SignalThe Signal: Edition 02Read this edition →Every Friday: the one AI story that actually mattered, plus the tools worth your time.

AM
AI Maestro is an independent British AI publication. We test what we recommend, and we write it the way we would say it. More about us

How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery

How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery

`Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.`

`follow us`

`Popular Tag`

`Popular Post`

`tencent/Hy3`

`US investors will soon…`

`The ‘first’ AI-run ransomware…`

How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

tencent/Hy3

US investors will soon…

The ‘first’ AI-run ransomware…

`Related articles`

`Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.`

`follow us`

`Popular Tag`

`Popular Post`

`tencent/Hy3`

`US investors will soon…`

`The ‘first’ AI-run ransomware…`