Pipeline Modules

This section documents the internal structure of the I-SAGE pipeline. Each stage is implemented as a modular Nextflow process, with clear inputs, outputs, and responsibilities.

Understanding these modules is useful for:

Interpreting intermediate outputs
Debugging failed runs
Extending the pipeline with new functionality

High-Level Module Flow

FASTQ
  ↓
Alignment & Deduplication
  ↓
Break Calling
  ↓
Visualization Tracks
  ↓
Normalization
  ↓
Differential Statistics
  ↓
Validation & Sensitivity Analysis

Each step is described below.

Alignment & Deduplication

Purpose

Align raw iBLESS reads to the reference genome and remove PCR duplicates.

Implementation

Uses standard short-read alignment tools (configured via Nextflow profile)
Produces coordinate-sorted BAM files

Outputs

Deduplicated BAM files per sample

Notes

Alignment parameters are defined in nextflow.config
Reference genome must match downstream chromosome naming

Break Calling

Module Location

modules/break_calling/

Purpose

Identify DNA double-strand breaks from aligned reads.

Modes

per_base (default) — breaks are identified at single-base resolution
binned — breaks are directly aggregated into bins (optional)

Key Parameters

break_calling:
  mode: "per_base"
  bin_size: 1000
  remove_sgrdi: true

Outputs

Strand-specific break calls
Per-base break coordinates

Notes

In per_base mode, binning is deferred to the visualization stage
Restriction enzyme site removal reduces background artifacts

Visualization Tracks

Module Location

modules/viz/

Purpose

Aggregate per-base break calls into fixed-width genomic bins and generate bedGraph tracks for visualization and statistics.

Key Parameters

viz:
  enabled: true
  bin_sizes: [250, 500, 1000]

Behavior

Break counts are summed per bin
Plus and minus strands are processed separately
Multiple bin sizes can be evaluated in parallel

Outputs

Strand-specific bedGraph files
Combined total-signal bedGraph files

Notes

Outputs are organized per bin size
Tracks are suitable for IGV/UCSC Genome Browser

Normalization

Module Location

modules/viz/normalize_tracks.nf

Purpose

Normalize binned break counts to enable comparisons across samples.

Method

Counts-per-million (CPM)–style normalization

Key Parameters

normalization:
  method: dsb_cpm
  scale: 1000000

Outputs

Normalized bedGraph tracks

Differential Statistics

Module Location

modules/stats/differential_breaks.py
modules/stats/diff_breaks.nf

Purpose

Identify genomic bins with statistically significant differences in break frequency between conditions.

Statistical Framework

Per-bin Fisher exact test
Benjamini–Hochberg FDR correction
Optional replicate-aware meta-analysis

Key Features

Replicate-aware testing (meta_fisher)
Upregulated vs downregulated bin separation
Optional region-restricted testing (BED)
Optional EBV annotation and enrichment

Outputs

TSV files of all bins
Significant bins (all / up / down)
Volcano and MA plots (PNG + PDF)
Per-contrast summary files

Bin-Size Sweep Summary

Module Location

modules/stats/bin_sweep_summary.py
modules/stats/bin_sweep_summary.nf

Purpose

Summarize differential statistics across multiple bin sizes.

Behavior

Aggregates per-bin-size summaries
Reports number of tested bins and significant bins
Tracks up/down counts and EBV metrics

Outputs

bin_sweep_summary.tsv

Notes

Facilitates sensitivity analysis
Enables informed bin-size selection

Validation & Sensitivity Analysis

Module Location

modules/validation/

Purpose

Assess robustness and reproducibility of differential results.

Validation Strategies

Downsampling

Re-run stats on subsets of the data
Measure stability of significant bins

Spike-In

Artificially add signal to random bins
Evaluate recovery performance

Key Parameters

validation:
  downsample_fracs: "1.0,0.5,0.25,0.1"
  downsample_reps: 1
  spikein_bins: 1000
  spikein_mult: 3.0

Outputs

Validation reports
Downsampling plots (PNG + PDF)

Workflow Orchestration

Module Location

workflows/iblesse_month2/main.nf

Purpose

Coordinate execution of all modules, handle parameter propagation, and manage bin-size sweeps.

Responsibilities

Fan-out across bin sizes
Ensure consistent inputs to stats and validation
Organize outputs into structured directories

Extending the Pipeline

To add a new module:

Create a new process under modules/
Define explicit inputs and outputs
Wire it into main.nf
Document the module here

Next: See Statistical Methods for detailed explanation of the statistical models, assumptions, and limitations.