Statistical Methods

This section documents the statistical framework used by I-SAGE to identify differential DNA double-strand break (DSB) regions from iBLESS data.

The emphasis is on explicit assumptions, transparent tests, and robustness assessment, rather than black-box modeling.

Overview

I-SAGE performs bin-level differential analysis of DSB counts between experimental conditions.

The statistical workflow consists of:

Aggregation of break counts into genomic bins
Hypothesis testing per bin
Multiple testing correction
Optional replicate-aware aggregation
Robustness and sensitivity validation

Data Representation

For each genomic bin, the pipeline computes:

count_case — total break counts in the case condition
count_ctrl — total break counts in the control condition

Counts are derived from strand-aware bedGraph tracks after normalization.

Bin-Level Hypothesis Test

Fisher Exact Test

For each bin, I-SAGE performs a two-sided Fisher exact test, comparing break enrichment in the bin versus the remaining tested genomic space.

The contingency table is:

          In bin     Outside bin
Case           k_c       T_c − k_c
Control        k_t       T_t − k_t

Where:

k_c, k_t — bin-level counts
T_c, T_t — total counts across all tested bins

This formulation tests whether the relative enrichment of breaks differs between conditions.

Why Fisher Exact Test?

Counts are sparse and non-normally distributed
Bin sizes may vary
No assumption of equal variance
Exact test appropriate for low counts

This choice prioritizes interpretability and correctness over parametric efficiency.

Effect Size: Log2 Fold Change

For each bin, effect size is reported as:

$$\text{log2FC} = \log_2 \left( \frac{(k_c + \varepsilon) / (T_c + \varepsilon)}{(k_t + \varepsilon) / (T_t + \varepsilon)} \right)$$

Where:

$\varepsilon$ — small pseudocount (default: 0.5)

This represents the relative enrichment of breaks in the bin, normalized by global signal.

Multiple Testing Correction

P-values are corrected using the Benjamini–Hochberg procedure to control the false discovery rate (FDR).

Bins with:

FDR ≤ threshold

are considered statistically significant.

The default threshold is:

stats:
  fdr: 0.05

Replicate-Aware Analysis

Motivation

Pooling replicates can allow a single replicate to dominate significance. I-SAGE therefore supports an optional replicate-aware meta-analysis.

`pooled` Mode

Counts are summed across replicates
A single Fisher test is performed per bin

This mode is simple but ignores replicate variability.

`meta_fisher` Mode

In replicate-aware mode:

For each bin, a Fisher test is performed per replicate against the pooled control
Per-replicate p-values are combined using Fisher's method
Effect size is reported as the median log2 fold change across replicates

This approach enforces replicate agreement without fitting a full variance model.

Limitations

Does not model dispersion explicitly
Assumes independence between replicate tests
Not equivalent to a negative binomial GLM

This design is intentional and favors transparency over complexity.

Directional Classification

Significant bins are classified as:

Upregulated — $\text{log2FC} > 0$
Downregulated — $\text{log2FC} < 0$

Separate output files are generated for each class.

Region-Restricted Testing

If a BED file is provided:

stats:
  regions_bed: "/path/to/regions.bed"

Only bins overlapping these regions are tested
Total counts are computed within the restricted region set

This reduces multiple testing burden and enables hypothesis-driven analyses.

EBV Annotation and Enrichment

If EBV contigs are present and enabled via:

stats:
  ebv_regex: "(?i)^chrEBV$"

Bins are annotated as EBV vs non-EBV.

Enrichment Test

A 2×2 contingency table is constructed:

                Significant   Not significant
EBV bins            a               c
Non-EBV bins        b               d

Enrichment is assessed using a two-sided Fisher exact test.

Reported metrics include:

Percentage of EBV bins among tested bins
Percentage of EBV bins among significant bins
Enrichment ratio
Enrichment p-value

Validation and Sensitivity Analysis

Statistical significance alone is insufficient. I-SAGE therefore includes built-in validation.

Downsampling

Break counts are randomly downsampled to fixed fractions
Differential testing is re-run
Stability of significant bins is assessed

Spike-In Analysis

Artificial signal is added to random bins
Recovery of spiked bins is measured
Evaluates sensitivity and false-negative behavior

Bin-Size Sensitivity

When multiple bin sizes are evaluated:

All statistics are computed independently per bin size
Results are summarized in a bin-size sweep table

This guards against bin-size–specific artifacts.

Interpretation Guidelines

Statistical significance does not imply causality
Effect size and reproducibility should be considered jointly
EBV enrichment is a biological consistency check, not proof of mechanism

Summary

I-SAGE uses a simple, explicit statistical framework:

Exact tests
Clear assumptions
Built-in robustness checks

This design prioritizes scientific defensibility and reproducibility over black-box modeling.

Next: See Outputs & Interpretation for how to read and interpret the results produced by the pipeline.