Getting Started with I-SAGE

This section explains how to set up and run the I-SAGE pipeline, from environment requirements to executing a first analysis.

I-SAGE is designed primarily for HPC environments and assumes familiarity with command-line tools and batch systems (e.g., SLURM).

System Requirements

Software

Nextflow ≥ 22.x
Java ≥ 11
Python ≥ 3.9
Bash / core UNIX utilities

Nextflow manages workflow execution and logging; Python scripts implement statistical and validation logic.

Bioinformatics Tools

The following tools must be available in the execution environment (typically via modules or Conda):

samtools — BAM processing, sorting, indexing
pysam — Python bindings for BAM/CRAM access
bwa (or equivalent aligner) — read alignment
bedGraph / UCSC utilities — bedGraph and bigWig handling

Python Dependencies

The following Python packages are required:

numpy
pandas
scipy
matplotlib

These are typically provided via: - A Conda environment - A system-wide Python installation - An HPC module

All plotting is headless-safe (no GUI required).

Note:
I-SAGE assumes these tools are provided by the execution environment (HPC modules, Conda, or container).
The pipeline does not install system-level dependencies automatically.

Repository Structure

After cloning the repository, the main structure is:

I-SAGE/
├── workflows/
│   └── iblesse_month2/
│       └── main.nf
├── modules/
│   ├── stats/
│   ├── validation/
│   ├── viz/
│   └── break_calling/
├── configs/
│   └── iblesse.yaml
├── docs/
├── README.md
└── nextflow.config

You will typically only modify: - configs/iblesse.yaml - SLURM submission scripts - (Optionally) Nextflow profiles

Input Data Requirements

FASTQ Files

Paired-end or single-end iBLESS FASTQ files
File names must match the pattern defined in sample_pattern
Each sample must correspond to a unique biological condition/replicate

Example:

sample_pattern: "*_R1.fastq.gz"

Reference Genome

You must provide:

A reference genome FASTA
An index compatible with the aligner used in the pipeline

Example:

genome_fasta: "/path/to/hg38_reference.fna"
genome_index: "/path/to/hg38_reference.fna"

If EBV contigs are included (e.g., chrEBV), EBV-specific analyses can be enabled.

Configuration File (`iblesse.yaml`)

All pipeline behavior is controlled via a single YAML file.

Key sections include:

break_calling
viz
normalization
stats
validation

A fully working example is provided in configs/iblesse.yaml. Detailed explanations are provided in the Configuration Guide section of the documentation.

Running the Pipeline

Basic Command

From the repository root:

nextflow run workflows/iblesse_month2/main.nf \
  -profile eden_local \
  -params-file configs/iblesse.yaml

This will:

Read FASTQ files
Execute all enabled pipeline stages
Write results to the directory specified by outdir

Running on an HPC Cluster (SLURM)

I-SAGE is commonly run inside a SLURM allocation.

Typical workflow:

Write a SLURM submission script
Allocate resources
Run Nextflow inside the job

Example (simplified):

sbatch run_isage.slurm

Within the SLURM script, Nextflow is executed normally. Nextflow parallelizes tasks within the allocated resources.

Execution Outputs

During and after execution, I-SAGE produces:

Nextflow report (report_*.html)
Execution trace (trace_*.txt)
Timeline view (timeline_*.html)
Structured output directories:
viz/
stats/
validation/

These files are critical for debugging and reproducibility.

First-Time Run Checklist

Before running a full analysis, verify:

FASTQ paths are correct
Reference genome paths exist
Output and log directories are writable
The selected Nextflow profile matches your environment
Bin sizes and contrasts are intentional (not defaults by accident)

Common First-Run Pitfalls

Using -resume unintentionally (skips updated steps)
Misnamed sample IDs in contrasts
Missing EBV contigs while EBV analysis is enabled
Insufficient disk space for intermediate files
Running bin-size sweeps without accounting for increased runtime

Next Steps

After completing a successful run:

Review output structure
Inspect differential statistics and plots
Consult the Configuration Guide to fine-tune parameters
Proceed to Pipeline Modules for a deeper understanding of each stage

Next: See Configuration Guide for detailed parameter explanations.