cmuts normalize
Purpose
cmuts normalize is designed to take modification counts generated by cmuts core, and generate normalized reactivity profiles for each RNA in the probing library. It will also generate additional coverage statistics and figures for visualization of the generated profiles.
In the case that cmuts core was run with the --pairwise argument, the 2D counts computed will also be processed, in order to compute modification rate correlations and similar pairwise statistics.
Usage
To run cmuts normalize requires an HDF5 file of modification counts and the original sequence library in FASTA format. The basic syntax is
--mod and --nomod flags specify the HDF5 datasets in which the treated and untreated modification counts are stored, respectively. The latter is optional.
Tip
More than one file can be passed to cmuts normalize. In such a case, the modification counts are summed across the files before normalization. This is useful in the case where the inputs to cmuts core were split across multiple files.
Command Line Options
Core Options
-o, --output : Output HDF5 filename (required)
--mod : Name of the dataset containing treated modification counts (required)
--nomod : Name of the dataset containing untreated modification counts
--fasta : FASTA file of probed RNA sequences.
Output Control
--overwrite : Overwrite existing HDF5 file
--group : HDF5 group to place the output datasets in (none if not specified)
Profile Computation
--no-insertions : Do not use insertions to compute the reactivity profiles
--no-deletions : Do not use deletions to compute the reactivity profiles
Normalization Control
--norm {ubr,raw,outlier} : Normalization method (default: ubr)
ubr: 90th percentile normalization (default)raw: No normalizationoutlier: 2-8% outlier-based normalization
--clip-low : Clip negative reactivity values
--clip-high : Clip reactivity values above 1
--blank-5p : NaN out this many bases on the 5' end (default: 0)
--blank-3p : NaN out this many bases on the 3' end (default: 0)
--blank-cutoff : NaN out any positions with less than this many reads (default: 10)
--norm-independent : Normalize each profile separately, rather than using experiment-wide statistics
Pairwise Settings
--sig : The significance value to use when plotting pairwise modification correlations.
Outputs
Files
The output of cmuts normalize will be an HDF5 file with the following structure:
roi-mask (Region Of Interest): A boolean array indicating the region of the RNA which is of structural importance (i.e. all but the flanking sequences and low-coverage regions).
SNR (Signal-To-Noise): The mean signed signal-to-noise ratio across all nucleotide positions, computed as reactivity / error for each position and averaged. Signed (not absolute) values are used so that noise-only positions contribute zero in expectation, avoiding positive bias. One value per reference.
error: The standard error of the mean at each nucleotide.
heatmap: The prevalence of each mutation, insertion, deletion, and termination type throughout the entire library.
norm: The normalization value used. It will be a scalar unless --norm-independent is passed, in which case each reference has its own norm value.
reactivity: The reactivity profiles.
reads: The number of reads used to compute each reactivity profile.
In the case where 2D data was also present in the input, the output HDF5 file will also contain
pairwise-snr: Signal-to-noise ratio for pairwise joint probabilities. Computed as P(i=1, j=1) / SE(P(i=1, j=1)) for each position pair, then averaged across all pairs (excluding diagonal). One value per reference. Higher values indicate more reliable pairwise correlation estimates.
Warning
If a value was passed to --group, then the above will be contained in an HDF5 group with that name.
Figures
Figures are saved to a figures/ directory next to the output HDF5 file. For example, -o /path/to/reactivity.h5 will write figures to /path/to/figures/.
In addition to the standard profile and coverage plots, cmuts normalize generates an SNR scaling plot (snr-scaling.png) that shows how the mean SNR is expected to change with sequencing depth. The x-axis is relative total read depth (1.0 = current depth), and three curves show the effect of allocating additional reads to:
- Modified — all extra reads go to the treated condition
- Unmodified — all extra reads go to the untreated condition
- Pareto — optimal allocation of reads between conditions (shown as a dashed black line with hatching above indicating the infeasible region)
This plot helps determine whether additional sequencing of a specific condition would meaningfully improve data quality. When only a modified condition is present (no --nomod), a single curve is shown.
The SNR at a given depth is estimated by scaling the standard error as \(\text{SE} \propto 1/\sqrt{n}\), holding the observed reactivity fixed. This is exact for Bernoulli proportions but assumes the mutation rate estimates are accurate, which may not hold at very low read counts.