iCount: protein-RNA interaction analysis

iCount is a Python module and associated command-line interface (CLI), which provides all the commands needed to process protein-RNA iCLIP interaction data and to identify and quantify sites of protein-RNA interactions on RNA.

iCount’s main input are FASTQ files with iCLIP sequencing data, its main output are BED files with identified and quantified cross-linked sites.

A number of analyses are included in iCount that provide insights into the properties of protein-RNA interaction.

Contents of this documentation:

Tutorial

You will need to install iCount first. Please, follow these instructions.

iCount provides all commands needed to process FASTQ files with iCLIP sequencing data and generate BED files listing identified and quantified cross-linked sites.

iCount uses and generates a number of files. We suggest you run this tutorial in an empty folder:

$ mkdir tutorial_example
$ cd tutorial_example

Note

All steps provided in this tutorial are included in the tutorial.sh script, which you can obtain by running:

$ iCount examples
$ ls examples

hnRNPC.sh  hnRNPC_reduced.sh  tutorial.sh

Preparing a genome index

iCLIP sequencing reads must be mapped to a reference genome. The user can prepare its own FASTA genome sequence and GTF genome annotation files.

Another option is to download a release from ensembl. You can use the command releases to get a list of available releases supported by iCount:

$ iCount releases --source ensembl

There are 30 releases available: 88,87,86,85,84,83,82,81,80,79,78,77,76,75,74,73,
72,71,70,69,68,67,66,65,64,63,62,61,60,59

You can then use the command species to get a list of species available in a release:

$ iCount species --source ensembl -r 88

There are 87 species available: ailuropoda_melanoleuca,anas_platyrhynchos,
ancestral_alleles,anolis_carolinensis,astyanax_mexicanus,bos_taurus,
...
gorilla_gorilla,homo_sapiens,ictidomys_tridecemlineatus,latimeria_chalumnae,
...
tursiops_truncatus,vicugna_pacos,xenopus_tropicalis,xiphophorus_maculatus

Note

Current version of iCount is tested to work with human and rat genomes only.

Let’s download the human genome sequence from release 88:

$ iCount genome --source ensembl homo_sapiens -r 88 --chromosomes 21 MT

Downloading FASTA file into: /..././homo_sapiens.88.chr21_MT.fa.gz
Fai file saved to : /..././iCount/homo_sapiens.88.chr21_MT.fa.gz.fai
Done.

Note

Processing the entire genome is computationally very expensive. For this reason, we are limiting the tutorial example to chromosomes 21 and MT.

And the annotation of the human genome from release 88:

$ iCount annotation --source ensembl homo_sapiens -r 88

Downloading GTF to: /..././homo_sapiens.88.gtf.gz
Done.

The next step is to generate a genome index that is used by STAR mapper. Let’s call the index hs88 and use ensembl’s GTF annotation on genes:

$ mkdir hs88  # folder should be empty
$ iCount indexstar homo_sapiens.88.chr21_MT.fa.gz hs88 \
--annotation homo_sapiens.88.gtf.gz --genome_sasparsed 2 --genome_saindexnbases 13

Building genome index with STAR for genome homo_sapiens.88.fa.gz
<timestamp> ..... Started STAR run
<timestamp> ... Starting to generate Genome files
<timestamp> ... starting to sort  Suffix Array. This may take a long time...
<timestamp> ... sorting Suffix Array chunks and saving them to disk...
<timestamp> ... loading chunks from disk, packing SA...
<timestamp> ... Finished generating suffix array
<timestamp> ... starting to generate Suffix Array index...
<timestamp> ..... Processing annotations GTF
<timestamp> ..... Inserting junctions into the genome indices
<timestamp> ... writing Genome to disk ...
<timestamp> ... writing Suffix Array to disk ...
<timestamp> ... writing SAindex to disk
<timestamp> ..... Finished successfully
Done.

Note

A subfolder hs88 will be created in current working directory. You can specify alternative relative or absolute paths, e.g., indexes/hs88.

Note

Changing the parameters genome_sasparsed and genome_saindexnbases results into lower memory requirements but longer run times.

We are now ready to start mapping iCLIP data to the human genome!

Preparing iCLIP data for mapping

Let’s process one of the hnRNP C sequencing data files from the original iCLIP publication:

$ wget http://icount.fri.uni-lj.si/data/20101116_LUjh03/\
SLX-2605.CRIRUN_501.s_4.sequence.reduced.txt.gz -O hnRNPC.fq.gz

Note

In the tutorial, we are using a subset of the file [23 MB]. If you want to use the entire file, then download it:

$ wget http://icount.fri.uni-lj.si/data/20101116_LUjh03/\
SLX-2605.CRIRUN_501.s_4.sequence.txt.gz -O hnRNPC.fq.gz

This is a single file that contains five iCLIP experiments. Each experiment is marked with a unique barcode sequence at the very beginning of the sequencing reads. Part of the barcode are also so-called randomer nucleotides that are used to identify unique cDNA molecules after mapping.

We can extract the sample assignment and randomer sequence with the command demultiplex. The command expects the adapter sequence AGATCGGAAGAGCGGTTCAG, followed by the sample barcodes, in our case five, expected to be present in the sequencing file:

$ mkdir demultiplexed  # make sure that folder exists
$ iCount demultiplex hnRNPC.fq.gz AGATCGGAAGAGCGGTTCAG NNNGGTTNN NNNTTGTNN \
NNNCAATNN NNNACCTNN NNNGGCGNN --out_dir "demultiplexed"

Allowing max 1 mismatches in barcodes.
Demultiplexing file: hnRNPC.fq.gz
Saving results to:
    demultiplexed/demux_nomatch_raw.fastq.gz
    demultiplexed/demux_NNNGGTTNN_raw.fastq.gz
    demultiplexed/demux_NNNTTGTNN_raw.fastq.gz
    demultiplexed/demux_NNNCAATNN_raw.fastq.gz
    demultiplexed/demux_NNNACCTNN_raw.fastq.gz
    demultiplexed/demux_NNNGGCGNN_raw.fastq.gz
Trimming adapters (discarding shorter than 15)...

Note

Position of a randomer nucleotide in barcode is indicated with the letter N.

This should have generated six files in subfolder demultiplexed:

$ ls -lh demultiplexed

total 37424
-rw-r--r-- 1 <user> <group>  758K <timestamp> demux_NNNACCTNN.fastq.gz
-rw-r--r-- 1 <user> <group>  2.9M <timestamp> demux_NNNCAATNN.fastq.gz
-rw-r--r-- 1 <user> <group>  8.0M <timestamp> demux_NNNGGCGNN.fastq.gz
-rw-r--r-- 1 <user> <group>  421K <timestamp> demux_NNNGGTTNN.fastq.gz
-rw-r--r-- 1 <user> <group>  4.8M <timestamp> demux_NNNTTGTNN.fastq.gz
-rw-r--r-- 1 <user> <group>  1.4M <timestamp> demux_nomatch.fastq.gz

Note

Reads that cannot be assigned to any of the specified sample barcodes (for the given number of allowed mismatches) are stored in a separate file named demux_nomatch.fastq.gz. You should have a look at such reads and try to understand why they do not conform to expectations.

Mapping sample reads to the genome

Let’s focus on iCLIP experiment with barcode NNNGGCGNN and process it further. Same steps should be taken to process each experiment.

First, create a folder to store the mapping results:

$ mkdir mapping_NNNGGCGNN

Then, map the reads in the selected FASTQ file using STAR and the genome index we have generated at the very beginning of this tutorial:

$ iCount mapstar demultiplexed/demux_NNNGGCGNN.fastq.gz hs88 mapping_NNNGGCGNN \
--annotation homo_sapiens.88.gtf.gz

Mapping reads from demultiplexed/demux_NNNGGCGNN.fastq.gz
<timestamp> ..... Started STAR run
<timestamp> ..... Loading genome
<timestamp> ..... Processing annotations GTF
<timestamp> ..... Inserting junctions into the genome indices
<timestamp> ..... Started mapping
<timestamp> ..... Started sorting BAM
<timestamp> ..... Finished successfully
Done.

This should have generated a file Aligned.sortedByCoord.out.bam in folder mapping_NNNGGCGNN:

$ ls -lh mapping_NNNGGCGNN

total 842M
-rw-r--r-- 1 <user> <group>   15M Nov 15 05:28 Aligned.sortedByCoord.out.bam
-rw-r--r-- 1 <user> <group>  1.6K Nov 15 05:28 Log.final.out
-rw-r--r-- 1 <user> <group>   15K Nov 15 05:28 Log.out
-rw-r--r-- 1 <user> <group>  364B Nov 15 05:28 Log.progress.out
-rw-r--r-- 1 <user> <group>   51K Nov 15 05:28 SJ.out.tab

Quantifying cross-linked sites

Command xlsites reads a BAM file and generates a BED file with identified and quantified cross-linked sites:

$ iCount xlsites mapping_NNNGGCGNN/Aligned.sortedByCoord.out.bam \
NNNGGCGNN_cDNA_unique.bed  NNNGGCGNN_cDNA_multiple.bed NNNGGCGNN_cDNA_skipped.bam \
--group_by start --quant cDNA

This will generate a BED file where interaction strength is measured by the number of unique cDNA molecules (randomer barcodes are used for this quantification).

You may generate a BED files where interaction strength is determined by the number of reads:

$ iCount xlsites mapping_NNNGGCGNN/Aligned.sortedByCoord.out.bam \
NNNGGCGNN_reads_unique.bed  NNNGGCGNN_reads_multiple.bed NNNGGCGNN_reads_skipped.bam \
--group_by start --quant reads

By comparing the ration of cDNA vs reads counts we can estimate the level of over-amplification. Ideally, this ratio should be close to one.

Identifying significantly cross-linked sites

The peak finding analysis expects an annotation file with information about the segmentation of the genome into regions of different types, such as intergenic, UTR3, UTR5, ncRNA, intron, CDS regions.

Command segment can read the annotation obtained from ensembl and generate a new annotation file with genome segmentation:

$ iCount segment homo_sapiens.88.gtf.gz hs88seg.gtf.gz \
homo_sapiens.88.chr21_MT.fa.gz.fai

Calculating intergenic regions...
Segmentation stored in hs88seg.gtf.gz

Command peaks reads a genome segmentation GTF file, a BED file with cross-linked sites and generates a BED file with subset of significantly cross-linked sites:

$ iCount peaks hs88seg.gtf.gz NNNGGCGNN_cDNA_unique.bed peaks.bed \
--scores scores.tsv

Loading annotation file...
874 out of 31150 annotation records will be used (30276 skipped).
Loading cross-links file...
Calculating intersection between annotation and cross-link file...
Processing intersections...
Peaks caculation finished. Writing results to files...
BED6 file with significant peaks saved to: peaks.bed
Scores for each cross-linked position saved to: scores.tsv
Done.

Note

P-value and FDR scores of all cross-linked sites can be stored by providing the parameter --scores.

Identifying clusters of significantly cross-linked sites

Command clusters reads a BED file with cross-linked sites and generates a BED file with clusters of cross-linked sites:

$ iCount clusters peaks.bed clusters.bed

Merging cross links form file peaks.bed
Done. Results saved to: clusters.bed

Annotating sites and summary statistics

Command clusters reads genome segmentation GTF file, a BED file with cross-linked sites and generates a file, where each site is annotated. By default it will annotate according to the biotype:

$ iCount annotate hs88seg.gtf.gz NNNGGCGNN_cDNA_unique.bed annotated_sites_biotype.tab

Calculating overlaps between cross-link and annotation_file...
Writing results to file...
Done. Output saved to: annotated_sites_biotype.tab

You can specify other attributes from annotation to use. For example, we can determine, which genes are annotated to each site:

$ iCount annotate --subtype gene_id hs88seg.gtf.gz NNNGGCGNN_cDNA_unique.bed \
annotated_sites_genes.tab

Calculating overlaps between cross-link and annotation_file...
Writing results to file...
Done. Output saved to: annotated_sites_genes.tab

A summary of annotations can be generated with the command summary:

$ iCount summary hs88seg.gtf.gz NNNGGCGNN_cDNA_unique.bed summary.tab \
homo_sapiens.88.chr21_MT.fa.gz.fai

Calculating intersection between cross-link and annotation...
Extracting summary from data...
Done. Results saved to: summary.tab

Installation

Running with Docker

Docker provides an easy way to run iCount in a working environment that is completely separated from your host machine.

A ready-to-use image of iCount is available at Docker Hub. After Installing Docker on your machine, you can run iCount by issuing the following command:

docker run -i -t tomazc/icount

If you want to build an image from source, then change to the source folder and issue the following command:

docker build -t icountsrc .

You can then run the freshly built image:

docker run -i -t icountsrc

To make all the files and results persistent, even after the container stops, create a folder and mount it as a data volume:

mkdir `pwd`/storage_docker
docker run -i -t -v `pwd`/storage_docker:/home/icuser/storage icountsrc

Note

Make sure to create a local folder and provide the path to it. The example above uses a path that may not be applicable to your computer. Both, path to the folder on the host machine and path within the container (/home/icuser/storage), must be absolute.

If you are developing iCount, then it makes sense to mount the source folder as a volume into the docker container. Make sure to change into the source folder and then issue:

docker run -i -t -v `pwd`:/home/icuser/iCount_src \
-v `pwd`/storage_docker:/home/icuser/storage icountsrc

This setup will behave similarly to the Installing from source setup. All changes to the source made on your host computer, will be immediately available in the running container.

Installing from the Python Package Index (PyPI)

The simplest way to install iCount is from PyPI, by issuing the following command:

pip install icount

If installing the package globally, you may need root priviliges (prefix upper command with sudo).

Running within virtualenv

We recommend installing the package in virtualenv. First, install the virtualenv tool and create a virtual environment, stored in ~/envs/icount_env:

pip3 install virtualenv
virtualenv ~/envs/icount_env

Then activate the environment and install iCount into it using pip:

source ~/envs/icount_env/bin/activate
pip install icount

To use iCount, make sure that the proper virtualenv is loaded:

source ~/envs/icount_env/bin/activate

Afterwards you can import iCount:

python
>>> import iCount

Or use its command line interface:

iCount -h

Installing from source

If you wish to install form source, follow instructions in section Contributing.

Reference

Command-line interface

usage: iCount [-h] [-v]   ...

iCount: protein-RNA interaction analysis
========================================

iCount is a Python module and associated command-line interface (CLI), which provides all the
commands needed to process protein-RNA `iCLIP`_ interaction data and to identify and quantify
sites of protein-RNA interactions on RNA.

iCount's main input are FASTQ files with `iCLIP`_ sequencing data, its main output are BED files
with identified and quantified cross-linked sites.

A number of analyses are included in iCount that provide insights into the properties of
protein-RNA interaction.

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

Commands:
   
    releases     Get list of available releases.
    species      Get list of available species.
    annotation   Download annotation for given release/species/source.
    genome       Download genome for given release/species/source.
    segment      Parse annotation file into internal iCount structure - segmentation.
    demultiplex  Split FASTQ file into separate files, one for each sample barcode.
    cutadapt     Remove adapter sequences from reads in FASTQ file.
    indexstar    Generate STAR genome index.
    mapstar      Map reads to genome with STAR.
    xlsites      Quantity cross-link events and determine their positions.
    annotate     Annotate each cross link site with types of regions that intersect with it.
    clusters     Merge adjacent peaks into clusters and sum cross-links within clusters.
    group        Merge multiple BED files with crosslinks into one.
    peaks        Find positions with high density of cross-linked sites.
    rnamaps      Distribution of cross-links relative to genomic landmarks.
    summary      Report proportion of cross-link events/sites on each region type.
    examples     Provide a set of example bash scripts.
    man          Print help for all commands.
    args         Print arguments form all CLI commands.


releases
========

usage: iCount releases [-h] [--source] [--species] [-S] [-F] [-P] [-M]

Get list of available releases.

optional arguments:
  -h, --help            show this help message and exit
  --source              Source of data. Only ENSEMBL or GENCODE are available (default: gencode)
  --species             Species name. Only relevant if source is GENCODE (default: None)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


species
=======

usage: iCount species [-h] [--source] [-r] [-S] [-F] [-P] [-M]

Get list of available species.

optional arguments:
  -h, --help            show this help message and exit
  --source              Source of data. Only ENSEMBL or GENCODE are available (default: gencode)
  -r , --release        Release number. Only relevant if source is ENSEMBL (default: None)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


annotation
==========

usage: iCount annotation [-h] [-od] [-a] [--source] [-S] [-F] [-P] [-M]
                         species release

Download annotation for given release/species/source.

positional arguments:
  species               Species name
  release               Release number

optional arguments:
  -h, --help            show this help message and exit
  -od , --out_dir       Download to this directory (if not given, current working directory) (default: None)
  -a , --annotation     Annotation filename (must have .gz file extension). If not given,
                        species.release.gtf.gz is used. If annotation is provided as absolute
                        path, value of out_dir parameter is ignored and file is saved to given
                        absolute path (default: None)
  --source              Source of data. Only ENSEMBL or GENCODE are available (default: gencode)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


genome
======

usage: iCount genome [-h] [-od] [--genome] [--chromosomes  [...]] [--source]
                     [-S] [-F] [-P] [-M]
                     species release

Download genome for given release/species/source.

positional arguments:
  species               Species name
  release               Release number

optional arguments:
  -h, --help            show this help message and exit
  -od , --out_dir       Download to this directory (if not given, current working directory) (default: None)
  --genome              Genome filename (must have .gz file extension). If not given,
                        species.release.fa.gz is used. If genome is provided as absolute path,
                        value of out_dir parameter is ignored and file is saved to given
                        absolute path (default: None)
  --chromosomes  [ ...]
                        If given, do not download the whole genome, but listed
                        chromosomes only. Only relevant if source is ENSEMBL (default: None)
  --source              Source of data. Only ENSEMBL or GENCODE are available (default: gencode)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


segment
=======

usage: iCount segment [-h] [-prog] [-S] [-F] [-P] [-M]
                      annotation segmentation fai

Parse annotation file into internal iCount structure - segmentation.

Currently, only annotations from ENSEMBl and GENCODE are supported.
http://www.gencodegenes.org/
http://www.ensembl.org

Segmentation is used in almost all further analyses.

In segmentation, each transcript is partitioned into so called
regions/intervals. Such regions must span the whole transcript, but should not
intersect with each other. However, higher hierarchy levels: transcripts and
genes can of course intersect each other.

Example of possible segmentation::

    Genome level: |---------------------------------------------------|

    Gene level:    |--------------gene1--------------|   |-intergenic-|
                                 |---------gene2--------|

    Transcript l.: |----------transcript1---------|
                           |-------transcript2-------|
                                 |------transcript3-----|

    Region level:  |-CDS-||-intron-||-CDS-||-UTR3-|

For simplicity, only the partition of transcript1 is presented.

positional arguments:
  annotation            Path to input GTF file
  segmentation          Path to output GTF file
  fai                   Path to input genome_file (.fai or similar)

optional arguments:
  -h, --help            show this help message and exit
  -prog, --report_progress
                        Switch to show progress (default: False)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


demultiplex
===========

usage: iCount demultiplex [-h] [-mis] [-ml] [--prefix] [-od] [-S] [-F] [-P]
                          [-M]
                          reads adapter barcodes [barcodes ...]

Split FASTQ file into separate files, one for each sample barcode.

Saved FASTQ files contain sequences where sample barcode, random
barcode, and adapter sequences are removed. Random barcode is moved into
the header line, since it is needed in later steps (removing PCR duplicates
and counting number of cross-link events).

.. autofunction:: iCount.demultiplex.run

positional arguments:
  reads                 Path to reads from a sequencing library
  adapter               Adapter sequence to remove from ends of reads
  barcodes              List of barcodes used for library

optional arguments:
  -h, --help            show this help message and exit
  -mis , --mismatches   Number of tolerated mismatches when comparing barcodes (default: 1)
  -ml , --minimum_length 
                        Minimum length of trimmed sequence to keep (default: 15)
  --prefix              Prefix of generated FASTQ files (default: demux)
  -od , --out_dir       Output folder. Use local folder if none given (default: .)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


cutadapt
========

usage: iCount cutadapt [-h] [--qual_trim] [-ml] [-S] [-F] [-P] [-M]
                       reads reads_trimmed adapter

Remove adapter sequences from reads in FASTQ file.

positional arguments:
  reads                 Input FASTQ file
  reads_trimmed         Output FASTQ file containing trimmed reads
  adapter               Sequence of an adapter ligated to the 3' end

optional arguments:
  -h, --help            show this help message and exit
  --qual_trim           Trim low-quality bases before adapter removal (default: None)
  -ml , --minimum_length 
                        Discard trimmed reads that are shorter than `minimum_length` (default: None)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


indexstar
=========

usage: iCount indexstar [-h] [-a] [--overhang] [--overhang_min] [--threads]
                        [--genome_sasparsed] [--genome_saindexnbases] [-S]
                        [-F] [-P] [-M]
                        genome genome_index

Generate STAR genome index.

positional arguments:
  genome                Genome sequence to index
  genome_index          Output folder, where to store genome index

optional arguments:
  -h, --help            show this help message and exit
  -a , --annotation     Annotation that defines splice junctions (default: )
  --overhang            Sequence length around annotated junctions to be used by STAR when
                        constructing splice junction database (default: 100)
  --overhang_min        Minimum overhang for unannotated junctions (default: 8)
  --threads             Number of threads that STAR can use for generating index (default: 1)
  --genome_sasparsed    STAR parameter genomeSAsparseD.
                        Suffix array sparsity. Bigger numbers decrease RAM requirements
                        at the cost of mapping speed reduction. Suggested values
                        are 1 (30 GB RAM) or 2 (16 GB RAM) (default: 1)
  --genome_saindexnbases 
                        STAR parameter genomeSAindexNbases.
                        SA pre-indexing string length, typically between 10 and 15.
                        Longer strings require more memory, but result in faster searches (default: 14)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


mapstar
=======

usage: iCount mapstar [-h] [-a] [--multimax] [-mis] [--threads]
                      [--genome_load] [-S] [-F] [-P] [-M]
                      reads genome_index out_dir

Map reads to genome with STAR.

positional arguments:
  reads                 Sequencing reads to map to genome
  genome_index          Folder with genome index
  out_dir               Output folder, where to store mapping results

optional arguments:
  -h, --help            show this help message and exit
  -a , --annotation     GTF annotation needed for mapping to splice-junctions (default: )
  --multimax            Number of allowed multiple hits (default: 10)
  -mis , --mismatches   Number of allowed mismatches (default: 2)
  --threads             Number of threads that STAR can use for generating index (default: 1)
  --genome_load         Load genome into shared memory
                        Shared memory must be available in the system. See Chapter 3.3 in STAR manual (default: False)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


xlsites
=======

usage: iCount xlsites [-h] [-g] [--quant] [--segmentation] [-mis] [--mapq_th]
                      [--multimax] [--gap_th] [--ratio_th] [-prog] [-S] [-F]
                      [-P] [-M]
                      bam sites_unique sites_multi skipped

Quantity cross-link events and determine their positions.

The simplest version of this script would oprate on such example::

    |--a---b--- reference sequence, chr 14, positive strand ------------
        |rbc1---R1--------|
        |rbc1---R2------|
        |rbc2---R3------|
            |rbc1--------R4-------------|
            |rbc3------R5-----|

Five reads (R1-R5) are mapped to a reference sequence (chromosome 14, positive
strand). Reads start on two distinct positons. On first position, there is
R1-R3. Cross-link site is located one nucleotide before start of the read (on
negative strand, one nucleotide after end of read). However, we wish to count
number of cDNA molecules, not the number of reads. This can be done by counting
the number of distinct random barcodes (sometimes also called randomers). So in
upper example, we have:

    Postion a: 3 reads, 2 distinct random barcodes = 2 cDNA's
    Postion b: 2 reads, 2 distinct random barcodes = 2 cDNA's

However, things can get complicated when a single read is mapped in multiple
parts. This can happen for several reasons. One common example is that introns
are removed during transcription. This can be illustrated with the following
image::

    |---------------- reference -----------------------

    |--------------------------transcript--------------------------|
    |---UTR5---||---intron---||---exon---||---intron---||---exon---|

                                |-------R1-------|
                                |--R2.1-->              <-R2.2-|
                      |-R3.1->      <-R3.2->            <-R3.3-|
                        |-R4.1->      <-R4.2-|

Read R1 and R2 are starting on same position. For the sake of argument, let's
also pretend they also have same random barcode. In so, we would count them as a
single cDNA molecule (= single cross-link event), even though it is obvious that
they represent two separate cross-link events. In order to fix this, we count
not just the number of different randomers on same position, but also number of
different "second-start" coordinates. Second-start coordinate is just the
coordinate of the second part of the read. This way, the actual number of
cross-link events can be determined more accurately. If read is not split, it's
second-start coordinate is 0. If read has multiple "holes" (as read R3) we
determine second-start from the largest hole.

Reads whose second-start do NOT fall on segmentation (like R4) are stored in a
separate BAM file ``sites_strange``. They should be treated with special care,
since they can indicate not-yet annotated features in genome. If segmentation is
not given, all reads with holes, bigger than ``holesize_th`` are considered
strange.

Another parameter needs more explanation: ``group_by``. When algorithm starts,
reads from BAM file are grouped in hierarchical structure by::

    * chromosome and strand
    * cross-link position
    * random barcode
    * second-start

Each second-start group receives 1 cDNA score. This score is divided to each
read in group (if there are 5 reads in group, each one gets 1/5 score). This
enables that each read has it's cDNA score and of course, 1 "read score". This
scores can be assigned to start (actually, to cross-link position), midlle or
end position of read. By default, score is of course assigned to cross-link
location. But for diagnostic purpuses, scores can also be assigned to middle or
end coordinate of the read.

TODO: check overlap between unique and multimap BED files, should be small,
otherwise, we should think of a more approapriate quantification of (division
of randomers among) unique mapped sites that overlap with multimapped reads

positional arguments:
  bam                   Input BAM file with mapped reads
  sites_unique          Output BED6 file to store data from uniquely mapped reads
  sites_multi           Output BED6 file to store data from multi-mapped reads
  skipped               Output BAM file to store reads that do not map as expected by segmentation and
                        reference genome sequence. If read's second start does not fall on any of
                        segmentation borders, it is considered problematic. If segmentation is not provided,
                        every read in two parts with gap longer than gap_th is not used (skipped).
                        All such reads are reported to the user for further exploration

optional arguments:
  -h, --help            show this help message and exit
  -g , --group_by       Assign score of a read to either 'start', 'middle' or 'end' nucleotide (default: start)
  --quant               Report number of 'cDNA' or number of 'reads' (default: cDNA)
  --segmentation        File with custon segmentation format (obtained by ``iCount segment``) (default: None)
  -mis , --mismatches   Reads on same position with random barcode differing less than
                        ``mismatches`` are merged together, if their ratio is below ratio_th (default: 1)
  --mapq_th             Ignore hits with MAPQ < mapq_th (default: 0)
  --multimax            Ignore reads, mapped to more than ``multimax`` places (default: 50)
  --gap_th              Reads with gaps less than gap_th are treated as if they have no gap (default: 4)
  --ratio_th            Ratio between the number of reads supporting a randomer versus the
                        number of reads supporting the most frequent randomer. All randomers
                        above this threshold are accepted as unique. Remaining are merged
                        with the rest, allowing for the specified number of mismatches (default: 0.1)
  -prog, --report_progress
                        Switch to report progress (default: False)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


annotate
========

usage: iCount annotate [-h] [--subtype] [-e  [...]] [-S] [-F] [-P] [-M]
                       annotation sites sites_annotated

Annotate each cross link site with types of regions that intersect with it.

positional arguments:
  annotation            Path to annotation file (should be GTF and include `subtype` attribute)
  sites                 Path to input BED6 file listing all cross-linked sites
  sites_annotated       Path to output BED6 file listing annotated cross-linked sites

optional arguments:
  -h, --help            show this help message and exit
  --subtype             Subtype (default: biotype)
  -e  [ ...], --excluded_types  [ ...]
                        Excluded types (default: None)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


clusters
========

usage: iCount clusters [-h] [--dist] [--slop] [-S] [-F] [-P] [-M]
                       sites peaks clusters

Merge adjacent peaks into clusters and sum cross-links within clusters.

positional arguments:
  sites                 Path to input BED6 file with sites
  peaks                 Path to input BED6 file with peaks (or clusters)
  clusters              Path to output BED6 file with merged peaks (clusters)

optional arguments:
  -h, --help            show this help message and exit
  --dist                Distance between two peaks to merge into same cluster (default: 20)
  --slop                Distance between site and cluster to assign site to cluster (default: 3)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


group
=====

usage: iCount group [-h] [-S] [-F] [-P] [-M] sites_grouped sites [sites ...]

Merge multiple BED files with crosslinks into one.

First, concatenate files into one file. Then, merge crosslinks from different
files that are on same position and sum their scores.

positional arguments:
  sites_grouped         Path to output BED6 file containing merged data from input sites files
  sites                 List of BED6 files(paths) to be merged

optional arguments:
  -h, --help            show this help message and exit
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


peaks
=====

usage: iCount peaks [-h] [--scores] [--features  [...]] [-g]
                    [--merge_features] [--half_window] [--fdr] [-p] [-rnd]
                    [-prog] [-S] [-F] [-P] [-M]
                    annotation sites peaks

Find positions with high density of cross-linked sites.

There are two typical variants of this analysis, depending on the parameters:

* Gene-wise analysis, where:
    * features = gene
    * group_by = gene_id

* Transcript-wise analysis where:
    * features = CDS, intron, UTR3, UTR5, ncRNA, intergenic
    * group_by = transcript_id

Let's look at the Gene-wise analysis in more detail first. Imagine the following
situation::

    |-----------gene1----------|
            |-----------------------------gene2------------------------------|
           ab c                d                 e

    a = 60
    b = 100
    c = 70
    d = 40
    e = 100
    gene1: gene_id = 001
    gene2: gene_id = 002

There are two genes (partially intersecting) and five positions with cross-links
(noted with a, b, c, d and e). Crosslink position "a" has 60 cross-link events,
"b" has 100 cross-link events and so on. Also, gene1 has gene_id 001, etc.

The algorithm first finds all intersections between annotation and
cross-links. In this case cross-link position "a" intersects only with gene1,
while position "b" intersects also with gene2... Annotation can include
various other types of segments (transcripts, intergenic, ncRNA, etc.), but only
segments of type ``gene`` are considered for intersection. This behaviour is
controlled by parameter ``features``.

Next step is to make groups of cross-links. They are grouped by ``group_by``
parameter (in this case, it equals to ``gene_id``). There will be 2 groups.
First group name will be 001 and will contain a, b, c and d. Second group name
will be 002 and will contain b, c, d and e.

The question now is: has any of positions in each group significantly increased
number of cross-link events? And how can one quantify this significance?

This is done by parmutation analysis. It draws a number of random situations
with same group size and number of cross-link scores. Number of such draws is
determined by ``perm`` parameter. This way, a random distribution is calculated.
When comparing the observed distribution with the random one, FDR values are
assigned to each position. A cutoff FRD value is chosen and only positions with
FDR < FDR cutoff are considered as significant.

One must also know that when considering only scores on single positions
significant *clusters* of cross-links can be missed. In the upper example, it is
obvious, that something more significantly is happening on position b than on
position e, despite having the same score. To account for this, algorithm
considers not only the score of single cross-link, but also scores of
cross-links some nucleotides before and after. This behaviour in controlled by
half-window (half_window) parameter. In the upper example, score of position b
eqals to 160 if half_window = 1 and 2530 if half_window=2. Score of position e
remains 100.

Let's also look at the transcript-wise analysis. In this case, scenario also
includes transcripts and sub-transcript elements::

    |-----------gene1----------|
    |--------transcript1-------|
    |-ncRNA-||-intron-||-ncRNA-|
            |-----------------------------gene2------------------------------|
            |---------------transcript2--------------|
            |-CDS-||-intron-||-CDS-||-intron-||-UTR3-|
                                    |---------------transcript3--------------|
                                    |-UTR5-||-intron-||-CDS-||-intron-||-CDS-|

           ab c                d                 e

    a = 60
    b = 100
    c = 70
    d = 40
    e = 100
    gene1: gene_id = 001
    gene2: gene_id = 002
    transcript1: transcript_id = 0001
    transcript2: transcript_id = 0002
    transcript3: transcript_id = 0003

Value of parameter features is: CDS, intron, UTR3, UTR5, ncRNA, intergenic.
Value of parameter group_by is transcript_id. Since we have multiple values in
feature parameter, another parameter becomes important: merge_features. If set
to false (default) algorithm will make the following groups:

    * group name: ncRNA-0001, members: a, b, d
    * group name: intron-0001, members: c
    * group name: CDS-0002, members: b, c, d
    * group name: UTR3-0002, members: e
    * group name: intron-0003, members: e

However, if merge_features equals to true, groups are:

    * group name: 0001, members: a, b, c, d
    * group name: 0002, members: b, c, d, e
    * group name: 0003, members: e

Then, for each group, procedure is exactly the same as in Gene-wise case.

When analysis is done, significant positions are reported in file, given by
peaks parameter. If scores parameter is also given, all positions are
reported in it, no matter the FDR value.

positional arguments:
  annotation            Annotation file in GTF format, obtained from "iCount segment" command
  sites                 File with cross-links in BED6 format
  peaks                 File name for "peaks" output. File reports positions with significant
                        number of cross-link events. It should have .bed or .bed.gz extension

optional arguments:
  -h, --help            show this help message and exit
  --scores              File name for "scores" output. File reports all cross-link events,
                        independent from their FDR score It should have .tsv, .csv, .txt or .gz
                        extension (default: None)
  --features  [ ...]    Features from annotation to consider. If None, ['gene'] is used.
                        Sometimes, it is advised to use ['gene', 'intergenic'] (default: None)
  -g , --group_by       Attribute by which cross-link positions are grouped (default: gene_id)
  --merge_features      Treat all features as one when grouping. Has no effect when only one
                        feature is given in features parameter (default: False)
  --half_window         Half-window size (default: 3)
  --fdr                 FDR threshold (default: 0.05)
  -p , --perms          Number of permutations when calculating random distribution (default: 100)
  -rnd , --rnd_seed     Seed for random generator (default: 42)
  -prog, --report_progress
                        Report analysis progress (default: False)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


rnamaps
=======

usage: iCount rnamaps [-h] [--implicit_handling] [-mis] [--mapq_th]
                      [--holesize_th] [-S] [-F] [-P] [-M]
                      bam segmentation out_file strange cross_transcript

Distribution of cross-links relative to genomic landmarks.

What is an RNA-map?
^^^^^^^^^^^^^^^^^^^

Imagine the following situation (on positive strand)::

    |---intron1---||---exon2---||---intron2---||---exon3---|
                              x|-----R1-----|

Situation is simple: a read perfectly maps to reference genome. First cross-link
(nucletide before the start of read R1) is located 2 nucleotides before
landmark of type 'exon-intron'. By examining all cross-links(reads) that are
located in vicinity of landmarks one can obtain the distribution of
cross-link-to-landmark distances. Such distibution is called an RNA map.

Of course, situation gets complicated. Real-world-situation more likely looks
like this::

    |--------transcript1-------|
    |-ncRNA-||-intron-||-ncRNA-|
            |------------------transcript2-----------------|
            |--CDS--||-intron-||--CDS--||-intron-||--UTR3--|
                                    |---------------transcript3--------------|
                                    |-UTR5-||-intron-||-CDS-||-intron-||-CDS-|

             x|-----R1-----|
             x|R2.1->          <--R2.2-|
                                x|-R3-|

All sort of situations can arise:

    * read can intersect with multiple genes/transcripts/segments
      simultaneously, which means that one read is relevant for different types
      of RNA maps.
    * if whole read is mapped to one segment (as is read R3 on transcript2),
      then it is impossible to tell the type of RNA map it belongs to: is it
      CDS-intron or intron-CDS?
    * read can be mapped in multiple parts - for example when introns are
      removed before translation, first half of read will map to ``exon1`` and
      second half to ``exon2``.
    * same cross-link event can be represented by multiple reads with same
      random barcode.
    * ...

The algorithm takes care for all of this, as explained below. It outputs a file,
that looks like this::

    rna_map_type    position    all     explicit
    exon-intron     42          13      14
    exon-intron     43          123     19
    exon-intron     44          56      16
    ....
    intron-CDS      23474       34      2
    intron-CDS      23475       85      65
    intron-CDS      23476       92      1
    ...

Each line consits of four columns: RNA map type, position, all and explicit. RNA
map type defines the landmark type, while position defines relative distance of
cross-links to this landmark. All and explicit are counting number of crosslinks
that are ``positions`` away from landmark of type RNA map type. Cross link is
explicit, if read identifying cross-link is mapped to both parts of RNA-map. In
upper example, R1 is explicit, since it maps to intron *and* exon. Read R3 on
transcript2 is implicit, since one has to decide wheather it's cross-link
belongs to ``CDS-intron`` or ``intron-CDS`` RNA map. The term *all* means
explicit + implicit in this context.

If read is implicit (start and stop within same segment), one can choose two
varinats of the algorithm. One option is that whole score of read is given to
RNA-map of the *closest* landmark. Other option is to *split* score on half to
both neighbouring segments. This behaviour is controlled by parameter
implicit_handling.

There are also two cases when a read can map in a way that is not predicted by
annotation:

    * For split reads, second start can fall on nucleotide that is not start of
      a new segment. This behaviour is excactly the same as in xlsites analysis.
      Such reads are reported in file defined with ``strange`` parameter.
    * Reads that map on two different transcripts or even two different genes
      are repoted in file with name defined in cross_transcript.

-------------------------------------------------------------------------------

Things that still need to be resolved:

    * negative strand inspection:
        * for each ss_group, take annotation from reverse strand and check what
          is the situation on other side.
        * What si again the bio-contxt here - explain in docs...?
    * Test: Each cross-link should have the same cDNA count as in xlsites
      results. The final check should be: sum all scores - the result should be
      the same as for xlsites. Actually, the metrics.cross_transcript should
      also be considered:
      assert: sum(xlsites scores) == sum(RNAmaps "all" + metrics.cross_transc)
    * if read crosses two ore more regions from start to end, only end-type is
      considered for RNAmap type. Should we fix this and which region type to
      take? Since reads are tpyically much shorted than regions, this is
      probably a rare case.

-------------------------------------------------------------------------------

Wishes and ideas(Jernej, Tomaz):

    * The whole idea is to get a feeling what is happening around landmarks

    * separate pre-mRNA and mRNA [this is partially done in _add_entry]
        * Life cyle of RNA: part of DNA in transcribed to RNA in nucleus. Introns
          are cut out already between this process! Really???
        * pre-mRNA contains introns, mature mRNA has no introns and goes outside
          from the nucleus.
        * so: RNAmap exon-intron will for sure contain only pre-mRNA, but RNAmap
          exon-exon can contain pre-mRNA and mRNA... It would be really nice to
          report on percentage of theese two categories.
        * This can be done by introducing three categories:  pre-mRNa, mRNA and
          undecided. This can be determined from results - they contain explicit /
          implicit info and RNAmap type:

    * Intoduce three categories (colors in visualisation?) in every RNAmap type.
      Let's take exon-intron as example:

        * whole read in left part - color #1 (negative coord, implicit)
        * whole read in right part - color #2 (positive coord, implicit)
        * read crossing junction - color #3 (explicit)

    * How to handle situation when one region in individual's sequence clearly
      differs from reference sequence but it is just some variation?

        * Change the reference sequence? This can be complex... Make a helper
          tool for this?
        * Provide additional data in function - which exceptions /abnormalities
          to ignore?

    * Gaussian smooting of RNAmaps?

    * split read - it can differ also on "first-stop", not juts second-start
        * we could have some sort of quality check when observing the variance
          of first-stop-s. THe major question is: "Do all reads for certain
          xlink event indicate the same behaviour?"

positional arguments:
  bam                   BAM file with alligned reads
  segmentation          GTF file with segmentation. Should be a file produced by function
                        `get_segments`
  out_file              Output file with analysis results
  strange               File with strange propertieas obtained when processing bam file
  cross_transcript      File with reads spanning over multiple transcripts or multiple genes

optional arguments:
  -h, --help            show this help message and exit
  --implicit_handling   Can be 'closest' or 'split'. In case of implicit read - split score to
                        both neighbours or give it just to the closest neighbour (default: closest)
  -mis , --mismatches   Reads on same position with random barcode differing less than
                        ``mismatches`` are grouped together (default: 2)
  --mapq_th             Ignore hits with MAPQ < mapq_th (default: 0)
  --holesize_th         Raeads with size of holes less than holesize_th are treted as if they
                        would have no holes (default: 4)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


summary
=======

usage: iCount summary [-h] [--types_length_file] [--digits] [--subtype]
                      [-e  [...]] [-S] [-F] [-P] [-M]
                      annotation sites summary fai

Report proportion of cross-link events/sites on each region type.

positional arguments:
  annotation            Path to annotation GTF file (should include subtype attribute)
  sites                 Path to BED6 file listing cross-linked sites
  summary               Path to output tab-delimited file with summary statistics
  fai                   Path to file with chromosome lengths

optional arguments:
  -h, --help            show this help message and exit
  --types_length_file   Path to file with lengths of each type (default: None)
  --digits              Number of decimal places in results (default: 8)
  --subtype             Name of attribute to be used as subtype (default: None)
  -e  [ ...], --excluded_types  [ ...]
                        Types listed in 3rd column of GTF to be exclude from analysis (default: None)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


examples
========

usage: iCount examples [-h] [-od] [-S] [-F] [-P] [-M]

Provide a set of example bash scripts.

.. autofunction:: iCount.examples.run

optional arguments:
  -h, --help            show this help message and exit
  -od , --out_dir       Directory to which example scripts should be copied (default: .)
  -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
  -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
  -P , --file_logpath   Path to log file.
  -M , --results_file   File into which to store Metrics.


args
====

usage: iCount args [-h]

optional arguments:
  -h, --help  show this help message and exit

Python package

iCount: protein-RNA interaction analysis

iCount is a Python module and associated command-line interface (CLI), which provides all the commands needed to process protein-RNA iCLIP interaction data and to identify and quantify sites of protein-RNA interactions on RNA.

iCount’s main input are FASTQ files with iCLIP sequencing data, its main output are BED files with identified and quantified cross-linked sites.

A number of analyses are included in iCount that provide insights into the properties of protein-RNA interaction.

Two attributes and associated environment variables define where output and temporary files will be stored:

iCount.OUTPUT_ROOT = '.'

Default output folder for iCount functions/commands. It points to the value of eviroment variable ICOUNT_OUTPUT_ROOT if set. Otherwise, current working directory is used.

iCount.TMP_ROOT = '/tmp/iCount'

Default temporary folder for iCount functions/commands. It is used to store temporary files, used in intermediate steps of analyses. It points to the value of eviroment variable ICOUNT_TMP_ROOT if set. Otherwise, /tmp/iCount is used.

Genomes

This module provides access to Ensembl and Gencode genome sequence and annotation. Also, segmentation into genes and segments of same type (exon, intron, UTR, …) is supported.

Ensembl API

Functions to query and download data from the Ensembl FTP site.

iCount.genomes.ensembl.annotation(species, release=88, out_dir=None, annotation=None)[source]

Download ENSEMBL annotation for given release/species.

Parameters:
  • species (str) – Species latin name.
  • release (int) – Release number. Only ENSEMBL releases 59-88 are available.
  • out_dir (str) – Download to this directory (if not given, current working directory).
  • annotation (str) – Annotation filename (must have .gz file extension). If not given, species.release.gtf.gz is used. If annotation is provided as absolute path, value of out_dir parameter is ignored and file is saved to given absolute path.
Returns:

Downloaded annotation filename.

Return type:

str

iCount.genomes.ensembl.chrom_length(fasta_in)[source]

Compute chromosome lengths of fasta file and store them into a file.

More about the .fai file format can be found here: http://www.htslib.org/doc/faidx.html

Parameters:fasta_in (str) – Path to genome FASTA file (can be .gz).
Returns:Absolute path to output file.
Return type:str
iCount.genomes.ensembl.genome(species, release=88, out_dir=None, genome=None, chromosomes=None)[source]

Download ENSEMBL genome for given release/species.

Parameters:
  • species (str) – Species latin name.
  • release (int) – Release number. Only ENSEMBL releases 59-88 are available.
  • out_dir (str) – Download to this directory (if not given, current working directory).
  • genome (str) – Genome filename (must have .gz file extension). If not given, species.release.fa.gz is used. If genome is provided as absolute path, value of out_dir parameter is ignored and file is saved to given absolute path.
  • chromosomes (list_str) – If given, do not download the whole genome, but listed chromosomes only. Chromosomes can be given as strings or integers.
Returns:

Downloaded genome/sequnce filename.

Return type:

str

iCount.genomes.ensembl.releases()[source]

Get list of available ENSEMBL releases.

Only allows ENSEMBL releases 59-88.

Returns:List of available releases
Return type:list
iCount.genomes.ensembl.species(release=88)[source]

Get list of available species for given ENSEMBL release.

Parameters:release (int) – Release number. Only ENSEMBL releases 59-88 are available.
Returns:List of species.
Return type:list

GENCODE API

Functions to query and download data from the Gencode FTP site.

iCount.genomes.gencode.annotation(species, release, out_dir=None, annotation=None)[source]

Download GENCODE annotation for given release/species.

Note: This will download the “primary assembly” type of annotation.

Parameters:
  • species (str) – Species name.
  • release (str) – Release number.
  • out_dir (str) – Download to this directory (if not given, current working directory).
  • annotation (str) – Annotation filename (must have .gz file extension). If not given original filename will be used. If annotation is provided as absolute path, value of out_dir parameter is ignored and file is saved to given absolute path.
Returns:

Downloaded annotation filename.

Return type:

str

iCount.genomes.gencode.genome(species, release, out_dir=None, genome=None)[source]

Download GENCODE genome for given release/species.

Note: This will download the “primary assembly” type of genome.

Parameters:
  • species (str by using samtools faidx) – Species latin name.
  • release (str) – Release number. Only gencode releases {0}-{1} are available.
  • out_dir (str) – Download to this directory (if not given, current working directory).
  • genome (str) – Genome filename (must have .gz file extension). If not given original filename will be used. If genome is provided as absolute path, value of out_dir parameter is ignored and file is saved to given absolute path.
Returns:

Downloaded genome/sequnce filename.

Return type:

str

iCount.genomes.gencode.releases(species='human')[source]

Get list of available releases for given species.

Parameters:species (str) – Species name.
Returns:List of available releases
Return type:list
iCount.genomes.gencode.species()[source]

Get list of available species.

Returns:List of species.
Return type:list

Segment

Parse annotation file into internal iCount structures, which are used in further analyses.

Currently, only annotations from ENSEMBl and GENCODE are supported. http://www.gencodegenes.org/ http://www.ensembl.org

Annotation is composed of three levels (gene level, transcript level and segment level). Example of annotation (for simplicity, only segment level of transcript1 is shown):

Gene level:    |--------------gene1--------------|   |-intergenic-|
                             |---------gene2--------|

Transcript l.: |----------transcript1---------|
                       |-------transcript2-------|
                             |------transcript3-----|

Segment level: |-CDS-||-intron-||-CDS-||-UTR3-|

Three “versions” of genome partitioning are produced: transcript-wise and genome-wise and landmarks files:

  • In transcript-wide partitioning, each transcript is split into intervals. Intervals must span the whole transcript, but should not intersect with each other inside transcript. However, higher hierarchy levels: transcripts and genes can still intersect each other. As a result, intervals from different genes/transcripts can intersect. Intervals in transcript wise partition are called segments and the file is called segmentation.
  • In genome-wide partition, whole genome is split into intervals. Such intervals must span the whole genome, and should also not intersect with each other (neither the ones from different genes/transcripts). Intervals in genome-wise partition are called regions and the file is also called regions.
  • The point where one region ends and other starts (for example end of exon and start of intron) is of special importance. This position, where certain type or upstream region (e.g. exon) and certain type of downstream region (e.g. intron) meet is called a landmark (of type exon-itron).

It is best to present all above partitions and their relation to annotation visualy. Example of annotation:

----------------------------------------------------------------->
                |-----------gene1(G1)-----------|
                |--------transcript1(A)---------|
                |-exon--|         |----exon-----|
                        |------------------gene2(G2)--------------------|
                        |-----------------transcript2(B)----------------|
                        |-exon--|        |----exon----|          |-exon-|

Example of transcript-wise partition. Intron and intergenic intervals are made. Also, exons are converted in CDS/UTR3/UTR5 or ncRNA:

----------------------------------------------------------------->
  |-intergenic-|
                |-----------gene1(G1)-----------|
                |--------transcript1(A)---------|
                |-UTR5--||-intron||-----CDS-----|
                        |------------------gene2(G2)--------------------|
                        |-----------------transcript2(B)----------------|
                        |-UTR5--||intron||-----CDS----||-intron-||-UTR3-|
                                                                         |-intergenic-|

Example of genome-wise partition. Now the annotation is “flat”: each nuclotide has one and only one region. How does one decide which region to keep if there are more overlaping segments? The following hierarchy is taken into account: CDS > UTR3 > UTR5 > ncRNA > intron > intergenic:

----------------------------------------------------------------->
  |-intergenic-||--UTR5-||--UTR5-||-----CDS-----||-CDS-||-intron-||-UTR3-||-intergenic-|

Finally, this is how landamrks look like. Note that for example landmark “c” is of type “exon-intron” and landmark “d” is of type “exon-intron”:

----------------------------------------------------------------->
                a                 b                     c         d       e
iCount.genomes.segment.get_segments(annotation, segmentation, fai, report_progress=False)[source]

Create GTF file with transcript level segmentation.

Each line in this file should define one of the following elements:

  • gene
  • transcript
  • CDS
  • UTR3
  • UTR5
  • intron
  • ncRNA
  • intergenic

Name of third field (interval.fields[2]) should correspond to one of theese names. Only consider GTF entries of chromosomes given in fai file.

Parameters:
  • annotation (str) – Path to input GTF file.
  • segmentation (str) – Path to output GTF file.
  • fai (str) – Path to input genome_file (.fai or similar).
  • report_progress (bool) – Show progress.
Returns:

Absolute path to output GTF file.

Return type:

str

Demultiplexing

Split FASTQ file into separate files, one for each sample barcode.

Saved FASTQ files contain sequences where sample barcode, random barcode, and adapter sequences are removed. Random barcode is moved into the header line, since it is needed in later steps (removing PCR duplicates and counting number of cross-link events).

iCount.demultiplex.run(reads, adapter, barcodes5, barcodes3=None, mismatches=1, minimum_length=15, min_adapter_overlap=7, prefix='demux', out_dir='.')[source]

Demultiplex FASTQ file.

Split input FASTQ file into separate files, one for each barcode, and additional file for non-matching barcodes. Write random barcode of a read into it’s FASTQ header row.

Parameters:
  • reads (str) – Sequencing reads.
  • adapter (str) – Adapter sequence to remove from 3-prime end of reads.
  • barcodes5 (list_str) – List of 5-prime end barcodes.
  • barcodes3 (list_str) – List of 3-prime end barcodes.
  • mismatches (int) – Number of tolerated mismatches when comparing barcodes.
  • minimum_length (int) – Minimum length of trimmed sequence to keep.
  • min_adapter_overlap (int) – Minimum length of adapter on 3’ end if demultiplexing also on 3’ barcodes.
  • prefix (str) – Prefix of generated FASTQ files.
  • out_dir (str) – Output folder. Use current folder if none is given.
Returns:

Metrics object, storing analysis metadata.

Return type:

iCount.Metrics

Mapping

Mapping index

Generate STAR genome index.

Mapping reads with STAR

Map reads to genome with STAR.

Filter hits

iCount.mapping.filters.group_by_end(hits)[source]

Group hits by their end position.

Records must grouped by contig and sorted by increasing position on contig.

iCount.mapping.filters.group_by_start(hits)[source]

Group hits by start.

Records must grouped by contig and sorted by increasing position on contig.

iCount.mapping.filters.remove_duplicates(hits)[source]

Remove duplicate hits that map to same position and have same randomer.

Records must grouped by contig and sorted by increasing position on contig.

iCount.mapping.filters.remove_wrong_assignments(hits_list)[source]

Remove wrong_assignments.

Remove low-frequency hits mapped to same position as frequent hits from other experiments.

Records must grouped by contig and sorted by increasing position on contig.

Identify and quantify cross-linked sites

Quantity cross-link events and determine their positions.

The simplest version of this script would oprate on such example:

|--a---b--- reference sequence, chr 14, positive strand ------------
    |rbc1---R1--------|
    |rbc1---R2------|
    |rbc2---R3------|
        |rbc1--------R4-------------|
        |rbc3------R5-----|

Five reads (R1-R5) are mapped to a reference sequence (chromosome 14, positive strand). Reads start on two distinct positons. On first position, there is R1-R3. Cross-link site is located one nucleotide before start of the read (on negative strand, one nucleotide after end of read). However, we wish to count number of cDNA molecules, not the number of reads. This can be done by counting the number of distinct random barcodes (sometimes also called randomers). So in upper example, we have:

Postion a: 3 reads, 2 distinct random barcodes = 2 cDNA’s Postion b: 2 reads, 2 distinct random barcodes = 2 cDNA’s

However, things can get complicated when a single read is mapped in multiple parts. This can happen for several reasons. One common example is that introns are removed during transcription. This can be illustrated with the following image:

|---------------- reference -----------------------

|--------------------------transcript--------------------------|
|---UTR5---||---intron---||---exon---||---intron---||---exon---|

                            |-------R1-------|
                            |--R2.1-->              <-R2.2-|
                  |-R3.1->      <-R3.2->            <-R3.3-|
                    |-R4.1->      <-R4.2-|

Read R1 and R2 are starting on same position. For the sake of argument, let’s also pretend they also have same random barcode. In so, we would count them as a single cDNA molecule (= single cross-link event), even though it is obvious that they represent two separate cross-link events. In order to fix this, we count not just the number of different randomers on same position, but also number of different “second-start” coordinates. Second-start coordinate is just the coordinate of the second part of the read. This way, the actual number of cross-link events can be determined more accurately. If read is not split, it’s second-start coordinate is 0. If read has multiple “holes” (as read R3) we determine second-start from the largest hole.

Reads whose second-start do NOT fall on segmentation (like R4) are stored in a separate BAM file sites_strange. They should be treated with special care, since they can indicate not-yet annotated features in genome. If segmentation is not given, all reads with holes, bigger than holesize_th are considered strange.

Another parameter needs more explanation: group_by. When algorithm starts, reads from BAM file are grouped in hierarchical structure by:

* chromosome and strand
* cross-link position
* random barcode
* second-start

Each second-start group receives 1 cDNA score. This score is divided to each read in group (if there are 5 reads in group, each one gets 1/5 score). This enables that each read has it’s cDNA score and of course, 1 “read score”. This scores can be assigned to start (actually, to cross-link position), midlle or end position of read. By default, score is of course assigned to cross-link location. But for diagnostic purpuses, scores can also be assigned to middle or end coordinate of the read.

iCount.mapping.xlsites.run(bam, sites_single, sites_multi, skipped, group_by='start', quant='cDNA', segmentation=None, mismatches=1, mapq_th=0, multimax=50, gap_th=4, ratio_th=0.1, max_barcodes=10000, report_progress=False)[source]

Identify and quantify cross-linked sites.

Interpret mapped sites and generate BED file with coordinates and number of cross-linked events.

MAPQ is calculated mapq=int(-10*log10(1-1/Nmap)). By default we set the mapq_th to 0 to include all reads. Mapq score is very useful, because values coming from STAR are from a very limited set: 0 (5 or more multiple hits), 1 (4 or 3 multiple hits), 3 (2 multiple hits), 255 (single hit)

Parameters:
  • bam (str) – Input BAM file with mapped reads.
  • sites_single (str) – Output BED6 file to store data from single mapped reads.
  • sites_multi (str) – Output BED6 file to store data from single and multi-mapped reads.
  • skipped (str) – Output BAM file to store reads that do not map as expected by segmentation and reference genome sequence. If read’s second start does not fall on any of segmentation borders, it is considered problematic. If segmentation is not provided, every read in two parts with gap longer than gap_th is not used (skipped). All such reads are reported to the user for further exploration.
  • group_by (str) – Assign score of a read to either ‘start’, ‘middle’ or ‘end’ nucleotide.
  • quant (str) – Report number of ‘cDNA’ or number of ‘reads’.
  • mismatches (int) – Reads on same position with random barcode differing less than mismatches are merged together, if their ratio is below ratio_th.
  • segmentation (str) – File with custon segmentation format (obtained by iCount segment).
  • mapq_th (int) – Ignore hits with MAPQ < mapq_th.
  • multimax (int) – Ignore reads, mapped to more than multimax places.
  • report_progress (bool) – Switch to report progress.
  • gap_th (int) – Reads with gaps less than gap_th are treated as if they have no gap.
  • ratio_th (float) – Ratio between the number of reads supporting a randomer versus the number of reads supporting the most frequent randomer. All randomers above this threshold are accepted as unique. Remaining are merged with the rest, allowing for the specified number of mismatches.
  • max_barcodes (int) – Skip merging similar barcodes if number of distinct barcodes at position is higher that this.
Returns:

Metrics object, storing analysis metadata.

Return type:

iCount.Metrics

Analysis

iCount implements a number of analyses.

Peak finding

Find positions with high density of cross-linked sites.

There are two typical variants of this analysis, depending on the parameters:

  • Gene-wise analysis, where:
    • features = gene
    • group_by = gene_id
  • Transcript-wise analysis where:
    • features = CDS, intron, UTR3, UTR5, ncRNA, intergenic
    • group_by = transcript_id

Let’s look at the Gene-wise analysis in more detail first. Imagine the following situation:

|-----------gene1----------|
        |-----------------------------gene2------------------------------|
       ab c                d                 e

a = 60
b = 100
c = 70
d = 40
e = 100
gene1: gene_id = 001
gene2: gene_id = 002

There are two genes (partially intersecting) and five positions with cross-links (noted with a, b, c, d and e). Crosslink position “a” has 60 cross-link events, “b” has 100 cross-link events and so on. Also, gene1 has gene_id 001, etc.

The algorithm first finds all intersections between annotation and cross-links. In this case cross-link position “a” intersects only with gene1, while position “b” intersects also with gene2… Annotation can include various other types of segments (transcripts, intergenic, ncRNA, etc.), but only segments of type gene are considered for intersection. This behaviour is controlled by parameter features.

Next step is to make groups of cross-links. They are grouped by group_by parameter (in this case, it equals to gene_id). There will be 2 groups. First group name will be 001 and will contain a, b, c and d. Second group name will be 002 and will contain b, c, d and e.

The question now is: has any of positions in each group significantly increased number of cross-link events? And how can one quantify this significance?

This is done by parmutation analysis. It draws a number of random situations with same group size and number of cross-link scores. Number of such draws is determined by perm parameter. This way, a random distribution is calculated. When comparing the observed distribution with the random one, FDR values are assigned to each position. A cutoff FRD value is chosen and only positions with FDR < FDR cutoff are considered as significant.

One must also know that when considering only scores on single positions significant clusters of cross-links can be missed. In the upper example, it is obvious, that something more significantly is happening on position b than on position e, despite having the same score. To account for this, algorithm considers not only the score of single cross-link, but also scores of cross-links some nucleotides before and after. This behaviour in controlled by half-window (half_window) parameter. In the upper example, score of position b eqals to 160 if half_window = 1 and 2530 if half_window=2. Score of position e remains 100.

Let’s also look at the transcript-wise analysis. In this case, scenario also includes transcripts and sub-transcript elements:

|-----------gene1----------|
|--------transcript1-------|
|-ncRNA-||-intron-||-ncRNA-|
        |-----------------------------gene2------------------------------|
        |---------------transcript2--------------|
        |-CDS-||-intron-||-CDS-||-intron-||-UTR3-|
                                |---------------transcript3--------------|
                                |-UTR5-||-intron-||-CDS-||-intron-||-CDS-|

       ab c                d                 e

a = 60
b = 100
c = 70
d = 40
e = 100
gene1: gene_id = 001
gene2: gene_id = 002
transcript1: transcript_id = 0001
transcript2: transcript_id = 0002
transcript3: transcript_id = 0003

Value of parameter features is: CDS, intron, UTR3, UTR5, ncRNA, intergenic. Value of parameter group_by is transcript_id. Since we have multiple values in feature parameter, another parameter becomes important: merge_features. If set to false (default) algorithm will make the following groups:

  • group name: ncRNA-0001, members: a, b, d
  • group name: intron-0001, members: c
  • group name: CDS-0002, members: b, c, d
  • group name: UTR3-0002, members: e
  • group name: intron-0003, members: e

However, if merge_features equals to true, groups are:

  • group name: 0001, members: a, b, c, d
  • group name: 0002, members: b, c, d, e
  • group name: 0003, members: e

Then, for each group, procedure is exactly the same as in Gene-wise case.

When analysis is done, significant positions are reported in file, given by peaks parameter. If scores parameter is also given, all positions are reported in it, no matter the FDR value.

iCount.analysis.peaks.cumulative_prob(vals, max_val)[source]

Compute cumulative probability.

Given a list of SWW scores in region vals, return list freqs_cum where probability that randomly picked value in vals is equal or greater than i equals freqs_cum[i].

Max_val is the largest possible value that can be expected in vals.

iCount.analysis.peaks.get_avg_rnd_distrib(size, total_hits, half_window, perms=10000)[source]

Return background distribution for given region size and number of hits.

We follow the modified FDR for peak height, proposed by [1]

[1] Yeo, G.W. et al. An RNA code for the FOX2 splicing regulator revealed by mapping RNA-protein interactions in stem cells. Nat. Struct. Mol. Biol. 16, 130–137 (2009). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735254/

Results are cached, so they can be reused.

Parameters:
  • size (int) – Size of region.
  • total_hits (int) – Number of cross-link events in region.
  • half_window (int) – Half-window size. The actual window size is: 2 * half_window + 1.
  • perms (int) – Number of permutations to make.
Returns:

Probability to find CWW score i or more on chosen position is equal to i-th element or returned array.

Return type:

numpy.ndarray

iCount.analysis.peaks.run(annotation, sites, peaks, scores=None, features=None, group_by='gene_id', merge_features=False, half_window=3, fdr=0.05, perms=100, rnd_seed=42, report_progress=False)[source]

Find positions with high density of cross-linked sites.

When determining feature.name, value of the first existing attribute in the following tuple is taken:

("ID", "gene_name", "transcript_id", "gene_id", "Parent")

Source in pybedtools: https://github.com/daler/pybedtools/blob/master/pybedtools/scripts/annotate.py#L34

Parameters:
  • annotation (str) – Annotation file in GTF format, obtained from “iCount segment” command.
  • sites (str) – File with cross-links in BED6 format.
  • peaks (str) – File name for “peaks” output. File reports positions with significant number of cross-link events. It should have .bed or .bed.gz extension.
  • scores (str) – File name for “scores” output. File reports all cross-link events, independent from their FDR score It should have .tsv, .csv, .txt or .gz extension.
  • features (list_str) – Features from annotation to consider. If None, [‘gene’] is used. Sometimes, it is advised to use [‘gene’, ‘intergenic’].
  • group_by (str) – Attribute by which cross-link positions are grouped.
  • merge_features (bool) – Treat all features as one when grouping. Has no effect when only one feature is given in features parameter.
  • half_window (int) – Half-window size.
  • fdr (float) – FDR threshold.
  • perms (int) – Number of permutations when calculating random distribution.
  • rnd_seed (int) – Seed for random generator.
  • report_progress (bool) – Report analysis progress.
Returns:

Analysis metadata.

Return type:

iCount.metrics

Cluster sites

Merge adjacent peaks into clusters and sum cross-links within clusters.

iCount.analysis.clusters.run(sites, peaks, clusters, dist=20, slop=3)[source]

Join neighboring peaks (at distance dist) into clusters.

Report sum of sites’ scores within each cluster, including slop.

Parameters:
  • sites (str) – Path to input BED6 file with sites.
  • peaks (str) – Path to input BED6 file with peaks (or clusters).
  • clusters (str) – Path to output BED6 file with merged peaks (clusters).
  • dist (int) – Distance between two peaks to merge into same cluster.
  • slop (int) – Distance between site and cluster to assign site to cluster.
Returns:

BED file with clusters as elements.

Return type:

str

k-mer enrichment

Read bedGraph with cross-linked sites. Count k-mer frequencies. Perform permutation analysis to determine significance of observed k-mer frequencies. Return ranked list of k-mer enrichment.

RNA maps

Perform RNA-maps analysis.

iCount.analysis.rnamaps.compute_distances(landmarks, sites, maptype)[source]

Compute distances between each xlink and it’s closest landmark.

iCount.analysis.rnamaps.get_single_type_landmarks(landmarks, maptype)[source]

Get file with landmarks of only certain type.

iCount.analysis.rnamaps.make_results_raw_file(distances, fname, total_cdna, maptype)[source]

Write distances data to file.

iCount.analysis.rnamaps.make_results_summarised_file(outdir, fname)[source]

Write “plot data” to file.

iCount.analysis.rnamaps.run(sites, landmarks, outdir=None, plot_type='combined', top_n=100, smoothing=1, nbins=50, binsize=None, colormap='Greys', imgfmt='png')[source]

Compute distribution of cross-links relative to genomic landmarks.

Parameters:
  • sites (str) – Croslinks file (BED6 format). Should be sorted by coordinate.
  • landmarks (str) – Landmark file (landmarks.bed.gz) that is produced by iCount segment.
  • outdir (str) – Output directory.
  • plot_type (str) – What kind of plot to make. Choices are distribution, heatmaps and combined.
  • top_n (int) – Plot heatmap for top_n best covered landmarks.
  • smoothing (int) – Smoothing half-window. Average smoothing is used.
  • nbins (int) – Number of bins. Either nbins or binsize can be defined, but not both.
  • binsize (int) – Bin size. Either nbins or binsize can be defined, but not both.
  • colormap (str) – Colormap to use. Any matplotlib colormap can be used.
  • imgfmt (str) – Output image format.
Returns:

Return type:

None

Group BED files

Merge multiple BED files with crosslinks into one.

First, concatenate files into one file. Then, merge crosslinks from different files that are on same position and sum their scores.

Externals

A number of external software is needed for iCount to work.

Cutadapt

Remove adapter sequences from reads in FASTQ file.

iCount.externals.cutadapt.get_version()[source]

Get cutadapt version.

iCount.externals.cutadapt.run(reads, adapter, reads_trimmed=None, overwrite=False, qual_trim=None, minimum_length=None, overlap=None, untrimmed_output=None, error_rate=None)[source]

Remove adapter sequences from high-throughput sequencing reads.

Parameters:
  • reads (str) – Input FASTQ file.
  • adapter (str) – Sequence of an adapter ligated to the 3’ end.
  • reads_trimmed (str) – Output FASTQ file containing trimmed reads. If not provided
  • overwrite (bool) – If true, overwrite input file (reads) with trimmed file.
  • qual_trim (int) – Trim low-quality bases before adapter removal.
  • minimum_length (int) – Discard trimmed reads that are shorter than minimum_length.
  • overlap (int) – Require overlap overlap between read and adapter for an adapter to be found.
  • untrimmed_output (str) – Write reads that do not contain any adapter to this file.
  • error_rate (float) – Maximum allowed error rate (no. of errors divided by the length of the matching region).
Returns:

Return code of the cutadapt program.

Return type:

int

STAR aligner

Interface to running STAR.

iCount.externals.star.build_index(genome, genome_index, annotation='', overhang=100, overhang_min=8, threads=1, genome_sasparsed=1, genome_saindexnbases=14)[source]

Call STAR to generate genome index, which is used for mapping.

Parameters:
  • genome (str) – Genome sequence to index.
  • genome_index (str) – Output folder, where to store genome index.
  • annotation (str) – Annotation that defines splice junctions.
  • overhang (int) – Sequence length around annotated junctions to be used by STAR when constructing splice junction database.
  • overhang_min (int) – Minimum overhang for unannotated junctions.
  • threads (int) – Number of threads that STAR can use for generating index.
  • genome_sasparsed (int) – STAR parameter genomeSAsparseD. Suffix array sparsity. Bigger numbers decrease RAM requirements at the cost of mapping speed reduction. Suggested values are 1 (30 GB RAM) or 2 (16 GB RAM).
  • genome_saindexnbases (int) – STAR parameter genomeSAindexNbases. SA pre-indexing string length, typically between 10 and 15. Longer strings require more memory, but result in faster searches.
Returns:

Star return code.

Return type:

int

iCount.externals.star.get_version()[source]

Get STAR version.

iCount.externals.star.map_reads(reads, genome_index, out_dir, annotation='', multimax=10, mismatches=2, threads=1, genome_load=False)[source]

Map FASTQ file reads to reference genome.

Parameters:
  • reads (str) – Sequencing reads to map to genome.
  • genome_index (str) – Folder with genome index.
  • out_dir (str) – Output folder, where to store mapping results.
  • annotation (str) – GTF annotation needed for mapping to splice-junctions.
  • multimax (int) – Number of allowed multiple hits.
  • mismatches (int) – Number of allowed mismatches.
  • threads (int) – Number of threads that STAR can use for generating index.
  • genome_load (bool) – Load genome into shared memory. Shared memory must be available in the system. See Chapter 3.3 in STAR manual.
Returns:

Return code

Return type:

int

Files

iCount works with various formats that store FASTA and FASTQ sequencing data, GTF genome annotation, BAM data on mapped reads, BED files with quantified cross-linked sites. Parsing of GTF files is done with pybedtools.

iCount.files.gz_open(fname, mode)[source]

Use gzip library to open compressed files ending with .gz.

Parameters:
  • fname (str) – Path to file to open.
  • omode (str) – String indicating how the file is to be opened.
Returns:

File Object.

Return type:

file

iCount.files.decompress_to_tempfile(fname, context='misc')[source]

Decompress files ending with .gz to a temporary file and return filename.

If file does nto end with .gz, juts return fname.

Parameters:
  • fname (str) – Path to file to open.
  • context (str) – Name of temporary subfolder where temporary file is created.
Returns:

Path to decompressed file.

Return type:

str

BED

Reading and writing BED files.

iCount.files.bed.convert_legacy(bedgraph_legacy, bed_converted)[source]

Convert legacy iCount’s four-column format into proper BED6 format.

Old iCount legacy format: chrome, start, end, [+-]value Strand can be either ‘+’ or ‘-‘, and value indicates the intensity of interaction.

The returned BED file follows the BED6 format, as explained in the [bedtools manual](http://bedtools.readthedocs.io/en/latest/content /general-usage.html).

iCount.files.bed.merge_bed(sites_grouped, sites)[source]

Merge multiple files with crosslinks into one.

Concatenate files into one file. Also, merge crosslinks from different files that are on same position and sum their scores.

Parameters:
  • sites_grouped (str) – Path to output BED6 file containing merged data from input sites files.
  • sites (list_str) – List of BED6 files(paths) to be merged.
Returns:

Absolute path to outfile.

Return type:

str

Bedgraph conversion

Convert from BED6 to bedGraph format.

iCount.files.bedgraph.bed2bedgraph(bed, bedgraph, name='User Track', description='User Supplied Track', visibility=None, priority=None, color=None, alt_color=None, max_height_pixels=None)[source]

Convert from BED6 to bedGraph format.

For further explanation of parameters see: https://genome.ucsc.edu/goldenPath/help/customTrack.html#TRACK https://genome.ucsc.edu/goldenpath/help/trackDb/trackDbHub.html

Parameters:
  • bed (str) – Input BED6 file.
  • bedgraph (str) – Output bedGraph file.
  • name (str) – Track label. Should be shorter than 15 characters.
  • description (str) – Track description. Should be shorter than 60 characters.
  • visibility (str) – Define the initial display mode of the annotation track. Choose among “hide”, “dense”, “full”, “pack” and “squish”. Default is “dense”.
  • priority (int) – Defines the track’s order relative to other tracks in same group.
  • color (str) – Define the main color for the annotation track. The track color consists of three comma-separated RGB values from 0-255, e.g RRR,GGG,BBB. The default value is 0,0,0 (black).
  • alt_color (str) – Allow a color range that varies from color to alt_color.
  • max_height_pixels (str) – The limits of vertical viewing space for track, though it is configurable by the user. Should be of the format <max:default:min>.
Returns:

Return type:

None

FASTQ

Reading and writting FASTQ files.

class iCount.files.fastq.FastqEntry(id, seq, plus, qual)[source]

Single FASTQ entry.

class iCount.files.fastq.FastqFile(fname, mode='rt')[source]

Write FASTQ files.

close()[source]

Close file if it is stil open.

read()[source]

Read FASTQ file.

write(fq_entry)[source]

Write single FASTQ entry.

iCount.files.fastq.get_qual_encoding(fname)[source]

Read first few records and determine quality encoding in FASTQ file.

See format description: http://en.wikipedia.org/wiki/FASTQ_format

S - Sanger Phred+33, raw reads typically (0, 40) [33..73] X - Solexa Solexa+64, raw reads typically (-5, 40) [59..104] I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) [64..104] J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) [66..104] L - Illumina 1.8+ Phred+33, raw reads typically (0, 41) [33..74]

FASTA

Reading FASTA files.

iCount.files.fasta.read_fasta(fasta_file)[source]

Read fasta file and return list.

The retuned list (named data) has the following structure:

fasta = [
    [header1, sequence1],
    [header2, sequence2],
    [header2, sequence2],
    ...
]

Metrics

iCount processing statistics can be stored into instances of iCount.metrics.Metrics.

class iCount.metrics.Metrics(context=None, **kwargs)[source]

Storge for statistics calculated during function execution.

__init__(context=None, **kwargs)[source]

When creating the placeholder, a process-specific context should be given.

Parameters:context (str) – Context is used to indicate the process that generated the processing statistics.
Returns:Instance where processing statistics can be added or modified.
Return type:Metrics

Examples

Provide a set of example bash scripts.

iCount.examples.run(out_dir='.')[source]

Create an examples subfolder with example scripts.

This will create an examples subfolder in current working directory and will copy bash scripts needed to run the iCount pipeline on a few examples (for now, only the hnRNP C data from Konig et al.).

Parameters:out_dir (str) – Directory to which example scripts should be copied.
Returns:Path of folder to where examples scripts were copied.
Return type:str

Automated CLI creation

This module offers CLI to iCount Python package. To separate Python package from CLI and to avoid code and docstring duplication, CLI commands are semi-automatically created from functions that already exist in the package. For this automation to work, some rules need to respected when writing functions that are later exposed to CLI.

iCount.cli.make_parser_from_function(function, subparsers, module=None, only_func=False)[source]

Make a argparse parser from function and add it to subparsers.

A Python function is exposed in CLI by calling make_parser_from_function. Example call for exposing function iCount.analysis.peaks.run that makes peaks analysis:

make_parser_from_function(iCount.analysis.peaks.run, subparsers)
What happens is such call?
  • CLI command iCount peaks is created, with description, positional and optional arguments exactly the same as the ones defined in function.

  • name of the command (peaks) equals to the name of the module where function is defined.

  • CLI help message is sourced from module’s docstring

  • Positional and optional arguments (and default values) are sourced from function signature

  • Help text for each of the arguments is sourced from function docstring. For this to work functions needs to follow Nummpy docstring formatting. All parameters should have meaningful description. All parameters should also have type equal to one of the keys in VALID_TYPES.

  • When function is executed, stdout logger with level INFO is registerd. This prints descriptive messages to CL. Function should therefore log its inputs (use iCount.logger.log_inputs function), outputs and most important steps. If there are no errors/execptions command exits with 0 exit status code.

  • If error/exception occurs, is it caught and stack is printed to logger with ERROR level. Also, failed CLI command has status code 1.

  • CLI exposed function should return Metric object that is instance of iCount.metrics.Metric class. Function should set descriptive attributes to the Metric object for analysis results and analysis tatistics for inter-experimetal comparison.

  • Each command also gets these additional arguments:
    • --stdout_log - Threshold value (0-50) for logging to stdout. If 0 logging to stdout if turned OFF.
    • --file_log - Threshold value (0-50) for logging to file. If 0 logging to stdout if turned OFF.
    • --file_logpath - Path to log file.
    • --results_file - File into which to store Metrics (result object).
    • --help - Help message fot a command

    They control the level and location of log inputs as well as string the result objects.

Exceptional cases:

  • In some cases function that performs the work is only imported in the correct module (“exposed module”), but the actual function definition is located somewhere else (“source module”). It that case, one can use module parameter with “source module” value. In this case, the command name and CLI docstring will be defined from “exposed module” but function, default values and parameter descriptions will be sourced from “source module”.
  • In some cases, there will be more CLI exposed functions in the same module. In such case, use set only_func parameter to True. This will use function name for CLI command name and use function docstring (form beginning until “Parameters” section) for CLI help message.

Contributing

Thanks for taking the time to contribute to iCount!

Please submit contributions in accordance with the flow explained in the GitHub Guides.

Installation for development

Fork the main iCount git repository.

If you don’t have Git installed on your system, follow these instructions.

Clone your fork (replace <username> with your GitHub account name) and change directory:

git clone https://github.com/<username>/iCount.git
cd iCount

Prepare iCount for development:

pip install -e .[test]

Note

We strongly recommend to install iCount in virtualenv to create an isolated Python environment.

Installing Docker

When working with docker, make sure that the docker-machine has enough memory to run STAR and associated programs, e.g., at least 32 GB RAM and 45 GB disk:

docker-machine create -d virtualbox --virtualbox-memory 32768 --virtualbox-disk-size "46080" default

Note

Other options for VirtualBox are described here.

Building the documentation

Issue these commands to build the documentation:

pip install -e .[docs]
cd docs
make html
open build/html/index.html

Note

Calling makefile ensures that the reference manual for the CLI is updated. It will envoke this command when needed:

iCount man > source/ref_CLI.txt

Python module and CLI living together

iCount package offers a complete set of Python function to analyse iCLIP data, as well as their command line counterparts. Effectively, they (should) offer excactly the same functionality. However, the tools are in continious development and pipelines will change with time… To avoid differences in functionality as well as the duplication of code, tests and documentation, the command line interface (CLI) is generated automatically from Python functions. For this to work, functions (as well as their docstrings and modules where they are defined) should be written by the conventions described in the following sections.

More about automatic CLI generation (how it works, how to expose function in CLI, etc.) can be read in Automated CLI creation.

Conventions

Syntax

  • Line length is 119 characters plus newline character.
  • Use ' instead of " character, if possible.

Parameters naming

If the variable is one of the commonly used names, such as cross-link file name or mismatches, the name should be one of those. For a list of these names, check the TODO: link.

Errors

Errors should be raised in a descriptive manner - for example if iCount calls an external tool that return a non-zero exit status, an error should be raised.

Docstrings

We follow Numpy format specification for docstrings.

All docstrings of unittest test functions should have three-braces (“”“) in separate lines (or else, docstrings are printed to stdout during test execution). This explains why:

Correct way:

def test_something(self):
    """
    The docstring.
    """

Wrong way:

def test_something(self):
    """The docstring."""

Commit and PR messages

If the core change that commit is introducing originates from module the_module, commit message should be:

the_module: Commit message

If commit is somehow connected with an issue, commit message should reference issue number:

the_module: Commit message, relates to #42, resolves #14

References to PR greatly help when preparing a description of changes in the change log docs/source/revisions.rst.

Logging logic

Logging is used to report the progress of program execution to user. iCount can be used as Python module or as it’s corresponding CLI.

When using iCount as Python module all logging is turned OFF by defult, so no messages will be printed about analysis execution. However, if user wishes, logging can be easily confiugured with functions defined in iCount.logger.py.

If using the CLI, logging to stdout with INFO level is set by defult. This can be configured for each command by using the appropriate command line arguents. Read more about it in Automated CLI creation.

Tests

Tests for iCount python package and corresponding CLI.

There are two types of tests: unit and functional (regression) tests. Continuous development testing supported on GitHub is also explained in the corresponding section.

Unit tests

Unit tests are located in top testing directory (iCount/tests/). They follow the standardy philosophy of unit tests (isolation), although this is sometimes violated when testing functions connect to external resources. This is only the case, when the main task of the function under test is to retrieve a resource from the web. Still, all tests should pass in no more than a couple of minutes.

Test can be run by standard unittest call:

# Run all tests:
python -m unittest iCount
# Run only a specific test:
python -m unittest iCount.test_file_name.Test_class.test_name

Alternatively, all test files can be called also like python script:

python test_file.py

Functional tests

Are located in subdirectory iCount/tests/functional. They should be executed manually, only at times one wishes to check that current implementation is compatible with earlier versions of iCount. The results are typically stored for future reference. These test may take significant amount of time to complete and are not meant to be run on daily basis.

Continuous development testing

iCount is project in continious development and therefore has central repository on GitHub. There, automatic testing is performed by:

  • Travis - unit, code and documentation stype testing
  • Coverage - enforcing sufficient covergae
  • Codacy - enforcing code quality
  • Scrutinizer - enforcing documentation quality

Tests on Travis are executed by tox. To avoid making multiple pull requests and waiting for Travis to do the checks, tox can run on local machine:

# Run all tox tests
tox
# Run only a single environment (for example unittests in Python 3.7)
tox -e py37

Tox enables to test code against different versions of Python and also perform code-style checks. This is achieved by different environments. Currently, we test three of them:

  • py37 (Python 3.7)
  • py36 (Python 3.6)
  • py35 (Python 3.5)
  • linters (wraps pylint, PEP8/pycodestyle and PEP257/pydocstyle)

For more info check tox.ini and .travis.yml in packyge root directory.

Naming proposition - TODO

The iCount pipeline looks like so: TODO: image

The elements appearing in it are files and analysis:

For each file (type):
  • format specification = FASTA, FASTQ, GTF, BED6, special, custom…
  • one-letter-CLI-abreviation: ‘-r’
  • full CLI name: ‘–reads’
  • naming convention: example for genome file: species.release[.chr1_chr2_chrMT].fa[.gz]
  • is output of which analysis: ?? (should be clear form image?)
  • can be input for which analysis: ?? (should be clear form image?)
The analysis in it are:
  • analysis name (single word?)
  • execution time (O(n^2), O(reads#, genome_size)) some typical estimation
  • inputs, outputs (should be clear form image?)

Preparing a release

First check that twine is installed:

pip install twine

Requirement already satisfied: twine in ...

Pull the latest master to be released:

git checkout master
git pull

Use the utility script docs/changelog.sh to list all commits that were made from last tagged release:

docs/changelog.sh > to_edit.rst

Edit to_edit.rst and incorporate a description of most important changes into docs/source/revisions.rst. Use syntax from releases package.

Remove .dev from project’s version in iCount/__about__.py file.

Clean build directory:

python setup.py clean -a

Remove previous distributions in dist directory:

rm dist/*

Remove previous egg-info directory:

rm -r *.egg-info

Commit changes:

git add docs/source/revisions.rst iCount/__about__.py
git commit -m "Release <version>"

Test the new version with Tox:

tox -r

Create source distribution:

python setup.py sdist

Build wheel:

python setup.py bdist_wheel

Upload distribution to PyPI:

twine upload dist/*

Note

It is advisable to test upload onto the test server https://test.pypi.org/legacy/ first:

twine upload --repository-url https://test.pypi.org/legacy/ dist/*

Afterwards, test the installation in a clean python environment:

docker build -t icounttestinstall -f Dockerfile_test .
docker run -t -i icounttestinstall

pip install -i https://test.pypi.org/pypi \
--extra-index-url https://pypi.python.org/pypi iCount

Tag the new version:

git tag <version>

Push changes to main repository:

git push origin master <version>

Decide how to bump version (to some new value <new-version>) and modify iCount/__about__.py. Don’t forget to add suffix .dev:

__version__=<new-version>.dev

Commit changes:

git add iCount/__about__.py
git commit -m "Bump version to <new-version>"

Frequently asked questions

When running peaks or other commands I get an “[OSError] 28” message.

The error number indicates that you are running out of storage (“No space left on device”). Make sure that you have enough free storage and that iCount.TMP_ROOT (or bash variable ICOUNT_TMP_ROOT) points to a folder with enough free storage.

Version history

Ver 2.0.0

Initial public version of iCount.

License

iCount, protein-RNA interaction analysis

Copyright (C) 2019 University of Ljubljana, Faculty of Computer and
Information Science, Slovenia

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, either version 3 of the License, or
any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

To receive a copy of the GNU Affero General Public License,
see <https://www.gnu.org/licenses/>.

How to cite

iCount is developed and supported by Tomaž Curk from the Bioinformatics Laboratory at the University of Ljubljana, Faculty of Computer and Information Science and in collaboration with Jernej Ule and members of the Ule Laboratory. In mid-2016, Jure Zmrzlikar was outsourced to refactor some of the code.

Many researchers and programmers have been involved in the development of iCount, from its early prototype in late 2008 to its current form:

  • Tomaž Curk
  • Gregor Rot
  • Črtomir Gorup
  • Igor Ruiz de los Mozos
  • Julian Konig
  • Jure Zmrzlikar
  • Yoichiro Sugimoto
  • Nejc Haberman
  • Goran Bobojević
  • Christian Hauer
  • Matthias Hentze
  • Blaž Zupan
  • Jernej Ule

If you use iCount in your research, please cite:

Curk et al. (2019) iCount: protein-RNA interaction iCLIP data analysis (in preparation).

This documentation is for iCount version: 2.0.1.dev