Python package¶
iCount: protein-RNA interaction analysis¶
iCount is a Python module and associated command-line interface (CLI), which provides all the commands needed to process protein-RNA iCLIP interaction data and to identify and quantify sites of protein-RNA interactions on RNA.
iCount’s main input are FASTQ files with iCLIP sequencing data, its main output are BED files with identified and quantified cross-linked sites.
A number of analyses are included in iCount that provide insights into the properties of protein-RNA interaction.
Two attributes and associated environment variables define where output and temporary files will be stored:
-
iCount.
OUTPUT_ROOT
= '.'¶ Default output folder for iCount functions/commands. It points to the value of eviroment variable
ICOUNT_OUTPUT_ROOT
if set. Otherwise, current working directory is used.
-
iCount.
TMP_ROOT
= '/tmp/iCount'¶ Default temporary folder for iCount functions/commands. It is used to store temporary files, used in intermediate steps of analyses. It points to the value of eviroment variable
ICOUNT_TMP_ROOT
if set. Otherwise,/tmp/iCount
is used.
Genomes¶
This module provides access to Ensembl genome sequence and annotation. Segmentation into genes and into segments of same type (exon, intron, UTR, ...) is supported.
Ensembl API¶
Functions to query and download data from the Ensembl FTP site.
-
iCount.genomes.ensembl.
annotation
(species, release=88, out_dir=None, annotation=None)[source]¶ Download ENSEMBL annotation for given release/species.
Parameters: - species (str) – Species latin name.
- release (int) – Release number. Only ENSEMBL releases 59-88 are available.
- out_dir (str) – Download to this directory (if not given, current working directory).
- annotation (str) – Annotation filename (must have .gz file extension). If not given, species.release.gtf.gz is used. If annotation is provided as absolute path, value of out_dir parameter is ignored and file is saved to given absolute path.
Returns: Downloaded annotation filename.
Return type:
-
iCount.genomes.ensembl.
chrom_length
(fasta_in)[source]¶ Compute chromosome lengths to a file by using samtools faidx.
More about the .fai file format can be found here: http://www.htslib.org/doc/faidx.html
Parameters: fasta_in (str) – Path to genome FASTA file (can be .gz). Returns: Absolute path to output file. Return type: str
-
iCount.genomes.ensembl.
genome
(species, release=88, out_dir=None, genome=None, chromosomes=None)[source]¶ Download ENSEMBL genome for given release/species.
Parameters: - species (str) – Species latin name.
- release (int) – Release number. Only ENSEMBL releases 59-88 are available.
- out_dir (str) – Download to this directory (if not given, current working directory).
- genome (str) – Genome filename (must have .gz file extension). If not given, species.release.fa.gz is used. If genome is provided as absolute path, value of out_dir parameter is ignored and file is saved to given absolute path.
- chromosomes (list_str) – If given, do not download the whole genome, but listed chromosomes only. Chromosomes can be given as strings or integers.
Returns: - str – Downloaded genome/sequnce filename.
- TODO (target_dir & target_fname into one parameter? But the default name is)
- a quite useful feature...
-
iCount.genomes.ensembl.
get_ftp_instance
()[source]¶ Get ftplib.FTP object that is connected to ftp.ensembl.org.
Returns: FTP object connected to ftp.ensembl.org Return type: ftplib.FTP
Segmentation¶
Parse annotation file into internal iCount structure - segmentation.
Currently, only annotations from ENSEMBl and GENCODE are supported. http://www.gencodegenes.org/ http://www.ensembl.org
Segmentation is used in almost all further analyses.
In segmentation, each transcript is partitioned into so called regions/intervals. Such regions must span the whole transcript, but should not intersect with each other. However, higher hierarchy levels: transcripts and genes can of course intersect each other.
Example of possible segmentation:
Genome level: |---------------------------------------------------|
Gene level: |--------------gene1--------------| |-intergenic-|
|---------gene2--------|
Transcript l.: |----------transcript1---------|
|-------transcript2-------|
|------transcript3-----|
Region level: |-CDS-||-intron-||-CDS-||-UTR3-|
For simplicity, only the partition of transcript1 is presented.
-
iCount.genomes.segment.
get_regions
(annotation, segmentation, fai, report_progress=False)[source]¶ Create new gtf file with custom annotation and filtered content.
Each line in new file should define one of the following elements:
- gene
- transcript
- CDS
- intron
- UTR3
- UTR5
- stop_codon
- ncRNA
- intergenic
Name of third field (interval.fields[2]) should correspond to one of theese names. Only consider GTF entries of chromosomes given in genome_file.
Parameters: Returns: Absolute path to output GTF file.
Return type:
Demultiplexing¶
Split FASTQ file into separate files, one for each sample barcode.
Saved FASTQ files contain sequences where sample barcode, random barcode, and adapter sequences are removed. Random barcode is moved into the header line, since it is needed in later steps (removing PCR duplicates and counting number of cross-link events).
-
iCount.demultiplex.
run
(reads, adapter, barcodes, mismatches=1, minimum_length=15, prefix='demux', out_dir='.')[source]¶ Demultiplex FASTQ file.
Split input FASTQ file into separate files, one for each barcode, and additional file for non-matching barcodes.
Parameters: - reads (str) – Path to reads from a sequencing library.
- adapter (str) – Adapter sequence to remove from ends of reads.
- barcodes (list_str) – List of barcodes used for library.
- mismatches (int) – Number of tolerated mismatches when comparing barcodes.
- minimum_length (int) – Minimum length of trimmed sequence to keep.
- prefix (str) – Prefix of generated FASTQ files.
- out_dir (str) – Output folder. Use local folder if none given.
Returns: List of filenames where separated reads are stored.
Return type:
Mapping¶
Mapping index¶
Generate STAR genome index.
Mapping reads with STAR¶
Map reads to genome with STAR.
Filter hits¶
-
iCount.mapping.filters.
group_by_end
(hits)[source]¶ Group hits by their end position.
Records must grouped by contig and sorted by increasing position on contig.
-
iCount.mapping.filters.
group_by_start
(hits)[source]¶ Group hits by start.
Records must grouped by contig and sorted by increasing position on contig.
Identify and quantify cross-linked sites¶
Quantity cross-link events and determine their positions.
The simplest version of this script would oprate on such example:
|--a---b--- reference sequence, chr 14, positive strand ------------
|rbc1---R1--------|
|rbc1---R2------|
|rbc2---R3------|
|rbc1--------R4-------------|
|rbc3------R5-----|
Five reads (R1-R5) are mapped to a reference sequence (chromosome 14, positive strand). Reads start on two distinct positons. On first position, there is R1-R3. Cross-link site is located one nucleotide before start of the read (on negative strand, one nucleotide after end of read). However, we wish to count number of cDNA molecules, not the number of reads. This can be done by counting the number of distinct random barcodes (sometimes also called randomers). So in upper example, we have:
Postion a: 3 reads, 2 distinct random barcodes = 2 cDNA’s Postion b: 2 reads, 2 distinct random barcodes = 2 cDNA’s
However, things can get complicated when a single read is mapped in multiple parts. This can happen for several reasons. One common example is that introns are removed during transcription. This can be illustrated with the following image:
|---------------- reference -----------------------
|--------------------------transcript--------------------------|
|---UTR5---||---intron---||---exon---||---intron---||---exon---|
|-------R1-------|
|--R2.1--> <-R2.2-|
|-R3.1-> <-R3.2-> <-R3.3-|
|-R4.1-> <-R4.2-|
Read R1 and R2 are starting on same position. For the sake of argument, let’s also pretend they also have same random barcode. In so, we would count them as a single cDNA molecule (= single cross-link event), even though it is obvious that they represent two separate cross-link events. In order to fix this, we count not just the number of different randomers on same position, but also number of different “second-start” coordinates. Second-start coordinate is just the coordinate of the second part of the read. This way, the actual number of cross-link events can be determined more accurately. If read is not split, it’s second-start coordinate is 0. If read has multiple “holes” (as read R3) we determine second-start from the largest hole.
Reads whose second-start do NOT fall on segmentation (like R4) are stored in a
separate BAM file sites_strange
. They should be treated with special care,
since they can indicate not-yet annotated features in genome. If segmentation is
not given, all reads with holes, bigger than holesize_th
are considered
strange.
Another parameter needs more explanation: group_by
. When algorithm starts,
reads from BAM file are grouped in hierarchical structure by:
* chromosome and strand
* cross-link position
* random barcode
* second-start
Each second-start group receives 1 cDNA score. This score is divided to each read in group (if there are 5 reads in group, each one gets 1/5 score). This enables that each read has it’s cDNA score and of course, 1 “read score”. This scores can be assigned to start (actually, to cross-link position), midlle or end position of read. By default, score is of course assigned to cross-link location. But for diagnostic purpuses, scores can also be assigned to middle or end coordinate of the read.
TODO: check overlap between unique and multimap BED files, should be small, otherwise, we should think of a more approapriate quantification of (division of randomers among) unique mapped sites that overlap with multimapped reads
-
iCount.mapping.xlsites.
run
(bam, sites_unique, sites_multi, skipped, group_by='start', quant='cDNA', segmentation=None, mismatches=2, mapq_th=0, multimax=50, gap_th=4, report_progress=False)[source]¶ Identify and quantify cross-linked sites.
Interpret mapped sites and generate BED file with coordinates and number of cross-linked events.
MAPQ is calculated mapq=int(-10*log10(1-1/Nmap)). By default we set the mapq_th to 0 to include all reads. Mapq score is very useful, because values coming from STAR are from a very limited set: 0 (5 or more multiple hits), 1 (4 or 3 multiple hits), 3 (2 multiple hits), 255 (unique hit)
Parameters: - bam (str) – Input BAM file with mapped reads.
- sites_unique (str) – Output BED6 file to store data from uniquely mapped reads.
- sites_multi (str) – Output BED6 file to store data from multi-mapped reads.
- skipped (str) – Output BAM file to store reads that do not map as expected by segmentation and reference genome sequence. If read’s second start does not fall on any of segmentation borders, it is considered problematic. If segmentation is not provided, every read in two parts with gap longer than gap_th is not used (skipped). All such reads are reported to the user for further exploration.
- group_by (str) – Assign score of a read to either ‘start’, ‘middle’ or ‘end’ nucleotide.
- quant (str) – Report number of ‘cDNA’ or number of ‘reads’.
- mismatches (int) – Reads on same position with random barcode differing less than
mismatches
are grouped together. - segmentation (str) – File with custon segmentation format (obtained by
iCount segment
). - mapq_th (int) – Ignore hits with MAPQ < mapq_th.
- multimax (int) – Ignore reads, mapped to more than
multimax
places. - report_progress (bool) – Switch to report progress.
- gap_th (int) – Reads with gaps less than gap_th are treated as if they have no gap.
Returns: Metrics object, storing analysis metadata.
Return type: iCount.Metrics
Analysis¶
iCount implements a number of analyses.
Peak finding¶
Find positions with high density of cross-linked sites.
There are two typical variants of this analysis, depending on the parameters:
- Gene-wise analysis, where:
- features = gene
- group_by = gene_id
- Transcript-wise analysis where:
- features = CDS, intron, UTR3, UTR5, ncRNA, intergenic
- group_by = transcript_id
Let’s look at the Gene-wise analysis in more detail first. Imagine the following situation:
|-----------gene1----------|
|-----------------------------gene2------------------------------|
ab c d e
a = 60
b = 100
c = 70
d = 40
e = 100
gene1: gene_id = 001
gene2: gene_id = 002
There are two genes (partially intersecting) and five positions with cross-links (noted with a, b, c, d and e). Crosslink position “a” has 60 cross-link events, “b” has 100 cross-link events and so on. Also, gene1 has gene_id 001, etc.
The algorithm first finds all intersections between annotation and
cross-links. In this case cross-link position “a” intersects only with gene1,
while position “b” intersects also with gene2... Annotation can include
various other types of segments (transcripts, intergenic, ncRNA, etc.), but only
segments of type gene
are considered for intersection. This behaviour is
controlled by parameter features
.
Next step is to make groups of cross-links. They are grouped by group_by
parameter (in this case, it equals to gene_id
). There will be 2 groups.
First group name will be 001 and will contain a, b, c and d. Second group name
will be 002 and will contain b, c, d and e.
The question now is: has any of positions in each group significantly increased number of cross-link events? And how can one quantify this significance?
This is done by parmutation analysis. It draws a number of random situations
with same group size and number of cross-link scores. Number of such draws is
determined by perm
parameter. This way, a random distribution is calculated.
When comparing the observed distribution with the random one, FDR values are
assigned to each position. A cutoff FRD value is chosen and only positions with
FDR < FDR cutoff are considered as significant.
One must also know that when considering only scores on single positions significant clusters of cross-links can be missed. In the upper example, it is obvious, that something more significantly is happening on position b than on position e, despite having the same score. To account for this, algorithm considers not only the score of single cross-link, but also scores of cross-links some nucleotides before and after. This behaviour in controlled by half-window (half_window) parameter. In the upper example, score of position b eqals to 160 if half_window = 1 and 2530 if half_window=2. Score of position e remains 100.
Let’s also look at the transcript-wise analysis. In this case, scenario also includes transcripts and sub-transcript elements:
|-----------gene1----------|
|--------transcript1-------|
|-ncRNA-||-intron-||-ncRNA-|
|-----------------------------gene2------------------------------|
|---------------transcript2--------------|
|-CDS-||-intron-||-CDS-||-intron-||-UTR3-|
|---------------transcript3--------------|
|-UTR5-||-intron-||-CDS-||-intron-||-CDS-|
ab c d e
a = 60
b = 100
c = 70
d = 40
e = 100
gene1: gene_id = 001
gene2: gene_id = 002
transcript1: transcript_id = 0001
transcript2: transcript_id = 0002
transcript3: transcript_id = 0003
Value of parameter features is: CDS, intron, UTR3, UTR5, ncRNA, intergenic. Value of parameter group_by is transcript_id. Since we have multiple values in feature parameter, another parameter becomes important: merge_features. If set to false (default) algorithm will make the following groups:
- group name: ncRNA-0001, members: a, b, d
- group name: intron-0001, members: c
- group name: CDS-0002, members: b, c, d
- group name: UTR3-0002, members: e
- group name: intron-0003, members: e
However, if merge_features equals to true, groups are:
- group name: 0001, members: a, b, c, d
- group name: 0002, members: b, c, d, e
- group name: 0003, members: e
Then, for each group, procedure is exactly the same as in Gene-wise case.
When analysis is done, significant positions are reported in file, given by peaks parameter. If scores parameter is also given, all positions are reported in it, no matter the FDR value.
-
iCount.analysis.peaks.
cumulative_prob
(vals, max_val)[source]¶ Compute cumulative probability.
Given a list of SWW scores in region
vals
, return listfreqs_cum
where probability that randomly picked value in vals is equal or greater than i equals freqs_cum[i].Max_val is the largest possible value that can be expected in vals.
-
iCount.analysis.peaks.
get_avg_rnd_distrib
(size, total_hits, half_window, perms=10000)[source]¶ Return background distribution for given region size and number of hits.
We follow the modified FDR for peak height, proposed by [1]
[1] Yeo, G.W. et al. An RNA code for the FOX2 splicing regulator revealed by mapping RNA-protein interactions in stem cells. Nat. Struct. Mol. Biol. 16, 130–137 (2009). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735254/
Results are cached, so they can be reused.
Parameters: Returns: Probability to find CWW score i or more on chosen position is equal to i-th element or returned array.
Return type: numpy.ndarray
-
iCount.analysis.peaks.
run
(annotation, sites, peaks, scores=None, features=None, group_by='gene_id', merge_features=False, half_window=3, fdr=0.05, perms=100, rnd_seed=42, report_progress=False)[source]¶ Find positions with high density of cross-linked sites.
When determining feature.name, value of the first existing attribute in the following tuple is taken:
("ID", "gene_name", "transcript_id", "gene_id", "Parent")
Source in pybedtools: https://github.com/daler/pybedtools/blob/master/pybedtools/scripts/annotate.py#L34
Parameters: - annotation (str) – Annotation file in GTF format, obtained from “iCount segment” command.
- sites (str) – File with cross-links in BED6 format.
- peaks (str) – File name for “peaks” output. File reports positions with significant number of cross-link events. It should have .bed or .bed.gz extension.
- scores (str) – File name for “scores” output. File reports all cross-link events, independent from their FDR score It should have .tsv, .csv, .txt or .gz extension.
- features (list_str) – Features from annotation to consider. If None, [‘gene’] is used. Sometimes, it is advised to use [‘gene’, ‘intergenic’].
- group_by (str) – Attribute by which cross-link positions are grouped.
- merge_features (bool) – Treat all features as one when grouping. Has no effect when only one feature is given in features parameter.
- half_window (int) – Half-window size.
- fdr (float) – FDR threshold.
- perms (int) – Number of permutations when calculating random distribution.
- rnd_seed (int) – Seed for random generator.
- report_progress (bool) – Report analysis progress.
Returns: Analysis metadata.
Return type:
Cluster sites¶
Merge adjacent peaks into clusters and sum cross-links within clusters.
-
iCount.analysis.clusters.
run
(sites, peaks, clusters, dist=20, slop=3)[source]¶ Join neighboring peaks (at distance dist) into clusters.
Report sum of sites’ scores within each cluster, including slop.
Parameters: - sites (str) – Path to input BED6 file with sites.
- peaks (str) – Path to input BED6 file with peaks (or clusters).
- clusters (str) – Path to output BED6 file with merged peaks (clusters).
- dist (int) – Distance between two peaks to merge into same cluster.
- slop (int) – Distance between site and cluster to assign site to cluster.
Returns: BED file with clusters as elements.
Return type:
k-mer enrichment¶
Read bedGraph with cross-linked sites. Count k-mer frequencies. Perform permutation analysis to determine significance of observed k-mer frequencies. Return ranked list of k-mer enrichment.
RNA maps¶
Distribution of cross-links relative to genomic landmarks.
What is an RNA-map?¶
Imagine the following situation (on positive strand):
|---intron1---||---exon2---||---intron2---||---exon3---|
x|-----R1-----|
Situation is simple: a read perfectly maps to reference genome. First cross-link (nucletide before the start of read R1) is located 2 nucleotides before landmark of type ‘exon-intron’. By examining all cross-links(reads) that are located in vicinity of landmarks one can obtain the distribution of cross-link-to-landmark distances. Such distibution is called an RNA map.
Of course, situation gets complicated. Real-world-situation more likely looks like this:
|--------transcript1-------|
|-ncRNA-||-intron-||-ncRNA-|
|------------------transcript2-----------------|
|--CDS--||-intron-||--CDS--||-intron-||--UTR3--|
|---------------transcript3--------------|
|-UTR5-||-intron-||-CDS-||-intron-||-CDS-|
x|-----R1-----|
x|R2.1-> <--R2.2-|
x|-R3-|
All sort of situations can arise:
- read can intersect with multiple genes/transcripts/segments simultaneously, which means that one read is relevant for different types of RNA maps.
- if whole read is mapped to one segment (as is read R3 on transcript2), then it is impossible to tell the type of RNA map it belongs to: is it CDS-intron or intron-CDS?
- read can be mapped in multiple parts - for example when introns are removed before translation, first half of read will map to
exon1
and second half toexon2
.- same cross-link event can be represented by multiple reads with same random barcode.
- ...
The algorithm takes care for all of this, as explained below. It outputs a file, that looks like this:
rna_map_type position all explicit
exon-intron 42 13 14
exon-intron 43 123 19
exon-intron 44 56 16
....
intron-CDS 23474 34 2
intron-CDS 23475 85 65
intron-CDS 23476 92 1
...
Each line consits of four columns: RNA map type, position, all and explicit. RNA
map type defines the landmark type, while position defines relative distance of
cross-links to this landmark. All and explicit are counting number of crosslinks
that are positions
away from landmark of type RNA map type. Cross link is
explicit, if read identifying cross-link is mapped to both parts of RNA-map. In
upper example, R1 is explicit, since it maps to intron and exon. Read R3 on
transcript2 is implicit, since one has to decide wheather it’s cross-link
belongs to CDS-intron
or intron-CDS
RNA map. The term all means
explicit + implicit in this context.
If read is implicit (start and stop within same segment), one can choose two varinats of the algorithm. One option is that whole score of read is given to RNA-map of the closest landmark. Other option is to split score on half to both neighbouring segments. This behaviour is controlled by parameter implicit_handling.
There are also two cases when a read can map in a way that is not predicted by annotation:
- For split reads, second start can fall on nucleotide that is not start of a new segment. This behaviour is excactly the same as in xlsites analysis. Such reads are reported in file defined with
strange
parameter.- Reads that map on two different transcripts or even two different genes are repoted in file with name defined in cross_transcript.
Things that still need to be resolved:
- negative strand inspection:
- for each ss_group, take annotation from reverse strand and check what is the situation on other side.
- What si again the bio-contxt here - explain in docs...?
- Test: Each cross-link should have the same cDNA count as in xlsites results. The final check should be: sum all scores - the result should be the same as for xlsites. Actually, the metrics.cross_transcript should also be considered: assert: sum(xlsites scores) == sum(RNAmaps “all” + metrics.cross_transc)
- if read crosses two ore more regions from start to end, only end-type is considered for RNAmap type. Should we fix this and which region type to take? Since reads are tpyically much shorted than regions, this is probably a rare case.
Wishes and ideas(Jernej, Tomaz):
The whole idea is to get a feeling what is happening around landmarks
- separate pre-mRNA and mRNA [this is partially done in _add_entry]
- Life cyle of RNA: part of DNA in transcribed to RNA in nucleus. Introns are cut out already between this process! Really???
- pre-mRNA contains introns, mature mRNA has no introns and goes outside from the nucleus.
- so: RNAmap exon-intron will for sure contain only pre-mRNA, but RNAmap exon-exon can contain pre-mRNA and mRNA... It would be really nice to report on percentage of theese two categories.
- This can be done by introducing three categories: pre-mRNa, mRNA and undecided. This can be determined from results - they contain explicit / implicit info and RNAmap type:
Intoduce three categories (colors in visualisation?) in every RNAmap type. Let’s take exon-intron as example:
- whole read in left part - color #1 (negative coord, implicit)
- whole read in right part - color #2 (positive coord, implicit)
- read crossing junction - color #3 (explicit)
How to handle situation when one region in individual’s sequence clearly differs from reference sequence but it is just some variation?
- Change the reference sequence? This can be complex... Make a helper tool for this?
- Provide additional data in function - which exceptions /abnormalities to ignore?
Gaussian smooting of RNAmaps?
- split read - it can differ also on “first-stop”, not juts second-start
- we could have some sort of quality check when observing the variance of first-stop-s. THe major question is: “Do all reads for certain xlink event indicate the same behaviour?”
-
iCount.analysis.rnamaps.
make_normalization
(segmentation, normalization)[source]¶ Make normalization file for RNAmaps (for given segmentation).
Parameters: Returns: Path to file with normalizations.
Return type:
-
iCount.analysis.rnamaps.
plot_rna_map
(rnamap_file, map_type, normalization=False, outfile='show')[source]¶ Plot simple image of RNAmap.
-
iCount.analysis.rnamaps.
run
(bam, segmentation, out_file, strange, cross_transcript, implicit_handling='closest', mismatches=2, mapq_th=0, holesize_th=4)[source]¶ Compute distribution of cross-links relative to genomic landmarks.
Parameters: - bam (str) – BAM file with alligned reads.
- segmentation (str) – GTF file with segmentation. Should be a file produced by function get_regions.
- out_file (str) – Output file with analysis results.
- strange (str) – File with strange propertieas obtained when processing bam file.
- cross_transcript (str) – File with reads spanning over multiple transcripts or multiple genes.
- implicit_handling (str) – Can be ‘closest’ or ‘split’. In case of implicit read - split score to both neighbours or give it just to the closest neighbour.
- mismatches (int) – Reads on same position with random barcode differing less than
mismatches
are grouped together. - mapq_th (int) – Ignore hits with MAPQ < mapq_th.
- holesize_th (int) – Raeads with size of holes less than holesize_th are treted as if they would have no holes.
Returns: File with number of (al, explicit) scores per each position in each RNA-map type.
Return type:
Cross-link site annotation¶
Annotate each cross link site with types of regions that intersect with it.
-
iCount.analysis.annotate.
annotate_cross_links
(annotation, sites, sites_annotated, subtype='biotype', excluded_types=None)[source]¶ Annotate each cross-link site with all region types that intersect it.
Each cross link can overlap/intersect with many intervals in annotation file. Each of these intervals has a given type. Make a copy of cross link file and include all these types in 4th column.
In the context of this report “type” equals to the combination of 3rd column and attribute subtype from annotation file (GTF). Regions/intervals are parts of transcript (UTR, CDS, introns...) that “non- intersectingly” span the whole transcript. However, since transcripts can overlap, also intervals belonging to different transcripts can overlap. Intergenic regions are also considered as region. Each region has one and only one type.
Parameters: - annotation (str) – Path to annotation file (should be GTF and include subtype attribute).
- sites (str) – Path to input BED6 file listing all cross-linked sites.
- sites_annotated (str) – Path to output BED6 file listing annotated cross-linked sites.
- subtype (str) – Subtype.
- excluded_types (list_str) – Excluded types.
Returns: Path to summary report file (should be equal to out_file parameter)
Return type:
Cross-link site summary¶
Report proportion of cross-link events/sites on each region type.
-
iCount.analysis.summary.
make_summary_report
(annotation, sites, summary, fai, types_length_file=None, digits='8', subtype=None, excluded_types=None)[source]¶ Make summary report from cross-link and annotation data.
In the context of this report “type” equals to the combination of 3rd column and attribute subtype from annotation file (GTF). “Regions” are parts of transcript (UTR, CDS, introns...) that “non- intersectingly” span the whole transcript. Intergenic regions are also considered as region. Each region has one and only one type.
For each of such types, number of cross-link events/sites is counted. Sites = sum of all cross-link sites that are in regions of some type. Events = sum of col5 for all cross-link sites that are in regions of some type (multiple events can happen on each cross link). Col5 is numerical value in 5th column of cross-link file.
Parameters: - annotation (str) – Path to annotation GTF file (should include subtype attribute).
- sites (str) – Path to BED6 file listing cross-linked sites.
- summary (str) – Path to output tab-delimited file with summary statistics.
- fai (str) – Path to file with chromosome lengths.
- types_length_file (str) – Path to file with lengths of each type.
- digits (int) – Number of decimal places in results.
- subtype (str) – Name of attribute to be used as subtype.
- excluded_types (list_str) – Types listed in 3rd column of GTF to be exclude from analysis.
Returns: Path to summary report file (equal to parameter out_file)
Return type:
-
iCount.analysis.summary.
make_types_length_file
(annotation, fai, out_file=None, subtype='biotype', excluded_types=None)[source]¶ Calculate the number of non-overlapping base pairs of each “type”.
In the context of this function “type” equals to the combination of 3rd column and attribute subtype from annotation file (GTF).
Parameters: Returns: Absolute path to out_file.
Return type:
Group BED files¶
Merge multiple BED files with crosslinks into one.
First, concatenate files into one file. Then, merge crosslinks from different files that are on same position and sum their scores.
Externals¶
A number of external software is needed for iCount to work.
Cutadapt¶
Remove adapter sequences from reads in FASTQ file.
-
iCount.externals.cutadapt.
run
(reads, reads_trimmed, adapter, qual_trim=None, minimum_length=None)[source]¶ Remove adapter sequences from high-throughput sequencing reads.
Parameters: - reads (str) – Input FASTQ file.
- reads_trimmed (str) – Output FASTQ file containing trimmed reads.
- adapter (str) – Sequence of an adapter ligated to the 3’ end.
- qual_trim (int) – Trim low-quality bases before adapter removal.
- minimum_length (int) – Discard trimmed reads that are shorter than minimum_length.
Returns: Return code of the cutadapt program.
Return type:
STAR aligner¶
Interface to running STAR.
-
iCount.externals.star.
build_index
(genome, genome_index, annotation='', overhang=100, overhang_min=8, threads=1)[source]¶ Call STAR to generate genome index, which is used for mapping.
Parameters: - genome (str) – Genome sequence to index.
- genome_index (str) – Output folder, where to store genome index.
- annotation (str) – Annotation that defines splice junctions.
- overhang (int) – Sequence length around annotated junctions to be used by STAR when constructing splice junction database.
- overhang_min (int) – TODO
- threads (int) – Number of threads that STAR can use for generating index.
Returns: Star return code.
Return type:
-
iCount.externals.star.
map_reads
(reads, genome_index, out_dir, annotation='', multimax=10, mismatches=2, threads=1)[source]¶ Map FASTQ file reads to reference genome.
TODO
Parameters: - reads (str) – Sequencing reads to map to genome.
- genome_index (str) – Folder with genome index.
- out_dir (str) – Output folder, where to store mapping results.
- annotation (str) – GTF annotation needed for mapping to splice-junctions.
- multimax (int) – Number of allowed multiple hits.
- mismatches (int) – Number of allowed mismatches.
- threads (int) – Number of threads that STAR can use for generating index.
Returns: Return code
Return type:
Files¶
iCount works with various formats that store FASTA and FASTQ sequencing data, GTF genome annotation, BAM data on mapped reads, BED files with quantified cross-linked sites. Parsing of GTF files is done with pybedtools.
-
iCount.files.
gz_open
(fname, mode)[source]¶ Use
gzip
library to open compressed files ending with .gz.Parameters: Returns: File Object.
Return type:
-
iCount.files.
decompress_to_tempfile
(fname, context='misc')[source]¶ Decompress files ending with .gz to a temporary file and return filename.
If file does nto end with .gz, juts return fname.
Parameters: Returns: Path to decompressed file.
Return type:
BED¶
Reading and writing BED files.
-
iCount.files.bed.
convert_legacy
(bedgraph_legacy, bed_converted)[source]¶ Convert legacy iCount’s four-column format into proper BED6 format.
Old iCount legacy format: chrome, start, end, [+-]value Strand can be either ‘+’ or ‘-‘, and value indicates the intensity of interaction.
The returned BED file follows the BED6 format, as explained in the [bedtools manual](http://bedtools.readthedocs.io/en/latest/content /general-usage.html).
-
iCount.files.bed.
merge_bed
(sites_grouped, sites)[source]¶ Merge multiple files with crosslinks into one.
Concatenate files into one file. Also, merge crosslinks from different files that are on same position and sum their scores.
Parameters: - sites_grouped (str) – Path to output BED6 file containing merged data from input sites files.
- sites (list_str) – List of BED6 files(paths) to be merged.
Returns: Absolute path to outfile.
Return type:
FASTQ¶
Reading and writting FASTQ files.
-
iCount.files.fastq.
get_qual_encoding
(fname)[source]¶ Read first few records and determine quality encoding in FASTQ file.
See format description: http://en.wikipedia.org/wiki/FASTQ_format
S - Sanger Phred+33, raw reads typically (0, 40) [33..73] X - Solexa Solexa+64, raw reads typically (-5, 40) [59..104] I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) [64..104] J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) [66..104] L - Illumina 1.8+ Phred+33, raw reads typically (0, 41) [33..74]
Metrics¶
iCount processing statistics can be stored into instances of iCount.metrics.Metrics
.
-
class
iCount.metrics.
Metrics
(context=None, **kwargs)[source]¶ Storge for statistics calculated during function execution.
-
__init__
(context=None, **kwargs)[source]¶ When creating the placeholder, a process-specific context should be given.
Parameters: context (str) – Context is used to indicate the process that generated the processing statistics. Returns: Instance where processing statistics can be added or modified. Return type: Metrics
-
Examples
Provide a set of example bash scripts.
-
iCount.examples.
run
(out_dir='.')[source]¶ Create an examples subfolder with example scripts.
This will create an examples subfolder in current working directory and will copy bash scripts needed to run the iCount pipeline on a few examples (for now, only the hnRNP C data from Konig et al.).
Parameters: out_dir (str) – Directory to which example scripts should be copied. Returns: Path of folder to where examples scripts were copied. Return type: str
Automated CLI creation¶
This module offers CLI to iCount Python package. To separate Python package from CLI and to avoid code and docstring duplication, CLI commands are semi-automatically created from functions that already exist in the package. For this automation to work, some rules need to respected when writing functions that are later exposed to CLI.
-
iCount.cli.
make_parser_from_function
(function, subparsers, module=None, only_func=False)[source]¶ Make a argparse parser from
function
and add it tosubparsers
.A Python function is exposed in CLI by calling
make_parser_from_function
. Example call for exposing functioniCount.analysis.peaks.run
that makes peaks analysis:make_parser_from_function(iCount.analysis.peaks.run, subparsers)
- What happens is such call?
CLI command
iCount peaks
is created, with description, positional and optional arguments exactly the same as the ones defined in function.name of the command (peaks) equals to the name of the module where function is defined.
CLI help message is sourced from module’s docstring
Positional and optional arguments (and default values) are sourced from function signature
Help text for each of the arguments is sourced from function docstring. For this to work functions needs to follow Nummpy docstring formatting. All parameters should have meaningful description. All parameters should also have type equal to one of the keys in VALID_TYPES.
When function is executed, stdout logger with level INFO is registerd. This prints descriptive messages to CL. Function should therefore log its inputs (use iCount.logger.log_inputs function), outputs and most important steps. If there are no errors/execptions command exits with 0 exit status code.
If error/exception occurs, is it caught and stack is printed to logger with ERROR level. Also, failed CLI command has status code 1.
CLI exposed function should return Metric object that is instance of iCount.metrics.Metric class. Function should set descriptive attributes to the Metric object for analysis results and analysis tatistics for inter-experimetal comparison.
- Each command also gets these additional arguments:
--stdout_log
- Threshold value (0-50) for logging to stdout. If 0 logging to stdout if turned OFF.--file_log
- Threshold value (0-50) for logging to file. If 0 logging to stdout if turned OFF.--file_logpath
- Path to log file.--results_file
- File into which to store Metrics (result object).--help
- Help message fot a command
They control the level and location of log inputs as well as string the result objects.
Exceptional cases:
- In some cases function that performs the work is only imported in the
correct module (“exposed module”), but the actual function definition is
located somewhere else (“source module”). It that case, one can use
module
parameter with “source module” value. In this case, the command name and CLI docstring will be defined from “exposed module” but function, default values and parameter descriptions will be sourced from “source module”. - In some cases, there will be more CLI exposed functions in the same
module. In such case, use set
only_func
parameter toTrue
. This will use function name for CLI command name and use function docstring (form beginning until “Parameters” section) for CLI help message.