Source code for iCount.analysis.peaks

""".. Line to protect from pydocstyle D205, D400.

Peak finding
------------

Find positions with high density of cross-linked sites.

There are two typical variants of this analysis, depending on the parameters:

* Gene-wise analysis, where:
    * features = gene
    * group_by = gene_id

* Transcript-wise analysis where:
    * features = CDS, intron, UTR3, UTR5, ncRNA, intergenic
    * group_by = transcript_id


Let's look at the Gene-wise analysis in more detail first. Imagine the following
situation::

    |-----------gene1----------|
            |-----------------------------gene2------------------------------|
           ab c                d                 e

    a = 60
    b = 100
    c = 70
    d = 40
    e = 100
    gene1: gene_id = 001
    gene2: gene_id = 002

There are two genes (partially intersecting) and five positions with cross-links
(noted with a, b, c, d and e). Crosslink position "a" has 60 cross-link events,
"b" has 100 cross-link events and so on. Also, gene1 has gene_id 001, etc.

The algorithm first finds all intersections between annotation and
cross-links. In this case cross-link position "a" intersects only with gene1,
while position "b" intersects also with gene2... Annotation can include
various other types of segments (transcripts, intergenic, ncRNA, etc.), but only
segments of type ``gene`` are considered for intersection. This behaviour is
controlled by parameter ``features``.

Next step is to make groups of cross-links. They are grouped by ``group_by``
parameter (in this case, it equals to ``gene_id``). There will be 2 groups.
First group name will be 001 and will contain a, b, c and d. Second group name
will be 002 and will contain b, c, d and e.

The question now is: has any of positions in each group significantly increased
number of cross-link events? And how can one quantify this significance?

This is done by parmutation analysis. It draws a number of random situations
with same group size and number of cross-link scores. Number of such draws is
determined by ``perm`` parameter. This way, a random distribution is calculated.
When comparing the observed distribution with the random one, FDR values are
assigned to each position. A cutoff FRD value is chosen and only positions with
FDR < FDR cutoff are considered as significant.

One must also know that when considering only scores on single positions
significant *clusters* of cross-links can be missed. In the upper example, it is
obvious, that something more significantly is happening on position b than on
position e, despite having the same score. To account for this, algorithm
considers not only the score of single cross-link, but also scores of
cross-links some nucleotides before and after. This behaviour in controlled by
half-window (half_window) parameter. In the upper example, score of position b
eqals to 160 if half_window = 1 and 2530 if half_window=2. Score of position e
remains 100.

Let's also look at the transcript-wise analysis. In this case, scenario also
includes transcripts and sub-transcript elements::


    |-----------gene1----------|
    |--------transcript1-------|
    |-ncRNA-||-intron-||-ncRNA-|
            |-----------------------------gene2------------------------------|
            |---------------transcript2--------------|
            |-CDS-||-intron-||-CDS-||-intron-||-UTR3-|
                                    |---------------transcript3--------------|
                                    |-UTR5-||-intron-||-CDS-||-intron-||-CDS-|

           ab c                d                 e

    a = 60
    b = 100
    c = 70
    d = 40
    e = 100
    gene1: gene_id = 001
    gene2: gene_id = 002
    transcript1: transcript_id = 0001
    transcript2: transcript_id = 0002
    transcript3: transcript_id = 0003

Value of parameter features is: CDS, intron, UTR3, UTR5, ncRNA, intergenic.
Value of parameter group_by is transcript_id. Since we have multiple values in
feature parameter, another parameter becomes important: merge_features. If set
to false (default) algorithm will make the following groups:

    * group name: ncRNA-0001, members: a, b, d
    * group name: intron-0001, members: c
    * group name: CDS-0002, members: b, c, d
    * group name: UTR3-0002, members: e
    * group name: intron-0003, members: e

However, if merge_features equals to true, groups are:

    * group name: 0001, members: a, b, c, d
    * group name: 0002, members: b, c, d, e
    * group name: 0003, members: e

Then, for each group, procedure is exactly the same as in Gene-wise case.

When analysis is done, significant positions are reported in file, given by
peaks parameter. If scores parameter is also given, all positions are
reported in it, no matter the FDR value.

"""
import os
import math
import bisect
import logging
from collections import Counter

import numpy
import pybedtools

import iCount
from iCount.files import _f2s

LOGGER = logging.getLogger(__name__)


def _sum_within_window(pos_val, half_window=3):
    """
    Sum counts in windows of half-window size ``half_window`` in ``pos_val``.

    Example::

        pos_val = [(41, 1), (42, 1), (43, 1), (44, 1)]
        ret_list = _sum_within_window(pos_val, 1)
        pos_val = [(41, 2), (42, 3), (43, 3), (44, 2)]

        pos_val = [(41, 1), (42, 1), (43, 1), (46, 1)]
        ret_list = _sum_within_window(pos_val, 3)
        pos_val = [(41, 3), (42, 3), (43, 4), (44, 2)]

    The returned list preserves the order of positions given on input.
    """
    if not pos_val:
        return []
    pos_val_ind = sorted((pos, val, i) for i, (pos, val) in enumerate(pos_val))
    poss, vals, inds = zip(*pos_val_ind)

    ret_list = [None] * len(pos_val)
    max_i = len(poss)
    for i, pos in enumerate(poss):
        i_start = bisect.bisect_left(
            poss, pos - half_window, lo=max(0, i - half_window), hi=i)
        i_stop = bisect.bisect_left(
            poss, pos + 1 + half_window, lo=i_start, hi=min(i + half_window + 1, max_i))
        ret_list[inds[i]] = (pos, sum(vals[i_start:i_stop]))
    return ret_list


def _sum_within_window_nopos(pos_val, half_window=3):
    """Make same thing as _sum_within_window but without positions."""
    if not pos_val:
        return []
    pos_val = sorted(pos_val)
    poss, vals = zip(*pos_val)

    ret_list = []
    max_i = len(poss)
    for i, pos in enumerate(poss):
        i_start = bisect.bisect_left(
            poss, pos - half_window, lo=max(0, i - half_window), hi=i)
        i_stop = bisect.bisect_left(
            poss, pos + 1 + half_window, lo=i_start, hi=min(i + half_window + 1, max_i))
        ret_list.append(sum(vals[i_start:i_stop]))
    return ret_list


[docs]def cumulative_prob(vals, max_val):
    """
    Compute cumulative probability.

    Given a list of SWW scores in region ``vals``, return list ``freqs_cum``
    where probability that randomly picked value in `vals` is equal or greater
    than i equals freqs_cum[i].

    Max_val is the largest possible value that can be expected in `vals`.
    """
    # Make histogram from vals, with max_val + 2 bins. O bin for zero scores and
    # one for one more than max_val. The last one bin should always be zero, right?

    # + 1 for the probability of observing 0 or more
    # + 1 beacouse range() excludes last element
    freqs, _ = numpy.histogram(vals, bins=range(max_val + 1 + 1), density=True)

    # Now we want to know not how many events with exactly x cross links is
    # possible, but with x cross-links OR MORE. We sum from behind:
    freqs_cum = numpy.cumsum(freqs[::-1])

    # Still, we return the reversed freqs_cum.
    return freqs_cum[::-1]


# def get_rnd_distrib(size, total_hits, half_window, perms=100):
#     """The simplest (and fastest) permutation test.
#
#     Not used by default.
#     """
#     rnd_cns = numpy.zeros(total_hits+1)
#     for i in range(perms):
#         rnd_hits = Counter(numpy.random.randint(size, size=total_hits))
#         rnd_hits_extended = _sum_within_window_nopos(rnd_hits.items(), half_window)
#         for v in rnd_hits_extended:
#             rnd_cns[v] += 1
#
#     s = sum(rnd_cns)
#     if s > 0:
#         rnd_ps = rnd_cns / s
#     else:
#         rnd_ps = rnd_cns
#
#     cum_prob_ret = numpy.cumsum(rnd_ps[::-1])
#     return cum_prob_ret[::-1]


PS_CACHE = {}


[docs]def get_avg_rnd_distrib(size, total_hits, half_window, perms=10000):
    """
    Return background distribution for given region size and number of hits.

    We follow the modified FDR for peak height, proposed by [1]

    [1] Yeo, G.W. et al. An RNA code for the FOX2 splicing regulator revealed
    by mapping RNA-protein interactions in stem cells. Nat. Struct. Mol.
    Biol. 16, 130–137 (2009).
    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735254/

    Results are cached, so they can be reused.

    Parameters
    ----------
    size : int
        Size of region.
    total_hits : int
        Number of cross-link events in region.
    half_window : int
        Half-window size. The actual window size is: 2 * half_window + 1.
    perms : int
        Number of permutations to make.

    Returns
    -------
    numpy.ndarray
        Probability to find CWW score i or more on chosen position is equal to
        i-th element or returned array.

    """
    cache_key = (size, total_hits, half_window, perms)
    if cache_key not in PS_CACHE:

        rnd_ps = numpy.zeros((perms, total_hits + 1))
        for i in range(perms):

            # Draw random distribution of cross-link events in a group with
            # group size = `size` and number of cross-link events = `total_hits`
            # pylint: disable=no-member
            rnd_hits = Counter(numpy.random.randint(size, size=total_hits))

            # This is then list. i-th element in list is probability, that there
            # is equal or more than i crossslinks on some position???

            scores_cww = _sum_within_window_nopos(rnd_hits.items(), half_window=half_window)
            rnd_ps[i, :] = cumulative_prob(scores_cww, total_hits)

        rnd_dist = numpy.mean(rnd_ps, axis=0) + numpy.std(rnd_ps, axis=0)
        # Adding std, can make probability higher than 1, which is nonsense. Fix:
        rnd_dist_fixed = [min(1.0, prob) for prob in rnd_dist]
        PS_CACHE[cache_key] = rnd_dist_fixed

    return PS_CACHE[cache_key]


def _process_group(pos_scores, group_size, half_window, perms):
    """
    Assign FDR value to each position in group.

    One is given a region of size N, with given distribution of K cross-link
    events (scores). How likely it is, that the observed distribution of scores
    is random? Additionally, scores need to be averaged in windows of size
    `half_window`.

    Lets do an example, with region size 5 and 3 cross-link events::

         0 0 2 0 1
        |_|_|_|_|_|

        pos_scores = [(0, 0), (1, 0), (2, 2), (3, 0), (4, 1)]

    First, do the averaging. This is done by function ``_sum_within_window``. If
    half-window size is half_window=1, the result is::

         0 2 2 3 1
        |_|_|_|_|_|

        pos_scores_sww = [(0, 0), (1, 2), (2, 2), (3, 3), (4, 1)]

    For *this* particular situation, what is the probability, to find a position
    in region with 3 or more SWW scores? Well, only one position has SWW
    score 3 or more. Since we have 5 positions, the probability of observing SWW
    score of 3 or more is 1/5=0.2. What about 2 or more? We have 2 positions
    with SWW score 2 and one position with score 3. So, in total we have 3
    positions. Probability of observing position with 2 or more SWW scores is
    3 / 5 = 0.6. This way, one can make the calculation for all cases::

        Prob (>= 0) = 5 / 5 = 1.0
        Prob (>= 1) = 4 / 5 = 0.8
        Prob (>= 2) = 3 / 5 = 0.6
        Prob (>= 3) = 1 / 5 = 0.2

    These probabilities can be put in a list: [1.0, 0.8, 0.6, 0.2]. Storing
    results in such list has a nice property::

        Prob (>= i) = list[i]

    ``cumulative_prob`` function produces exactly such list from SWW scores.
    This is "cumulative_prob" for given example.

    To produce a reference to which this can be compared, we use function
    ``get_avg_rnd_distrib``.

    Then, random and observed "cumulative_prob" are compared and FDR scores for
    each cross-link can be derived. More can be read in artice [1] or in code
    comments.

    [1] Yeo, G.W. et al. An RNA code for the FOX2 splicing regulator revealed by
    mapping RNA-protein interactions in stem cells. Nat. Struct. Mol. Biol. 16,
    130–137 (2009).

    Parameters
    ----------
    pos_scores : list
        Lits with (position, scores) elements.
    group_size : list
        Size of region

    Returns
    -------
    list
        List of tuples, containing (position, score, sww_score and fdr_value)
        for all cross-link positions in group.

    """
    # Count the number of all cross-link events in a group:
    sum_scores = math.ceil(sum([score for _, score in pos_scores]))

    # Calculate the observed cumulative_prob:
    pos_scores_sww = _sum_within_window(pos_scores, half_window=half_window)
    positions, scores_sww = zip(*pos_scores_sww)
    max_val = math.ceil(max(scores_sww))
    observed = cumulative_prob(scores_sww, sum_scores)

    # Calculate random cumulative_prob for given group_size and sum_scores:
    random_ = get_avg_rnd_distrib(group_size, sum_scores, half_window, perms=perms)

    # This step follows the article [1] to produce FDR values. First, produce
    # mapping from sww_scores to FDR value:
    sww2fdr = [min(1.0, rnd / true_) for rnd, true_ in zip(random_, observed[:max_val + 1])]
    # Compute FDR scores por each position based on it's sww_score:
    fdr_scores = [sww2fdr[round(sww_score)] for _, sww_score in pos_scores_sww]

    positions, scores = zip(*pos_scores)
    return zip(positions, scores, scores_sww, fdr_scores)


[docs]def run(annotation, sites, peaks, scores=None, features=None, group_by='gene_id',
        merge_features=False, half_window=3, fdr=0.05, perms=100, rnd_seed=42,
        report_progress=False):
    """
    Find positions with high density of cross-linked sites.

    When determining feature.name, value of the first existing attribute in the
    following tuple is taken::

        ("ID", "gene_name", "transcript_id", "gene_id", "Parent")

    Source in pybedtools:
    https://github.com/daler/pybedtools/blob/master/pybedtools/scripts/annotate.py#L34

    Parameters
    ----------
    annotation : str
        Annotation file in GTF format, obtained from "iCount segment" command.
    sites : str
        File with cross-links in BED6 format.
    peaks : str
        File name for "peaks" output. File reports positions with significant
        number of cross-link events. It should have .bed or .bed.gz extension.
    scores : str
        File name for "scores" output. File reports all cross-link events,
        independent from their FDR score It should have .tsv, .csv, .txt or .gz
        extension.
    features : list_str
        Features from annotation to consider. If None, ['gene'] is used.
        Sometimes, it is advised to use ['gene', 'intergenic'].
    group_by : str
        Attribute by which cross-link positions are grouped.
    merge_features : bool
        Treat all features as one when grouping. Has no effect when only one
        feature is given in features parameter.
    half_window : int
        Half-window size.
    fdr : float
        FDR threshold.
    perms : int
        Number of permutations when calculating random distribution.
    rnd_seed : int
        Seed for random generator.
    report_progress : bool
        Report analysis progress.

    Returns
    -------
    iCount.metrics
        Analysis metadata.

    """
    iCount.log_inputs(LOGGER, level=logging.INFO)
    metrics = iCount.Metrics()

    if features is None:
        features = ['gene']
    assert peaks.endswith(('.bed', '.bed.gz'))
    if scores:
        assert scores.endswith(('.tsv', '.tsv.gz', '.csv', '.csv.gz', 'txt', 'txt.gz'))
    numpy.random.seed(rnd_seed)  # pylint: disable=no-member

    LOGGER.info('Loading annotation file...')
    annotation2 = iCount.files.decompress_to_tempfile(annotation)
    if annotation2 != annotation:
        to_delete_temp = annotation2
        annotation = annotation2
    else:
        to_delete_temp = None
    annotation = pybedtools.BedTool(annotation).saveas()
    metrics.annotation_all = len(annotation)
    annotation = annotation.filter(lambda x: x[2] in features).sort().saveas()
    metrics.annotation_used = len(annotation)
    metrics.annotation_skipped = metrics.annotation_all - metrics.annotation_used
    LOGGER.info('%d out of %d annotation records will be used (%d skipped).',
                metrics.annotation_used, metrics.annotation_all, metrics.annotation_skipped)

    LOGGER.info('Loading cross-links file...')
    sites = pybedtools.BedTool(sites).sort().saveas()

    # intersect cross-linked sites with regions
    LOGGER.info('Calculating intersection between annotation and cross-link file...')
    overlaps = annotation.intersect(sites, sorted=True, s=True, wo=True).saveas()

    groups = {}
    group_sizes = {}
    multi_mode = len(features) > 1 and not merge_features
    LOGGER.info('Processing intersections...')
    for feature in overlaps:
        chrom = feature.chrom
        start = feature.start
        end = feature.stop
        name = feature.name
        strand = feature.strand
        site_chrom = feature.fields[9]
        site_pos = int(feature.fields[10])
        site_end = int(feature.fields[11])
        site_dot = feature.fields[12]
        site_score = float(feature.fields[13])
        site_strand = feature.fields[14]
        assert site_chrom == chrom
        assert site_strand == strand
        assert site_dot == '.'
        assert site_pos == site_end - 1

        # Determine group_id depending on multi_mode...
        group_id = feature.attrs[group_by]
        if multi_mode:
            group_id = feature[2] + '_' + group_id

        groups.setdefault((chrom, strand, group_id, name), []).append((site_pos, site_score))
        group_sizes.setdefault((chrom, strand, group_id, name), set()).add((start, end))

    # Validate that segments in same group do not overlap: start of next feature
    # is greater than stop of the current one:
    for sizes in group_sizes.values():
        sizes = sorted(sizes)
        for first, second in zip(sizes, sizes[1:]):
            assert first[1] < second[0]

    # calculate total length of each group by summing element sizes:
    group_sizes = dict([(name, sum([end - start for start, end in elements])) for
                        name, elements in group_sizes.items()])

    # calculate and assign FDRs to each cross-linked site. FDR values are
    # calculated together for each group.
    results = {}
    metrics.all_groups = len(groups)
    progress, j = 0, 0
    for (chrom, strand, group_id, name), hits in sorted(groups.items()):
        j += 1
        if report_progress:
            new_progress = j / metrics.all_groups
            # pylint: disable=protected-access
            progress = iCount._log_progress(new_progress, progress, LOGGER)

        group_size = group_sizes[(chrom, strand, group_id, name)]

        # Crucial step: each position in a group is given a fdr_score, based on
        # hits in group, group_size, half-window size and number of
        # permutations. Than, FDR scores (+ some other info) are written to
        # `results` container:
        processed = _process_group(hits, group_size, half_window, perms)
        for (pos, val, val_extended, fdr_score) in processed:
            results.setdefault((chrom, pos, strand), []).\
                append((fdr_score, name, group_id, val, val_extended))
    metrics.positions_annotated = len(results)

    # cross-linked sites outside annotated regions
    LOGGER.info('Determining cross-links not intersecting with annotation...')
    skipped = sites.intersect(annotation, sorted=True, s=True, v=True).saveas()
    for feature in skipped:
        site_chrom = feature.chrom
        site_start = feature.start
        site_end = feature.stop
        # site_name = feature.name
        site_score = feature.score
        site_strand = feature.strand
        assert site_start == site_end - 1
        k = (site_chrom, site_start, site_strand)
        assert k not in results
        results.setdefault(k, []).\
            append((1.0, 'not_annotated', 'not_annotated', site_score, 'not_calculated'))

    metrics.positions_all = len(results)
    metrics.positions_not_annotated = metrics.positions_all - metrics.positions_annotated
    LOGGER.info('Peaks calculation finished. Writing results to files...')

    # Make peaks: a BED6 file, with only the most significant cross-links:
    metrics.significant_positions = 0
    with iCount.files.gz_open(peaks, 'wt') as peaks:
        for (chrom, pos, strand), annot_list in sorted(results.items()):
            annot_list = sorted(annot_list)

            # report minimum fdr_score for each position in BED6
            min_fdr_score = annot_list[0][0]
            if min_fdr_score < fdr:
                metrics.significant_positions += 1
                # position has significant records - report the most significant ones:
                min_fdr_records = [rec for rec in annot_list if rec[0] == min_fdr_score]

                _, names, group_ids, group_scores, _ = zip(*min_fdr_records)
                if names == group_ids:
                    name = ','.join(names)
                else:
                    name = ','.join(names) + '-' + ','.join(group_ids)
                line = [chrom, pos, pos + 1, name, group_scores[0], strand]
                peaks.write('\t'.join([_f2s(i, dec=4)for i in line]) + '\n')
    LOGGER.info('BED6 file with significant peaks saved to: %s', peaks.name)

    # Make scores: a tab-separated file, with ALL cross-links, (no significance threshold)
    header = ['chrom', 'position', 'strand', 'name', 'group_id', 'score', 'score_extended', 'FDR']
    if scores:
        with iCount.files.gz_open(scores, 'wt') as scores:
            scores.write('\t'.join(header) + '\n')
            for (chrom, pos, strand), annot_list in sorted(results.items()):
                for (fdr_score, name, group_id, score, val_extended) in sorted(annot_list):
                    line = [chrom, pos, strand, name, group_id, score, val_extended, fdr_score]
                    scores.write('\t'.join([_f2s(i, dec=6) for i in line]) + '\n')
        LOGGER.info('Scores for each cross-linked position saved to: %s', scores.name)

    if to_delete_temp:
        os.remove(to_delete_temp)

    LOGGER.info('Done.')
    return metrics
iCount

Navigation

Related Topics

Source code for iCount.analysis.peaks