crested.utils.calculate_nucleotide_distribution

crested.utils.calculate_nucleotide_distribution#

crested.utils.calculate_nucleotide_distribution(input, genome=None, per_position=False, n_regions=None)#

Calculate the nucleotide distribution of a genome in a set of regions or sequences.

Parameters:
  • input (str | list[str] | ndarray | AnnData) – Input data to calculate the ACGT distribution of. Can be a (list of) sequence(s), a (list of) region name(s), a matrix of one hot encodings (N, L, 4), or an AnnData object with region names as its var_names.

  • genome (Union[Genome, PathLike, None] (default: None)) – The genome object or path to the genome fasta file. Required if input is a region or AnnData.

  • per_position (bool (default: False)) – If True, calculate the nucleotide distribution per position in the sequence instead of over the whole sequence.

  • n_regions (Optional[int] (default: None)) – Randomly sample n_regions from the input. If None, all inputs are used. This is useful for large datasets to speed up the calculation.

Return type:

ndarray

Returns:

The nucleotide distribution as an array of floats (4,) in order A, C, G, T if per_position is False. Else, it returns an array of shape (L, 4) with the nucleotide distribution per position.