crested.utils.calculate_nucleotide_distribution#
- crested.utils.calculate_nucleotide_distribution(input, genome=None, per_position=False, n_regions=None)#
Calculate the nucleotide distribution of a genome in a set of regions or sequences.
- Parameters:
input (
str
|list
[str
] |ndarray
|AnnData
) – Input data to calculate the ACGT distribution of. Can be a (list of) sequence(s), a (list of) region name(s), a matrix of one hot encodings (N, L, 4), or an AnnData object with region names as its var_names.genome (
Union
[Genome
,PathLike
,None
] (default:None
)) – The genome object or path to the genome fasta file. Required if input is a region or AnnData.per_position (
bool
(default:False
)) – If True, calculate the nucleotide distribution per position in the sequence instead of over the whole sequence.n_regions (
Optional
[int
] (default:None
)) – Randomly sample n_regions from the input. If None, all inputs are used. This is useful for large datasets to speed up the calculation.
- Return type:
- Returns:
The nucleotide distribution as an array of floats (4,) in order A, C, G, T if per_position is False. Else, it returns an array of shape (L, 4) with the nucleotide distribution per position.