crested.tl.contribution_scores_specific

crested.tl.contribution_scores_specific#

crested.tl.contribution_scores_specific(input, target_idx, model, genome=None, method='expected_integrated_grad', transpose=True, batch_size=128, output_dir=None, verbose=True)#

Calculate contribution scores based on given method only for the most specific regions per class.

Contrary to contribution_scores(), this function will only calculate one set of contribution scores per region per class. Expects the user to have ran sort_and_filter_regions_on_specificity() beforehand.

If multiple models are provided, the contribution scores will be averaged across all models.

These scores can then be plotted to visualize the importance of each base in the sequence using contribution_scores().

Parameters:
  • input (AnnData) – Input anndata to calculate the contribution scores for. Should have a ‘Class name’ column in .var.

  • target_idx (int | list[int] | None) – Index/indices of the target class(es) to calculate the contribution scores for. If this is an empty list, the contribution scores for the ‘combined’ class will be calculated. If this is None, the contribution scores for all classes will be calculated. You can get these for your classes of interest by running list(anndata.obs_names).index(class_name).

  • model (Model | list[Model]) – A (list of) trained keras model(s) to calculate the contribution scores for.

  • genome (Union[Genome, PathLike, None] (default: None)) – Genome or Path to the genome file. Required if no genome is registered.

  • method (str (default: 'expected_integrated_grad')) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’, ‘saliency_map’.

  • transpose (bool (default: True)) – Transpose the contribution scores to (N, C, 4, L) and one hots to (N, 4, L) (for compatibility with MoDISco). Defaults to True here since that is what modisco expects.

  • batch_size (int (default: 128)) – Maximum number of input sequences to predict at once when calculating scores. Useful for methods like ‘integrated_grad’ which also calculate 25 background sequence contributions together with the sequence’s contributions in one batch. Default is 128.

  • output_dir (Optional[PathLike] (default: None)) – Path to the output directory to save the contribution scores and one hot seqs. Will create a separate npz file per class.

  • verbose (bool (default: True)) – Boolean for disabling the plotting progress of calculations using tqdm.

Return type:

tuple[ndarray, ndarray]

Returns:

Contribution scores (N, 1, L, 4) and one-hot encoded sequences (N, L, 4). Since each region is specific to a class, the contribution scores are only calculated for that class.