crested.tl.contribution_scores#
- crested.tl.contribution_scores(input, target_idx, model, method='expected_integrated_grad', genome=None, transpose=False, all_class_names=None, batch_size=128, output_dir=None, seed=42, verbose=True)#
Calculate contribution scores based on given method for the specified inputs.
If multiple models are provided, the contribution scores will be averaged across all models.
These scores can then be plotted to visualize the importance of each base in the sequence using
contribution_scores()
.- Parameters:
input (str | list[str] | np.array | AnnData) – Input data to calculate the contribution scores for. Can be a (list of) sequence(s), a (list of) region name(s), a matrix of one hot encodings (N, L, 4), or an AnnData object with region names as its var_names.
target_idx (int | list[int] | None) – Index/indices of the target class(es) to calculate the contribution scores for. If this is an empty list, the contribution scores for the ‘combined’ class will be calculated. If this is None, the contribution scores for all classes will be calculated. You can get these for your classes of interest by running
list(anndata.obs_names).index(class_name)
.model (keras.Model | list[keras.Model]) – A (list of) trained keras model(s) to calculate the contribution scores for.
method (str (default:
'expected_integrated_grad'
)) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’, ‘saliency_map’.genome (Genome | os.PathLike | None (default:
None
)) – Genome or path to the genome fasta. Required if no genome is registered and input is an anndata object or region names.transpose (bool (default:
False
)) – Transpose the contribution scores to (N, C, 4, L) and one hots to (N, 4, L) (for compatibility with MoDISco).all_class_names (list[str] | None (default:
None
)) – Optional list of all class names in the dataset. If provided and output_dir is not None, will use these to name the output files.batch_size (int (default:
128
)) – Maximum number of input sequences to predict at once when calculating scores. Useful for methods like ‘integrated_grad’ which also calculate 25 background sequence contributions together with the sequence’s contributions in one batch. Default is 128.output_dir (os.PathLike | None (default:
None
)) – Path to the output directory to save the contribution scores and one hot seqs. Will create a separate npz file per class.seed (int | None (default:
42
)) – Seed to use for shuffling regions. Only used in “expected_integrated_grad”.verbose (bool (default:
True
)) – Boolean for disabling the logs and plotting progress of calculations using tqdm.
- Return type:
tuple[np.ndarray, np.ndarray]
- Returns:
Contribution scores (N, C, L, 4) and one-hot encoded sequences (N, L, 4).