crested.tl.Crested#

class crested.tl.Crested(data, model=None, config=None, project_name=None, run_name=None, logger=None, seed=None)#

Main class to handle training, testing, predicting and calculation of contribution scores.

Parameters:
  • data (AnnDataModule) – AnndataModule object containing the data.

  • model (Optional[Model] (default: None)) – Model architecture to use for training.

  • config (Optional[TaskConfig] (default: None)) – Task configuration (optimizer, loss, and metrics) for use in tl.Crested.

  • project_name (Optional[str] (default: None)) – Name of the project. Used for logging and creating output directories. If not provided, the default project name “CREsted” will be used.

  • run_name (Optional[str] (default: None)) – Name of the run. Used for wandb logging and creating output directories. If not provided, the current date and time will be used.

  • logger (Optional[str] (default: None)) – Logger to use for logging. Can be “wandb”, “tensorboard”, or “dvc” (tensorboard not implemented yet) If not provided, no additional logging will be done.

  • seed (Optional[int] (default: None)) – Seed to use for reproducibility. WARNING: this doesn’t make everything fully reproducible, especially on GPU. Some (GPU) operations are non-deterministic and simply can’t be controlled by the seed.

Examples

>>> from crested.tl import Crested
>>> from crested.tl import default_configs
>>> from crested.tl.data import AnnDataModule
>>> from crested.tl.zoo import deeptopic_cnn
>>> # Load data
>>> anndatamodule = AnnDataModule(anndata, genome="path/to/genome.fa")
>>> model_architecture = deeptopic_cnn(seq_len=1000, n_classes=10)
>>> configs = default_configs("topic_classification")
>>> # Initialize trainer
>>> trainer = Crested(
...     data=anndatamodule,
...     model=model_architecture,
...     config=configs,
...     project_name="test",
... )
>>> # Fit the model
>>> trainer.fit(epochs=100)
>>> # Evaluate the model
>>> trainer.test()
>>> # Make predictions and add them to anndata as a .layers attribute
>>> trainer.predict(anndata, model_name="predictions")
>>> # Calculate contribution scores
>>> scores, seqs_one_hot = trainer.calculate_contribution_scores_regions(
...     region_idx="chr1:1000-2000",
...     class_names=["class1", "class2"],
...     method="integrated_grad",
... )

Methods table#

calculate_contribution_scores(class_names[, ...])

Calculate contribution scores based on the given method for the full dataset.

calculate_contribution_scores_enhancer_design(...)

Calculate contribution scores of enhancer design.

calculate_contribution_scores_regions(...[, ...])

Calculate contribution scores based on given method for a specified region.

calculate_contribution_scores_sequence(...)

Calculate contribution scores based on given method for a specified sequence.

enhancer_design_in_silico_evolution(n_mutations)

Create synthetic enhancers for a specified class using in silico evolution (ISE).

enhancer_design_motif_implementation(patterns)

Create synthetic enhancers for a specified class using motif implementation.

fit([epochs, mixed_precision, ...])

Fit the model on the training and validation set.

get_embeddings([layer_name, anndata])

Extract embeddings from a specified layer in the model for all regions in the dataset.

load_model(model_path[, compile])

Load a (pretrained) model from a file.

predict([anndata, model_name])

Make predictions using the model on the full dataset.

predict_regions(region_idx)

Make predictions using the model on the specified region(s).

predict_sequence(sequence)

Make predictions using the model on the provided DNA sequence.

score_gene_locus(chr_name, gene_start, ...)

Score regions upstream and downstream of a gene locus using the model's prediction.

test([return_metrics])

Evaluate the model on the test set.

tfmodisco_calculate_and_save_contribution_scores(adata)

Calculate and save contribution scores for all regions in adata.var.

tfmodisco_calculate_and_save_contribution_scores_sequences(...)

Calculate and save contribution scores for the sequence(s).

transferlearn([epochs_first_phase, ...])

Perform transfer learning on the model.

Methods#

Crested.calculate_contribution_scores(class_names, anndata=None, method='expected_integrated_grad')#

Calculate contribution scores based on the given method for the full dataset.

These scores can then be plotted to visualize the importance of each base in the dataset using contribution_scores().

Parameters:
  • class_names (list[str]) – List of class names to calculate the contribution scores for (should match anndata.obs_names) If the list is empty, the contribution scores for the ‘combined’ class will be calculated.

  • anndata (Optional[AnnData] (default: None)) – Anndata object to store the contribution scores in as a .varm[class_name] attribute. If None, will only return the contribution scores without storing them.

  • method (str (default: 'expected_integrated_grad')) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’.

Return type:

tuple[ndarray, ndarray] | None

Returns:

Contribution scores (N, C, L, 4) and one-hot encoded sequences (N, L, 4) or None if anndata is provided.

Crested.calculate_contribution_scores_enhancer_design(enhancer_design_intermediate, class_names=None, method='expected_integrated_grad', disable_tqdm=False)#

Calculate contribution scores of enhancer design.

These scores can then be plotted to visualize the importance of each base in the region using enhancer_design_steps_contribution_scores().

Parameters:
  • enhancer_design_intermediate (list[dict]) – Intermediate output from enhancer design when return_intermediate is True

  • class_names (Optional[list[str]] (default: None)) – List of class names to calculate the contribution scores for (should match anndata.obs_names) If None, the contribution scores for the ‘combined’ class will be calculated.

  • method (str (default: 'expected_integrated_grad')) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’.

  • disable_tqdm (bool (default: False)) – Boolean for disabling the plotting progress of calculations using tqdm.

Return type:

tuple[ndarray, ndarray] | list[tuple[ndarray, ndarray]]

Returns:

A tuple of arrays or a list of tuple of arrays of contribution scores (N, C, L, 4) and one-hot encoded sequences (N, L, 4).

Examples

>>> scores, onehot = crested.calculate_contribution_scores_enhancer_design(
...     enhancer_design_intermediate,
...     class_names=["cell_type_A"],
...     method="expected_integrated_grad",
... )
Crested.calculate_contribution_scores_regions(region_idx, class_names, method='expected_integrated_grad', disable_tqdm=False)#

Calculate contribution scores based on given method for a specified region.

These scores can then be plotted to visualize the importance of each base in the region using contribution_scores().

Parameters:
  • region_idx (list[str] | str) – Region(s) for which to calculate the contribution scores in the format “chr:start-end” or “chr:start-end:strand”.

  • class_names (list[str]) – List of class names to calculate the contribution scores for (should match anndata.obs_names) If the list is empty, the contribution scores for the ‘combined’ class will be calculated.

  • method (str (default: 'expected_integrated_grad')) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’.

  • disable_tqdm (bool (default: False)) – Boolean for disabling the plotting progress of calculations using tqdm.

Return type:

tuple[ndarray, ndarray]

Returns:

Contribution scores (N, C, L, 4) and one-hot encoded sequences (N, L, 4).

Crested.calculate_contribution_scores_sequence(sequences, class_names, method='expected_integrated_grad', disable_tqdm=False)#

Calculate contribution scores based on given method for a specified sequence.

These scores can then be plotted to visualize the importance of each base in the sequence using contribution_scores().

Parameters:
  • sequence – Sequence(s) for which to calculate the contribution scores.

  • class_names (list[str]) – List of class names to calculate the contribution scores for (should match anndata.obs_names) If the list is empty, the contribution scores for the ‘combined’ class will be calculated.

  • method (str (default: 'expected_integrated_grad')) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’.

  • disable_tqdm (bool (default: False)) – Boolean for disabling the plotting progress of calculations using tqdm.

Return type:

tuple[ndarray, ndarray]

Returns:

Contribution scores (N, C, L, 4) and one-hot encoded sequences (N, L, 4).

Crested.enhancer_design_in_silico_evolution(n_mutations, n_sequences=1, target_class=None, target=None, return_intermediate=False, no_mutation_flanks=None, target_len=None, enhancer_optimizer=None, starting_sequences=None, **kwargs)#

Create synthetic enhancers for a specified class using in silico evolution (ISE).

Parameters:
  • n_mutations (int) – Number of iterations

  • n_sequences (int (default: 1)) – Number of enhancers to design

  • target_class (Optional[str] (default: None)) – Class name for which the enhancers will be designed for. If this value is set to None target needs to be specified.

  • target (Union[int, ndarray, None] (default: None)) – target index, needs to be specified when target_class is None

  • return_intermediate (bool (default: False)) – If True, returns a dictionary with predictions and changes made in intermediate steps for selected sequences

  • no_mutation_flanks (Optional[tuple] (default: None)) – A tuple of integers which determine the regions in each flank to not do implementations.

  • target_len (Optional[int] (default: None)) – Length of the area in the center of the sequence to make implementations, ignored if no_mutation_flanks is supplied.

  • enhancer_optimizer (Optional[EnhancerOptimizer] (default: None)) – An instance of EnhancerOptimizer, defining how sequences should be optimized. If None, a default EnhancerOptimizer will be initialized using _weighted_difference as optimization function.

  • starting_sequences (Union[str, list, None] (default: None)) – A DNA sequence or a list of DNA sequences that will be used instead of randomly generated sequences, if provided, n_sequences is ignored

  • kwargs (dict[str, Any]) – Keyword arguments that will be passed to the get_best function of the EnhancerOptimizer

Return type:

tuple[list[dict], list] | list

Returns:

A list of designed sequences and if return_intermediate is True a list of dictionaries of intermediate mutations and predictions as well as the designed sequences

Examples

>>> (
...     intermediate_results,
...     designed_sequences,
... ) = trained_crested_object.enhancer_design_in_silico_evolution(
...     target_class="cell_type_A",
...     n_mutations=20,
...     n_sequences=1,
...     return_intermediate=True,
... )
Crested.enhancer_design_motif_implementation(patterns, n_sequences=1, target_class=None, target=None, insertions_per_pattern=None, return_intermediate=False, no_mutation_flanks=None, target_len=None, preserve_inserted_motifs=True, enhancer_optimizer=None, starting_sequences=None, **kwargs)#

Create synthetic enhancers for a specified class using motif implementation.

Parameters:
  • patterns (dict) – Dictionary of patterns to be implemented in the form of ‘pattern_name’:’pattern_sequence’

  • n_sequences (int (default: 1)) – Number of enhancers to design.

  • target_class (Optional[str] (default: None)) – Class name for which the enhancers will be designed for. If this value is set to None target needs to be specified.

  • target (Union[int, ndarray, None] (default: None)) – target index, needs to be specified when target_class is None

  • insertions_per_pattern (Optional[dict] (default: None)) – Dictionary of number of patterns to be implemented in the form of ‘pattern_name’:number_of_insertions If not used one of each pattern in patterns will be implemented.

  • return_intermediate (bool (default: False)) – If True, returns a dictionary with predictions and changes made in intermediate steps for selected sequences

  • no_mutation_flanks (Optional[tuple] (default: None)) – A tuple of integers which determine the regions in each flank to not do implementations.

  • target_len (Optional[int] (default: None)) – Length of the area in the center of the sequence to make implementations, ignored if no_mutation_flanks is supplied.

  • preserve_inserted_motifs (bool (default: True)) – If True, sequentially inserted motifs can’t be inserted on previous motifs.

  • enhancer_optimizer (Optional[EnhancerOptimizer] (default: None)) – An instance of EnhancerOptimizer, defining how sequences should be optimized. If None, a default EnhancerOptimizer will be initialized using _weighted_difference as optimization function.

  • starting_sequences (Union[str, list, None] (default: None)) – A DNA sequence or a list of DNA sequences that will be used instead of randomly generated sequences, if provided, n_sequences is ignored

  • kwargs (dict[str, Any]) – Keyword arguments that will be passed to the get_best function of the EnhancerOptimizer

Return type:

tuple[list[dict], list] | list

Returns:

A list of designed sequences and if return_intermediate is True a list of dictionaries of intermediate mutations and predictions

Crested.fit(epochs=100, mixed_precision=False, model_checkpointing=True, model_checkpointing_best_only=True, model_checkpointing_metric='val_loss', model_checkpointing_mode='min', early_stopping=True, early_stopping_patience=10, early_stopping_metric='val_loss', early_stopping_mode='min', learning_rate_reduce=True, learning_rate_reduce_patience=5, learning_rate_reduce_metric='val_loss', learning_rate_reduce_mode='min', custom_callbacks=None)#

Fit the model on the training and validation set.

Parameters:
  • epochs (int (default: 100)) – Number of epochs to train the model.

  • mixed_precision (bool (default: False)) – Enable mixed precision training.

  • model_checkpointing (bool (default: True)) – Save model checkpoints.

  • model_checkpointing_best_only (bool (default: True)) – Save only the best model checkpoint.

  • model_checkpointing_metric (str (default: 'val_loss')) – Metric to monitor to choose best models.

  • model_checkpointing_mode (str (default: 'min')) – ‘max’ if a high metric is better, ‘min’ if a low metric is better

  • early_stopping (bool (default: True)) – Enable early stopping.

  • early_stopping_patience (int (default: 10)) – Number of epochs with no improvement after which training will be stopped.

  • early_stopping_metric (str (default: 'val_loss')) – Metric to monitor for early stopping.

  • early_stopping_mode (str (default: 'min')) – ‘max’ if a high metric is better, ‘min’ if a low metric is better

  • learning_rate_reduce (bool (default: True)) – Enable learning rate reduction.

  • learning_rate_reduce_patience (int (default: 5)) – Number of epochs with no improvement after which learning rate will be reduced.

  • learning_rate_reduce_metric (str (default: 'val_loss')) – Metric to monitor for reducing the learning rate.

  • learning_rate_reduce_mode (str (default: 'min')) – ‘max’ if a high metric is better, ‘min’ if a low metric is better

  • custom_callbacks (Optional[list] (default: None)) – List of custom callbacks to use during training.

Return type:

None

Crested.get_embeddings(layer_name='global_average_pooling1d', anndata=None)#

Extract embeddings from a specified layer in the model for all regions in the dataset.

If anndata is provided, it will add the embeddings to anndata.varm[layer_name].

Parameters:
  • anndata (Optional[AnnData] (default: None)) – Anndata object containing the data.

  • layer_name (str (default: 'global_average_pooling1d')) – The name of the layer from which to extract the embeddings.

Return type:

ndarray

Returns:

Embeddings of shape (N, D), where N is the number of regions in the dataset and D is the size of the embedding layer.

Crested.load_model(model_path, compile=True)#

Load a (pretrained) model from a file.

Parameters:
  • model_path (PathLike) – Path to the model file.

  • compile (bool (default: True)) – Compile the model after loading. Set to False if you only want to load the model weights (e.g. when finetuning a model). If False, you should provide a TaskConfig to the Crested object before calling fit.

Return type:

None

Crested.predict(anndata=None, model_name=None)#

Make predictions using the model on the full dataset.

If anndata and model_name are provided, will add the predictions to anndata as a .layers[model_name] attribute. Else, will return the predictions as a numpy array.

Parameters:
  • anndata (Optional[AnnData] (default: None)) – Anndata object containing the data.

  • model_name (Optional[str] (default: None)) – Name that will be used to store the predictions in anndata.layers[model_name].

Return type:

None | ndarray

Returns:

None or Predictions of shape (N, C)

Crested.predict_regions(region_idx)#

Make predictions using the model on the specified region(s).

Parameters:

region_idx (list[str] | str) – List of regions for which to make predictions in the format of your original data, either “chr:start-end” or “chr:start-end:strand”.

Return type:

ndarray

Returns:

Predictions for the specified region(s) of shape (N, C)

Crested.predict_sequence(sequence)#

Make predictions using the model on the provided DNA sequence.

Parameters:

sequence (str) – A string containing a DNA sequence (A, C, G, T).

Return type:

ndarray

Returns:

Predictions for the provided sequence.

Crested.score_gene_locus(chr_name, gene_start, gene_end, class_name, strand='+', upstream=50000, downstream=10000, window_size=2114, central_size=1000, step_size=50, genome=None)#

Score regions upstream and downstream of a gene locus using the model’s prediction.

The model predicts a value for the central 1000bp of each window.

Parameters:
  • chr_name (str) – The chromosome name (e.g., ‘chr12’).

  • gene_start (int) – The start position of the gene locus (TSS for + strand).

  • gene_end (int) – The end position of the gene locus (TSS for - strand).

  • class_name (str) – Output class name for prediction.

  • strand (str (default: '+')) – ‘+’ for positive strand, ‘-’ for negative strand. Default ‘+’.

  • upstream (int (default: 50000)) – Distance upstream of the gene to score. Default 50 000.

  • downstream (int (default: 10000)) – Distance downstream of the gene to score. Default 10 000.

  • window_size (int (default: 2114)) – Size of the window to use for scoring. Default 2114.

  • central_size (int (default: 1000)) – Size of the central region that the model predicts for. Default 1000.

  • step_size (int (default: 50)) – Distance between consecutive windows. Default 50.

  • genome (Optional[FastaFile] (default: None)) – Genome of species to score locus on. If none, genome of crested class is used.

Return type:

tuple[ndarray, ndarray, int, int, int]

Returns:

scores

An array of prediction scores across the entire genomic range.

coordinates

An array of tuples, each containing the chromosome name and the start and end positions of the sequence for each window.

min_loc

Start position of the entire scored region.

max_loc

End position of the entire scored region.

tss_position

The transcription start site (TSS) position.

Crested.test(return_metrics=False)#

Evaluate the model on the test set.

Make sure to load a model first using Crested.load_model() before calling this function. Make sure the model is compiled before calling this function.

Parameters:

return_metrics (bool (default: False)) – Return the evaluation metrics as a dictionary.

Return type:

dict | None

Returns:

Evaluation metrics as a dictionary or None if return_metrics is False.

Crested.tfmodisco_calculate_and_save_contribution_scores(adata, output_dir='modisco_results', method='expected_integrated_grad', class_names=None)#

Calculate and save contribution scores for all regions in adata.var.

Parameters:
  • adata (AnnData) – The AnnData object containing regions and class information, obtained from crested.pp.sort_and_filter_regions_on_specificity.

  • output_dir (PathLike (default: 'modisco_results')) – Directory to save the output files.

  • method (str (default: 'expected_integrated_grad')) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’.

  • class_names (Optional[list[str]] (default: None)) – List of class names to process. If None, all class names in adata.obs_names will be processed.

Crested.tfmodisco_calculate_and_save_contribution_scores_sequences(adata, sequences, output_dir='modisco_results', method='expected_integrated_grad', class_names=None)#

Calculate and save contribution scores for the sequence(s).

Parameters:
  • adata (AnnData) – The AnnData object containing class information.

  • sequences (list[str]) – List of sequences (string encoded) to calculate contribution on.

  • output_dir (PathLike (default: 'modisco_results')) – Directory to save the output files.

  • method (str (default: 'expected_integrated_grad')) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’.

  • class_names (Optional[list[str]] (default: None)) – List of class names to process. If None, all class names in adata.obs_names will be processed.

Crested.transferlearn(epochs_first_phase=50, epochs_second_phase=50, learning_rate_first_phase=0.0001, learning_rate_second_phase=1e-06, freeze_until_layer_name=None, freeze_until_layer_index=None, set_output_activation=None, **kwargs)#

Perform transfer learning on the model.

The first phase freezes layers up to a specified layer (if provided), removes the later layers, adds a dense output layer, and trains with a low learning rate. The second phase unfreezes all layers and continues training with an even lower learning rate.

Ensure that you load a model first using Crested.load_model() before calling this function and have a datamodule and config loaded in your Crested object.

One of freeze_until_layer_name or freeze_until_layer_index must be provided.

Parameters:
  • epochs_first_phase (int (default: 50)) – Number of epochs to train in the first phase.

  • epochs_second_phase (int (default: 50)) – Number of epochs to train in the second phase.

  • learning_rate_first_phase (float (default: 0.0001)) – Learning rate for the first phase.

  • learning_rate_second_phase (float (default: 1e-06)) – Learning rate for the second phase.

  • freeze_until_layer_name (Optional[str] (default: None)) – Name of the layer up to which to freeze layers. If None, defaults to freezing all layers except the last layer.

  • freeze_until_layer_index (Optional[int] (default: None)) – Index of the layer up to which to freeze layers. If None, defaults to freezing all layers except the last layer.

  • set_output_activation (Optional[str] (default: None)) – Set output activation if different from the previous model.

  • kwargs – Additional keyword arguments to pass to the fit method.