crested.tl.Crested#
- class crested.tl.Crested(data, model=None, config=None, project_name=None, run_name=None, logger=None, seed=None)#
Main class to handle training, testing, predicting and calculation of contribution scores.
- Parameters:
data (
AnnDataModule
) – AnndataModule object containing the data.model (
Optional
[Model
] (default:None
)) – Model architecture to use for training.config (
Optional
[TaskConfig
] (default:None
)) – Task configuration (optimizer, loss, and metrics) for use in tl.Crested.project_name (
Optional
[str
] (default:None
)) – Name of the project. Used for logging and creating output directories. If not provided, the default project name “CREsted” will be used.run_name (
Optional
[str
] (default:None
)) – Name of the run. Used for wandb logging and creating output directories. If not provided, the current date and time will be used.logger (
Optional
[str
] (default:None
)) – Logger to use for logging. Can be “wandb”, “tensorboard”, or “dvc” (tensorboard not implemented yet) If not provided, no additional logging will be done.seed (
Optional
[int
] (default:None
)) – Seed to use for reproducibility. WARNING: this doesn’t make everything fully reproducible, especially on GPU. Some (GPU) operations are non-deterministic and simply can’t be controlled by the seed.
Examples
>>> from crested.tl import Crested >>> from crested.tl import default_configs >>> from crested.tl.data import AnnDataModule >>> from crested.tl.zoo import deeptopic_cnn
>>> # Load data >>> anndatamodule = AnnDataModule(anndata, genome="path/to/genome.fa") >>> model_architecture = deeptopic_cnn(seq_len=1000, n_classes=10) >>> configs = default_configs("topic_classification")
>>> # Initialize trainer >>> trainer = Crested( ... data=anndatamodule, ... model=model_architecture, ... config=configs, ... project_name="test", ... )
>>> # Fit the model >>> trainer.fit(epochs=100)
>>> # Evaluate the model >>> trainer.test()
>>> # Make predictions and add them to anndata as a .layers attribute >>> trainer.predict(anndata, model_name="predictions")
>>> # Calculate contribution scores >>> scores, seqs_one_hot = trainer.calculate_contribution_scores_regions( ... region_idx="chr1:1000-2000", ... class_names=["class1", "class2"], ... method="integrated_grad", ... )
Methods table#
|
Calculate contribution scores based on the given method for the full dataset. |
Calculate contribution scores of enhancer design. |
|
|
Calculate contribution scores based on given method for a specified region. |
Calculate contribution scores based on given method for a specified sequence. |
|
|
Create synthetic enhancers for a specified class using in silico evolution (ISE). |
|
Create synthetic enhancers for a specified class using motif implementation. |
|
Fit the model on the training and validation set. |
|
Extract embeddings from a specified layer in the model for all regions in the dataset. |
|
Load a (pretrained) model from a file. |
|
Make predictions using the model on the full dataset. |
|
Make predictions using the model on the specified region(s). |
|
Make predictions using the model on the provided DNA sequence. |
|
Score regions upstream and downstream of a gene locus using the model's prediction. |
|
Evaluate the model on the test set. |
Calculate and save contribution scores for all regions in adata.var. |
|
|
Calculate and save contribution scores for the sequence(s). |
|
Perform transfer learning on the model. |
Methods#
- Crested.calculate_contribution_scores(class_names, anndata=None, method='expected_integrated_grad')#
Calculate contribution scores based on the given method for the full dataset.
These scores can then be plotted to visualize the importance of each base in the dataset using
contribution_scores()
.- Parameters:
class_names (
list
[str
]) – List of class names to calculate the contribution scores for (should match anndata.obs_names) If the list is empty, the contribution scores for the ‘combined’ class will be calculated.anndata (
Optional
[AnnData
] (default:None
)) – Anndata object to store the contribution scores in as a .varm[class_name] attribute. If None, will only return the contribution scores without storing them.method (
str
(default:'expected_integrated_grad'
)) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’.
- Return type:
- Returns:
Contribution scores (N, C, L, 4) and one-hot encoded sequences (N, L, 4) or None if anndata is provided.
- Crested.calculate_contribution_scores_enhancer_design(enhancer_design_intermediate, class_names=None, method='expected_integrated_grad', disable_tqdm=False)#
Calculate contribution scores of enhancer design.
These scores can then be plotted to visualize the importance of each base in the region using
enhancer_design_steps_contribution_scores()
.- Parameters:
enhancer_design_intermediate (
list
[dict
]) – Intermediate output from enhancer design when return_intermediate is Trueclass_names (
Optional
[list
[str
]] (default:None
)) – List of class names to calculate the contribution scores for (should match anndata.obs_names) If None, the contribution scores for the ‘combined’ class will be calculated.method (
str
(default:'expected_integrated_grad'
)) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’.disable_tqdm (
bool
(default:False
)) – Boolean for disabling the plotting progress of calculations using tqdm.
- Return type:
- Returns:
A tuple of arrays or a list of tuple of arrays of contribution scores (N, C, L, 4) and one-hot encoded sequences (N, L, 4).
See also
crested.pl.patterns.enhancer_design_steps_contribution_scores
,crested.tl.Crested.enhancer_design_in_silico_evolution
,crested.tl.Crested.enhancer_design_motif_implementation
Examples
>>> scores, onehot = crested.calculate_contribution_scores_enhancer_design( ... enhancer_design_intermediate, ... class_names=["cell_type_A"], ... method="expected_integrated_grad", ... )
- Crested.calculate_contribution_scores_regions(region_idx, class_names, method='expected_integrated_grad', disable_tqdm=False)#
Calculate contribution scores based on given method for a specified region.
These scores can then be plotted to visualize the importance of each base in the region using
contribution_scores()
.- Parameters:
region_idx (
list
[str
] |str
) – Region(s) for which to calculate the contribution scores in the format “chr:start-end” or “chr:start-end:strand”.class_names (
list
[str
]) – List of class names to calculate the contribution scores for (should match anndata.obs_names) If the list is empty, the contribution scores for the ‘combined’ class will be calculated.method (
str
(default:'expected_integrated_grad'
)) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’.disable_tqdm (
bool
(default:False
)) – Boolean for disabling the plotting progress of calculations using tqdm.
- Return type:
- Returns:
Contribution scores (N, C, L, 4) and one-hot encoded sequences (N, L, 4).
- Crested.calculate_contribution_scores_sequence(sequences, class_names, method='expected_integrated_grad', disable_tqdm=False)#
Calculate contribution scores based on given method for a specified sequence.
These scores can then be plotted to visualize the importance of each base in the sequence using
contribution_scores()
.- Parameters:
sequence – Sequence(s) for which to calculate the contribution scores.
class_names (
list
[str
]) – List of class names to calculate the contribution scores for (should match anndata.obs_names) If the list is empty, the contribution scores for the ‘combined’ class will be calculated.method (
str
(default:'expected_integrated_grad'
)) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’.disable_tqdm (
bool
(default:False
)) – Boolean for disabling the plotting progress of calculations using tqdm.
- Return type:
- Returns:
Contribution scores (N, C, L, 4) and one-hot encoded sequences (N, L, 4).
- Crested.enhancer_design_in_silico_evolution(n_mutations, n_sequences=1, target_class=None, target=None, return_intermediate=False, no_mutation_flanks=None, target_len=None, enhancer_optimizer=None, starting_sequences=None, **kwargs)#
Create synthetic enhancers for a specified class using in silico evolution (ISE).
- Parameters:
n_mutations (
int
) – Number of iterationsn_sequences (
int
(default:1
)) – Number of enhancers to designtarget_class (
Optional
[str
] (default:None
)) – Class name for which the enhancers will be designed for. If this value is set to None target needs to be specified.target (
Union
[int
,ndarray
,None
] (default:None
)) – target index, needs to be specified when target_class is Nonereturn_intermediate (
bool
(default:False
)) – If True, returns a dictionary with predictions and changes made in intermediate steps for selected sequencesno_mutation_flanks (
Optional
[tuple
] (default:None
)) – A tuple of integers which determine the regions in each flank to not do implementations.target_len (
Optional
[int
] (default:None
)) – Length of the area in the center of the sequence to make implementations, ignored if no_mutation_flanks is supplied.enhancer_optimizer (
Optional
[EnhancerOptimizer
] (default:None
)) – An instance of EnhancerOptimizer, defining how sequences should be optimized. If None, a default EnhancerOptimizer will be initialized using_weighted_difference
as optimization function.starting_sequences (
Union
[str
,list
,None
] (default:None
)) – A DNA sequence or a list of DNA sequences that will be used instead of randomly generated sequences, if provided, n_sequences is ignoredkwargs (
dict
[str
,Any
]) – Keyword arguments that will be passed to theget_best
function of the EnhancerOptimizer
- Return type:
- Returns:
A list of designed sequences and if return_intermediate is True a list of dictionaries of intermediate mutations and predictions as well as the designed sequences
See also
crested.tl.Crested.calculate_contribution_scores_enhancer_design
,crested.utils.EnhancerOptimizer
Examples
>>> ( ... intermediate_results, ... designed_sequences, ... ) = trained_crested_object.enhancer_design_in_silico_evolution( ... target_class="cell_type_A", ... n_mutations=20, ... n_sequences=1, ... return_intermediate=True, ... )
- Crested.enhancer_design_motif_implementation(patterns, n_sequences=1, target_class=None, target=None, insertions_per_pattern=None, return_intermediate=False, no_mutation_flanks=None, target_len=None, preserve_inserted_motifs=True, enhancer_optimizer=None, starting_sequences=None, **kwargs)#
Create synthetic enhancers for a specified class using motif implementation.
- Parameters:
patterns (
dict
) – Dictionary of patterns to be implemented in the form of ‘pattern_name’:’pattern_sequence’n_sequences (
int
(default:1
)) – Number of enhancers to design.target_class (
Optional
[str
] (default:None
)) – Class name for which the enhancers will be designed for. If this value is set to None target needs to be specified.target (
Union
[int
,ndarray
,None
] (default:None
)) – target index, needs to be specified when target_class is Noneinsertions_per_pattern (
Optional
[dict
] (default:None
)) – Dictionary of number of patterns to be implemented in the form of ‘pattern_name’:number_of_insertions If not used one of each pattern in patterns will be implemented.return_intermediate (
bool
(default:False
)) – If True, returns a dictionary with predictions and changes made in intermediate steps for selected sequencesno_mutation_flanks (
Optional
[tuple
] (default:None
)) – A tuple of integers which determine the regions in each flank to not do implementations.target_len (
Optional
[int
] (default:None
)) – Length of the area in the center of the sequence to make implementations, ignored if no_mutation_flanks is supplied.preserve_inserted_motifs (
bool
(default:True
)) – If True, sequentially inserted motifs can’t be inserted on previous motifs.enhancer_optimizer (
Optional
[EnhancerOptimizer
] (default:None
)) – An instance of EnhancerOptimizer, defining how sequences should be optimized. If None, a default EnhancerOptimizer will be initialized using_weighted_difference
as optimization function.starting_sequences (
Union
[str
,list
,None
] (default:None
)) – A DNA sequence or a list of DNA sequences that will be used instead of randomly generated sequences, if provided, n_sequences is ignoredkwargs (
dict
[str
,Any
]) – Keyword arguments that will be passed to theget_best
function of the EnhancerOptimizer
- Return type:
- Returns:
A list of designed sequences and if return_intermediate is True a list of dictionaries of intermediate mutations and predictions
- Crested.fit(epochs=100, mixed_precision=False, model_checkpointing=True, model_checkpointing_best_only=True, model_checkpointing_metric='val_loss', model_checkpointing_mode='min', early_stopping=True, early_stopping_patience=10, early_stopping_metric='val_loss', early_stopping_mode='min', learning_rate_reduce=True, learning_rate_reduce_patience=5, learning_rate_reduce_metric='val_loss', learning_rate_reduce_mode='min', custom_callbacks=None)#
Fit the model on the training and validation set.
- Parameters:
epochs (
int
(default:100
)) – Number of epochs to train the model.mixed_precision (
bool
(default:False
)) – Enable mixed precision training.model_checkpointing (
bool
(default:True
)) – Save model checkpoints.model_checkpointing_best_only (
bool
(default:True
)) – Save only the best model checkpoint.model_checkpointing_metric (
str
(default:'val_loss'
)) – Metric to monitor to choose best models.model_checkpointing_mode (
str
(default:'min'
)) – ‘max’ if a high metric is better, ‘min’ if a low metric is betterearly_stopping (
bool
(default:True
)) – Enable early stopping.early_stopping_patience (
int
(default:10
)) – Number of epochs with no improvement after which training will be stopped.early_stopping_metric (
str
(default:'val_loss'
)) – Metric to monitor for early stopping.early_stopping_mode (
str
(default:'min'
)) – ‘max’ if a high metric is better, ‘min’ if a low metric is betterlearning_rate_reduce (
bool
(default:True
)) – Enable learning rate reduction.learning_rate_reduce_patience (
int
(default:5
)) – Number of epochs with no improvement after which learning rate will be reduced.learning_rate_reduce_metric (
str
(default:'val_loss'
)) – Metric to monitor for reducing the learning rate.learning_rate_reduce_mode (
str
(default:'min'
)) – ‘max’ if a high metric is better, ‘min’ if a low metric is bettercustom_callbacks (
Optional
[list
] (default:None
)) – List of custom callbacks to use during training.
- Return type:
- Crested.get_embeddings(layer_name='global_average_pooling1d', anndata=None)#
Extract embeddings from a specified layer in the model for all regions in the dataset.
If anndata is provided, it will add the embeddings to anndata.varm[layer_name].
- Parameters:
- Return type:
- Returns:
Embeddings of shape (N, D), where N is the number of regions in the dataset and D is the size of the embedding layer.
- Crested.load_model(model_path, compile=True)#
Load a (pretrained) model from a file.
- Parameters:
- Return type:
- Crested.predict(anndata=None, model_name=None)#
Make predictions using the model on the full dataset.
If anndata and model_name are provided, will add the predictions to anndata as a .layers[model_name] attribute. Else, will return the predictions as a numpy array.
- Crested.predict_regions(region_idx)#
Make predictions using the model on the specified region(s).
- Crested.predict_sequence(sequence)#
Make predictions using the model on the provided DNA sequence.
- Crested.score_gene_locus(chr_name, gene_start, gene_end, class_name, strand='+', upstream=50000, downstream=10000, window_size=2114, central_size=1000, step_size=50, genome=None)#
Score regions upstream and downstream of a gene locus using the model’s prediction.
The model predicts a value for the central 1000bp of each window.
- Parameters:
chr_name (
str
) – The chromosome name (e.g., ‘chr12’).gene_start (
int
) – The start position of the gene locus (TSS for + strand).gene_end (
int
) – The end position of the gene locus (TSS for - strand).class_name (
str
) – Output class name for prediction.strand (
str
(default:'+'
)) – ‘+’ for positive strand, ‘-’ for negative strand. Default ‘+’.upstream (
int
(default:50000
)) – Distance upstream of the gene to score. Default 50 000.downstream (
int
(default:10000
)) – Distance downstream of the gene to score. Default 10 000.window_size (
int
(default:2114
)) – Size of the window to use for scoring. Default 2114.central_size (
int
(default:1000
)) – Size of the central region that the model predicts for. Default 1000.step_size (
int
(default:50
)) – Distance between consecutive windows. Default 50.genome (
Optional
[FastaFile
] (default:None
)) – Genome of species to score locus on. If none, genome of crested class is used.
- Return type:
- Returns:
- scores
An array of prediction scores across the entire genomic range.
- coordinates
An array of tuples, each containing the chromosome name and the start and end positions of the sequence for each window.
- min_loc
Start position of the entire scored region.
- max_loc
End position of the entire scored region.
- tss_position
The transcription start site (TSS) position.
- Crested.test(return_metrics=False)#
Evaluate the model on the test set.
Make sure to load a model first using Crested.load_model() before calling this function. Make sure the model is compiled before calling this function.
- Crested.tfmodisco_calculate_and_save_contribution_scores(adata, output_dir='modisco_results', method='expected_integrated_grad', class_names=None)#
Calculate and save contribution scores for all regions in adata.var.
- Parameters:
adata (
AnnData
) – The AnnData object containing regions and class information, obtained from crested.pp.sort_and_filter_regions_on_specificity.output_dir (
PathLike
(default:'modisco_results'
)) – Directory to save the output files.method (
str
(default:'expected_integrated_grad'
)) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’.class_names (
Optional
[list
[str
]] (default:None
)) – List of class names to process. If None, all class names in adata.obs_names will be processed.
- Crested.tfmodisco_calculate_and_save_contribution_scores_sequences(adata, sequences, output_dir='modisco_results', method='expected_integrated_grad', class_names=None)#
Calculate and save contribution scores for the sequence(s).
- Parameters:
adata (
AnnData
) – The AnnData object containing class information.sequences (
list
[str
]) – List of sequences (string encoded) to calculate contribution on.output_dir (
PathLike
(default:'modisco_results'
)) – Directory to save the output files.method (
str
(default:'expected_integrated_grad'
)) – Method to use for calculating the contribution scores. Options are: ‘integrated_grad’, ‘mutagenesis’, ‘expected_integrated_grad’.class_names (
Optional
[list
[str
]] (default:None
)) – List of class names to process. If None, all class names in adata.obs_names will be processed.
- Crested.transferlearn(epochs_first_phase=50, epochs_second_phase=50, learning_rate_first_phase=0.0001, learning_rate_second_phase=1e-06, freeze_until_layer_name=None, freeze_until_layer_index=None, set_output_activation=None, **kwargs)#
Perform transfer learning on the model.
The first phase freezes layers up to a specified layer (if provided), removes the later layers, adds a dense output layer, and trains with a low learning rate. The second phase unfreezes all layers and continues training with an even lower learning rate.
Ensure that you load a model first using Crested.load_model() before calling this function and have a datamodule and config loaded in your Crested object.
One of freeze_until_layer_name or freeze_until_layer_index must be provided.
- Parameters:
epochs_first_phase (
int
(default:50
)) – Number of epochs to train in the first phase.epochs_second_phase (
int
(default:50
)) – Number of epochs to train in the second phase.learning_rate_first_phase (
float
(default:0.0001
)) – Learning rate for the first phase.learning_rate_second_phase (
float
(default:1e-06
)) – Learning rate for the second phase.freeze_until_layer_name (
Optional
[str
] (default:None
)) – Name of the layer up to which to freeze layers. If None, defaults to freezing all layers except the last layer.freeze_until_layer_index (
Optional
[int
] (default:None
)) – Index of the layer up to which to freeze layers. If None, defaults to freezing all layers except the last layer.set_output_activation (
Optional
[str
] (default:None
)) – Set output activation if different from the previous model.kwargs – Additional keyword arguments to pass to the fit method.
See also