crested.tl.enhancer_design_motif_insertion

crested.tl.enhancer_design_motif_insertion#

crested.tl.enhancer_design_motif_insertion(patterns, model, target, n_sequences=1, insertions_per_pattern=None, return_intermediate=False, no_mutation_flanks=None, target_len=None, preserve_inserted_motifs=True, enhancer_optimizer=None, starting_sequences=None, acgt_distribution=None, **kwargs)#

Create synthetic enhancers using motif insertions.

Parameters:
  • patterns (dict) – Dictionary of patterns to be implemented in the form {‘pattern_name’: ‘pattern_sequence’}

  • model (Model | list[Model]) – A (list of) trained keras model(s) to design enhancers with. If a list of models is provided, the predictions will be averaged across all models.

  • target (int | ndarray) – Using the default weighted_difference optimization function this should be the index of the target class to design enhancers for. This gets passed to the get_best function of the EnhancerOptimizer, so can represent other target values too.

  • n_sequences (int (default: 1)) – Number of enhancers to design.

  • insertions_per_pattern (Optional[dict] (default: None)) – Dictionary of number of patterns to be implemented in the form {‘pattern_name’: number_of_insertions}. If not provided, each pattern is inserted once.

  • return_intermediate (bool (default: False)) – If True, returns a dictionary with predictions and changes made in intermediate steps.

  • no_mutation_flanks (Optional[tuple[int, int]] (default: None)) – A tuple specifying regions in each flank where no modifications should occur.

  • target_len (Optional[int] (default: None)) – Length of the area in the center of the sequence to make insertions, ignored if no_mutation_flanks is set.

  • preserve_inserted_motifs (bool (default: True)) – If True, prevents motifs from being inserted on top of previously inserted motifs.

  • enhancer_optimizer (Optional[EnhancerOptimizer] (default: None)) – An instance of EnhancerOptimizer, defining how sequences should be optimized. If None, a default EnhancerOptimizer will be initialized using _weighted_difference as optimization function.

  • starting_sequences (Union[str, list, None] (default: None)) – An optional DNA sequence or a list of DNA sequences that will be used instead of randomly generated sequences. If provided, n_sequences is ignored

  • acgt_distribution (Optional[ndarray[float]] (default: None)) – An array of floats representing the distribution of A, C, G, and T in the genome (in that order). If the array is of shape (L, 4), it will be assumed to be per position. If it is of shape (4,), it will be assumed to be overall. If None, a uniform distribution will be used. This will be used to generate random sequences if starting_sequences is not provided. You can calculate these using calculate_nucleotide_distribution().

  • kwargs (dict[str, Any]) – Additional arguments passed to get_best function of EnhancerOptimizer.

Return type:

list | tuple[list[dict], list]

Returns:

A list of designed sequences, and if return_intermediate=True, a list of intermediate results.

Examples

>>> acgt_distribution = crested.utils.calculate_nucleotide_distribution(
...     my_anndata, genome, per_position=True
... )  # shape (L, 4)
>>> target_idx = my_anndata.obs_names.index("my_celltype")
>>> my_motifs = {
...     "motif1": "ACGTTTGA",
...     "motif2": "TGCA",
... }
>>> (
...     intermediate_results,
...     designed_sequences,
... ) = crested.tl.enhancer_design_motif_insertion(
...     patterns=my_motifs,
...     n_mutations=20,
...     target=target_idx,
...     model=my_trained_model,
...     n_sequences=1,
...     return_intermediate=True,
...     acgt_distribution=acgt_distribution,
... )