Enhancer code analysis#

In this notebook we will go over how to obtain cell type characteristic sequence patterns from CREsted models, or any other model, using tfmodisco-lite.

Obtaining contribution scores per model class and running tfmodisco-lite#

Before we can do any analysis, we need to calculate the contribution scores for cell type-specific regions. From those, we can run tfmodisco-lite.

Load data and CREsted model#

import os

os.environ["PATH"] = "/data/projects/c04/cbd-saerts/nkemp/tools:" + os.environ["PATH"]
import anndata
import crested

adata = anndata.read_h5ad("mouse_cortex_filtered.h5ad")

genome_file = "../../../../mouse/biccn/mm10.fa"
chrom_sizes = "../../../../mouse/biccn/mm10.chrom.sizes"
genome = crested.Genome(genome_file, chrom_sizes)
crested.register_genome(
    genome
)  # Register the genome so that it can be used by the package
2025-02-04T11:38:30.211063+0100 INFO Genome mm10 registered.
# load a trained model
import keras

model_path = "mouse_biccn/finetuned_model/checkpoints/02.keras"
model = keras.models.load_model(model_path, compile=False)  # change to your model path
# store predictions for all our regions in the anndata object for later inspection.
predictions = crested.tl.predict(adata, model)
adata.layers["biccn_model"] = predictions.T  # adata expects (C, N) instead of (N, C)
  206/91475 ━━━━━━━━━━━━━━━━━━━━ 1:07 735us/step
91475/91475 ━━━━━━━━━━━━━━━━━━━━ 67s 719us/step

Select the most informative regions per cell type#

To obtain cell type-characteristic patterns, we need to calculate contribution scores on highly specific regions. For this purpose, we’ve included a preprocessing function crested.pp.sort_and_filter_regions_on_specificity() to keep the top k most specific regions per cell type that you can use to filter your data before running modisco.

There are three options to select the top regions: either purely based on peak height, purely based on predictions, or on their combination. Here we show how to use the combination of both (which we recommend).

adata_combined = adata.copy()  # Copy the peak heights
adata_combined.X = (
    adata_combined.X + adata_combined.layers["biccn_model"]
) / 2  # Take the average with the predictions
# most informative regions per class
adata_filtered = adata_combined.copy()
top_k = 2000  # Here we take the top 2k most specific regions, but doing this on top 500 or top 1k will give similar results and will be faster to calculate
crested.pp.sort_and_filter_regions_on_specificity(
    adata_filtered, top_k=top_k, method="proportion"
)
adata_filtered
2025-02-04T11:48:51.471343+0100 INFO After sorting and filtering, kept 38000 regions.
AnnData object with n_obs × n_vars = 19 × 38000
    obs: 'file_path'
    var: 'chr', 'start', 'end', 'split', 'Class name', 'rank', 'proportion_score'
    obsm: 'weights'
    layers: 'biccn_model'
adata_filtered.var
chr start end split Class name rank proportion_score
region
chr10:45499432-45501546 chr10 45499432 45501546 val Astro 1 0.716959
chrX:23135863-23137977 chrX 23135863 23137977 train Astro 2 0.700491
chr14:14264851-14266965 chr14 14264851 14266965 train Astro 3 0.699784
chr1:127159287-127161401 chr1 127159287 127161401 train Astro 4 0.694755
chr15:18236980-18239094 chr15 18236980 18239094 train Astro 5 0.694676
... ... ... ... ... ... ... ...
chr14:109213984-109216098 chr14 109213984 109216098 train Vip 1996 0.240961
chr3:93481374-93483488 chr3 93481374 93483488 train Vip 1997 0.240937
chr9:8248775-8250889 chr9 8248775 8250889 test Vip 1998 0.240879
chr1:70704410-70706524 chr1 70704410 70706524 train Vip 1999 0.240873
chr16:85306932-85309046 chr16 85306932 85309046 train Vip 2000 0.240872

38000 rows × 7 columns

Calculating contribution scores per class#

Now you can calculate the contribution scores for all the regions in your filtered anndata.
By default, the contribution scores are calculated using the expected integrated gradients method, but you can change this to simple integrated gradients to speed up the calculation (this might result in less accurate scores).

crested.tl.contribution_scores_specific(
    input=adata_filtered,
    target_idx=None,  # We calculate for all classes
    model=model,
    output_dir="modisco_results_ft_2000",
)

Running tfmodisco-lite#

When this is done, you can run TFModisco Lite on the saved contribution scores to find motifs that are important for the classification/regression task.

You could use the tfmodisco package directly to do this, or you could use the crested.tl.modisco.tfmodisco() function which is essentially a wrapper around the tfmodisco package.

Note that from here on, you can use contribution scores from any model trained in any framework, as this analysis just requires a set of one hot encoded sequences and contribution scores per cell type stored in the same directory.

meme_db, motif_to_tf_file = crested.get_motif_db()
# run tfmodisco on the contribution scores
crested.tl.modisco.tfmodisco(
    window=1000,
    output_dir="modisco_results_ft_2000",
    contrib_dir="modisco_results_ft_2000",
    report=True,  # Optional, will match patterns to motif MEME database
    meme_db="/home/VIB.LOCAL/niklas.kempynck/nkemp/tools/motifs.meme",  # File to MEME database
    max_seqlets=20000,
)

Analysis of cell-type specific sequence patterns#

Once you have obtained your modisco results, you can plot the the found patterns using the crested.pl.patterns.modisco_results() function.

%matplotlib inline
top_k = 1000
crested.pl.patterns.modisco_results(
    classes=["Astro", "L5ET", "Vip", "Oligo"],
    contribution="positive",
    contribution_dir="modisco_results_ft_2000",
    num_seq=top_k,
    y_max=0.15,
    viz="pwm",
)  # You can also visualize in 'pwm' format
2025-02-04T11:49:16.697738+0100 INFO Starting genomic contributions plot for classes: ['Astro', 'L5ET', 'Vip', 'Oligo']
../_images/44cdd06065506812065c73c05cd4271841b89c12002280d2a9f68236041d2ddf.png

Matching patterns across cell types#

Since we have calculated per cell type the patterns independently of each other, we do not know quantitavely how and if they overlap. It can be interesting to get an overview of which patterns are found across multiple cell types, how important they are, and if there are unique patterns only found in a small selection of classes. Therefore, we have made a pattern clustering algorithm, which starts from the results of tfmodisco-lite, and return a pattern matrix, which contains the importance of the clustered patterns per cell type, and a pattern dictionary, describing all clustered patterns.

First, we’ll obtain the modisco files per class by using crested.tl.modisco.match_h5_files_to_classes() using our selected classes. Then, we’ll cluster these patterns using crested.tl.modisco.process_patterns() and create a pattern matrix with crested.tl.modisco.create_pattern_matrix()

# First we obtain the resulting modisco files per class
matched_files = crested.tl.modisco.match_h5_files_to_classes(
    contribution_dir="modisco_results_ft_2000", classes=list(adata.obs_names)
)
# Then we cluster matching patterns, and define a pattern matrix [#classes, #patterns] describing their importance
all_patterns = crested.tl.modisco.process_patterns(
    matched_files,
    sim_threshold=4.25,  # The similarity threshold used for matching patterns. We take the -log10(pval), pval obtained through TOMTOM matching from tangermeme
    trim_ic_threshold=0.05,  # Information content (IC) threshold on which to trim patterns
    discard_ic_threshold=0.2,  # IC threshold used for discarding single instance patterns
    verbose=True,  # Useful for doing sanity checks on matching patterns
)
pattern_matrix = crested.tl.modisco.create_pattern_matrix(
    classes=list(adata.obs_names),
    all_patterns=all_patterns,
    normalize=False,
    pattern_parameter="seqlet_count_log",
)
pattern_matrix.shape
Reading file modisco_results_ft_2000/Astro_modisco_results.h5
Match between Astro_pos_patterns_1 and Astro_neg_patterns_4 with similarity score 4.83
Match between Astro_pos_patterns_11 and Astro_pos_patterns_5 with similarity score 7.19
Match between Astro_pos_patterns_14 and Astro_pos_patterns_8 with similarity score 4.80
Match between Astro_pos_patterns_15 and Astro_pos_patterns_3 with similarity score 7.44
Reading file modisco_results_ft_2000/Endo_modisco_results.h5
Match between Endo_neg_patterns_0 and Astro_neg_patterns_0 with similarity score 4.44
Match between Endo_pos_patterns_1 and Astro_pos_patterns_3 with similarity score 7.06
Match between Endo_pos_patterns_7 and Astro_neg_patterns_2 with similarity score 5.45
Match between Endo_pos_patterns_11 and Endo_pos_patterns_7 with similarity score 5.20
Reading file modisco_results_ft_2000/L2_3IT_modisco_results.h5
Match between L2_3IT_neg_patterns_2 and Astro_neg_patterns_0 with similarity score 4.79
Match between L2_3IT_neg_patterns_9 and Endo_pos_patterns_7 with similarity score 6.88
Match between L2_3IT_pos_patterns_3 and Endo_pos_patterns_16 with similarity score 5.13
Match between L2_3IT_pos_patterns_4 and Endo_pos_patterns_17 with similarity score 4.67
Match between L2_3IT_pos_patterns_6 and Endo_pos_patterns_1 with similarity score 5.74
Match between L2_3IT_pos_patterns_7 and Astro_pos_patterns_8 with similarity score 12.00
Match between L2_3IT_pos_patterns_14 and L2_3IT_neg_patterns_4 with similarity score 4.82
Match between L2_3IT_pos_patterns_17 and Astro_neg_patterns_1 with similarity score 4.64
Reading file modisco_results_ft_2000/L5ET_modisco_results.h5
Match between L5ET_neg_patterns_1 and Astro_neg_patterns_0 with similarity score 4.42
Match between L5ET_pos_patterns_0 and L2_3IT_pos_patterns_15 with similarity score 5.04
Match between L5ET_pos_patterns_1 and Astro_pos_patterns_0 with similarity score 5.27
Match between L5ET_pos_patterns_2 and L2_3IT_pos_patterns_12 with similarity score 6.67
Match between L5ET_pos_patterns_3 and L2_3IT_pos_patterns_1 with similarity score 6.23
Match between L5ET_pos_patterns_4 and L2_3IT_pos_patterns_2 with similarity score 6.20
Match between L5ET_pos_patterns_5 and L2_3IT_pos_patterns_5 with similarity score 5.74
Match between L5ET_pos_patterns_7 and L2_3IT_pos_patterns_0 with similarity score 8.36
Match between L5ET_pos_patterns_11 and Astro_pos_patterns_10 with similarity score 5.87
Reading file modisco_results_ft_2000/L5IT_modisco_results.h5
Match between L5IT_neg_patterns_0 and L2_3IT_neg_patterns_0 with similarity score 5.57
Match between L5IT_neg_patterns_1 and Astro_neg_patterns_0 with similarity score 4.54
Match between L5IT_neg_patterns_3 and L2_3IT_neg_patterns_1 with similarity score 4.63
Match between L5IT_neg_patterns_4 and L2_3IT_neg_patterns_4 with similarity score 6.32
Match between L5IT_neg_patterns_6 and L5ET_neg_patterns_3 with similarity score 4.94
Match between L5IT_neg_patterns_7 and L2_3IT_neg_patterns_10 with similarity score 4.51
Match between L5IT_neg_patterns_8 and L2_3IT_neg_patterns_5 with similarity score 7.56
Match between L5IT_neg_patterns_9 and L2_3IT_neg_patterns_3 with similarity score 5.71
Match between L5IT_neg_patterns_13 and Endo_pos_patterns_7 with similarity score 7.15
Match between L5IT_pos_patterns_0 and L5ET_pos_patterns_5 with similarity score 9.79
Match between L5IT_pos_patterns_1 and L2_3IT_pos_patterns_1 with similarity score 9.10
Match between L5IT_pos_patterns_2 and L2_3IT_pos_patterns_0 with similarity score 10.22
Match between L5IT_pos_patterns_3 and Astro_pos_patterns_2 with similarity score 8.99
Match between L5IT_pos_patterns_4 and L2_3IT_pos_patterns_4 with similarity score 10.65
Match between L5IT_pos_patterns_5 and L2_3IT_pos_patterns_6 with similarity score 5.87
Match between L5IT_pos_patterns_6 and L2_3IT_pos_patterns_2 with similarity score 6.07
Match between L5IT_pos_patterns_7 and L2_3IT_pos_patterns_3 with similarity score 5.05
Match between L5IT_pos_patterns_9 and L2_3IT_neg_patterns_4 with similarity score 4.51
Match between L5IT_pos_patterns_11 and L5IT_pos_patterns_3 with similarity score 4.63
Match between L5IT_pos_patterns_13 and L2_3IT_pos_patterns_16 with similarity score 5.24
Match between L5IT_pos_patterns_14 and L5IT_pos_patterns_9 with similarity score 6.07
Reading file modisco_results_ft_2000/L5_6NP_modisco_results.h5
Match between L5_6NP_neg_patterns_0 and L2_3IT_neg_patterns_8 with similarity score 6.11
Match between L5_6NP_pos_patterns_0 and L5ET_pos_patterns_5 with similarity score 6.72
Match between L5_6NP_pos_patterns_1 and L5IT_pos_patterns_9 with similarity score 7.38
Match between L5_6NP_pos_patterns_2 and L5ET_pos_patterns_1 with similarity score 5.87
Match between L5_6NP_pos_patterns_3 and L5ET_pos_patterns_6 with similarity score 5.46
Match between L5_6NP_pos_patterns_4 and L2_3IT_pos_patterns_4 with similarity score 10.51
Match between L5_6NP_pos_patterns_5 and L2_3IT_pos_patterns_3 with similarity score 4.67
Match between L5_6NP_pos_patterns_7 and L2_3IT_pos_patterns_7 with similarity score 12.00
Match between L5_6NP_pos_patterns_10 and L5IT_pos_patterns_10 with similarity score 4.90
Reading file modisco_results_ft_2000/L6CT_modisco_results.h5
Match between L6CT_neg_patterns_0 and Endo_pos_patterns_7 with similarity score 5.79
Match between L6CT_neg_patterns_2 and L5ET_neg_patterns_2 with similarity score 5.43
Match between L6CT_pos_patterns_0 and L2_3IT_pos_patterns_6 with similarity score 5.87
Match between L6CT_pos_patterns_1 and L2_3IT_pos_patterns_0 with similarity score 9.26
Match between L6CT_pos_patterns_2 and L5ET_pos_patterns_5 with similarity score 6.64
Match between L6CT_pos_patterns_3 and L5IT_pos_patterns_8 with similarity score 7.36
Match between L6CT_pos_patterns_5 and L5_6NP_pos_patterns_1 with similarity score 9.34
Match between L6CT_pos_patterns_6 and L5IT_pos_patterns_10 with similarity score 4.63
Match between L6CT_pos_patterns_8 and L2_3IT_pos_patterns_2 with similarity score 5.84
Match between L6CT_pos_patterns_9 and L2_3IT_pos_patterns_4 with similarity score 8.63
Match between L6CT_pos_patterns_10 and L6CT_pos_patterns_3 with similarity score 5.64
Match between L6CT_pos_patterns_11 and L2_3IT_pos_patterns_16 with similarity score 5.62
Match between L6CT_pos_patterns_12 and L2_3IT_pos_patterns_0 with similarity score 6.07
Match between L6CT_pos_patterns_13 and L5_6NP_pos_patterns_8 with similarity score 5.17
Reading file modisco_results_ft_2000/L6IT_modisco_results.h5
Match between L6IT_neg_patterns_0 and Astro_neg_patterns_0 with similarity score 4.83
Match between L6IT_neg_patterns_1 and Astro_neg_patterns_1 with similarity score 4.39
Match between L6IT_neg_patterns_2 and L6CT_neg_patterns_2 with similarity score 5.42
Match between L6IT_neg_patterns_5 and L5_6NP_neg_patterns_0 with similarity score 5.10
Match between L6IT_neg_patterns_6 and L5IT_neg_patterns_11 with similarity score 5.74
Match between L6IT_neg_patterns_8 and L6CT_neg_patterns_1 with similarity score 5.82
Match between L6IT_neg_patterns_10 and L2_3IT_neg_patterns_5 with similarity score 12.00
Match between L6IT_neg_patterns_12 and Endo_pos_patterns_7 with similarity score 6.93
Match between L6IT_pos_patterns_0 and L2_3IT_pos_patterns_0 with similarity score 7.64
Match between L6IT_pos_patterns_1 and L5ET_pos_patterns_1 with similarity score 5.57
Match between L6IT_pos_patterns_2 and L2_3IT_pos_patterns_2 with similarity score 6.38
Match between L6IT_pos_patterns_3 and L5ET_pos_patterns_5 with similarity score 10.70
Match between L6IT_pos_patterns_4 and L5_6NP_pos_patterns_1 with similarity score 9.43
Match between L6IT_pos_patterns_5 and L5ET_pos_patterns_11 with similarity score 4.41
Match between L6IT_pos_patterns_6 and L6CT_pos_patterns_4 with similarity score 12.00
Match between L6IT_pos_patterns_7 and L6CT_pos_patterns_11 with similarity score 6.02
Match between L6IT_pos_patterns_8 and L2_3IT_pos_patterns_7 with similarity score 12.00
Match between L6IT_pos_patterns_10 and L2_3IT_pos_patterns_8 with similarity score 6.62
Match between L6IT_pos_patterns_11 and L6CT_pos_patterns_6 with similarity score 12.00
Match between L6IT_pos_patterns_12 and L5_6NP_pos_patterns_5 with similarity score 5.39
Match between L6IT_pos_patterns_13 and L6CT_pos_patterns_3 with similarity score 4.60
Match between L6IT_pos_patterns_14 and L6CT_pos_patterns_7 with similarity score 8.12
Match between L6IT_pos_patterns_15 and L5ET_pos_patterns_5 with similarity score 5.50
Match between L6IT_pos_patterns_16 and L6IT_pos_patterns_0 with similarity score 5.38
Match between L6IT_pos_patterns_17 and L5ET_pos_patterns_5 with similarity score 5.42
Reading file modisco_results_ft_2000/L6b_modisco_results.h5
Match between L6b_neg_patterns_0 and L2_3IT_neg_patterns_11 with similarity score 6.04
Match between L6b_neg_patterns_2 and Astro_neg_patterns_5 with similarity score 6.25
Match between L6b_pos_patterns_0 and L5ET_pos_patterns_1 with similarity score 5.87
Match between L6b_pos_patterns_1 and L6CT_pos_patterns_3 with similarity score 10.00
Match between L6b_pos_patterns_2 and L2_3IT_pos_patterns_1 with similarity score 7.83
Match between L6b_pos_patterns_4 and L6IT_pos_patterns_7 with similarity score 6.91
Match between L6b_pos_patterns_5 and L2_3IT_pos_patterns_4 with similarity score 6.66
Match between L6b_pos_patterns_7 and L6IT_pos_patterns_0 with similarity score 8.25
Match between L6b_pos_patterns_8 and L6IT_pos_patterns_8 with similarity score 11.32
Match between L6b_pos_patterns_9 and L6IT_neg_patterns_3 with similarity score 5.87
Match between L6b_pos_patterns_10 and L5ET_pos_patterns_8 with similarity score 4.63
Match between L6b_pos_patterns_12 and L6IT_pos_patterns_2 with similarity score 5.91
Match between L6b_pos_patterns_13 and L6IT_neg_patterns_1 with similarity score 4.83
Reading file modisco_results_ft_2000/Lamp5_modisco_results.h5
Match between Lamp5_neg_patterns_1 and L6b_neg_patterns_0 with similarity score 5.82
Match between Lamp5_pos_patterns_0 and L6CT_pos_patterns_0 with similarity score 5.87
Match between Lamp5_pos_patterns_1 and Astro_pos_patterns_1 with similarity score 4.66
Match between Lamp5_pos_patterns_2 and L6b_pos_patterns_2 with similarity score 7.95
Match between Lamp5_pos_patterns_3 and L6CT_pos_patterns_4 with similarity score 9.63
Match between Lamp5_pos_patterns_4 and L6b_pos_patterns_4 with similarity score 6.22
Match between Lamp5_pos_patterns_5 and L6b_pos_patterns_5 with similarity score 10.96
Match between Lamp5_pos_patterns_11 and L5IT_pos_patterns_3 with similarity score 7.90
Reading file modisco_results_ft_2000/Micro_PVM_modisco_results.h5
Match between Micro_PVM_neg_patterns_1 and L5IT_neg_patterns_2 with similarity score 5.43
Match between Micro_PVM_neg_patterns_2 and Endo_pos_patterns_2 with similarity score 4.79
Match between Micro_PVM_pos_patterns_0 and Endo_pos_patterns_2 with similarity score 4.52
Match between Micro_PVM_pos_patterns_2 and Lamp5_pos_patterns_10 with similarity score 5.92
Match between Micro_PVM_pos_patterns_4 and Endo_pos_patterns_14 with similarity score 4.42
Match between Micro_PVM_pos_patterns_5 and L6IT_neg_patterns_2 with similarity score 4.66
Match between Micro_PVM_pos_patterns_7 and L5_6NP_pos_patterns_1 with similarity score 11.09
Match between Micro_PVM_pos_patterns_8 and L6CT_pos_patterns_4 with similarity score 6.92
Match between Micro_PVM_pos_patterns_9 and L5_6NP_pos_patterns_9 with similarity score 5.44
Match between Micro_PVM_pos_patterns_10 and Micro_PVM_neg_patterns_1 with similarity score 4.38
Reading file modisco_results_ft_2000/OPC_modisco_results.h5
Match between OPC_pos_patterns_0 and Astro_pos_patterns_11 with similarity score 7.81
Match between OPC_pos_patterns_1 and OPC_neg_patterns_3 with similarity score 6.59
Match between OPC_pos_patterns_2 and OPC_neg_patterns_1 with similarity score 6.20
Match between OPC_pos_patterns_4 and Endo_pos_patterns_7 with similarity score 4.82
Match between OPC_pos_patterns_5 and Astro_pos_patterns_11 with similarity score 7.09
Match between OPC_pos_patterns_6 and Astro_pos_patterns_7 with similarity score 5.09
Match between OPC_pos_patterns_7 and L5IT_pos_patterns_12 with similarity score 5.19
Match between OPC_pos_patterns_9 and Micro_PVM_pos_patterns_4 with similarity score 5.17
Match between OPC_pos_patterns_11 and OPC_pos_patterns_1 with similarity score 7.28
Match between OPC_pos_patterns_12 and L6IT_pos_patterns_8 with similarity score 11.13
Reading file modisco_results_ft_2000/Oligo_modisco_results.h5
Match between Oligo_neg_patterns_1 and L5ET_pos_patterns_5 with similarity score 6.42
Match between Oligo_neg_patterns_2 and Micro_PVM_neg_patterns_5 with similarity score 5.50
Match between Oligo_neg_patterns_4 and Micro_PVM_neg_patterns_3 with similarity score 5.89
Match between Oligo_pos_patterns_0 and OPC_pos_patterns_3 with similarity score 6.25
Match between Oligo_pos_patterns_1 and OPC_pos_patterns_1 with similarity score 6.08
Match between Oligo_pos_patterns_2 and OPC_pos_patterns_4 with similarity score 5.27
Match between Oligo_pos_patterns_3 and OPC_pos_patterns_2 with similarity score 5.34
Match between Oligo_pos_patterns_4 and Micro_PVM_pos_patterns_11 with similarity score 4.46
Match between Oligo_pos_patterns_5 and OPC_pos_patterns_7 with similarity score 10.23
Match between Oligo_pos_patterns_8 and OPC_pos_patterns_9 with similarity score 6.22
Match between Oligo_pos_patterns_10 and Oligo_pos_patterns_1 with similarity score 5.70
Match between Oligo_pos_patterns_11 and OPC_pos_patterns_8 with similarity score 7.00
Match between Oligo_pos_patterns_12 and Oligo_pos_patterns_4 with similarity score 5.00
Reading file modisco_results_ft_2000/Pvalb_modisco_results.h5
Match between Pvalb_pos_patterns_0 and Micro_PVM_pos_patterns_7 with similarity score 10.10
Match between Pvalb_pos_patterns_4 and L5IT_pos_patterns_3 with similarity score 5.59
Match between Pvalb_pos_patterns_5 and Lamp5_pos_patterns_5 with similarity score 9.59
Match between Pvalb_pos_patterns_6 and L6IT_pos_patterns_2 with similarity score 6.77
Match between Pvalb_pos_patterns_8 and OPC_pos_patterns_9 with similarity score 4.93
Match between Pvalb_pos_patterns_9 and L5_6NP_pos_patterns_8 with similarity score 6.07
Match between Pvalb_pos_patterns_10 and Oligo_pos_patterns_4 with similarity score 4.72
Match between Pvalb_pos_patterns_11 and Micro_PVM_pos_patterns_2 with similarity score 5.62
Reading file modisco_results_ft_2000/Sncg_modisco_results.h5
Match between Sncg_pos_patterns_1 and Astro_pos_patterns_7 with similarity score 9.65
Match between Sncg_pos_patterns_2 and Micro_PVM_pos_patterns_7 with similarity score 8.62
Match between Sncg_pos_patterns_3 and L6CT_pos_patterns_4 with similarity score 11.65
Match between Sncg_pos_patterns_4 and L5ET_pos_patterns_1 with similarity score 4.74
Match between Sncg_pos_patterns_6 and Oligo_pos_patterns_13 with similarity score 8.01
Match between Sncg_pos_patterns_8 and L5ET_pos_patterns_11 with similarity score 7.04
Match between Sncg_pos_patterns_9 and L6IT_pos_patterns_8 with similarity score 9.68
Reading file modisco_results_ft_2000/Sst_modisco_results.h5
Match between Sst_pos_patterns_0 and Pvalb_pos_patterns_2 with similarity score 5.05
Match between Sst_pos_patterns_1 and Pvalb_pos_patterns_1 with similarity score 7.42
Match between Sst_pos_patterns_2 and Sncg_pos_patterns_5 with similarity score 4.39
Match between Sst_pos_patterns_3 and Sncg_pos_patterns_1 with similarity score 8.86
Match between Sst_pos_patterns_5 and L5ET_pos_patterns_2 with similarity score 5.49
Match between Sst_pos_patterns_7 and Oligo_pos_patterns_11 with similarity score 5.64
Match between Sst_pos_patterns_9 and L6IT_pos_patterns_2 with similarity score 5.56
Match between Sst_pos_patterns_10 and Micro_PVM_pos_patterns_2 with similarity score 5.01
Reading file modisco_results_ft_2000/SstChodl_modisco_results.h5
Match between SstChodl_pos_patterns_0 and L6b_neg_patterns_2 with similarity score 6.13
Match between SstChodl_pos_patterns_1 and Pvalb_pos_patterns_1 with similarity score 5.67
Match between SstChodl_pos_patterns_2 and Oligo_pos_patterns_3 with similarity score 4.92
Match between SstChodl_pos_patterns_3 and L6IT_pos_patterns_2 with similarity score 8.13
Match between SstChodl_pos_patterns_4 and Astro_pos_patterns_4 with similarity score 4.74
Match between SstChodl_pos_patterns_5 and Astro_pos_patterns_12 with similarity score 7.57
Match between SstChodl_pos_patterns_7 and SstChodl_pos_patterns_2 with similarity score 4.51
Match between SstChodl_pos_patterns_8 and SstChodl_pos_patterns_2 with similarity score 4.62
Match between SstChodl_pos_patterns_9 and Pvalb_pos_patterns_7 with similarity score 5.55
Match between SstChodl_pos_patterns_10 and Endo_pos_patterns_4 with similarity score 6.52
Match between SstChodl_pos_patterns_11 and L6IT_pos_patterns_8 with similarity score 12.00
Match between SstChodl_pos_patterns_12 and Sst_pos_patterns_0 with similarity score 4.72
Reading file modisco_results_ft_2000/VLMC_modisco_results.h5
Match between VLMC_neg_patterns_0 and Micro_PVM_neg_patterns_4 with similarity score 4.60
Match between VLMC_neg_patterns_2 and Oligo_neg_patterns_3 with similarity score 5.39
Match between VLMC_pos_patterns_0 and Endo_pos_patterns_0 with similarity score 10.39
Match between VLMC_pos_patterns_1 and Oligo_pos_patterns_7 with similarity score 6.37
Match between VLMC_pos_patterns_2 and Endo_pos_patterns_3 with similarity score 12.00
Match between VLMC_pos_patterns_3 and Endo_pos_patterns_5 with similarity score 5.37
Match between VLMC_pos_patterns_4 and Endo_pos_patterns_12 with similarity score 8.03
Match between VLMC_pos_patterns_5 and Endo_pos_patterns_6 with similarity score 4.79
Match between VLMC_pos_patterns_7 and Oligo_pos_patterns_2 with similarity score 5.00
Match between VLMC_pos_patterns_8 and Endo_pos_patterns_10 with similarity score 7.54
Match between VLMC_pos_patterns_10 and Sst_pos_patterns_0 with similarity score 5.07
Match between VLMC_pos_patterns_11 and Endo_pos_patterns_9 with similarity score 9.73
Match between VLMC_pos_patterns_13 and Lamp5_pos_patterns_6 with similarity score 4.71
Reading file modisco_results_ft_2000/Vip_modisco_results.h5
Match between Vip_neg_patterns_0 and Micro_PVM_neg_patterns_1 with similarity score 4.75
Match between Vip_pos_patterns_0 and L5ET_pos_patterns_1 with similarity score 5.87
Match between Vip_pos_patterns_1 and Sst_pos_patterns_0 with similarity score 5.43
Match between Vip_pos_patterns_2 and Lamp5_pos_patterns_1 with similarity score 4.59
Match between Vip_pos_patterns_3 and Sncg_pos_patterns_6 with similarity score 6.60
Match between Vip_pos_patterns_4 and Sst_pos_patterns_5 with similarity score 5.21
Match between Vip_pos_patterns_5 and Sncg_pos_patterns_2 with similarity score 9.55
Match between Vip_pos_patterns_6 and Sst_pos_patterns_7 with similarity score 5.61
Match between Vip_pos_patterns_7 and L6IT_pos_patterns_2 with similarity score 5.05
Merged patterns Astro_neg_patterns_0 and VLMC_neg_patterns_0 with similarity 4.49281973389049
Merged patterns Astro_neg_patterns_0 and Oligo_neg_patterns_2 with similarity 4.532187312470385
Merged patterns Astro_neg_patterns_0 and Oligo_neg_patterns_3 with similarity 4.509073822566347
Merged patterns L6IT_neg_patterns_1 and L6CT_pos_patterns_0 with similarity 4.918786284377535
Merged patterns L6IT_neg_patterns_1 and Sncg_pos_patterns_3 with similarity 5.873027627667018
Merged patterns Oligo_pos_patterns_2 and Sst_pos_patterns_0 with similarity 4.3959105935447065
Merged patterns Oligo_pos_patterns_2 and Vip_pos_patterns_8 with similarity 4.445761747019882
Merged patterns Lamp5_pos_patterns_1 and Sncg_pos_patterns_0 with similarity 4.425873526057055
Merged patterns L5ET_pos_patterns_1 and Oligo_pos_patterns_1 with similarity 5.873027628818805
Merged patterns L5ET_pos_patterns_1 and VLMC_pos_patterns_1 with similarity 5.153519272990622
Merged patterns OPC_pos_patterns_5 and OPC_pos_patterns_3 with similarity 4.698762562373
Merged patterns L6IT_pos_patterns_8 and Astro_pos_patterns_13 with similarity 4.6447705626451095
Merged patterns Astro_pos_patterns_9 and Oligo_pos_patterns_9 with similarity 4.4426318724326315
Merged patterns L5ET_pos_patterns_11 and L2_3IT_pos_patterns_13 with similarity 4.373996780635363
Merged patterns SstChodl_pos_patterns_5 and Sst_pos_patterns_5 with similarity 4.755843928748292
Merged patterns VLMC_pos_patterns_0 and VLMC_pos_patterns_2 with similarity 4.405712577494168
Merged patterns VLMC_pos_patterns_0 and VLMC_pos_patterns_9 with similarity 4.7095358108771395
Merged patterns Endo_pos_patterns_6 and Lamp5_pos_patterns_2 with similarity 4.26441143655451
Merged patterns L5_6NP_pos_patterns_5 and L5ET_pos_patterns_8 with similarity 4.60344892622454
Merged patterns Pvalb_pos_patterns_5 and L2_3IT_pos_patterns_10 with similarity 4.600872270055695
Merged patterns L5IT_neg_patterns_3 and L6IT_neg_patterns_2 with similarity 4.668909829360258
Merged patterns Sncg_pos_patterns_2 and L5ET_pos_patterns_6 with similarity 4.251197747887697
Merged patterns L2_3IT_neg_patterns_7 and VLMC_pos_patterns_14 with similarity 5.4742936955697665
Merged patterns L2_3IT_neg_patterns_10 and L5IT_neg_patterns_12 with similarity 5.0138239176771044
Merged patterns L2_3IT_pos_patterns_15 and Sst_pos_patterns_7 with similarity 4.987630140281623
Merged patterns L5ET_neg_patterns_0 and L6b_neg_patterns_1 with similarity 4.9912905383480295
Merged patterns L6b_pos_patterns_9 and L6IT_neg_patterns_13 with similarity 4.527979417979521
Merged patterns L6IT_neg_patterns_4 and L6IT_neg_patterns_9 with similarity 4.288197001214521
Merged patterns Pvalb_pos_patterns_3 and Sst_pos_patterns_12 with similarity 4.347864160345473
Merged patterns Sst_pos_patterns_6 and Sst_pos_patterns_11 with similarity 4.55784659568327
Iteration 1: Merging complete, checking again
Merged patterns L6CT_pos_patterns_0 and L5ET_pos_patterns_1 with similarity 5.270968075224089
Merged patterns L6CT_pos_patterns_0 and Endo_neg_patterns_1 with similarity 4.423326782648028
Merged patterns Astro_pos_patterns_13 and L2_3IT_pos_patterns_9 with similarity 5.02937471908944
Merged patterns L5ET_pos_patterns_6 and Lamp5_pos_patterns_4 with similarity 4.530131691462881
Iteration 2: Merging complete, checking again
Discarded 7 patterns below IC threshold 0.2 and with a single class instance:
['Endo_neg_patterns_2', 'Endo_neg_patterns_3', 'L5IT_neg_patterns_10', 'L5IT_neg_patterns_14', 'L6IT_neg_patterns_7', 'L6IT_neg_patterns_11', 'VLMC_neg_patterns_1']
Total iterations: 2
(19, 108)

Now we can plot a clustermap of cell types/classes and patterns, where the classes are clustered purely on pattern importance with crested.tl.modisco.generate_nucleotide_sequences() and crested.pl.patterns.clustermap()

import matplotlib

%matplotlib inline
matplotlib.rcParams["pdf.fonttype"] = 42
matplotlib.rcParams["ps.fonttype"] = 42

pat_seqs = crested.tl.modisco.generate_nucleotide_sequences(all_patterns)
crested.pl.patterns.clustermap(
    pattern_matrix,
    list(adata.obs_names),
    figsize=(25, 4.2),
    pat_seqs=pat_seqs,
    grid=True,
    dendrogram_ratio=(0.03, 0.15),
    importance_threshold=5,
)
../_images/3ec09d1ff1f96975ac397a10c0424d726bd5ea6cc48a0987d461079cb67b5dc3.png

If you have the horizontal space for it, you can also add the PWM/Contrib logos to the x-axis.

crested.pl.patterns.clustermap_with_pwm_logos(
    pattern_matrix,
    list(adata.obs_names),
    pattern_dict=all_patterns,
    figsize=(25, 4.2),
    grid=True,
    dendrogram_ratio=(0.03, 0.15),
    importance_threshold=5,  # 3.5,
    logo_height_fraction=0.35,
    logo_y_padding=0.25,
)
../_images/611c2732da28123db3c979f1c5ac5ffea031a58b74a10816ac47a2b2a75f857c.png
<seaborn.matrix.ClusterGrid at 0x7eec542b46d0>

We can also subset to classes we are interested in and want to compare in more detail.

%matplotlib inline
pat_seqs = crested.tl.modisco.generate_nucleotide_sequences(all_patterns)
crested.pl.patterns.clustermap_with_pwm_logos(
    pattern_matrix,
    classes=list(adata.obs_names),
    pattern_dict=all_patterns,
    subset=["Astro", "OPC", "Oligo"],
    figsize=(10, 3),
    grid=True,
    logo_height_fraction=0.35,
    logo_y_padding=0.3,
)
../_images/093b3f204f408e97d78a1d968c023928ab4a35372d1865b089cf384a3ef48722.png
<seaborn.matrix.ClusterGrid at 0x7f04bd7b6490>
%matplotlib inline
pat_seqs = crested.tl.modisco.generate_nucleotide_sequences(all_patterns)
crested.pl.patterns.clustermap_with_pwm_logos(
    pattern_matrix,
    classes=list(adata.obs_names),
    subset=["L2_3IT", "L5ET", "L5IT", "L5_6NP", "L6CT", "L6IT", "L6b"],
    pattern_dict=all_patterns,
    figsize=(15, 3),
    grid=True,
    logo_height_fraction=0.35,
    logo_y_padding=0.3,
    importance_threshold=4,
)
../_images/b8589e7d978514df0868fbc330ffa642c514094491416d5e2f3b5e2728c16558.png
<seaborn.matrix.ClusterGrid at 0x7eee119da750>

Additional pattern insights#

It is always interesting to investigate specific patterns show in the clustermap above. Here there are some example function on how to do that.

Plotting patterns based on their indices can be done with crested.pl.patterns.selected_instances()

pattern_indices = [53]
crested.pl.patterns.selected_instances(
    all_patterns, pattern_indices
)  # The pattern that is shown is the most representative pattern of the cluster with the highest average IC
../_images/08c82af0d3b81eb3c9c335d1a187c260b7f1d5ed8e564d84fa3b7ecd3365fe63.png

We can also do a check of pattern similarity

idx1 = 1
idx2 = 2
sim = crested.tl.modisco.pattern_similarity(all_patterns, idx1, idx2)
print("Pattern similarity is " + str(sim))
crested.pl.patterns.selected_instances(all_patterns, [idx1, idx2])
Pattern similarity is tensor(0.2183, dtype=torch.float64)
../_images/1f419174370ad00ca4c306cfb616129bc7cd6b5d8b618d58c6df25836d7c6687.png

We can plot all the instances of patterns in the same cluster with crested.pl.patterns.class_instances()

crested.pl.patterns.class_instances(all_patterns, 5)
../_images/11d1f2e91b2361ea996c7d243c406235565f9ac5e35138336f3cbd36a923876b.png

If you want to find out in which pattern cluster a certain pattern is from your modisco results, you can call the crested.tl.modisco.find_pattern() function.

idx = crested.tl.modisco.find_pattern("OPC_neg_patterns_0", all_patterns)
if idx is not None:
    print("Pattern index is " + str(idx))
    crested.pl.patterns.class_instances(all_patterns, idx, class_representative=True)
Pattern index is 87
../_images/73844869dcdfc694a34058b5d09e957dde5756c74131641996ef06f0fc0df38e.png

Finally we can also plot the similarity between all patterns with crested.tl.modisco.calculate_similarity_matrix() and crested.pl.patterns.similarity_heatmap()

sim_matrix, indices = crested.tl.modisco.calculate_similarity_matrix(all_patterns)
crested.pl.patterns.similarity_heatmap(sim_matrix, indices, fig_size=(42, 17))
../_images/a14798dd942f418d9c6d6d17fc390ab264b82f4e3d5ed5354ede2e11152cef66.png

Matching patterns to TF candidates from scRNA-seq data [Optional]#

To understand the actual transcription factor (TF) candidates binding to the characteristic patterns/potential binding sites per cell type, we can propose potential candidates through scRNA-seq data and a TF-motif collection file.

This analysis requires that you ran tfmodisco-lite with the report function such that each pattern has potential MEME database hits and that you have multiome data. The names in the motif database should match those in the TF-motif collection file.

Load scRNA-seq data#

Load scRNA seq data and calculate mean expression per cell type using crested.tl.modisco.calculate_mean_expression_per_cell_type()

import crested

file_path = (
    "../../../../mouse/biccn/Mouse_rna.h5ad"  # Locate h5 file containing scRNAseq data
)
cell_type_column = "subclass_Bakken_2022"
mean_expression_df = crested.tl.modisco.calculate_mean_expression_per_cell_type(
    file_path, cell_type_column, cpm_normalize=True
)
crested.pl.patterns.tf_expression_per_cell_type(
    mean_expression_df, ["Nfia", "Spi1", "Mef2c"]
)
../_images/0020c7d6b0c0fb4eb8fe7668b3c695a71d29eb5a54c8f1a1eda70c44c647081e.png

Generating pattern to database motif dictionary#

classes = list(adata.obs_names)
contribution_dir = "modisco_results_ft_2000"
html_paths = crested.tl.modisco.generate_html_paths(
    all_patterns, classes, contribution_dir
)

pattern_match_dict = crested.tl.modisco.find_pattern_matches(
    all_patterns, html_paths, q_val_thr=0.05
)  # q_val threshold to only select significant matches

Loading TF-motif database#

motif_to_tf_df = crested.tl.modisco.read_motif_to_tf_file(
    "../../../../tools/Motif_collection.tsv"
)
motif_to_tf_df
logo Motif_name Cluster Human_Direct_annot Human_Orthology_annot Mouse_Direct_annot Mouse_Orthology_annot Fly_Direct_annot Fly_Orthology_annot Cluster_Human_Direct_annot Cluster_Human_Orthology_annot Cluster_Mouse_Direct_annot Cluster_Mouse_Orthology_annot Cluster_Fly_Direct_annot Cluster_Fly_Orthology_annot
0 <img src="https://motifcollections.aertslab.org/v10/logos/bergman__Adf1.png" height="52" alt="bergman__Adf1"></img> bergman__Adf1 NaN NaN NaN NaN NaN Adf1 NaN NaN NaN NaN NaN NaN NaN
1 <img src="https://motifcollections.aertslab.org/v10/logos/bergman__Aef1.png" height="52" alt="bergman__Aef1"></img> bergman__Aef1 NaN NaN NaN NaN NaN Aef1 NaN NaN NaN NaN NaN NaN NaN
2 <img src="https://motifcollections.aertslab.org/v10/logos/bergman__ap.png" height="52" alt="bergman__ap"></img> bergman__ap NaN NaN NaN NaN NaN ap NaN NaN NaN NaN NaN NaN NaN
3 <img src="https://motifcollections.aertslab.org/v10/logos/elemento__ACCTTCA.png" height="52" alt="elemento__ACCTTCA"></img> elemento__ACCTTCA NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 <img src="https://motifcollections.aertslab.org/v10/logos/bergman__bcd.png" height="52" alt="bergman__bcd"></img> bergman__bcd NaN NaN NaN NaN NaN bcd NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22471 <img src="https://motifcollections.aertslab.org/v10/logos/elemento__CAAGGAG.png" height="52" alt="elemento__CAAGGAG"></img> elemento__CAAGGAG 98.3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22472 <img src="https://motifcollections.aertslab.org/v10/logos/elemento__TCCTTGC.png" height="52" alt="elemento__TCCTTGC"></img> elemento__TCCTTGC 98.3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22473 <img src="https://motifcollections.aertslab.org/v10/logos/swissregulon__hs__ZNF274.png" height="52" alt="swissregulon__hs__ZNF274"></img> swissregulon__hs__ZNF274 99.1 ZNF274 NaN NaN Zfp369, Zfp110 NaN NaN ZNF274 NaN NaN Zfp369, Zfp110 NaN NaN
22474 <img src="https://motifcollections.aertslab.org/v10/logos/swissregulon__sacCer__THI2.png" height="52" alt="swissregulon__sacCer__THI2"></img> swissregulon__sacCer__THI2 99.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22475 <img src="https://motifcollections.aertslab.org/v10/logos/jaspar__MA0407.1.png" height="52" alt="jaspar__MA0407.1"></img> jaspar__MA0407.1 99.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

22476 rows × 15 columns

Matching patterns to TF candidates#

We calculate a pattern-tf by cell type matrix which contains the imporatance of each pattern linked to a TF per cell type using crested.tl.modisco.create_pattern_tf_dict() and crested.tl.modisco.create_tf_ct_matrix()

cols = [
    "Mouse_Direct_annot",
    "Mouse_Orthology_annot",
    "Cluster_Mouse_Direct_annot",
    "Cluster_Mouse_Orthology_annot",
]
pattern_tf_dict, all_tfs = crested.tl.modisco.create_pattern_tf_dict(
    pattern_match_dict, motif_to_tf_df, all_patterns, cols
)
tf_ct_matrix, tf_pattern_annots = crested.tl.modisco.create_tf_ct_matrix(
    pattern_tf_dict,
    all_patterns,
    mean_expression_df,
    classes,
    log_transform=True,
    normalize_pattern_importances=False,
    normalize_gex=True,
    min_tf_gex=0.95,
    importance_threshold=5.5,
    pattern_parameter="seqlet_count_log",
    filter_correlation=True,
    verbose=True,
    zscore_threshold=1.5,
    correlation_threshold=0.35,
)
Total columns before threshold filtering: 989
Total columns after threshold filtering: 149
Total columns removed: 840
Total columns before correlation filtering: 149
Total columns after correlation filtering: 103
Total columns removed: 46

Finally, we can plot a clustermap of potential pattern-TF matches and their importance per cell type with crested.pl.patterns.clustermap_tf_motif()

crested.pl.patterns.clustermap_tf_motif(
    tf_ct_matrix,
    heatmap_dim="contrib",
    dot_dim="gex",
    class_labels=classes,
    pattern_labels=tf_pattern_annots,
    fig_size=(35, 6),
    cluster_rows=True,
    cluster_columns=False,
)
../_images/6c67e7edbe34de8047c136d100e20ccd01338543a52461550d7412e14b5cdb36.png
crested.pl.patterns.clustermap_tf_motif(
    tf_ct_matrix,
    heatmap_dim="contrib",
    dot_dim="gex",
    class_labels=classes,
    subset_classes=["Lamp5", "Sncg", "Vip", "Pvalb", "Sst", "SstChodl"],
    pattern_labels=tf_pattern_annots,
    fig_size=(15, 6),
    cluster_rows=False,
    cluster_columns=False,
)
../_images/4edffda669d1999034ec5927cc1d4f56b4e40504c9503aee8f2959b2c10452d3.png