Enhancer code analysis#

In this notebook we will go over how to obtain cell type characteristic sequence patterns from CREsted models, or any other model, using CREsted functionality and the extra packages from crested[motif]: tfmodisco-lite and memesuite-lite.

Hide code cell source

# Set package settings
import matplotlib
import os

## Set the font type to ensure text is saved as whole words
matplotlib.rcParams["pdf.fonttype"] = 42  # Use TrueType fonts instead of Type 3 fonts
matplotlib.rcParams["ps.fonttype"] = 42  # For PostScript as well, if needed

## Set the base directory for data retrieval with crested.get_dataset()/get_model()
os.environ['CRESTED_DATA_DIR'] = '/home/VIB.LOCAL/niklas.kempynck/nkemp/crested_data/Figure_2/'
data_dir = os.environ['CRESTED_DATA_DIR']

Obtaining contribution scores per model class and running tfmodisco-lite#

Before we can do any analysis, we need to calculate the contribution scores for cell type-specific regions. These regions hold the most sequence information relevant for cell type identity. From those, we can run tfmodisco-lite.

import anndata as ad
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import crested
import keras
import pickle

Load data and CREsted model#

We’ll load the dataset and model from the the introduction notebook:

# Set the genome
genome = crested.Genome(
    os.path.join(data_dir, "mm10/mm10.fa"),
    os.path.join(data_dir, "mm10/mm10.chrom.sizes")
)
crested.register_genome(genome)  # Register the genome so that it's automatically used in every function
2026-03-26T14:52:51.519305+0100 INFO Genome mm10 registered.
# Read in the anndata
adata = ad.read_h5ad(os.path.join(data_dir, "mouse_cortex.h5ad"))
# Load a trained model
model_path = "mouse_biccn/finetuned_model/checkpoints/02.keras" # change to your model path
model = crested.utils.load_model(model_path)

Select the most informative regions per cell type#

To obtain cell type-characteristic patterns, we need to calculate contribution scores on highly specific regions. For this purpose, we’ve included a preprocessing function crested.pp.sort_and_filter_regions_on_specificity() to keep the top k most specific regions per cell type that you can use to filter your data before running modisco.

There are three options to select the top regions: either purely based on peak height, purely based on predictions, or on their combination. Here we show how to use the combination of both (which we recommend, see Johansen & Kempynck et al., 2025).

# Store predictions for all our regions in the anndata object
predictions = crested.tl.predict(adata, model)
adata.layers["biccn_model"] = predictions.T
  12/4274 ━━━━━━━━━━━━━━━━━━━━ 1:08 16ms/step
4270/4274 ━━━━━━━━━━━━━━━━━━━ 0s 16ms/step
4274/4274 ━━━━━━━━━━━━━━━━━━━━ 105s 19ms/step
# Calculate the average of the ground truth and predictions
adata.layers['combined'] = (adata.X + adata.layers["biccn_model"])/2

We can use crested.pl.qc.sort_and_filter_cutoff() to look at the gini score distributions for the different regions per class.

Here, we take the top 2000 most specific regions, but top 500 or 1000 regions would give similar results.

%matplotlib inline
crested.pl.qc.sort_and_filter_cutoff(adata, model_name="combined", cutoffs=[500, 1000, 2000], max_k=5000)
2026-03-26T14:55:04.575502+0100 INFO Lazily importing module crested.pl. This could take a second...
../_images/212a3e3414881e33b19a18f50954d5c9b1e5fee7bdb8176d7f74d3da9277fa13.png
# Filter most informative regions per class
top_k = 2000
adata_filtered = crested.pp.sort_and_filter_regions_on_specificity(adata, model_name="combined", top_k=top_k, method="gini", inplace=False)
adata_filtered
2026-03-26T14:55:17.767796+0100 INFO After sorting and filtering, kept 38000 regions.
AnnData object with n_obs × n_vars = 19 × 38000
    obs: 'file_path'
    var: 'chr', 'start', 'end', 'split', 'Class name', 'rank', 'gini_score'
    obsm: 'weights'
    layers: 'DeepBICCN2', 'DeepBICCN2_base', 'gReLU_22M', 'gReLU_6M', 'biccn_model', 'combined'
adata_filtered.var
chr start end split Class name rank gini_score
region
chr10:45499432-45501546 chr10 45499432 45501546 val Astro 1 0.827100
chr2:65274604-65276718 chr2 65274604 65276718 train Astro 2 0.826102
chrX:23135863-23137977 chrX 23135863 23137977 train Astro 3 0.821430
chr1:161272641-161274755 chr1 161272641 161274755 train Astro 4 0.816905
chr5:9868290-9870404 chr5 9868290 9870404 train Astro 5 0.810628
... ... ... ... ... ... ... ...
chr9:22789193-22791307 chr9 22789193 22791307 test Vip 1996 0.535163
chr6:14177242-14179356 chr6 14177242 14179356 train Vip 1997 0.535157
chr7:137374461-137376575 chr7 137374461 137376575 train Vip 1998 0.535079
chr17:57934356-57936470 chr17 57934356 57936470 train Vip 1999 0.535075
chr1:184003351-184005465 chr1 184003351 184005465 train Vip 2000 0.535071

38000 rows × 7 columns

Calculating contribution scores per class#

Now you can calculate the contribution scores for all the regions in your filtered AnnData.
By default, the contribution scores are calculated using the expected integrated gradients method, but you can change this to simple integrated gradients to speed up the calculation. We’ve found anecdotally that this has a very minor effect on the quality of the contribution scores, while speeding up the calculation significantly.

crested.tl.contribution_scores_specific(
    input=adata_filtered,
    target_idx=None,  # We calculate for all classes
    model=model,
    output_dir="modisco_results_ft_2000",
    method="integrated_grad"
)

Hide code cell output

2026-02-18T22:09:57.090369+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:12:03.960105+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:14:02.281014+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:16:00.581708+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:17:59.153553+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:19:57.538069+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:21:55.870284+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:25:52.376153+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:27:50.404973+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:29:48.763463+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:31:46.930498+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:33:46.036175+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:35:44.168858+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:37:42.562841+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:39:40.721231+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:41:39.174031+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:43:37.742379+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
2026-02-18T22:45:35.769823+0100 INFO Calculating contribution scores for 1 class(es) and 2000 region(s).
(array([[[[-1.48615626e-07, -5.54927112e-07,  4.07066267e-07, ...,
           -1.79133679e-07,  7.91571267e-07, -9.54171497e-08],
          [ 2.84154055e-07,  3.62958559e-07,  6.59639511e-07, ...,
            1.65198423e-06,  6.93648587e-07,  6.85019188e-07],
          [ 1.58720212e-07,  1.27419142e-07,  2.07420385e-07, ...,
           -7.57875739e-07, -7.95578956e-07, -2.73807672e-07],
          [-2.04508495e-07,  3.26645903e-07, -9.35306844e-07, ...,
           -9.60897182e-07,  3.23470566e-08, -3.14376678e-07]]],
 
 
        [[[ 1.38958785e-06, -1.30966725e-06, -5.06607375e-06, ...,
           -1.16397632e-05, -5.93204959e-06, -3.62574792e-06],
          [-3.17109595e-07,  2.51010647e-07, -3.71272307e-07, ...,
            1.86718694e-06,  2.69274597e-06,  4.92592653e-06],
          [ 2.27687622e-07, -7.58866179e-07,  1.47926028e-06, ...,
            3.09061252e-05,  1.20547866e-05, -5.71382179e-06],
          [-1.85486203e-06,  3.86895226e-06,  4.91772153e-06, ...,
           -1.04782321e-05, -5.92017886e-06,  3.52403322e-06]]],
 
 
        [[[-7.97643736e-07, -3.85100606e-07, -9.47989420e-08, ...,
            4.23821547e-07,  5.40903329e-07, -9.31905106e-07],
          [ 9.30548481e-07,  2.43477984e-06, -4.68601691e-08, ...,
           -1.04987203e-06, -1.52894552e-06, -8.91138612e-08],
          [-4.05994570e-07, -1.15031787e-06, -4.41645710e-07, ...,
            1.60890374e-06,  5.07899529e-07,  1.45172123e-07],
          [ 6.96884001e-07, -8.35631113e-07,  1.02211168e-07, ...,
           -4.88856870e-07,  3.50123827e-07,  7.61683054e-07]]],
 
 
        ...,
 
 
        [[[-1.57228055e-06,  3.17765525e-09,  2.14368313e-07, ...,
            1.71230147e-06,  4.87835405e-06,  1.38985786e-06],
          [ 2.29700549e-06, -2.85101783e-06,  1.34790946e-06, ...,
           -4.60111278e-06, -2.89423065e-06, -2.40853046e-06],
          [-1.60753962e-06,  7.44850468e-07, -2.92837126e-06, ...,
            1.70980388e-06, -6.49964363e-07,  1.65252720e-06],
          [ 6.95323479e-07,  2.22642188e-06,  1.17253705e-06, ...,
            2.39325936e-06, -2.82814631e-06, -4.49548168e-07]]],
 
 
        [[[ 1.07652374e-06, -2.11098222e-06, -4.70268617e-07, ...,
           -8.07893775e-06, -5.11730605e-06, -3.85602380e-06],
          [-1.14714965e-06,  6.61964975e-07, -1.61520995e-06, ...,
            3.53050041e-06,  2.91305514e-06,  2.93008748e-06],
          [ 3.23378771e-07,  2.05736205e-06,  2.07632320e-06, ...,
            6.25897792e-06, -7.90520403e-07,  3.05838034e-06],
          [-1.81226682e-07, -1.31726586e-06, -1.25529581e-07, ...,
           -8.64686444e-06,  2.35491757e-06, -2.12834334e-06]]],
 
 
        [[[-9.90992021e-07,  3.45539547e-06, -7.99445843e-06, ...,
           -1.64329231e-06,  1.53056931e-06,  3.08550898e-06],
          [-5.81575932e-07, -4.70697023e-06,  6.51809933e-06, ...,
            3.41147097e-06, -6.72604438e-06,  1.11118268e-07],
          [-5.77494689e-07,  3.27210250e-06, -2.31046465e-06, ...,
            2.65483209e-06,  3.99918690e-06,  4.32558656e-09],
          [ 1.26465181e-07, -2.15214413e-06,  1.48661877e-06, ...,
           -8.24616927e-06, -1.60217598e-06, -4.17409910e-06]]]]),
 array([[[0., 0., 1., ..., 0., 0., 0.],
         [0., 1., 0., ..., 1., 0., 1.],
         [0., 0., 0., ..., 0., 0., 0.],
         [1., 0., 0., ..., 0., 1., 0.]],
 
        [[1., 0., 0., ..., 0., 1., 1.],
         [0., 1., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 1., 0., 0.],
         [0., 0., 1., ..., 0., 0., 0.]],
 
        [[1., 0., 0., ..., 0., 0., 1.],
         [0., 0., 0., ..., 1., 1., 0.],
         [0., 0., 1., ..., 0., 0., 0.],
         [0., 1., 0., ..., 0., 0., 0.]],
 
        ...,
 
        [[1., 0., 0., ..., 1., 0., 1.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 1., 0., ..., 0., 1., 0.],
         [0., 0., 1., ..., 0., 0., 0.]],
 
        [[0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 1., 0.],
         [1., 1., 0., ..., 0., 0., 0.],
         [0., 0., 1., ..., 1., 0., 1.]],
 
        [[1., 0., 1., ..., 1., 0., 0.],
         [0., 0., 0., ..., 0., 1., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 1., 0., ..., 0., 0., 1.]]], dtype=float32))

Running tfmodisco-lite#

When this is done, you can run TF-MoDISco-lite on the saved contribution scores to find motifs that are important for the classification/regression task.

You could use the tfmodisco package directly to do this, or you could use the crested.tl.modisco.tfmodisco() function which is essentially a wrapper around the tfmodisco package.

Note that from here on, you can use contribution scores from any model trained in any framework, as this analysis just requires a set of one hot encoded sequences and contribution scores per cell type stored in the same directory.

If you don’t have tomtom available on the command line, set report=False in crested.tl.modisco.tfmodisco(). You won’t get an html report matching the motifs to their closest match in the motif database, but you’ll still get a collection of clustered seqlets for use in the rest of the notebook.

meme_db, motif_to_tf_file = crested.get_motif_db()
# run tfmodisco on the contribution scores
crested.tl.modisco.tfmodisco(
    window=1000,
    output_dir="modisco_results_ft_2000",
    contrib_dir="modisco_results_ft_2000",
    report=False,  # Optional, will match patterns to motif MEME database - deactivate if you don't have TOMTOM read on the command line
    meme_db=meme_db,  # File to MEME database - not needed if not generating reports
    max_seqlets=20000,
)
2026-02-18T22:48:10.031481+0100 INFO No class names provided, using all found in the contribution directory: ['Pvalb', 'Lamp5', 'L5ET', 'Oligo', 'L6IT', 'L2_3IT', 'Endo', 'Sst', 'L5IT', 'Sncg', 'Astro', 'VLMC', 'OPC', 'L6CT', 'Vip', 'L5_6NP', 'L6b', 'Micro_PVM', 'SstChodl']
2026-02-18T22:48:10.527970+0100 INFO Running modisco for class: Pvalb
Using 11070 positive seqlets
Extracted 1602 negative seqlets
2026-02-18T22:55:24.463964+0100 INFO Running modisco for class: Lamp5
Using 11127 positive seqlets
Extracted 1617 negative seqlets
2026-02-18T23:02:18.775590+0100 INFO Running modisco for class: L5ET
Using 12954 positive seqlets
Extracted 1638 negative seqlets
2026-02-18T23:10:50.207266+0100 INFO Running modisco for class: Oligo
Using 13245 positive seqlets
Extracted 3037 negative seqlets
2026-02-18T23:20:09.298092+0100 INFO Running modisco for class: L6IT
Using 15015 positive seqlets
2026-02-18T23:29:28.185590+0100 INFO Running modisco for class: L2_3IT
Using 15115 positive seqlets
Extracted 2734 negative seqlets
2026-02-18T23:39:27.206960+0100 INFO Running modisco for class: Endo
Using 7876 positive seqlets
Extracted 1529 negative seqlets
2026-02-18T23:44:16.357842+0100 INFO Running modisco for class: Sst
Using 13196 positive seqlets
Extracted 2422 negative seqlets
2026-02-18T23:53:55.514966+0100 INFO Running modisco for class: L5IT
Using 14035 positive seqlets
Extracted 2655 negative seqlets
2026-02-19T00:04:25.113868+0100 INFO Running modisco for class: Sncg
Using 11203 positive seqlets

Analysis of cell-type specific sequence patterns#

Once you have obtained your modisco results, you can plot the the found patterns using the crested.pl.modisco.modisco_results() function.

%matplotlib inline
top_k = 1000
crested.pl.modisco.modisco_results(
    classes=["Astro", "L5ET", "Vip", "Oligo"],
    contribution="positive",
    contribution_dir="modisco_results_ft_2000",
    num_seq=top_k,
    y_max=0.15,
    viz="pwm",
)  # You can also visualize in 'pwm' format
2026-03-26T14:55:17.854668+0100 INFO Starting genomic contributions plot for classes: ['Astro', 'L5ET', 'Vip', 'Oligo']
../_images/043c88f59efcde4a9247e45851a1aaae2949763d5a18c149e5dfc21ddda6b368.png

Matching patterns across cell types#

Since we have calculated per cell type the patterns independently of each other, we do not know quantitatively how and if they overlap. It can be interesting to get an overview of which patterns are found across multiple cell types, how important they are, and if there are unique patterns only found in a small selection of classes. Therefore, we have made a pattern clustering algorithm, which starts from the results of tfmodisco-lite, and returns a pattern matrix, which contains the importance of the clustered patterns per cell type, and a pattern dictionary, describing all clustered patterns.

First, we’ll obtain the modisco files per class by using crested.tl.modisco.match_h5_files_to_classes() using our selected classes.

# First we obtain the resulting modisco files per class
matched_files = crested.tl.modisco.match_h5_files_to_classes(
    contribution_dir="modisco_results_ft_2000", classes=list(adata.obs_names)
)

A quick pairwise comparison#

Before we do any pattern clustering, we can check for each independent pattern how similar it is to all the other patterns using memesuite-lite. We use crested.tl.modisco.calculate_tomtom_similarity_per_pattern(). This function returns a pairwise similarity matrix of every unique pattern, together with a list of ids and a dictionary containing additional information per pattern.

sim_matrix, pattern_ids, pattern_dict = crested.tl.modisco.calculate_tomtom_similarity_per_pattern(
    matched_files=matched_files, trim_ic_threshold=0.025, verbose=True
)
Reading file modisco_results_ft_2000/Astro_modisco_results.h5
Reading file modisco_results_ft_2000/Endo_modisco_results.h5
Reading file modisco_results_ft_2000/L2_3IT_modisco_results.h5
Reading file modisco_results_ft_2000/L5ET_modisco_results.h5
Reading file modisco_results_ft_2000/L5IT_modisco_results.h5
Reading file modisco_results_ft_2000/L5_6NP_modisco_results.h5
Reading file modisco_results_ft_2000/L6CT_modisco_results.h5
Reading file modisco_results_ft_2000/L6IT_modisco_results.h5
Reading file modisco_results_ft_2000/L6b_modisco_results.h5
Reading file modisco_results_ft_2000/Lamp5_modisco_results.h5
Reading file modisco_results_ft_2000/Micro_PVM_modisco_results.h5
Reading file modisco_results_ft_2000/OPC_modisco_results.h5
Reading file modisco_results_ft_2000/Oligo_modisco_results.h5
Reading file modisco_results_ft_2000/Pvalb_modisco_results.h5
Reading file modisco_results_ft_2000/Sncg_modisco_results.h5
Reading file modisco_results_ft_2000/Sst_modisco_results.h5
Reading file modisco_results_ft_2000/SstChodl_modisco_results.h5
Reading file modisco_results_ft_2000/VLMC_modisco_results.h5
Reading file modisco_results_ft_2000/Vip_modisco_results.h5
Total patterns: 354

We can now visualize this similarities using crested.pl.modisco.clustermap_tomtom_similarities(). We will subset the often large similarity matrix to a relevant subset. First, we will look at a pattern and find other patterns similar to it in other cell types. We add some additional group labels for visualization purposes.

nn = {"Astro", "Endo", "Micro_PVM", "OPC", "Oligo", "VLMC"}
exc = {"L2_3IT", "L5ET", "L5IT", "L5_6NP", "L6CT", "L6IT", "L6b"}
inh = {"Lamp5", "Pvalb", "Sncg", "Sst", "SstChodl", "Vip"}

groups, groups_2 = [], []
for id in pattern_ids:
    ct = "_".join(id.split("_")[:-3])
    groups.append(
        "Non-neuronal"
        if ct in nn
        else "Excitatory"
        if ct in exc
        else "Inhibitory"
        if ct in inh
        else (_ for _ in ()).throw(ValueError(f"Unknown class: {ct}"))
    )
    groups_2.append(ct)

unique_cats = pd.unique(groups_2)
group_colors_2 = {cat: mcolors.to_hex(plt.get_cmap("tab20", len(unique_cats))(i)) for i, cat in enumerate(unique_cats)}
group_colors = {"Non-neuronal": "skyblue", "Excitatory": "salmon", "Inhibitory": "green"}
%matplotlib inline
crested.pl.modisco.clustermap_tomtom_similarities(
    sim_matrix=sim_matrix,
    ids=pattern_ids,
    pattern_dict=pattern_dict,
    group_info=[(groups, group_colors), (groups_2, group_colors_2)],  # Grouping labels
    query_id="Vip_pos_patterns_1",  # Find patterns similar to this one
    threshold=3,  # TOMTOM similarity threshold, we take the -log10(pval)
    min_seqlets=100,  # Add a minimum amount of seqlets to take the most relevant patterns
)
<seaborn.matrix.ClusterGrid at 0x7f8d2fc8ded0>
../_images/19eb807610cc2d1b23a563abbab8ab55300cc218656fecad6987872f96d6c63c.png

We can also use this to look at all the different patterns in a subset of cell types, and see if we can find interesting groups of similar motifs.

crested.pl.modisco.clustermap_tomtom_similarities(
    sim_matrix=sim_matrix,
    ids=pattern_ids,
    pattern_dict=pattern_dict,
    group_info=[(groups, group_colors), (groups_2, group_colors_2)],  # Grouping labels
    class_names=["Lamp5", "Pvalb", "Sncg", "Sst", "SstChodl", "Vip"],  # Subset of classes to use patterns from
    min_seqlets=300,  # Add a minimum amount of seqlets to take the most relevant patterns
)
<seaborn.matrix.ClusterGrid at 0x7fa458119550>
../_images/7d1f4eea0a394e03cbbd3827b3425f93aa695fdcd78e43951bdfd287eaf721f0.png

Pattern clustering across cell types#

Since many patterns are similar and can be matched across cell types, we can also cluster them into groups and compare the importance of that matched motif over all cell types. We cluster all separate patterns using crested.tl.modisco.process_patterns() and create a pattern matrix with crested.tl.modisco.create_pattern_matrix().

# Then we cluster matching patterns, and define a pattern matrix [#classes, #patterns] describing their importance
all_patterns = crested.tl.modisco.process_patterns(
    matched_files,
    sim_threshold=6.5,  # The similarity threshold used for matching patterns. We take the -log10(pval), pval obtained through TOMTOM matching from memesuite-lite
    trim_ic_threshold=0.05,  # Information content (IC) threshold on which to trim patterns
    discard_ic_threshold=0.2,  # IC threshold used for discarding single instance patterns
    verbose=True,  # Useful for doing sanity checks on matching patterns
)
pattern_matrix = crested.tl.modisco.create_pattern_matrix(
    classes=list(adata.obs_names),
    all_patterns=all_patterns,
    normalize=False,
    pattern_parameter="seqlet_count_log",
)
pattern_matrix.shape

Hide code cell output

Reading file modisco_results_ft_2000/Astro_modisco_results.h5
Match between Astro_pos_patterns_11 and Astro_pos_patterns_5 with similarity score 8.58
Match between Astro_pos_patterns_15 and Astro_pos_patterns_3 with similarity score 6.78
Reading file modisco_results_ft_2000/Endo_modisco_results.h5
Match between Endo_pos_patterns_1 and Astro_pos_patterns_3 with similarity score 7.12
Match between Endo_pos_patterns_7 and Astro_neg_patterns_2 with similarity score 7.55
Reading file modisco_results_ft_2000/L2_3IT_modisco_results.h5
Match between L2_3IT_neg_patterns_9 and Endo_pos_patterns_7 with similarity score 7.06
Match between L2_3IT_neg_patterns_11 and L2_3IT_neg_patterns_8 with similarity score 6.51
Match between L2_3IT_pos_patterns_1 and Endo_pos_patterns_6 with similarity score 7.91
Match between L2_3IT_pos_patterns_7 and Astro_pos_patterns_8 with similarity score 12.88
Match between L2_3IT_pos_patterns_14 and L2_3IT_neg_patterns_4 with similarity score 7.28
Match between L2_3IT_pos_patterns_16 and Endo_pos_patterns_17 with similarity score 6.84
Match between L2_3IT_pos_patterns_17 and Endo_neg_patterns_1 with similarity score 7.34
Reading file modisco_results_ft_2000/L5ET_modisco_results.h5
Match between L5ET_neg_patterns_1 and Endo_neg_patterns_0 with similarity score 9.00
Match between L5ET_pos_patterns_1 and Astro_pos_patterns_0 with similarity score 6.70
Match between L5ET_pos_patterns_2 and L2_3IT_pos_patterns_12 with similarity score 8.30
Match between L5ET_pos_patterns_4 and L2_3IT_pos_patterns_2 with similarity score 11.42
Match between L5ET_pos_patterns_7 and L2_3IT_pos_patterns_0 with similarity score 13.15
Match between L5ET_pos_patterns_8 and L2_3IT_pos_patterns_3 with similarity score 9.37
Reading file modisco_results_ft_2000/L5IT_modisco_results.h5
Match between L5IT_neg_patterns_0 and L2_3IT_neg_patterns_0 with similarity score 6.99
Match between L5IT_neg_patterns_1 and L2_3IT_neg_patterns_2 with similarity score 9.07
Match between L5IT_neg_patterns_4 and L2_3IT_neg_patterns_4 with similarity score 9.82
Match between L5IT_neg_patterns_7 and L5ET_neg_patterns_2 with similarity score 7.07
Match between L5IT_neg_patterns_8 and L2_3IT_neg_patterns_5 with similarity score 8.51
Match between L5IT_neg_patterns_9 and L2_3IT_neg_patterns_3 with similarity score 10.56
Match between L5IT_neg_patterns_10 and Endo_neg_patterns_3 with similarity score 8.08
Match between L5IT_neg_patterns_12 and L5ET_neg_patterns_2 with similarity score 8.24
Match between L5IT_neg_patterns_13 and Endo_pos_patterns_7 with similarity score 7.59
Match between L5IT_neg_patterns_14 and L5IT_neg_patterns_10 with similarity score 6.56
Match between L5IT_pos_patterns_0 and L5ET_pos_patterns_5 with similarity score 12.41
Match between L5IT_pos_patterns_1 and L5ET_pos_patterns_3 with similarity score 7.72
Match between L5IT_pos_patterns_2 and L2_3IT_pos_patterns_0 with similarity score 13.32
Match between L5IT_pos_patterns_3 and Astro_pos_patterns_2 with similarity score 9.10
Match between L5IT_pos_patterns_4 and L2_3IT_pos_patterns_4 with similarity score 14.57
Match between L5IT_pos_patterns_5 and L2_3IT_pos_patterns_6 with similarity score 7.38
Match between L5IT_pos_patterns_6 and L2_3IT_pos_patterns_2 with similarity score 8.28
Match between L5IT_pos_patterns_7 and L5ET_pos_patterns_8 with similarity score 7.34
Match between L5IT_pos_patterns_9 and L2_3IT_pos_patterns_16 with similarity score 7.25
Match between L5IT_pos_patterns_13 and L2_3IT_pos_patterns_16 with similarity score 7.00
Reading file modisco_results_ft_2000/L5_6NP_modisco_results.h5
Match between L5_6NP_pos_patterns_0 and L2_3IT_pos_patterns_5 with similarity score 7.53
Match between L5_6NP_pos_patterns_1 and L5ET_pos_patterns_3 with similarity score 7.53
Match between L5_6NP_pos_patterns_2 and L2_3IT_pos_patterns_6 with similarity score 7.23
Match between L5_6NP_pos_patterns_3 and L5ET_pos_patterns_6 with similarity score 10.22
Match between L5_6NP_pos_patterns_4 and L2_3IT_pos_patterns_4 with similarity score 12.14
Match between L5_6NP_pos_patterns_5 and L5ET_pos_patterns_8 with similarity score 7.57
Match between L5_6NP_pos_patterns_7 and L2_3IT_pos_patterns_7 with similarity score 15.00
Match between L5_6NP_pos_patterns_9 and L5IT_pos_patterns_14 with similarity score 8.16
Reading file modisco_results_ft_2000/L6CT_modisco_results.h5
Match between L6CT_pos_patterns_0 and L5ET_pos_patterns_1 with similarity score 9.10
Match between L6CT_pos_patterns_1 and L2_3IT_pos_patterns_0 with similarity score 15.00
Match between L6CT_pos_patterns_2 and L5_6NP_pos_patterns_0 with similarity score 13.88
Match between L6CT_pos_patterns_3 and L5IT_pos_patterns_8 with similarity score 8.50
Match between L6CT_pos_patterns_4 and Endo_pos_patterns_1 with similarity score 6.67
Match between L6CT_pos_patterns_5 and L5IT_pos_patterns_14 with similarity score 6.97
Match between L6CT_pos_patterns_9 and L2_3IT_pos_patterns_4 with similarity score 8.55
Match between L6CT_pos_patterns_10 and L6CT_pos_patterns_3 with similarity score 7.39
Match between L6CT_pos_patterns_11 and L5ET_pos_patterns_6 with similarity score 7.40
Match between L6CT_pos_patterns_12 and L2_3IT_pos_patterns_0 with similarity score 6.60
Match between L6CT_pos_patterns_13 and Endo_pos_patterns_16 with similarity score 8.92
Reading file modisco_results_ft_2000/L6IT_modisco_results.h5
Match between L6IT_neg_patterns_0 and L5IT_neg_patterns_1 with similarity score 9.00
Match between L6IT_neg_patterns_2 and L2_3IT_neg_patterns_1 with similarity score 8.49
Match between L6IT_neg_patterns_7 and L6IT_neg_patterns_4 with similarity score 9.56
Match between L6IT_neg_patterns_8 and L6CT_neg_patterns_1 with similarity score 10.48
Match between L6IT_neg_patterns_9 and L6IT_neg_patterns_4 with similarity score 10.24
Match between L6IT_neg_patterns_10 and L2_3IT_neg_patterns_5 with similarity score 12.50
Match between L6IT_neg_patterns_12 and Endo_pos_patterns_7 with similarity score 8.14
Match between L6IT_neg_patterns_13 and L6IT_neg_patterns_11 with similarity score 7.63
Match between L6IT_pos_patterns_0 and L2_3IT_pos_patterns_0 with similarity score 12.33
Match between L6IT_pos_patterns_1 and L5ET_pos_patterns_1 with similarity score 11.04
Match between L6IT_pos_patterns_2 and L2_3IT_pos_patterns_2 with similarity score 11.48
Match between L6IT_pos_patterns_3 and L5ET_pos_patterns_5 with similarity score 13.08
Match between L6IT_pos_patterns_4 and L5ET_pos_patterns_3 with similarity score 9.81
Match between L6IT_pos_patterns_6 and L6CT_pos_patterns_4 with similarity score 12.39
Match between L6IT_pos_patterns_7 and L5ET_pos_patterns_6 with similarity score 9.28
Match between L6IT_pos_patterns_8 and L2_3IT_pos_patterns_7 with similarity score 13.42
Match between L6IT_pos_patterns_10 and L2_3IT_pos_patterns_8 with similarity score 10.45
Match between L6IT_pos_patterns_11 and L6CT_pos_patterns_6 with similarity score 12.52
Match between L6IT_pos_patterns_12 and L5ET_pos_patterns_8 with similarity score 9.07
Match between L6IT_pos_patterns_14 and L6CT_pos_patterns_7 with similarity score 8.15
Match between L6IT_pos_patterns_16 and L6IT_pos_patterns_0 with similarity score 6.52
Match between L6IT_pos_patterns_17 and L6CT_pos_patterns_2 with similarity score 7.81
Reading file modisco_results_ft_2000/L6b_modisco_results.h5
Match between L6b_neg_patterns_0 and L2_3IT_neg_patterns_11 with similarity score 10.00
Match between L6b_neg_patterns_1 and L6IT_neg_patterns_4 with similarity score 9.95
Match between L6b_neg_patterns_2 and L6CT_neg_patterns_0 with similarity score 8.22
Match between L6b_pos_patterns_0 and L5ET_pos_patterns_1 with similarity score 9.29
Match between L6b_pos_patterns_1 and L6CT_pos_patterns_3 with similarity score 14.57
Match between L6b_pos_patterns_2 and L6CT_pos_patterns_5 with similarity score 9.88
Match between L6b_pos_patterns_4 and L5ET_pos_patterns_6 with similarity score 7.49
Match between L6b_pos_patterns_5 and L2_3IT_pos_patterns_4 with similarity score 7.00
Match between L6b_pos_patterns_7 and L6IT_pos_patterns_0 with similarity score 12.41
Match between L6b_pos_patterns_8 and L6IT_pos_patterns_8 with similarity score 12.63
Match between L6b_pos_patterns_10 and L5ET_pos_patterns_8 with similarity score 8.52
Match between L6b_pos_patterns_12 and L6IT_pos_patterns_2 with similarity score 6.95
Reading file modisco_results_ft_2000/Lamp5_modisco_results.h5
Match between Lamp5_pos_patterns_0 and L5ET_pos_patterns_1 with similarity score 9.65
Match between Lamp5_pos_patterns_2 and L6b_pos_patterns_2 with similarity score 9.68
Match between Lamp5_pos_patterns_3 and L6CT_pos_patterns_4 with similarity score 11.69
Match between Lamp5_pos_patterns_4 and L5ET_pos_patterns_6 with similarity score 8.04
Match between Lamp5_pos_patterns_5 and L6b_pos_patterns_5 with similarity score 12.27
Match between Lamp5_pos_patterns_6 and L5ET_pos_patterns_8 with similarity score 7.93
Match between Lamp5_pos_patterns_11 and L5IT_pos_patterns_11 with similarity score 9.13
Reading file modisco_results_ft_2000/Micro_PVM_modisco_results.h5
Match between Micro_PVM_neg_patterns_0 and L5IT_neg_patterns_1 with similarity score 6.72
Match between Micro_PVM_neg_patterns_4 and Astro_neg_patterns_5 with similarity score 7.85
Match between Micro_PVM_neg_patterns_5 and L5IT_pos_patterns_10 with similarity score 7.01
Match between Micro_PVM_pos_patterns_0 and Micro_PVM_neg_patterns_2 with similarity score 8.01
Match between Micro_PVM_pos_patterns_5 and L2_3IT_neg_patterns_1 with similarity score 7.03
Match between Micro_PVM_pos_patterns_7 and Lamp5_pos_patterns_2 with similarity score 8.84
Match between Micro_PVM_pos_patterns_8 and L6CT_pos_patterns_4 with similarity score 6.89
Match between Micro_PVM_pos_patterns_11 and Endo_pos_patterns_14 with similarity score 8.05
Reading file modisco_results_ft_2000/OPC_modisco_results.h5
Match between OPC_neg_patterns_3 and Endo_neg_patterns_1 with similarity score 9.14
Match between OPC_pos_patterns_0 and Astro_pos_patterns_10 with similarity score 8.29
Match between OPC_pos_patterns_1 and L6b_pos_patterns_13 with similarity score 6.95
Match between OPC_pos_patterns_3 and Astro_pos_patterns_10 with similarity score 7.13
Match between OPC_pos_patterns_4 and Endo_pos_patterns_7 with similarity score 6.95
Match between OPC_pos_patterns_5 and Astro_pos_patterns_11 with similarity score 8.50
Match between OPC_pos_patterns_7 and Endo_pos_patterns_4 with similarity score 6.95
Match between OPC_pos_patterns_11 and OPC_pos_patterns_1 with similarity score 7.71
Match between OPC_pos_patterns_12 and L6IT_pos_patterns_8 with similarity score 15.00
Reading file modisco_results_ft_2000/Oligo_modisco_results.h5
Match between Oligo_neg_patterns_0 and Micro_PVM_neg_patterns_0 with similarity score 6.75
Match between Oligo_neg_patterns_1 and L6CT_pos_patterns_2 with similarity score 8.87
Match between Oligo_neg_patterns_4 and L6b_neg_patterns_0 with similarity score 6.88
Match between Oligo_pos_patterns_0 and OPC_pos_patterns_3 with similarity score 6.81
Match between Oligo_pos_patterns_1 and OPC_pos_patterns_1 with similarity score 7.21
Match between Oligo_pos_patterns_2 and OPC_pos_patterns_4 with similarity score 9.93
Match between Oligo_pos_patterns_5 and Endo_pos_patterns_4 with similarity score 6.83
Match between Oligo_pos_patterns_8 and Micro_PVM_pos_patterns_4 with similarity score 8.28
Match between Oligo_pos_patterns_10 and Oligo_pos_patterns_1 with similarity score 9.05
Match between Oligo_pos_patterns_11 and OPC_pos_patterns_8 with similarity score 7.89
Reading file modisco_results_ft_2000/Pvalb_modisco_results.h5
Match between Pvalb_pos_patterns_0 and Lamp5_pos_patterns_2 with similarity score 11.94
Match between Pvalb_pos_patterns_2 and Oligo_pos_patterns_2 with similarity score 7.20
Match between Pvalb_pos_patterns_4 and L5IT_pos_patterns_3 with similarity score 6.59
Match between Pvalb_pos_patterns_5 and Lamp5_pos_patterns_5 with similarity score 11.47
Match between Pvalb_pos_patterns_6 and L6IT_pos_patterns_2 with similarity score 7.41
Match between Pvalb_pos_patterns_9 and L5_6NP_pos_patterns_8 with similarity score 8.42
Match between Pvalb_pos_patterns_10 and Endo_pos_patterns_14 with similarity score 8.23
Match between Pvalb_pos_patterns_11 and Micro_PVM_pos_patterns_2 with similarity score 7.30
Reading file modisco_results_ft_2000/Sncg_modisco_results.h5
Match between Sncg_pos_patterns_0 and Lamp5_pos_patterns_1 with similarity score 7.91
Match between Sncg_pos_patterns_1 and Astro_pos_patterns_7 with similarity score 10.10
Match between Sncg_pos_patterns_2 and Lamp5_pos_patterns_2 with similarity score 9.87
Match between Sncg_pos_patterns_3 and L6CT_pos_patterns_4 with similarity score 13.15
Match between Sncg_pos_patterns_4 and L5ET_pos_patterns_1 with similarity score 7.79
Match between Sncg_pos_patterns_5 and Lamp5_pos_patterns_10 with similarity score 6.98
Match between Sncg_pos_patterns_8 and L5ET_pos_patterns_11 with similarity score 7.49
Match between Sncg_pos_patterns_9 and L6IT_pos_patterns_8 with similarity score 10.96
Reading file modisco_results_ft_2000/Sst_modisco_results.h5
Match between Sst_pos_patterns_0 and Oligo_pos_patterns_2 with similarity score 8.30
Match between Sst_pos_patterns_1 and Pvalb_pos_patterns_1 with similarity score 7.02
Match between Sst_pos_patterns_2 and Micro_PVM_pos_patterns_2 with similarity score 6.62
Match between Sst_pos_patterns_3 and Sncg_pos_patterns_1 with similarity score 9.39
Match between Sst_pos_patterns_5 and L5ET_pos_patterns_2 with similarity score 8.06
Match between Sst_pos_patterns_7 and L5ET_pos_patterns_0 with similarity score 9.71
Match between Sst_pos_patterns_9 and L6IT_pos_patterns_2 with similarity score 7.36
Reading file modisco_results_ft_2000/SstChodl_modisco_results.h5
Match between SstChodl_pos_patterns_2 and Oligo_pos_patterns_3 with similarity score 7.57
Match between SstChodl_pos_patterns_3 and L6IT_pos_patterns_2 with similarity score 9.89
Match between SstChodl_pos_patterns_5 and Astro_pos_patterns_12 with similarity score 7.72
Match between SstChodl_pos_patterns_8 and OPC_pos_patterns_2 with similarity score 6.85
Match between SstChodl_pos_patterns_11 and L6IT_pos_patterns_8 with similarity score 15.00
Match between SstChodl_pos_patterns_12 and Micro_PVM_pos_patterns_10 with similarity score 7.30
Reading file modisco_results_ft_2000/VLMC_modisco_results.h5
Match between VLMC_neg_patterns_0 and Endo_neg_patterns_0 with similarity score 8.51
Match between VLMC_neg_patterns_2 and Oligo_neg_patterns_3 with similarity score 7.02
Match between VLMC_pos_patterns_0 and Endo_pos_patterns_0 with similarity score 12.64
Match between VLMC_pos_patterns_2 and Endo_pos_patterns_3 with similarity score 15.00
Match between VLMC_pos_patterns_3 and Endo_pos_patterns_5 with similarity score 9.54
Match between VLMC_pos_patterns_4 and Endo_pos_patterns_12 with similarity score 12.44
Match between VLMC_pos_patterns_5 and Endo_pos_patterns_6 with similarity score 7.14
Match between VLMC_pos_patterns_7 and Oligo_pos_patterns_2 with similarity score 9.50
Match between VLMC_pos_patterns_8 and Endo_pos_patterns_10 with similarity score 9.81
Match between VLMC_pos_patterns_10 and Endo_pos_patterns_11 with similarity score 15.00
Match between VLMC_pos_patterns_11 and Endo_pos_patterns_9 with similarity score 15.00
Reading file modisco_results_ft_2000/Vip_modisco_results.h5
Match between Vip_neg_patterns_0 and L6CT_neg_patterns_0 with similarity score 7.68
Match between Vip_pos_patterns_0 and L2_3IT_pos_patterns_6 with similarity score 7.97
Match between Vip_pos_patterns_1 and Oligo_pos_patterns_2 with similarity score 7.87
Match between Vip_pos_patterns_3 and Sncg_pos_patterns_6 with similarity score 7.99
Match between Vip_pos_patterns_5 and Lamp5_pos_patterns_2 with similarity score 11.29
Match between Vip_pos_patterns_6 and Oligo_pos_patterns_11 with similarity score 7.91
Match between Vip_pos_patterns_7 and L6CT_pos_patterns_8 with similarity score 6.71
Match between Vip_pos_patterns_8 and Micro_PVM_pos_patterns_10 with similarity score 7.57
Iteration 1: performing 18 merges
  -> merging L5ET_neg_patterns_0 + L6b_neg_patterns_1 (sim=8.527)
  -> merging L6CT_pos_patterns_13 + L5_6NP_pos_patterns_8 (sim=8.141)
  -> merging L6CT_pos_patterns_2 + L5ET_pos_patterns_5 (sim=7.974)
  -> merging L5IT_pos_patterns_12 + L6IT_neg_patterns_13 (sim=7.710)
  -> merging L2_3IT_pos_patterns_15 + Sst_pos_patterns_7 (sim=7.700)
  -> merging Vip_pos_patterns_0 + Oligo_pos_patterns_1 (sim=7.569)
  -> merging Oligo_pos_patterns_8 + OPC_pos_patterns_9 (sim=7.491)
  -> merging Astro_neg_patterns_0 + VLMC_neg_patterns_0 (sim=7.427)
  -> merging Oligo_pos_patterns_2 + Endo_pos_patterns_11 (sim=7.361)
  -> merging Sncg_pos_patterns_3 + VLMC_pos_patterns_1 (sim=7.155)
  -> merging SstChodl_pos_patterns_5 + Sst_pos_patterns_5 (sim=7.123)
  -> merging Endo_neg_patterns_1 + VLMC_neg_patterns_1 (sim=7.082)
  -> merging Oligo_neg_patterns_2 + Oligo_neg_patterns_3 (sim=7.081)
  -> merging L6IT_pos_patterns_2 + Vip_pos_patterns_7 (sim=7.056)
  -> merging L5IT_neg_patterns_2 + Micro_PVM_neg_patterns_1 (sim=6.912)
  -> merging Endo_pos_patterns_6 + Lamp5_pos_patterns_2 (sim=6.774)
  -> merging Sst_pos_patterns_2 + Pvalb_pos_patterns_3 (sim=6.626)
  -> merging Oligo_neg_patterns_4 + Micro_PVM_neg_patterns_3 (sim=6.604)
Iteration 2: performing 2 merges
  -> merging Vip_pos_patterns_0 + L5ET_pos_patterns_1 (sim=7.162)
  -> merging Endo_pos_patterns_6 + VLMC_pos_patterns_9 (sim=6.572)
Iteration 3: no more matches above 6.5
Dropping Astro_neg_patterns_4 (IC=0.178)
Dropping Endo_neg_patterns_2 (IC=0.168)
Done after 3 iterations; 160 patterns remain.
(19, 160)
# Optional: save the matched patterns if you want to re-use them later
with open("modisco_results_ft_2000/all_patterns.pkl", 'wb') as f:
    pickle.dump(all_patterns, f)

Now we can plot a clustermap of cell types/classes and patterns, where the classes are clustered purely on pattern importance with crested.tl.modisco.generate_nucleotide_sequences() and crested.pl.modisco.clustermap()

pat_seqs = crested.tl.modisco.generate_nucleotide_sequences(all_patterns)
crested.pl.modisco.clustermap(
    pattern_matrix,
    list(adata.obs_names),
    width=25,
    height=4.2,
    pat_seqs=pat_seqs,
    grid=True,
    dendrogram_ratio=(0.03, 0.15),
    importance_threshold=5,
)
<seaborn.matrix.ClusterGrid at 0x7f8b0b0160d0>
../_images/a645814d4931521176dfee22fabff35d23d90bc6b19b357b402c1636ad29b1ee.png

If you have the horizontal space for it, you can also add the PWM/contribution logos to the x-axis.

crested.pl.modisco.clustermap_with_pwm_logos(
    pattern_matrix,
    list(adata.obs_names),
    pattern_dict=all_patterns,
    width=50,
    height=4.2,
    grid=True,
    dendrogram_ratio=(0.03, 0.15),
    importance_threshold=5,
    logo_x_multiplier=1,
    logo_height_fraction=0.35,
    logo_y_padding=0.25,
)
<seaborn.matrix.ClusterGrid at 0x7f8a6f464950>
../_images/90ac1f7606c2725e4cdb85a0357685b3b862afe1b7f524489f8cc85ddf6b916d.png

We can also subset to classes we are interested in and want to compare in more detail.

crested.pl.modisco.clustermap_with_pwm_logos(
    pattern_matrix,
    classes=list(adata.obs_names),
    pattern_dict=all_patterns,
    subset=["Astro", "OPC", "Oligo"],
    width=10,
    height=3,
    grid=True,
    logo_height_fraction=0.35,
    logo_y_padding=0.3,
)
<seaborn.matrix.ClusterGrid at 0x7f85d1820110>
../_images/24ef7421bff92be199f96df0792d68dbba3ecb4ea19c41452b18317a2f086f21.png
crested.pl.modisco.clustermap_with_pwm_logos(
    pattern_matrix,
    classes=list(adata.obs_names),
    subset=["L2_3IT", "L5ET", "L5IT", "L5_6NP", "L6CT", "L6IT", "L6b"],
    pattern_dict=all_patterns,
    width=15,
    height=3,
    grid=True,
    logo_height_fraction=0.35,
    logo_y_padding=0.3,
    importance_threshold=4,
)
<seaborn.matrix.ClusterGrid at 0x7f8b7ae52a90>
../_images/370d2aa1a9faaa092bc18749cbf78b1a4ea985809606e1e466e89f0caa2eb5a1.png

Additional pattern insights#

It’s always interesting to investigate specific patterns that show in the clustermap above. Here there are some example functions with which to do that.

Plotting patterns based on their indices can be done with crested.pl.modisco.selected_instances():

pattern_indices = [21]
crested.pl.modisco.selected_instances(
    all_patterns, pattern_indices
)  # The pattern that is shown is the most representative pattern of the cluster with the highest average information content (IC)
../_images/21db98f6e4619f287e6f1ec8cd552605454e71842ef136ce656526f9c52dbc7b.png

We can also do a check of pattern similarity:

idx1 = 3
idx2 = 7
sim = crested.tl.modisco.pattern_similarity(all_patterns, idx1, idx2)
print("Pattern similarity is " + str(sim))
crested.pl.modisco.selected_instances(all_patterns, [idx1, idx2])
Pattern similarity is 0.2277526412936863
../_images/8630551ce86057b8b52894f96e9fbc8f412239cbf08449f7afc8a0d8ac8502a0.png

We can plot all the instances of patterns in the same cluster with crested.pl.modisco.class_instances():

crested.pl.modisco.class_instances(all_patterns, 0)
../_images/5d063ceda60b8e975efd86e8cf17385967cec93e0d160cfdef64454c02cd2819.png

If you want to find out in which pattern cluster a certain pattern is from your modisco results, you can use the crested.tl.modisco.find_pattern() function.

idx = crested.tl.modisco.find_pattern("Micro_PVM_pos_patterns_1", all_patterns)
if idx is not None:
    print("Pattern index is " + str(idx))
    crested.pl.modisco.class_instances(all_patterns, idx, class_representative=True)
Pattern index is 108
../_images/6b0d431a3fda84e4f2e73bcd1a967f202771ecb7ee1bb8653530202c288a4fca.png

Finally, we can also plot the similarity between all patterns with crested.tl.modisco.calculate_similarity_matrix() and crested.pl.modisco.similarity_heatmap():

sim_matrix, indices = crested.tl.modisco.calculate_similarity_matrix(all_patterns)
crested.pl.modisco.similarity_heatmap(sim_matrix, indices, fig_size=(42, 17))
2026-03-26T14:58:58.453153+0100 WARNING `fig_size` is deprecated since version 1.7.0; please use arguments `width=42, height=17` instead.
../_images/3e317a3356dfe12a29fe7f75320c0923bcf1bed8509eb62cc0640c4b08b7abd6.png

Matching patterns to TF candidates from scRNA-seq data [Optional]#

To understand the actual transcription factor (TF) candidates binding to the characteristic patterns/potential binding sites per cell type, we can propose potential candidates through scRNA-seq data and a TF-motif collection file.

This analysis requires that you ran tfmodisco-lite with the report function such that each pattern has potential MEME database hits and that you have multiome data. The names in the motif database should match those in the TF-motif collection file.

meme_db, motif_to_tf_file = crested.get_motif_db()

If you haven’t run this yet and using crested.tl.modisco.tfmodisco did not work due to lack of access to TOMTOM, versions of modiscolite above v2.4.0 also support using memelite with argument ttl=True, meaning you can generate the reports this way:

Hide code cell content

import modiscolite as modisco
output_dir = "modisco_results_ft_2000"
for class_name in adata.obs_names:
    print(class_name)
    output_file = os.path.join(output_dir, f"{class_name}_modisco_results.h5")
    report_dir = os.path.join(output_dir, f"{class_name}_report")
    modisco.report.report_motifs(
        output_file,
        report_dir,
        meme_motif_db=meme_db,
        top_n_matches=3,
        is_writing_tomtom_matrix=False,
        img_path_suffix="./",
        ttl=True
    )
Astro
Endo
L2_3IT
L5ET
L5IT
L5_6NP
L6CT
L6IT
L6b
Lamp5
Micro_PVM
OPC
Oligo
Pvalb
Sncg
Sst
SstChodl
VLMC
Vip
# If you don't have the patterns loaded, load here
with open('modisco_results_ft_2000/all_patterns.pkl', 'rb') as f:
    all_patterns = pickle.load(f)

Load scRNA-seq data#

Load scRNA seq data and calculate mean expression per cell type using crested.tl.modisco.calculate_mean_expression_per_cell_type().

file_path = os.path.join(data_dir,"mouse_cortex_scrna.h5ad")  # Locate h5 file containing scRNAseq data
cell_type_column = "subclass_Bakken_2022"
mean_expression_df = crested.tl.modisco.calculate_mean_expression_per_cell_type(
    file_path, cell_type_column, cpm_normalize=True
)

Please make sure that the classes in the RNA file match those used in CREsted, and rename mean_expression_df’s index if not:

# Rename classes that don't match exactly (due to / being cleaned up to _, etc)
# Here they're in the same (alphabetical) order anyway so we can just zip them
class_mapping = dict(zip(mean_expression_df.index, adata.obs_names, strict=True))
print(class_mapping)

mean_expression_df.index = mean_expression_df.index.map(class_mapping)
{'Astro': 'Astro', 'Endo': 'Endo', 'L2/3 IT': 'L2_3IT', 'L5 ET': 'L5ET', 'L5 IT': 'L5IT', 'L5/6 NP': 'L5_6NP', 'L6 CT': 'L6CT', 'L6 IT': 'L6IT', 'L6b': 'L6b', 'Lamp5': 'Lamp5', 'Micro-PVM': 'Micro_PVM', 'OPC': 'OPC', 'Oligo': 'Oligo', 'Pvalb': 'Pvalb', 'Sncg': 'Sncg', 'Sst': 'Sst', 'Sst Chodl': 'SstChodl', 'VLMC': 'VLMC', 'Vip': 'Vip'}
crested.pl.modisco.tf_expression_per_cell_type(mean_expression_df, ["Nfia", "Spi1", "Mef2c"])
../_images/6787e9939bdd4b345356440743b1464a4500a7b7070007064357cbd7f26d6be2.png

Generating pattern to database motif dictionary#

classes = list(adata.obs_names)
contribution_dir = "modisco_results_ft_2000"
html_paths = crested.tl.modisco.generate_html_paths(all_patterns, classes, contribution_dir)

# p_val threshold to only select significant matches
pattern_match_dict = crested.tl.modisco.find_pattern_matches(
    all_patterns, html_paths, p_val_thr=0.05
)

Loading TF-motif database#

motif_to_tf_df = crested.tl.modisco.read_motif_to_tf_file(motif_to_tf_file)
motif_to_tf_df
logo Motif_name Cluster Human_Direct_annot Human_Orthology_annot Mouse_Direct_annot Mouse_Orthology_annot Fly_Direct_annot Fly_Orthology_annot Cluster_Human_Direct_annot Cluster_Human_Orthology_annot Cluster_Mouse_Direct_annot Cluster_Mouse_Orthology_annot Cluster_Fly_Direct_annot Cluster_Fly_Orthology_annot
0 <img src="https://motifcollections.aertslab.org/v10/logos/bergman__Adf1.png" height="52" alt="bergman__Adf1"></img> bergman__Adf1 NaN NaN NaN NaN NaN Adf1 NaN NaN NaN NaN NaN NaN NaN
1 <img src="https://motifcollections.aertslab.org/v10/logos/bergman__Aef1.png" height="52" alt="bergman__Aef1"></img> bergman__Aef1 NaN NaN NaN NaN NaN Aef1 NaN NaN NaN NaN NaN NaN NaN
2 <img src="https://motifcollections.aertslab.org/v10/logos/bergman__ap.png" height="52" alt="bergman__ap"></img> bergman__ap NaN NaN NaN NaN NaN ap NaN NaN NaN NaN NaN NaN NaN
3 <img src="https://motifcollections.aertslab.org/v10/logos/elemento__ACCTTCA.png" height="52" alt="elemento__ACCTTCA"></img> elemento__ACCTTCA NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 <img src="https://motifcollections.aertslab.org/v10/logos/bergman__bcd.png" height="52" alt="bergman__bcd"></img> bergman__bcd NaN NaN NaN NaN NaN bcd NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17990 <img src="https://motifcollections.aertslab.org/v10/logos/elemento__CAAGGAG.png" height="52" alt="elemento__CAAGGAG"></img> elemento__CAAGGAG 98.3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17991 <img src="https://motifcollections.aertslab.org/v10/logos/elemento__TCCTTGC.png" height="52" alt="elemento__TCCTTGC"></img> elemento__TCCTTGC 98.3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17992 <img src="https://motifcollections.aertslab.org/v10/logos/swissregulon__hs__ZNF274.png" height="52" alt="swissregulon__hs__ZNF274"></img> swissregulon__hs__ZNF274 99.1 ZNF274 NaN NaN Zfp369, Zfp110 NaN NaN ZNF274 NaN NaN Zfp369, Zfp110 NaN NaN
17993 <img src="https://motifcollections.aertslab.org/v10/logos/swissregulon__sacCer__THI2.png" height="52" alt="swissregulon__sacCer__THI2"></img> swissregulon__sacCer__THI2 99.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17994 <img src="https://motifcollections.aertslab.org/v10/logos/jaspar__MA0407.1.png" height="52" alt="jaspar__MA0407.1"></img> jaspar__MA0407.1 99.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

17995 rows × 15 columns

Matching patterns to TF candidates#

We calculate a pattern-tf by cell type matrix which contains the imporatance of each pattern linked to a TF per cell type using crested.tl.modisco.create_pattern_tf_dict() and crested.tl.modisco.create_tf_ct_matrix()

cols = [
    "Mouse_Direct_annot",
    "Mouse_Orthology_annot",
    "Cluster_Mouse_Direct_annot",
    "Cluster_Mouse_Orthology_annot",
]
pattern_tf_dict, all_tfs = crested.tl.modisco.create_pattern_tf_dict(
    pattern_match_dict, motif_to_tf_df, all_patterns, cols
)
tf_ct_matrix, tf_pattern_annots = crested.tl.modisco.create_tf_ct_matrix(
    pattern_tf_dict,
    all_patterns,
    mean_expression_df,
    classes,
    log_transform=True,
    normalize_pattern_importances=False,
    normalize_gex=True,
    min_tf_gex=0.95,
    importance_threshold=5,
    pattern_parameter="seqlet_count_log",
    filter_correlation=True,
    verbose=True,
    zscore_threshold=1.5,
    correlation_threshold=0.45,
)
Total columns before threshold filtering: 1153
Total columns after threshold filtering: 166
Total columns removed: 987
Total columns before correlation filtering: 166
Total columns after correlation filtering: 97
Total columns removed: 69

Finally, we can plot a clustermap of potential pattern-TF matches and their importance per cell type with crested.pl.modisco.clustermap_tf_motif()

crested.pl.modisco.clustermap_tf_motif(
    tf_ct_matrix,
    heatmap_dim="contrib",
    dot_dim="gex",
    class_labels=classes,
    pattern_labels=tf_pattern_annots,
    width=35,
    height=6,
    cluster_rows=True,
    cluster_columns=False,
    xtick_rotation=90,
)
../_images/754832df2a20dcc6f79e643d9ec89e50cc928ce75c2c763e93a20072feda0dce.png
crested.pl.modisco.clustermap_tf_motif(
    tf_ct_matrix,
    heatmap_dim="contrib",
    dot_dim="gex",
    class_labels=classes,
    subset_classes=["Lamp5", "Sncg", "Vip", "Pvalb", "Sst", "SstChodl"],
    pattern_labels=tf_pattern_annots,
    width=15,
    height=6,
    cluster_rows=False,
    cluster_columns=False,
    xtick_rotation=90,
)
../_images/345051a72e2d19a8249e2bc93976b7f3407eb65669803ae0df96760cf3fc0138.png