Enhancer code analysis#
In this notebook we will go over how to obtain cell type characteristic sequence patterns from CREsted models, or any other model, using tfmodisco-lite.
Obtaining contribution scores per model class and running tfmodisco-lite#
Before we can do any analysis, we need to calculate the contribution scores for cell type-specific regions. From those, we can run tfmodisco-lite.
Load data and CREsted model#
import anndata
import crested
adata = anndata.read_h5ad("mouse_cortex_filtered.h5ad")
genome_file = "../../../../mouse/biccn/mm10.fa"
genome = crested.Genome(genome_file)
datamodule = crested.tl.data.AnnDataModule(
adata,
genome,
)
# load an existing model
evaluator = crested.tl.Crested(data=datamodule)
evaluator.load_model(
"mouse_biccn/finetuned_model/checkpoints/02.keras",
compile=True,
)
# add predictions for model checkpoint to the adata
evaluator.predict(
adata, model_name="biccn_model"
) # adds the predictions to the adata.layers["biccn_model"]
4/358 ━━━━━━━━━━━━━━━━━━━━ 18s 53ms/step
357/358 ━━━━━━━━━━━━━━━━━━━━ 0s 54ms/step
358/358 ━━━━━━━━━━━━━━━━━━━━ 32s 65ms/step
2024-12-11T17:54:09.941173+0100 INFO Adding predictions to anndata.layers[biccn_model].
Select the most informative regions per cell type#
To obtain cell type-characteristic patterns, we need to calculate contribution scores on highly specific regions. For this purpose, we’ve included a preprocessing function crested.pp.sort_and_filter_regions_on_specificity()
to keep the top k most specific regions per cell type that you can use to filter your data before running modisco.
There are three options to select the top regions: either purely based on peak height, purely based on predictions, or on their combination. Here we show how to use the combination of both (which we recommend).
adata_combined = adata.copy() # Copy the peak heights
adata_combined.X = (
adata_combined.X + adata_combined.layers["biccn_model"]
) / 2 # Take the average with the predictions
# most informative regions per class
adata_filtered = adata_combined.copy()
top_k = 1000
crested.pp.sort_and_filter_regions_on_specificity(
adata_filtered, top_k=top_k, method="proportion"
)
adata_filtered
2024-12-11T17:54:09.997726+0100 INFO After sorting and filtering, kept 19000 regions.
AnnData object with n_obs × n_vars = 19 × 19000
obs: 'file_path'
var: 'chr', 'start', 'end', 'split', 'Class name', 'rank', 'proportion_score'
obsm: 'weights'
layers: 'biccn_model'
adata_filtered.var
chr | start | end | split | Class name | rank | proportion_score | |
---|---|---|---|---|---|---|---|
region | |||||||
chr10:45499432-45501546 | chr10 | 45499432 | 45501546 | val | Astro | 1 | 0.716850 |
chrX:23135863-23137977 | chrX | 23135863 | 23137977 | train | Astro | 2 | 0.700334 |
chr14:14264851-14266965 | chr14 | 14264851 | 14266965 | train | Astro | 3 | 0.699505 |
chr1:127159287-127161401 | chr1 | 127159287 | 127161401 | train | Astro | 4 | 0.694734 |
chr15:18236980-18239094 | chr15 | 18236980 | 18239094 | train | Astro | 5 | 0.694518 |
... | ... | ... | ... | ... | ... | ... | ... |
chr15:72298006-72300120 | chr15 | 72298006 | 72300120 | train | Vip | 996 | 0.301501 |
chr7:37869994-37872108 | chr7 | 37869994 | 37872108 | train | Vip | 997 | 0.301440 |
chr17:14671541-14673655 | chr17 | 14671541 | 14673655 | train | Vip | 998 | 0.301406 |
chr2:3837593-3839707 | chr2 | 3837593 | 3839707 | train | Vip | 999 | 0.301379 |
chr8:92796715-92798829 | chr8 | 92796715 | 92798829 | val | Vip | 1000 | 0.301316 |
19000 rows × 7 columns
Calculating contribution scores per class#
Now you can calculate the contribution scores for all the regions in your filtered anndata.
By default, the contribution scores are calculated using the expected integrated gradients method, but you can change this to simple integrated gradients to speed up the calculation (this might result in less accurate scores).
# calculate contribution scores for all regions and save them to output_dir
evaluator.tfmodisco_calculate_and_save_contribution_scores(
adata=adata_filtered, output_dir="modisco_results_ft"
)
2024-12-10T11:34:47.223018+0100 INFO Found 'Class name' column in adata.var. Using specific regions per class to calculate contribution scores.
2024-12-10T11:34:47.253415+0100 INFO Calculating contribution scores for 1 class(es) and 1000 region(s).
2024-12-10T11:55:51.445174+0100 INFO Calculating contribution scores for 1 class(es) and 1000 region(s).
2024-12-10T12:16:48.107676+0100 INFO Calculating contribution scores for 1 class(es) and 1000 region(s).
2024-12-10T12:37:39.990974+0100 INFO Calculating contribution scores for 1 class(es) and 1000 region(s).
Running tfmodisco-lite#
When this is done, you can run TFModisco Lite on the saved contribution scores to find motifs that are important for the classification/regression task.
You could use the tfmodisco package directly to do this, or you could use the crested.tl.modisco.tfmodisco()
function which is essentially a wrapper around the tfmodisco package.
Note that from here on, you can use contribution scores from any model trained in any framework, as this analysis just requires a set of one hot encoded sequences and contribution scores per cell type stored in the same directory.
# run tfmodisco on the contribution scores
crested.tl.modisco.tfmodisco(
window=1000,
output_dir="modisco_results_ft",
contrib_dir="modisco_results_ft",
report=True, # Optional, will match patterns to motif MEME database
meme_db="/home/VIB.LOCAL/niklas.kempynck/nkemp/tools/motifs.meme", # File to MEME database
max_seqlets=20000,
)
Analysis of cell-type specific sequence patterns#
Once you have obtained your modisco results, you can plot the the found patterns using the crested.pl.patterns.modisco_results()
function.
%matplotlib inline
top_k = 1000
crested.pl.patterns.modisco_results(
classes=["Astro", "L5ET", "Vip", "Oligo"],
contribution="positive",
contribution_dir="modisco_results_ft",
num_seq=top_k,
y_max=0.15,
viz="pwm",
) # You can also visualize in 'pwm' format
2024-12-11T17:54:10.040977+0100 INFO Starting genomic contributions plot for classes: ['Astro', 'L5ET', 'Vip', 'Oligo']
Matching patterns across cell types#
Since we have calculated per cell type the patterns independently of each other, we do not know quantitavely how and if they overlap. It can be interesting to get an overview of which patterns are found across multiple cell types, how important they are, and if there are unique patterns only found in a small selection of classes. Therefore, we have made a pattern clustering algorithm, which starts from the results of tfmodisco-lite, and return a pattern matrix, which contains the importance of the clustered patterns per cell type, and a pattern dictionary, describing all clustered patterns.
First, we’ll obtain the modisco files per class by using crested.tl.modisco.match_h5_files_to_classes()
using our selected classes.
Then, we’ll cluster these patterns using crested.tl.modisco.process_patterns()
and create a pattern matrix with crested.tl.modisco.create_pattern_matrix()
# First we obtain the resulting modisco files per class
matched_files = crested.tl.modisco.match_h5_files_to_classes(
contribution_dir="modisco_results_ft", classes=list(adata.obs_names)
)
# Then we cluster matching patterns, and define a pattern matrix [#classes, #patterns] describing their importance
all_patterns = crested.tl.modisco.process_patterns(
matched_files,
sim_threshold=3.5, # The similarity threshold used for matching patterns. We take the -log10(pval), pval obtained through TOMTOM matching from tangermeme
trim_ic_threshold=0.1, # Information content (IC) threshold on which to trim patterns
discard_ic_threshold=0.2, # IC threshold used for discarding single instance patterns
verbose=True, # Useful for doing sanity checks on matching patterns
)
pattern_matrix = crested.tl.modisco.create_pattern_matrix(
classes=list(adata.obs_names),
all_patterns=all_patterns,
normalize=False,
pattern_parameter="seqlet_count_log",
)
pattern_matrix.shape
Reading file modisco_results_ft/Astro_modisco_results.h5
Match between Astro_neg_patterns_2 and Astro_neg_patterns_0 with similarity score 4.66
Match between Astro_pos_patterns_9 and Astro_pos_patterns_7 with similarity score 4.81
Match between Astro_pos_patterns_10 and Astro_pos_patterns_9 with similarity score 7.00
Reading file modisco_results_ft/Endo_modisco_results.h5
Match between Endo_pos_patterns_1 and Astro_pos_patterns_0 with similarity score 4.79
Match between Endo_pos_patterns_6 and Astro_neg_patterns_1 with similarity score 5.17
Match between Endo_pos_patterns_7 and Endo_pos_patterns_6 with similarity score 4.11
Match between Endo_pos_patterns_9 and Astro_pos_patterns_4 with similarity score 3.90
Match between Endo_pos_patterns_15 and Endo_pos_patterns_6 with similarity score 3.84
Reading file modisco_results_ft/L2_3IT_modisco_results.h5
Match between L2_3IT_neg_patterns_2 and Astro_neg_patterns_0 with similarity score 4.23
Match between L2_3IT_pos_patterns_0 and Astro_neg_patterns_3 with similarity score 6.12
Match between L2_3IT_pos_patterns_1 and Endo_pos_patterns_13 with similarity score 5.94
Match between L2_3IT_pos_patterns_6 and Astro_pos_patterns_6 with similarity score 8.78
Match between L2_3IT_pos_patterns_8 and Astro_pos_patterns_0 with similarity score 4.34
Match between L2_3IT_pos_patterns_12 and Endo_pos_patterns_13 with similarity score 3.99
Match between L2_3IT_pos_patterns_13 and Endo_pos_patterns_13 with similarity score 3.76
Reading file modisco_results_ft/L5ET_modisco_results.h5
Match between L5ET_pos_patterns_0 and Astro_pos_patterns_0 with similarity score 5.09
Match between L5ET_pos_patterns_1 and L2_3IT_pos_patterns_10 with similarity score 4.98
Match between L5ET_pos_patterns_2 and L2_3IT_pos_patterns_7 with similarity score 4.20
Match between L5ET_pos_patterns_3 and L2_3IT_pos_patterns_3 with similarity score 7.01
Match between L5ET_pos_patterns_4 and L2_3IT_pos_patterns_2 with similarity score 4.92
Match between L5ET_pos_patterns_5 and L2_3IT_pos_patterns_5 with similarity score 3.93
Match between L5ET_pos_patterns_6 and Endo_pos_patterns_13 with similarity score 4.04
Match between L5ET_pos_patterns_8 and L2_3IT_pos_patterns_4 with similarity score 4.11
Match between L5ET_pos_patterns_10 and L2_3IT_pos_patterns_0 with similarity score 8.87
Match between L5ET_pos_patterns_11 and L5ET_pos_patterns_7 with similarity score 5.32
Reading file modisco_results_ft/L5IT_modisco_results.h5
Match between L5IT_neg_patterns_0 and Endo_pos_patterns_11 with similarity score 4.39
Match between L5IT_neg_patterns_1 and Astro_neg_patterns_0 with similarity score 5.82
Match between L5IT_neg_patterns_7 and L2_3IT_neg_patterns_4 with similarity score 3.62
Match between L5IT_neg_patterns_8 and L2_3IT_neg_patterns_3 with similarity score 6.02
Match between L5IT_neg_patterns_11 and L2_3IT_neg_patterns_6 with similarity score 6.79
Match between L5IT_pos_patterns_0 and L5ET_pos_patterns_5 with similarity score 7.61
Match between L5IT_pos_patterns_1 and L5ET_pos_patterns_7 with similarity score 6.10
Match between L5IT_pos_patterns_2 and L2_3IT_pos_patterns_0 with similarity score 6.28
Match between L5IT_pos_patterns_3 and Astro_pos_patterns_2 with similarity score 9.13
Match between L5IT_pos_patterns_4 and Endo_pos_patterns_2 with similarity score 4.31
Match between L5IT_pos_patterns_7 and L5ET_pos_patterns_3 with similarity score 4.98
Match between L5IT_pos_patterns_8 and L5IT_pos_patterns_6 with similarity score 4.93
Match between L5IT_pos_patterns_11 and L5ET_pos_patterns_4 with similarity score 4.31
Reading file modisco_results_ft/L5_6NP_modisco_results.h5
Match between L5_6NP_pos_patterns_0 and L5IT_pos_patterns_0 with similarity score 7.01
Match between L5_6NP_pos_patterns_1 and L5ET_pos_patterns_4 with similarity score 6.91
Match between L5_6NP_pos_patterns_2 and Astro_pos_patterns_0 with similarity score 4.76
Match between L5_6NP_pos_patterns_3 and Endo_pos_patterns_6 with similarity score 3.73
Match between L5_6NP_pos_patterns_4 and L2_3IT_pos_patterns_4 with similarity score 6.13
Reading file modisco_results_ft/L6CT_modisco_results.h5
Match between L6CT_pos_patterns_0 and Astro_pos_patterns_0 with similarity score 5.57
Match between L6CT_pos_patterns_1 and L2_3IT_pos_patterns_0 with similarity score 6.37
Match between L6CT_pos_patterns_2 and L5IT_pos_patterns_0 with similarity score 5.34
Match between L6CT_pos_patterns_3 and L5IT_pos_patterns_6 with similarity score 9.74
Match between L6CT_pos_patterns_4 and Astro_pos_patterns_0 with similarity score 3.98
Match between L6CT_pos_patterns_5 and L5IT_pos_patterns_10 with similarity score 6.40
Match between L6CT_pos_patterns_7 and L5IT_pos_patterns_9 with similarity score 6.36
Reading file modisco_results_ft/L6IT_modisco_results.h5
Match between L6IT_neg_patterns_1 and Astro_pos_patterns_1 with similarity score 4.26
Match between L6IT_neg_patterns_6 and L5IT_neg_patterns_5 with similarity score 3.65
Match between L6IT_neg_patterns_9 and L2_3IT_neg_patterns_3 with similarity score 5.01
Match between L6IT_pos_patterns_0 and L2_3IT_pos_patterns_0 with similarity score 5.35
Match between L6IT_pos_patterns_1 and Astro_pos_patterns_0 with similarity score 5.03
Match between L6IT_pos_patterns_2 and L5IT_pos_patterns_0 with similarity score 7.11
Match between L6IT_pos_patterns_3 and L5IT_pos_patterns_7 with similarity score 4.60
Match between L6IT_pos_patterns_4 and Astro_pos_patterns_0 with similarity score 3.93
Match between L6IT_pos_patterns_5 and L5ET_pos_patterns_7 with similarity score 6.22
Match between L6IT_pos_patterns_6 and L5ET_pos_patterns_6 with similarity score 5.63
Match between L6IT_pos_patterns_8 and L5ET_pos_patterns_4 with similarity score 9.36
Match between L6IT_pos_patterns_9 and Astro_pos_patterns_6 with similarity score 9.40
Match between L6IT_pos_patterns_10 and L5ET_pos_patterns_6 with similarity score 4.61
Match between L6IT_pos_patterns_11 and L2_3IT_pos_patterns_4 with similarity score 6.86
Match between L6IT_pos_patterns_12 and L6CT_pos_patterns_7 with similarity score 3.87
Reading file modisco_results_ft/L6b_modisco_results.h5
Match between L6b_pos_patterns_0 and Astro_pos_patterns_0 with similarity score 5.57
Match between L6b_pos_patterns_1 and L6CT_pos_patterns_3 with similarity score 8.43
Match between L6b_pos_patterns_2 and L6IT_pos_patterns_5 with similarity score 5.60
Match between L6b_pos_patterns_3 and L5ET_pos_patterns_4 with similarity score 6.43
Match between L6b_pos_patterns_6 and Astro_pos_patterns_6 with similarity score 8.15
Match between L6b_pos_patterns_7 and L6CT_pos_patterns_7 with similarity score 9.20
Match between L6b_pos_patterns_8 and L6b_pos_patterns_7 with similarity score 9.90
Match between L6b_pos_patterns_9 and L6b_pos_patterns_7 with similarity score 9.67
Reading file modisco_results_ft/Lamp5_modisco_results.h5
Match between Lamp5_pos_patterns_0 and L5IT_pos_patterns_4 with similarity score 4.67
Match between Lamp5_pos_patterns_1 and L6b_pos_patterns_2 with similarity score 7.90
Match between Lamp5_pos_patterns_2 and Astro_pos_patterns_1 with similarity score 4.07
Match between Lamp5_pos_patterns_3 and L6b_pos_patterns_3 with similarity score 3.70
Match between Lamp5_pos_patterns_6 and L5IT_pos_patterns_5 with similarity score 3.65
Reading file modisco_results_ft/Micro_PVM_modisco_results.h5
Match between Micro_PVM_neg_patterns_0 and L6IT_neg_patterns_8 with similarity score 4.31
Match between Micro_PVM_neg_patterns_1 and L6IT_pos_patterns_0 with similarity score 5.41
Match between Micro_PVM_neg_patterns_3 and L5IT_pos_patterns_0 with similarity score 4.46
Match between Micro_PVM_pos_patterns_0 and Endo_pos_patterns_3 with similarity score 4.69
Match between Micro_PVM_pos_patterns_5 and Endo_pos_patterns_11 with similarity score 3.70
Match between Micro_PVM_pos_patterns_6 and Lamp5_pos_patterns_7 with similarity score 4.93
Match between Micro_PVM_pos_patterns_8 and L6b_pos_patterns_2 with similarity score 7.78
Match between Micro_PVM_pos_patterns_11 and Lamp5_pos_patterns_3 with similarity score 5.52
Match between Micro_PVM_pos_patterns_12 and Lamp5_pos_patterns_4 with similarity score 7.83
Reading file modisco_results_ft/OPC_modisco_results.h5
Match between OPC_neg_patterns_0 and Astro_neg_patterns_0 with similarity score 3.75
Match between OPC_pos_patterns_0 and Astro_pos_patterns_3 with similarity score 4.85
Match between OPC_pos_patterns_1 and Astro_pos_patterns_9 with similarity score 4.89
Match between OPC_pos_patterns_2 and Lamp5_pos_patterns_0 with similarity score 4.77
Match between OPC_pos_patterns_3 and Astro_pos_patterns_5 with similarity score 4.75
Match between OPC_pos_patterns_4 and Endo_pos_patterns_6 with similarity score 4.78
Match between OPC_pos_patterns_5 and Astro_pos_patterns_9 with similarity score 8.21
Match between OPC_pos_patterns_9 and L5ET_pos_patterns_2 with similarity score 3.81
Match between OPC_pos_patterns_10 and Astro_pos_patterns_4 with similarity score 4.11
Reading file modisco_results_ft/Oligo_modisco_results.h5
Match between Oligo_neg_patterns_0 and OPC_neg_patterns_0 with similarity score 4.10
Match between Oligo_neg_patterns_1 and L5IT_pos_patterns_0 with similarity score 3.86
Match between Oligo_pos_patterns_0 and Astro_pos_patterns_5 with similarity score 4.24
Match between Oligo_pos_patterns_1 and Astro_pos_patterns_0 with similarity score 5.57
Match between Oligo_pos_patterns_2 and OPC_pos_patterns_4 with similarity score 5.63
Match between Oligo_pos_patterns_3 and Micro_PVM_pos_patterns_10 with similarity score 4.62
Match between Oligo_pos_patterns_5 and Astro_pos_patterns_3 with similarity score 5.45
Match between Oligo_pos_patterns_6 and OPC_pos_patterns_7 with similarity score 10.38
Match between Oligo_pos_patterns_7 and L6IT_neg_patterns_2 with similarity score 4.83
Match between Oligo_pos_patterns_8 and OPC_pos_patterns_5 with similarity score 3.60
Match between Oligo_pos_patterns_9 and Oligo_pos_patterns_0 with similarity score 6.72
Match between Oligo_pos_patterns_10 and L6IT_pos_patterns_3 with similarity score 5.16
Match between Oligo_pos_patterns_12 and Micro_PVM_pos_patterns_3 with similarity score 5.57
Match between Oligo_pos_patterns_14 and Oligo_pos_patterns_3 with similarity score 4.39
Match between Oligo_pos_patterns_15 and OPC_pos_patterns_6 with similarity score 3.98
Reading file modisco_results_ft/Pvalb_modisco_results.h5
Match between Pvalb_pos_patterns_0 and L6b_pos_patterns_2 with similarity score 5.69
Match between Pvalb_pos_patterns_1 and L5IT_neg_patterns_10 with similarity score 3.88
Match between Pvalb_pos_patterns_2 and OPC_pos_patterns_4 with similarity score 5.86
Match between Pvalb_pos_patterns_3 and Micro_PVM_pos_patterns_2 with similarity score 4.17
Match between Pvalb_pos_patterns_4 and L5IT_pos_patterns_3 with similarity score 6.61
Match between Pvalb_pos_patterns_5 and Lamp5_pos_patterns_4 with similarity score 9.60
Match between Pvalb_pos_patterns_6 and Pvalb_pos_patterns_3 with similarity score 4.17
Match between Pvalb_pos_patterns_7 and L6IT_neg_patterns_0 with similarity score 3.69
Match between Pvalb_pos_patterns_8 and L6IT_pos_patterns_3 with similarity score 5.65
Match between Pvalb_pos_patterns_12 and Oligo_pos_patterns_13 with similarity score 3.97
Reading file modisco_results_ft/Sncg_modisco_results.h5
Match between Sncg_pos_patterns_1 and Astro_pos_patterns_8 with similarity score 10.18
Match between Sncg_pos_patterns_2 and Astro_pos_patterns_0 with similarity score 5.57
Match between Sncg_pos_patterns_3 and L5_6NP_pos_patterns_5 with similarity score 4.59
Match between Sncg_pos_patterns_4 and L2_3IT_pos_patterns_11 with similarity score 4.68
Match between Sncg_pos_patterns_8 and Pvalb_pos_patterns_3 with similarity score 5.32
Reading file modisco_results_ft/Sst_modisco_results.h5
Match between Sst_pos_patterns_0 and OPC_pos_patterns_4 with similarity score 6.64
Match between Sst_pos_patterns_1 and Pvalb_pos_patterns_1 with similarity score 5.84
Match between Sst_pos_patterns_2 and Pvalb_pos_patterns_3 with similarity score 5.54
Match between Sst_pos_patterns_3 and Sncg_pos_patterns_1 with similarity score 6.44
Match between Sst_pos_patterns_5 and L5ET_pos_patterns_2 with similarity score 4.50
Match between Sst_pos_patterns_6 and L5ET_pos_patterns_1 with similarity score 4.57
Match between Sst_pos_patterns_7 and L6IT_pos_patterns_3 with similarity score 5.04
Match between Sst_pos_patterns_8 and Sst_pos_patterns_1 with similarity score 3.53
Match between Sst_pos_patterns_9 and L2_3IT_pos_patterns_4 with similarity score 6.13
Reading file modisco_results_ft/SstChodl_modisco_results.h5
Match between SstChodl_pos_patterns_2 and Oligo_pos_patterns_5 with similarity score 4.80
Match between SstChodl_pos_patterns_3 and L6IT_pos_patterns_3 with similarity score 5.76
Match between SstChodl_pos_patterns_4 and Astro_pos_patterns_4 with similarity score 5.04
Match between SstChodl_pos_patterns_6 and Endo_pos_patterns_8 with similarity score 5.57
Reading file modisco_results_ft/VLMC_modisco_results.h5
Match between VLMC_pos_patterns_0 and L5IT_neg_patterns_5 with similarity score 5.30
Match between VLMC_pos_patterns_1 and Astro_pos_patterns_0 with similarity score 4.00
Match between VLMC_pos_patterns_2 and Endo_pos_patterns_10 with similarity score 6.47
Match between VLMC_pos_patterns_3 and Astro_pos_patterns_0 with similarity score 4.59
Match between VLMC_pos_patterns_4 and Endo_pos_patterns_5 with similarity score 3.74
Match between VLMC_pos_patterns_5 and Endo_pos_patterns_11 with similarity score 6.22
Match between VLMC_pos_patterns_6 and OPC_pos_patterns_4 with similarity score 4.75
Match between VLMC_pos_patterns_7 and OPC_pos_patterns_4 with similarity score 4.75
Match between VLMC_pos_patterns_10 and Endo_pos_patterns_12 with similarity score 6.96
Match between VLMC_pos_patterns_13 and Lamp5_pos_patterns_6 with similarity score 5.48
Match between VLMC_pos_patterns_15 and Endo_pos_patterns_4 with similarity score 3.56
Reading file modisco_results_ft/Vip_modisco_results.h5
Match between Vip_pos_patterns_0 and SstChodl_pos_patterns_0 with similarity score 4.37
Match between Vip_pos_patterns_1 and OPC_pos_patterns_2 with similarity score 5.57
Match between Vip_pos_patterns_2 and Lamp5_pos_patterns_2 with similarity score 5.10
Match between Vip_pos_patterns_4 and Sncg_pos_patterns_6 with similarity score 4.97
Match between Vip_pos_patterns_5 and L5ET_pos_patterns_2 with similarity score 5.41
Match between Vip_pos_patterns_6 and Sncg_pos_patterns_4 with similarity score 6.37
Merged patterns Astro_pos_patterns_0 and OPC_pos_patterns_2 with similarity 3.6375287429916194
Merged patterns Astro_pos_patterns_0 and Sncg_pos_patterns_3 with similarity 4.542319134843585
Merged patterns Astro_pos_patterns_0 and Oligo_pos_patterns_7 with similarity 4.681529901975512
Merged patterns Pvalb_pos_patterns_4 and Endo_pos_patterns_11 with similarity 4.543095877546642
Merged patterns Pvalb_pos_patterns_4 and L2_3IT_neg_patterns_1 with similarity 3.680944583152613
Merged patterns Pvalb_pos_patterns_4 and Pvalb_pos_patterns_7 with similarity 3.5692057277191847
Merged patterns Oligo_pos_patterns_0 and Vip_pos_patterns_4 with similarity 4.408425660522137
Merged patterns Sncg_pos_patterns_1 and OPC_pos_patterns_6 with similarity 3.7185542503697016
Merged patterns Endo_pos_patterns_0 and VLMC_pos_patterns_0 with similarity 4.767029162591732
Merged patterns Endo_pos_patterns_3 and Endo_pos_patterns_14 with similarity 4.621785987487528
Merged patterns Endo_pos_patterns_4 and VLMC_pos_patterns_11 with similarity 3.716038190708335
Merged patterns VLMC_pos_patterns_4 and VLMC_pos_patterns_12 with similarity 3.5014047659787613
Merged patterns Endo_pos_patterns_8 and Oligo_pos_patterns_6 with similarity 3.9647388167974538
Merged patterns VLMC_pos_patterns_2 and L6b_pos_patterns_2 with similarity 3.5065776441727086
Merged patterns L5ET_pos_patterns_6 and Lamp5_pos_patterns_3 with similarity 3.5550881591309627
Merged patterns L5ET_pos_patterns_6 and Sncg_pos_patterns_4 with similarity 4.1067676346805015
Merged patterns L5IT_pos_patterns_0 and L5ET_pos_patterns_13 with similarity 4.255505242360988
Merged patterns L6b_pos_patterns_1 and L6b_pos_patterns_7 with similarity 7.991289768923955
Merged patterns L6IT_pos_patterns_7 and L6b_pos_patterns_4 with similarity 3.7477654160953104
Merged patterns Oligo_pos_patterns_12 and Pvalb_pos_patterns_11 with similarity 3.586900935525632
Merged patterns Oligo_pos_patterns_3 and Pvalb_pos_patterns_12 with similarity 4.806101801919134
Merged patterns Pvalb_pos_patterns_9 and SstChodl_pos_patterns_5 with similarity 3.8977786373112933
Iteration 1: Merging complete, checking again
Discarded 2 patterns below IC threshold 0.2 and with a single class instance:
['L6IT_neg_patterns_4', 'L6IT_neg_patterns_7']
Total iterations: 1
(19, 84)
Now we can plot a clustermap of cell types/classes and patterns, where the classes are clustered
purely on pattern importance with crested.tl.modisco.generate_nucleotide_sequences()
and crested.pl.patterns.clustermap()
import matplotlib
%matplotlib inline
matplotlib.rcParams["pdf.fonttype"] = 42
matplotlib.rcParams["ps.fonttype"] = 42
pat_seqs = crested.tl.modisco.generate_nucleotide_sequences(all_patterns)
crested.pl.patterns.clustermap(
pattern_matrix,
list(adata.obs_names),
figsize=(16, 4.2),
pat_seqs=pat_seqs,
grid=True,
dendrogram_ratio=(0.03, 0.15),
importance_threshold=4.5,
)
We can also subset to classes we are interested in and want to compare in more detail.
%matplotlib inline
pat_seqs = crested.tl.modisco.generate_nucleotide_sequences(all_patterns)
crested.pl.patterns.clustermap(
pattern_matrix,
classes=list(adata.obs_names),
subset=["Astro", "OPC", "Oligo"],
figsize=(10, 2),
pat_seqs=pat_seqs,
grid=True,
dy=0.0025,
)
%matplotlib inline
pat_seqs = crested.tl.modisco.generate_nucleotide_sequences(all_patterns)
crested.pl.patterns.clustermap(
pattern_matrix,
classes=list(adata.obs_names),
subset=["L2_3IT", "L5ET", "L5IT", "L5_6NP", "L6CT", "L6IT", "L6b"],
figsize=(10, 2),
pat_seqs=pat_seqs,
grid=True,
dy=0.0025,
importance_threshold=4.5,
)
Additional pattern insights#
It is always interesting to investigate specific patterns show in the clustermap above. Here there are some example function on how to do that.
Plotting patterns based on their indices can be done with crested.pl.patterns.selected_instances()
pattern_indices = [43]
crested.pl.patterns.selected_instances(
all_patterns, pattern_indices
) # The pattern that is shown is the most representative pattern of the cluster with the highest average IC
We can also do a check of pattern similarity
idx1 = 1
idx2 = 3
sim = crested.tl.modisco.pattern_similarity(all_patterns, idx1, idx2)
print("Pattern similarity is " + str(sim))
crested.pl.patterns.selected_instances(all_patterns, [idx1, idx2])
Pattern similarity is tensor(0.3197, dtype=torch.float64)
We can plot all the instances of patterns in the same cluster with crested.pl.patterns.class_instances()
crested.pl.patterns.class_instances(all_patterns, 77)
If you want to find out in which pattern cluster a certain pattern is from your modisco results, you can call the crested.tl.modisco.find_pattern()
function.
idx = crested.tl.modisco.find_pattern("OPC_neg_patterns_0", all_patterns)
if idx is not None:
print("Pattern index is " + str(idx))
crested.pl.patterns.class_instances(all_patterns, idx, class_representative=True)
Pattern index is 0
Finally we can also plot the similarity between all patterns with crested.tl.modisco.calculate_similarity_matrix()
and crested.pl.patterns.similarity_heatmap()
sim_matrix, indices = crested.tl.modisco.calculate_similarity_matrix(all_patterns)
crested.pl.patterns.similarity_heatmap(sim_matrix, indices, fig_size=(42, 17))
Matching patterns to TF candidates from scRNA-seq data [Optional]#
To understand the actual transcription factor (TF) candidates binding to the characteristic patterns/potential binding sites per cell type, we can propose potential candidates through scRNA-seq data and a TF-motif collection file.
This analysis requires that you ran tfmodisco-lite with the report function such that each pattern has potential MEME database hits. The names in the motif database should match those in the TF-motif collection file.
Load scRNA-seq data#
Load scRNA seq data and calculate mean expression per cell type using crested.tl.modisco.calculate_mean_expression_per_cell_type()
%load_ext autoreload
%autoreload 2
import crested
import crested
file_path = "../../../../mouse/biccn/Mouse_rna.h5ad" # Locate h5 file containing scRNAseq data
cell_type_column = "subclass_Bakken_2022"
mean_expression_df = crested.tl.modisco.calculate_mean_expression_per_cell_type(
file_path, cell_type_column, cpm_normalize=True
)
crested.pl.patterns.tf_expression_per_cell_type(
mean_expression_df, ["Nfia", "Spi1", "Mef2c"]
)
Generating pattern to database motif dictionary#
classes = list(adata.obs_names)
contribution_dir = "modisco_results_ft"
html_paths = crested.tl.modisco.generate_html_paths(
all_patterns, classes, contribution_dir
)
pattern_match_dict = crested.tl.modisco.find_pattern_matches(
all_patterns, html_paths, q_val_thr=0.05
) # q_val threshold to only select significant matches
Loading TF-motif database#
motif_to_tf_df = crested.tl.modisco.read_motif_to_tf_file(
"/data/projects/c04/cbd-saerts/nkemp/tools/Motif_collection.tsv"
)
motif_to_tf_df
logo | Motif_name | Cluster | Human_Direct_annot | Human_Orthology_annot | Mouse_Direct_annot | Mouse_Orthology_annot | Fly_Direct_annot | Fly_Orthology_annot | Cluster_Human_Direct_annot | Cluster_Human_Orthology_annot | Cluster_Mouse_Direct_annot | Cluster_Mouse_Orthology_annot | Cluster_Fly_Direct_annot | Cluster_Fly_Orthology_annot | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | <img src="https://motifcollections.aertslab.org/v10/logos/bergman__Adf1.png" height="52" alt="bergman__Adf1"></img> | bergman__Adf1 | NaN | NaN | NaN | NaN | NaN | Adf1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | <img src="https://motifcollections.aertslab.org/v10/logos/bergman__Aef1.png" height="52" alt="bergman__Aef1"></img> | bergman__Aef1 | NaN | NaN | NaN | NaN | NaN | Aef1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | <img src="https://motifcollections.aertslab.org/v10/logos/bergman__ap.png" height="52" alt="bergman__ap"></img> | bergman__ap | NaN | NaN | NaN | NaN | NaN | ap | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | <img src="https://motifcollections.aertslab.org/v10/logos/elemento__ACCTTCA.png" height="52" alt="elemento__ACCTTCA"></img> | elemento__ACCTTCA | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | <img src="https://motifcollections.aertslab.org/v10/logos/bergman__bcd.png" height="52" alt="bergman__bcd"></img> | bergman__bcd | NaN | NaN | NaN | NaN | NaN | bcd | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
22471 | <img src="https://motifcollections.aertslab.org/v10/logos/elemento__CAAGGAG.png" height="52" alt="elemento__CAAGGAG"></img> | elemento__CAAGGAG | 98.3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
22472 | <img src="https://motifcollections.aertslab.org/v10/logos/elemento__TCCTTGC.png" height="52" alt="elemento__TCCTTGC"></img> | elemento__TCCTTGC | 98.3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
22473 | <img src="https://motifcollections.aertslab.org/v10/logos/swissregulon__hs__ZNF274.png" height="52" alt="swissregulon__hs__ZNF274"></img> | swissregulon__hs__ZNF274 | 99.1 | ZNF274 | NaN | NaN | Zfp369, Zfp110 | NaN | NaN | ZNF274 | NaN | NaN | Zfp369, Zfp110 | NaN | NaN |
22474 | <img src="https://motifcollections.aertslab.org/v10/logos/swissregulon__sacCer__THI2.png" height="52" alt="swissregulon__sacCer__THI2"></img> | swissregulon__sacCer__THI2 | 99.2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
22475 | <img src="https://motifcollections.aertslab.org/v10/logos/jaspar__MA0407.1.png" height="52" alt="jaspar__MA0407.1"></img> | jaspar__MA0407.1 | 99.2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
22476 rows × 15 columns
Matching patterns to TF candidates#
We calculate a pattern-tf by cell type matrix which contains the imporatance of each pattern linked to a TF per cell type using crested.tl.modisco.create_pattern_tf_dict()
and crested.tl.modisco.create_tf_ct_matrix()
cols = [
"Mouse_Direct_annot",
"Mouse_Orthology_annot",
"Cluster_Mouse_Direct_annot",
"Cluster_Mouse_Orthology_annot",
]
pattern_tf_dict, all_tfs = crested.tl.modisco.create_pattern_tf_dict(
pattern_match_dict, motif_to_tf_df, all_patterns, cols
)
tf_ct_matrix, tf_pattern_annots = crested.tl.modisco.create_tf_ct_matrix(
pattern_tf_dict,
all_patterns,
mean_expression_df,
classes,
log_transform=True,
normalize_pattern_importances=False,
normalize_gex=True,
min_tf_gex=0.95,
importance_threshold=5,
pattern_parameter="seqlet_count_log",
filter_correlation=True,
verbose=True,
zscore_threshold=1,
correlation_threshold=0.3,
)
Total columns before threshold filtering: 736
Total columns after threshold filtering: 130
Total columns removed: 606
Total columns before correlation filtering: 130
Total columns after correlation filtering: 96
Total columns removed: 34
Finally, we can plot a clustermap of potential pattern-TF matches and their importance per cell type with crested.pl.patterns.clustermap_tf_motif()
crested.pl.patterns.clustermap_tf_motif(
tf_ct_matrix,
heatmap_dim="contrib",
dot_dim="gex",
class_labels=classes,
pattern_labels=tf_pattern_annots,
fig_size=(35, 6),
cluster_rows=True,
cluster_columns=False,
)