crested.tl.data.AnnDataModule#
- class crested.tl.data.AnnDataModule(adata, genome=None, chromsizes_file=None, in_memory=True, always_reverse_complement=True, random_reverse_complement=False, max_stochastic_shift=0, deterministic_shift=False, shuffle=True, batch_size=256)#
DataModule class which defines how dataloaders should be loaded in each stage.
Required input for the
tl.Crested
class.Note
Expects a
split
column in the.var
DataFrame of the AnnData object. Runpp.train_val_test_split
first to add thesplit
column to the AnnData object if not yet done.Example
>>> data_module = AnnDataModule( ... adata, ... genome=my_genome, ... always_reverse_complement=True, ... max_stochastic_shift=50, ... batch_size=256, ... )
- Parameters:
adata – An instance of AnnData containing the data to be loaded.
genome (
Union
[PathLike
,Genome
,None
] (default:None
)) – Instance of Genome or Path to the fasta file. If None, will look for a registered genome object.chromsizes_file (
Optional
[PathLike
] (default:None
)) – Path to the chromsizes file. Not required if genome is a Genome object. If genome is a path and chromsizes is not provided, will deduce the chromsizes from the fasta file.in_memory (
bool
(default:True
)) – If True, the train and val sequences will be loaded into memory. Default is True.always_reverse_complement (default:
True
) – If True, all sequences will be augmented with their reverse complement during training. Effectively increases the training dataset size by a factor of 2. Default is True.random_reverse_complement (
bool
(default:False
)) – If True, the sequences will be randomly reverse complemented during training. Default is False.max_stochastic_shift (
int
(default:0
)) – Maximum stochastic shift (n base pairs) to apply randomly to each sequence during training. Default is 0.deterministic_shift (
bool
(default:False
)) – If true, each region will be shifted twice with stride 50bp to each side. Default is False. This is our legacy shifting, we recommend using max_stochastic_shift instead.shuffle (
bool
(default:True
)) – If True, the data will be shuffled at the end of each epoch during training. Default is True.batch_size (
int
(default:256
)) – Number of samples per batch to load. Default is 256.
Attributes table#
Prediction dataloader. |
|
Test dataloader. |
|
Training dataloader. |
|
Validation dataloader. |
Methods table#
|
Set up the Anndatasets for a given stage. |
Attributes#
- AnnDataModule.predict_dataloader#
Prediction dataloader.
- AnnDataModule.test_dataloader#
Test dataloader.
- AnnDataModule.train_dataloader#
Training dataloader.
- AnnDataModule.val_dataloader#
Validation dataloader.
Methods#
- AnnDataModule.setup(stage)#
Set up the Anndatasets for a given stage.
Generates the train, val, test or predict dataset based on the provided stage. Should always be called before accessing the dataloaders. Generally you don’t need to call this directly, as this is called inside the
tl.Crested
trainer class.