crested.tl.data.AnnDataModule#

class crested.tl.data.AnnDataModule(adata, genome=None, chromsizes_file=None, in_memory=True, always_reverse_complement=True, random_reverse_complement=False, max_stochastic_shift=0, deterministic_shift=False, shuffle=True, batch_size=256)#

DataModule class which defines how dataloaders should be loaded in each stage.

Required input for the tl.Crested class.

Note

Expects a split column in the .var DataFrame of the AnnData object. Run pp.train_val_test_split first to add the split column to the AnnData object if not yet done.

Example

>>> data_module = AnnDataModule(
...     adata,
...     genome=my_genome,
...     always_reverse_complement=True,
...     max_stochastic_shift=50,
...     batch_size=256,
... )
Parameters:
  • adata – An instance of AnnData containing the data to be loaded.

  • genome (Union[PathLike, Genome, None] (default: None)) – Instance of Genome or Path to the fasta file. If None, will look for a registered genome object.

  • chromsizes_file (Optional[PathLike] (default: None)) – Path to the chromsizes file. Not required if genome is a Genome object. If genome is a path and chromsizes is not provided, will deduce the chromsizes from the fasta file.

  • in_memory (bool (default: True)) – If True, the train and val sequences will be loaded into memory. Default is True.

  • always_reverse_complement (default: True) – If True, all sequences will be augmented with their reverse complement during training. Effectively increases the training dataset size by a factor of 2. Default is True.

  • random_reverse_complement (bool (default: False)) – If True, the sequences will be randomly reverse complemented during training. Default is False.

  • max_stochastic_shift (int (default: 0)) – Maximum stochastic shift (n base pairs) to apply randomly to each sequence during training. Default is 0.

  • deterministic_shift (bool (default: False)) – If true, each region will be shifted twice with stride 50bp to each side. Default is False. This is our legacy shifting, we recommend using max_stochastic_shift instead.

  • shuffle (bool (default: True)) – If True, the data will be shuffled at the end of each epoch during training. Default is True.

  • batch_size (int (default: 256)) – Number of samples per batch to load. Default is 256.

Attributes table#

predict_dataloader

Prediction dataloader.

test_dataloader

Test dataloader.

train_dataloader

Training dataloader.

val_dataloader

Validation dataloader.

Methods table#

setup(stage)

Set up the Anndatasets for a given stage.

Attributes#

AnnDataModule.predict_dataloader#

Prediction dataloader.

Type:

crested.tl.data.AnnDataLoader

AnnDataModule.test_dataloader#

Test dataloader.

Type:

crested.tl.data.AnnDataLoader

AnnDataModule.train_dataloader#

Training dataloader.

Type:

crested.tl.data.AnnDataLoader

AnnDataModule.val_dataloader#

Validation dataloader.

Type:

crested.tl.data.AnnDataLoader

Methods#

AnnDataModule.setup(stage)#

Set up the Anndatasets for a given stage.

Generates the train, val, test or predict dataset based on the provided stage. Should always be called before accessing the dataloaders. Generally you don’t need to call this directly, as this is called inside the tl.Crested trainer class.

Parameters:

stage (str) – Stage for which to setup the dataloader. Either ‘fit’, ‘test’ or ‘predict’.

Return type:

None