crested.pp.train_val_test_split

crested.pp.train_val_test_split#

crested.pp.train_val_test_split(adata, strategy='region', val_size=0.1, test_size=0.1, val_chroms=None, test_chroms=None, shuffle=True, random_state=None)#

Add ‘train/val/test’ split column to AnnData object.

Adds a new column split to the .var DataFrame of the AnnData object, indicating whether each sample should be part of the training, validation, or test set based on the chosen splitting strategy.

Note

Model training always requires a split column in the .var DataFrame.

Parameters:
  • adata (AnnData) – AnnData object to which the ‘train/val/test’ split column will be added.

  • strategy (str (default: 'region')) –

    strategy of split. Either ‘region’, ‘chr’ or ‘chr_auto’. If ‘chr’ or ‘chr_auto’, the anndata’s var_names should contain the chromosome name at the start, followed by a : (e.g. I:2000-2500 or chr3:10-20:+).

    region: Split randomly on region indices.

    chr: Split based on provided chromosomes.

    chr_auto: Automatically select chromosomes for val and test sets based on val and test size.

    If strategy ‘chr’, it’s also possible to provide the same chromosome(s) to both val_chroms and test_chroms. In this case, the regions will be divided evenly between the two sets.

  • val_size (float (default: 0.1)) – Proportion of the training dataset to include in the validation split.

  • test_size (float (default: 0.1)) – Proportion of the dataset to include in the test split.

  • val_chroms (Optional[list[str]] (default: None)) – List of chromosomes to include in the validation set. Required if strategy=’chr’.

  • test_chroms (Optional[list[str]] (default: None)) – List of chromosomes to include in the test set. Required if strategy=’chr’.

  • shuffle (bool (default: True)) – Whether or not to shuffle the data before splitting (when strategy=’region’).

  • random_state (Optional[int] (default: None)) – Random_state affects the ordering of the indices when shuffling in regions or auto splitting on chromosomes.

Return type:

None

Returns:

Adds a new column inplace to adata.var: ‘split’: ‘train’, ‘val’, or ‘test’

Examples

>>> crested.train_val_test_split(
...     adata,
...     strategy="region",
...     val_size=0.1,
...     test_size=0.1,
...     shuffle=True,
...     random_state=42,
... )
>>> crested.train_val_test_split(
...     adata,
...     strategy="chr",
...     val_chroms=["chr1", "chr2"],
...     test_chroms=["chr3", "chr4"],
... )