vak.prep.parametric_umap.parametric_umap.prep_parametric_umap_dataset¶
- vak.prep.parametric_umap.parametric_umap.prep_parametric_umap_dataset(data_dir: str | Path, purpose: str, output_dir: str | Path | None = None, audio_format: str | None = None, spect_params: dict | None = None, annot_format: str | None = None, annot_file: str | Path | None = None, labelset: set | None = None, context_s: float = 0.015, train_dur: int | None = None, val_dur: int | None = None, test_dur: int | None = None, train_set_durs: list[float] | None = None, num_replicates: int | None = None, spect_key: str = 's', timebins_key: str = 't')[source]¶
Prepare datasets for neural network models that perform a dimensionality reduction task.
For general information on dataset preparation, see the docstring for
vak.prep.prep()
.- Parameters:
data_dir (str, Path) – Path to directory with files from which to make dataset.
purpose (str) – Purpose of the dataset. One of {‘train’, ‘eval’, ‘predict’, ‘learncurve’}. These correspond to commands of the vak command-line interface.
output_dir (str) – Path to location where data sets should be saved. Default is
None
, in which case it defaults todata_dir
.audio_format (str) – Format of audio files. One of {‘wav’, ‘cbin’}. Default is
None
, but eitheraudio_format
orspect_format
must be specified.spect_params (dict, vak.config.SpectParams) – Parameters for creating spectrograms. Default is
None
.annot_format (str) – Format of annotations. Any format that can be used with the :module:`crowsetta` library is valid. Default is
None
.labelset (str, list, set) – Set of unique labels for vocalizations. Strings or integers. Default is
None
. If notNone
, then files will be skipped where the associated annotation contains labels not found inlabelset
.labelset
is converted to a Pythonset
usingvak.converters.labelset_to_set()
. See help for that function for details on how to specifylabelset
.train_dur (float) – Total duration of training set, in seconds. When creating a learning curve, training subsets of shorter duration will be drawn from this set. Default is None.
val_dur (float) – Total duration of validation set, in seconds. Default is None.
test_dur (float) – Total duration of test set, in seconds. Default is None.
train_set_durs (list) – of int, durations in seconds of subsets taken from training data to create a learning curve, e.g. [5, 10, 15, 20].
num_replicates (int) – number of times to replicate training for each training set duration to better estimate metrics for a training set of that size. Each replicate uses a different randomly drawn subset of the training data (but of the same duration).
spect_key (str) – key for accessing spectrogram in files. Default is ‘s’.
timebins_key (str) – key for accessing vector of time bins in files. Default is ‘t’.
- Returns:
dataset_df (pandas.DataFrame) – That represents a dataset.
dataset_path (pathlib.Path) – Path to csv saved from
dataset_df
.