The dataset.yaml Specification
tip
dataset.yaml
is optional for the swcli dataset build
command.
Building Starwhale Dataset
uses dataset.yaml
. Omitting dataset.yaml
allows describing related configurations in swcli dataset build
command line parameters. dataset.yaml
can be considered as a file-based representation of the build
command line configuration.
YAML Field Descriptionsâ
Field | Description | Required | Type | Default |
---|---|---|---|---|
name | Name of the Starwhale Dataset | Yes | String | |
handler | Importable address of a class that inherits starwhale.SWDSBinBuildExecutor , starwhale.UserRawBuildExecutor or starwhale.BuildExecutor , or a function that returns a Generator or iterable object. Format is {module path}:{class name\|function name} | Yes | String | |
desc | Dataset description | No | String | "" |
version | dataset.yaml format version, currently only "1.0" is supported | No | String | 1.0 |
attr | Dataset build parameters | No | Dict | |
attr.volume_size | Size of each data file in the swds-bin dataset. Can be a number in bytes, or a number plus unit like 64M, 1GB etc. | No | Int or Str | 64MB |
attr.alignment_size | Data alignment size of each data block in the swds-bin dataset. If set to 4k, and a data block is 7.9K, 0.1K padding will be added to make the block size a multiple of alignment_size , improving page size and read efficiency. | No | Integer or String | 128 |
Examplesâ
Simplest Exampleâ
name: helloworld
handler: dataset:ExampleProcessExecutor
The helloworld
dataset uses the ExampleProcessExecutor
class in dataset.py
of the dataset.yaml
directory to build data.
MNIST Dataset Build Exampleâ
name: mnist
handler: mnist.dataset:DatasetProcessExecutor
desc: MNIST data and label test dataset
attr:
alignment_size: 128
volume_size: 4M
Example with handler
as a generator functionâ
dataset.yaml
contents:
name: helloworld
handler: dataset:iter_item
dataset.py
contents:
def iter_item():
for i in range(10):
yield {"img": f"image-{i}".encode(), "label": i}