Skip to main content
Version: 0.6.4

swcli dataset

Overviewโ€‹

swcli [GLOBAL OPTIONS] dataset [OPTIONS] <SUBCOMMAND> [ARGS]...

The dataset command includes the following subcommands:

  • build
  • copy(cp)
  • diff
  • head
  • history
  • info
  • list(ls)
  • recover
  • remove(rm)
  • summary
  • tag

swcli dataset buildโ€‹

swcli [GLOBAL OPTIONS] dataset build [OPTIONS]

Build Starwhale Dataset. This command only supports to build standalone dataset.

Optionsโ€‹

  • Data sources options:
OptionRequiredTypeDefaultsDescription
-if or --image or --image-folderNStringBuild dataset from image folder, the folder should contain the image files.
-af or --audio or --audio-folderNStringBuild dataset from audio folder, the folder should contain the audio files.
-vf or --video or --video-folderNStringBuild dataset from video folder, the folder should contain the video files.
-h or --handler or --python-handlerNStringBuild dataset from python executor handler, the handler format is [module path]:[class or func name].
-f or --yaml or --dataset-yamlNdataset.yaml in cwdBuild dataset from dataset.yaml file. Default uses dataset.yaml in the work directory(cwd).
-jf or --jsonNStringBuild dataset from json or jsonl file, the json or jsonl file option is a json file path or a http downloaded url.The json content structure should be a list[dict] or tuple[dict].
-hf or --huggingfaceNStringBuild dataset from huggingface dataset, the huggingface option is a huggingface repo name.
-c or --csvNStringBuild dataset from csv files. The option is a csv file path, dir path or a http downloaded url.The option can be used multiple times.

Data source options are mutually exclusive, only one option is accepted. If no set, swcli dataset build command will use dataset yaml mode to build dataset with the dataset.yaml in the cwd.

  • Other options:
OptionRequiredScopeTypeDefaultsDescription
-pt or --patchone of --patch and --overwriteGlobalBooleanTruePatch mode, only update the changed rows and columns for the existed dataset.
-ow or --overwriteone of --patch and --overwriteGlobalBooleanFalseOverwrite mode, update records and delete extraneous rows from the existed dataset.
-n or --nameNGlobalStringDataset name
-p or --projectNGlobalStringDefault projectProject URI, the default is the current selected project. The dataset will store in the specified project.
-d or --descNGlobalStringDataset description
-as or --alignment-sizeNGlobalString128Bswds-bin format dataset: alignment size
-vs or --volume-sizeNGlobalString64MBswds-bin format dataset: volume size
-r or --runtimeNGlobalStringRuntime URI
-w or --workdirNPython Handler ModeStringcwdwork dir to search handler.
--auto-label/--no-auto-labelNImage/Video/Audio Folder ModeBooleanTrueWhether to auto label by the sub-folder name.
--field-selectorNJSON File ModeStringThe filed from which you would like to extract dataset array items. The filed is split by the dot(.) symbol.
--subsetNHuggingface ModeStringHuggingface dataset subset name. If the subset name is not specified, the all subsets will be built.
--splitNHuggingface ModeStringHuggingface dataset split name. If the split name is not specified, the all splits will be built.
--revisionNHuggingface ModeStringmainVersion of the dataset script to load. Defaults to 'main'. The option value accepts tag name, or branch name, or commit hash.
--add-hf-info/--no-add-hf-infoNHuggingface ModeBooleanTrueWhether to add huggingface dataset info to the dataset rows, currently support to add subset and split into the dataset rows. Subset uses _hf_subset field name, split uses _hf_split field name.
--cache/--no-cacheNHuggingface ModeBooleanTrueWhether to use huggingface dataset cache(download + local hf dataset).
-t or --tagNGlobalStringDataset tags, the option can be used multiple times.
--encodingNCSV/JSON/JSONL ModeStringfile encoding.
--dialectNCSV ModeStringexcelThe csv file dialect, the default is excel. Current supports excel, excel-tab and unix formats.
--delimiterNCSV ModeString,A one-character string used to separate fields for the csv file.
--quotecharNCSV ModeString"A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters.
--skipinitialspace/--no-skipinitialspaceNCSV ModeBoolFalseWhether to skip spaces after delimiter for the csv file.
--strict/--no-strictNCSV ModeBoolFalseWhen True, raise exception Error if the csv is not well formed.

Examples for dataset buildingโ€‹

#- from dataset.yaml
swcli dataset build # build dataset from dataset.yaml in the current work directory(pwd)
swcli dataset build --yaml /path/to/dataset.yaml # build dataset from /path/to/dataset.yaml, all the involved files are related to the dataset.yaml file.
swcli dataset build --overwrite --yaml /path/to/dataset.yaml # build dataset from /path/to/dataset.yaml, and overwrite the existed dataset.
swcli dataset build --tag tag1 --tag tag2

#- from handler
swcli dataset build --handler mnist.dataset:iter_mnist_item # build dataset from mnist.dataset:iter_mnist_item handler, the workdir is the current work directory(pwd).
# build dataset from mnist.dataset:LinkRawDatasetProcessExecutor handler, the workdir is example/mnist
swcli dataset build --handler mnist.dataset:LinkRawDatasetProcessExecutor --workdir example/mnist

#- from image folder
swcli dataset build --image-folder /path/to/image/folder # build dataset from /path/to/image/folder, search all image type files.

#- from audio folder
swcli dataset build --audio-folder /path/to/audio/folder # build dataset from /path/to/audio/folder, search all audio type files.

#- from video folder
swcli dataset build --video-folder /path/to/video/folder # build dataset from /path/to/video/folder, search all video type files.

#- from json/jsonl file
swcli dataset build --json /path/to/example.json
swcli dataset build --json http://example.com/example.json
swcli dataset build --json /path/to/example.json --field-selector a.b.c # extract the json_content["a"]["b"]["c"] field from the json file.
swcli dataset build --name qald9 --json https://raw.githubusercontent.com/ag-sc/QALD/master/9/data/qald-9-test-multilingual.json --field-selector questions
swcli dataset build --json /path/to/test01.jsonl --json /path/to/test02.jsonl
swcli dataset build --json https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo\?Revision\=master\&FilePath\=train.jsonl

#- from huggingface dataset
swcli dataset build --huggingface mnist
swcli dataset build -hf mnist --no-cache
swcli dataset build -hf cais/mmlu --subset anatomy --split auxiliary_train --revision 7456cfb

#- from csv files
swcli dataset build --csv /path/to/example.csv
swcli dataset build --csv /path/to/example.csv --csv-file /path/to/example2.csv
swcli dataset build --csv /path/to/csv-dir
swcli dataset build --csv http://example.com/example.csv
swcli dataset build --name product-desc-modelscope --csv https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo\?Revision\=master\&FilePath\=test.csv --encoding=utf-8-sig

swcli dataset copyโ€‹

swcli [GLOBAL OPTIONS] dataset copy [OPTIONS] <SRC> <DEST>

dataset copy copies from SRC to DEST.

SRC and DEST are both dataset URIs.

When copying Starwhale Dataset, all custom user-defined labels will be copied by default. You can use the --ignore-tag parameter to ignore certain labels. In addition, the latest and ^v\d+$ labels are Starwhale built-in labels that are only used within the instance itself and will not be copied to other instances.

OptionRequiredTypeDefaultsDescription
--force or -fNBooleanFalseIf true, DEST will be overwritten if it exists. In addition, if the labels carried during duplication have already been used by other versions, this parameter can be used to forcibly update the labels to this version.
-p or --patchone of --patch and --overwriteBooleanTruePatch mode, only update the changed rows and columns for the remote dataset.
-o or --overwriteone of --patch and --overwriteBooleanFalseOverwrite mode, update records and delete extraneous rows from the remote dataset.
-i or --ignore-tagNStringIgnore tags to copy. The option can be used multiple times.

Examples for dataset copyโ€‹

#- copy cloud instance(pre-k8s) mnist project's mnist-cloud dataset to local project(myproject) with a new dataset name 'mnist-local'
swcli dataset cp cloud://pre-k8s/project/mnist/mnist-cloud/version/ge3tkylgha2tenrtmftdgyjzni3dayq local/project/myproject/mnist-local

#- copy cloud instance(pre-k8s) mnist project's mnist-cloud dataset to local default project(self) with the cloud instance dataset name 'mnist-cloud'
swcli dataset cp --patch cloud://pre-k8s/project/dataset/mnist/mnist-cloud/version/ge3tkylgha2tenrtmftdgyjzni3dayq .

#- copy cloud instance(pre-k8s) mnist project's mnist-cloud dataset to local project(myproject) with the cloud instance dataset name 'mnist-cloud'
swcli dataset cp cloud://pre-k8s/project/mnist/mnist-cloud/version/ge3tkylgha2tenrtmftdgyjzni3dayq . -dlp myproject

#- copy cloud instance(pre-k8s) mnist project's mnist-cloud dataset to local default project(self) with a dataset name 'mnist-local'
swcli dataset cp --overwrite cloud://pre-k8s/project/dataset/mnist/mnist-cloud/version/ge3tkylgha2tenrtmftdgyjzni3dayq mnist-local

#- copy cloud instance(pre-k8s) mnist project's mnist-cloud dataset to local project(myproject) with a dataset name 'mnist-local'
swcli dataset cp cloud://pre-k8s/project/mnist/mnist-cloud/version/ge3tkylgha2tenrtmftdgyjzni3dayq mnist-local -dlp myproject

#- copy standalone instance(local) default project(self)'s mnist-local dataset to cloud instance(pre-k8s) mnist project with a new dataset name 'mnist-cloud'
swcli dataset cp mnist-local/version/latest cloud://pre-k8s/project/mnist/mnist-cloud

#- copy standalone instance(local) default project(self)'s mnist-local dataset to cloud instance(pre-k8s) mnist project with standalone instance dataset name 'mnist-local'
swcli dataset cp mnist-local/version/latest cloud://pre-k8s/project/mnist

#- copy standalone instance(local) default project(self)'s mnist-local dataset to cloud instance(pre-k8s) mnist project without 'cloud://' prefix
swcli dataset cp mnist-local/version/latest pre-k8s/project/mnist

#- copy standalone instance(local) project(myproject)'s mnist-local dataset to cloud instance(pre-k8s) mnist project with standalone instance dataset name 'mnist-local'
swcli dataset cp local/project/myproject/dataset/mnist-local/version/latest cloud://pre-k8s/project/mnist

#- copy without some tags
swcli dataset cp mnist cloud://cloud.starwhale.cn/project/starwhale:public --ignore-tag t1 --force

swcli dataset diffโ€‹

swcli [GLOBAL OPTIONS] dataset diff [OPTIONS] <DATASET VERSION> <DATASET VERSION>

dataset diff compares the difference between two versions of the same dataset.

DATASET VERSION is a dataset URI.

OptionRequiredTypeDefaultsDescription
--show-detailsNBooleanFalseIf true, outputs the detail information.
swcli [ๅ…จๅฑ€้€‰้กน] dataset head [้€‰้กน] <DATASET VERSION>

Print the first n rows of the dataset. DATASET VERSION is a dataset URI.

OptionRequiredTypeDefaultsDescription
-n or --rowsNInt5Print the first NUM rows of the dataset.
-srd or --show-raw-dataNBooleanFalseFetch raw data content from objectstore.
-st or --show-typesNBooleanFalseshow data types.

Examples for dataset headโ€‹

#- print the first 5 rows of the mnist dataset
swcli dataset head -n 5 mnist

#- print the first 10 rows of the mnist(v0 version) dataset and show raw data
swcli dataset head -n 10 mnist/v0 --show-raw-data

#- print the data types of the mnist dataset
swcli dataset head mnist --show-types

#- print the remote cloud dataset's first 5 rows
swcli dataset head cloud://cloud-cn/project/test/dataset/mnist -n 5

#- print the first 5 rows in the json format
swcli -o json dataset head -n 5 mnist

swcli dataset historyโ€‹

swcli [GLOBAL OPTIONS] dataset history [OPTIONS] <DATASET>

dataset history outputs all history versions of the specified Starwhale Dataset.

DATASET is a dataset URI.

OptionRequiredTypeDefaultsDescription
--fullnameNBooleanFalseShow the full version name. Only the first 12 characters are shown if this option is false.

swcli dataset infoโ€‹

swcli [GLOBAL OPTIONS] dataset info [OPTIONS] <DATASET>

dataset info outputs detailed information about the specified Starwhale Dataset version.

DATASET is a dataset URI.

swcli dataset listโ€‹

swcli [GLOBAL OPTIONS] dataset list [OPTIONS]

dataset list shows all Starwhale Datasets.

OptionRequiredTypeDefaultsDescription
--projectNStringThe URI of the project to list. Use the default project if not specified.
--fullnameNBooleanFalseShow the full version name. Only the first 12 characters are shown if this option is false.
--show-removed or -srNBooleanFalseIf true, include datasets that are removed but not garbage collected.
--pageNInteger1The starting page number. Server and cloud instances only.
--sizeNInteger20The number of items in one page. Server and cloud instances only.
--filter or -flNStringShow only Starwhale Datasetes that match specified filters. This option can be used multiple times in one command.
FilterTypeDescriptionExample
nameKey-ValueThe name prefix of datasets--filter name=mnist
ownerKey-ValueThe dataset owner name--filter owner=starwhale
latestFlagIf specified, it shows only the latest version.--filter latest

swcli dataset recoverโ€‹

swcli [GLOBAL OPTIONS] dataset recover [OPTIONS] <DATASET>

dataset recover recovers previously removed Starwhale Datasets or versions.

DATASET is a dataset URI. If the version part of the URI is omitted, all removed versions are recovered.

Garbage-collected Starwhale Datasets or versions can not be recovered, as well as those are removed with the --force option.

OptionRequiredTypeDefaultsDescription
--force or -fNBooleanFalseIf true, overwrite the Starwhale Dataset or version with the same name or version id.

swcli dataset removeโ€‹

swcli [GLOBAL OPTIONS] dataset remove [OPTIONS] <DATASET>

dataset remove removes the specified Starwhale Dataset or version.

DATASET is a dataset URI. If the version part of the URI is omitted, all versions are removed.

Removed Starwhale Datasets or versions can be recovered by swcli dataset recover before garbage collection. Use the --force option to persistently remove a Starwhale Dataset or version.

Removed Starwhale Datasets or versions can be listed by swcli dataset list --show-removed.

OptionRequiredTypeDefaultsDescription
--force or -fNBooleanFalseIf true, persistently delete the Starwhale Dataset or version. It can not be recovered.

swcli dataset summaryโ€‹

swcli [GLOBAL OPTIONS]  dataset summary <DATASET>

Show dataset summary. DATASET is a dataset URI.

swcli dataset tagโ€‹

swcli [GLOBAL OPTIONS] dataset tag [OPTIONS] <DATASET> [TAGS]...

dataset tag attaches a tag to a specified Starwhale Dataset version. At the same time, tag command also supports list and remove tags. The tag can be used in a dataset URI instead of the version id.

DATASET is a dataset URI.

Each dataset version can have any number of tags๏ผŒ but duplicated tag names are not allowed in the same dataset.

dataset tag only works for the Standalone Instance.

OptionRequiredTypeDefaultsDescription
--remove or -rNBooleanFalseremove the tag if true
--quiet or -qNBooleanFalseignore errors, for example, removing tags that do not exist.
--force-add or -fNBooleanFalseWhen adding labels to server/cloud instances, if the label is already used by another dataset version, an error will be prompted. In this case, you can force an update using the --force-add parameter.

Examples for dataset tagโ€‹

#- list tags of the mnist dataset
swcli dataset tag mnist

#- add tags for the mnist dataset
swcli dataset tag mnist t1 t2
swcli dataset tag cloud://cloud.starwhale.cn/project/public:starwhale/dataset/mnist/version/latest t1 --force-ad
swcli dataset tag mnist t1 --quiet

#- remove tags for the mnist dataset
swcli dataset tag mnist -r t1 t2
swcli dataset tag cloud://cloud.starwhale.cn/project/public:starwhale/dataset/mnist --remove t1