Dataset Building
Starwhale provides a highly flexible method to build datasets, allowing you to build dataset from various file types including images, audio, video, CSV, JSON, and JSONL files. Python scripts and datasets from the Huggingface Hub can also be used for construction.
Building from Data Filesâ
Imageâ
Starwhale supports recursively traversing image files within directories to build a dataset without any coding:
- Supported image formats:
png
,jpg
,jpeg
,webp
,svg
,apng
. - Images are converted to
Starwhale.Image
type and can be viewed in the Starwhale Server Web page. - Supported by
swcli dataset build --image
command line andstarwhale.Dataset.from_folder
Python SDK. - Label mechanism: when SDK sets
auto_label=True
or command line sets--auto-label
, the parent directory name will be used as thelabel
. - Metadata mechanism: dataset columns can be expanded by setting
metadata.csv
ormetadata.jsonl
files in the root directory. - Caption mechanism: when
{image-name}.txt
files are found in the same directory, the content will be automatically imported and populated into thecaption
column.
Assuming there are the following four files in the folder directory:
folder/dog/1.png
folder/cat/2.png
folder/dog/3.png
folder/cat/4.png
Command line construction:
⯠swcli dataset build --image folder --name image-folder
đ§ start to build dataset bundle...
đˇ uri local/project/self/dataset/image-folder/version/latest
đ creating dataset local/project/self/dataset/image-folder/version/uw6mdisnf7alg4t4fs2myfug4ie4636w3x4jqcu2...
đĻ update 4 records into dataset
đē congratulation! you can run swcli dataset info image-folder/version/uw6mdisnf7al
⯠swcli dataset head image-folder -n 2
row âââââââââââââââââââââââââââââââââââââââ
đŗ id: cat/2.png
đ features:
đ
file_name : cat/2.png
đ
label : cat
đ
file : ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding:
row âââââââââââââââââââââââââââââââââââââââ
đŗ id: cat/4.png
đ features:
đ
file_name : cat/4.png
đ
label : cat
đ
file : ArtifactType.Image, display:4.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding:
Python SDK construction:
from starwhale import Dataset
ds = Dataset.from_folder("folder", kind="image")
print(ds)
print(ds.fetch_one().features)
đ creating dataset local/project/self/dataset/folder/version/nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna...
đĻ update 4 records into dataset
Dataset: folder, stash version: d22hdiwyakdfh5xitcpn2s32gblfbhrerzczkb63, loading version: nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna
{'file_name': 'cat/2.png', 'label': 'cat', 'file': ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: }
Videoâ
Recursive traversal of video files in a directory to construct Starwhale datasets without any coding:
- Supported video formats:
mp4
,webm
andavi
. - Videos are converted to Starwhale.Video types and can be viewed in the Starwhale Server Web page.
- Supported by
swcli dataset build --video
command line andstarwhale.Dataset.from_folder
Python SDK. - Label, caption and metadata mechanisms are the same as for images.
Audioâ
Recursive traversal of audio files in a directory to construct Starwhale datasets without any coding:
- Supported audio formats:
mp3
andwav
. - Audio is converted to Starwhale.Audio types and can be viewed in the Starwhale Server Web page.
- Supported by
swcli dataset build --audio
command line andstarwhale.Dataset.from_folder
Python SDK. - Label, caption and metadata mechanisms are the same as for images.
csv Filesâ
Command line or Python SDK can directly convert local or remote csv files into Starwhale datasets:
- Support one or more local csv files.
- Support recursive finding of csv files in a local directory.
- Support one or more remote csv files specified by http urls.
Command line construction:
⯠swcli dataset build --name product-desc-modelscope --csv https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo\?Revision\=master\&FilePath\=test.csv --encoding=utf-8-sig
đ§ start to build dataset bundle...
đˇ uri local/project/self/dataset/product-desc-modelscope/version/latest
đ creating dataset local/project/self/dataset/product-desc-modelscope/version/wzaz4ccodpyj4jelgupljreyida2bleg5xp7viwe...
đĻ update 3848 records into dataset
đē congratulation! dataset build from csv files(('https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo?Revision=master&FilePath=test.csv',)) has been built. You can run swcli dataset info product-desc-modelscope/version/wzaz4ccodpyj
Python SDK construction:
from starwhale import Dataset
ds = Dataset.from_csv(path="http://example.com/data.csv", name="my-csv-dataset")
json/jsonl Filesâ
Command line or Python SDK can directly convert local or remote json/jsonl files into Starwhale datasets:
- Support one or more local json/jsonl files.
- Support recursive finding of json/jsonl files in a local directory.
- Support one or more remote json/jsonl files specified by http urls.
For JSON files:
- By default, the parsed json object is assumed to be a list, and each object in the list is a dict, which maps to one row in the Starwhale dataset.
- The
--field-selector
orfield_selector
parameter can be used to locate a specific list.
For example, for the json file:
{
"p1": {
"p2":{
"p3": [
{"a": 1, "b": 2},
{"a": 10, "b": 20}
]
}
}
}
Set --field-selector=p1.p2.p3
to accurately add two rows of data to the dataset.
Command line construction:
⯠swcli dataset build --json https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo\?Revision\=master\&FilePath\=train.jsonl
đ§ start to build dataset bundle...
đˇ uri local/project/self/dataset/json-b0o2zsvg/version/latest
đ creating dataset local/project/self/dataset/json-b0o2zsvg/version/q3uoziwqligxdggncqywpund75jz55h3bne6a5la...
đĻ update 906 records into dataset
đē congratulation! dataset build from ('https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo?Revision=master&FilePath=train.jsonl',) has been built. You can run swcli dataset info json-b0o2zsvg/version/q3uoziwqligx
Python SDK construction:
from starwhale import Dataset
myds = Dataset.from_json(
name="translation",
text='{"content": {"child_content": [{"en":"hello","zh-cn":"äŊ åĨŊ"},{"en":"how are you","zh-cn":"æčŋæäšæ ˇ"}]}}',
field_selector="content.child_content"
)
print(myds[0].features["zh-cn"])
đ creating dataset local/project/self/dataset/translation/version/kblfn5zh4cpoqxqbhgdfbvonulr2zefp6lojq44y...
đĻ update 2 records into dataset
äŊ åĨŊ
Building from Huggingface Hubâ
There are numerous datasets available on the Huggingface Hub, which can be converted into Starwhale Dataset with a single line of code or command.
Huggingface Datasets conversion relies on the datasets library.
Command line:
swcli dataset build -hf lambdalabs/pokemon-blip-captions --name pokemon
Python SDK:
from starwhale import Dataset
# You only specify starwhale dataset expected name and huggingface repo name
# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions
ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions")
print(ds)
print(len(ds))
print(repr(ds.fetch_one()))
Building from Python SDK scriptsâ
The Starwhale Dataset SDK provides a way similar to Python dict
to add or update data, enabling the creation and update of local or remote datasets.
Starwhale defines two attributes for each row of data: key
and features
.
key
: int or str type. There is only one type ofkey
in a dataset.key
indicates the unique index of that row of data.features
: dict type. Starwhale Dataset adopts a schema-free design, so thefeatures
structure of each row can be different.features
data supports Python constant types like int, float, str, as well as Starwhale types like Image, Video, Audio, Text, and Binary. It also supports Python compound types like list, tuple, dict.
Dataset Initializationâ
To create, update, or load a dataset, you need to get a Starwhale.Dataset
object, usually in the following ways:
from starwhale import dataset
# Create a dataset named new-test in standalone instance. If it exists, raise an exception.
local_ds = dataset("new-test", create="empty")
print(local_ds)
print(len(local_ds))
# If the mnist64 dataset does not exist, create one; otherwise, load this existing dataset.
remote_ds = dataset("https://cloud.starwhale.cn/project/starwhale:helloworld/dataset/mnist64", create="auto")
print(remote_ds)
print(len(remote_ds))
# Load the existing dataset named mnist64, and if it does not exist, an error will be raised.
existed_ds = dataset("mnist64", create="forbid")
print(existed_ds)
print(len(existed_ds))
Dataset: new-test, stash version: y4touw3btifhkd4f2gg4x3qvydgnfmvoztqqm5cf, loading version: y4touw3btifhkd4f2gg4x3qvydgnfmvoztqqm5cf
0
Dataset: mnist64, stash version: 4z5wpbpozsxlelma3j6soeatekufymnyxdeihoqo, loading version: vs3gnaauakidjcc5effevaoh63vivu7dzodo5cmc
500
Dataset: mnist64, stash version: 3ahtfbizw63myxcz34ebd72lhgc25dualcmtznts, loading version: lwhvvixpimlsghfs2xqmtgrwti4yn2z5nevz7hth
500
Adding and Updating Dataset Elementsâ
After adding data, calling commit
will generate a new version that can then be used to access the dataset.
The append Methodâ
The Dataset provides the append function, which automatically adds features
to a new row in the dataset when called.
from starwhale import dataset
ds = dataset("new-test", create="empty")
# key is the auto increment index. The example key is zero.
ds.append({"a": 0, "b": 0})
# Keys in the dataset can also be explicitly declared, but they must maintain consistency with the key types of other rows.
# When data is added in the form of a list or tuple, the first element (at index 0) represents the key for that particular row, while the second element (at index 1) contains the corresponding features.
ds.append((1, {"a":1, "b":1}))
ds.commit()
__setitem__ Methodâ
The Dataset's __setitem__
method provides a dict-like way to add data by index.
ds[2] = {"a":2, "b":2}
ds.commit()
Building from Python Handlerâ
Supports reading functions in Python files through the swcli
command line as input to build datasets. The return value of the function needs to be iterable.
Example python script dataset.py:
def iter_item():
for i in range(100):
# only return features. key is auto increment index.
yield {"a": i, "b": i}
def iter_item_with_key():
for i in range(100):
# key + features
yield i, {"a": i, "b": i}
Build datasets by triggering through the swcli
command line:
swcli dataset build --handler dataset:iter_item --name test1
swcli dataset build --handler dataset:iter_item_with_key --name test2