Dataset Versioning
Starwhale dataset supports fine-grained version control to trace changes to each row and column. The version control of Starwhale Dataset has the following features:
- Linear versioning. The design aims at simplifying operations without complex branch and merge operations. Branch merge on massive datasets is almost impossible.
- Fine-grained control. The minimum unit is a change to a column in a row that can generate a new version.
- Unique version IDs. When generating a version, a globally unique ID is produced. Copying datasets between instances will keep this ID unchanged. The dataset content can be loaded by this ID.
Generating Versions During Dataset Constructionâ
SDK commit to Actively Create Versionsâ
When constructing a dataset using the Starwhale Dataset SDK, after adding data, calling the commit
method will produce a new version and obtain a UUID.
from starwhale import dataset
ds1 = dataset("new-ds", create="empty")
ds1["train/0"] = {"a": 1, "b": 10}
ds1["train/1"] = {"a": 2, "b": 20}
version = ds1.commit()
print(version)
ds1.close()
ds2 = dataset(f"new-ds/version/{version}")
ds2["train/0"].features.c = 100
ds2["train/1"].features.a = -2
ds2["train/1"].features.b = -20
new_version = ds2.commit()
print(new_version)
ds2.close()
ds1 = dataset(f"new-ds/version/{version}", readonly=True)
print(f"---{version}")
print(ds1["train/0"].index, ds1["train/0"].features)
print(ds1["train/1"].index, ds1["train/1"].features)
ds2 = dataset(f"new-ds/version/{new_version}", readonly=True)
print(f"---{new_version}")
print(ds2["train/0"].index, ds2["train/0"].features)
print(ds2["train/1"].index, ds2["train/1"].features)
ds1.close()
ds2.close()
n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw
a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf
---n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw
train/0 {'a': 1, 'b': 10}
train/1 {'a': 2, 'b': 20}
---a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf
train/0 {'a': 1, 'b': 10, 'c': 100}
train/1 {'a': -2, 'b': -20}
swcli Command Lineâ
swcli dataset build
commands automatically generate a new version:
⯠swcli dataset build --json https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo\?Revision\=master\&FilePath\=train.jsonl
đ§ start to build dataset bundle...
đˇ uri local/project/self/dataset/json-gec8u5sv/version/latest
đ creating dataset local/project/self/dataset/json-gec8u5sv/version/f3iz4sdljjt7rmmfd4rkiak4vkbilp5pbrdgfgom...
đĻ update 906 records into dataset
đē congratulation! dataset build from ('https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo?Revision=master&FilePath=train.jsonl',) has been built. You can run swcli dataset info json-gec8u5sv/version/f3iz4sdljjt7
Tagging Versionsâ
Starwhale introduces the concept of Tags, which can be specified during commit
or when executing dataset construction commands to associate dataset versions with Tags, allowing dataset loading by Tag.
- Dataset version: A unique ID, similar to
f3iz4sdljjt7rmmfd4rkiak4vkbilp5pbrdgfgom
, ensuring the ID is unique across all Starwhale instances. - Dataset Tag: A readable string, similar to
t1
,t2
,v0.3
. There is a one-to-many relationship between dataset versions and Tags. Each Tag can only identify one version, but each dataset version can have multiple Tags.- Manually specified Tags: The
tags
parameter in thecommit
function, or the--tag
parameter in theswcli dataset build
command, can be used to specify one or multiple Tags. When the dataset is copied to other instances, these Tags can be carried over by parameter settings. - Automatically generated incremental Tags: Within an instance, after each commit or build, an incremental Tag like
v0
,v1
,v2
is generated. When copying the dataset, these Tags are ignored on the source instance, and new incremental Tags are generated on the destination instance. latest
Tag: Automatically generated, the last commit or build command will mark thelatest
Tag on that version.
- Manually specified Tags: The
Loading Datasets by Versionâ
Datasets can be loaded from any location using the Dataset URI, where the version field in the URI can take various forms such as unique IDs, unique ID abbreviations, custom Tags, incremental Tags, and the latest
Tag.
from starwhale import dataset
# load with the latest version
print("latest version(default):", dataset("new-ds").loading_version)
print("latest version(specified):", dataset("new-ds/version/latest").loading_version)
# load with the full specified version
print("uuid version(full):", dataset("new-ds/version/n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw").loading_version)
print("uuid version(prefix):", dataset("new-ds/version/n7uglydp4p").loading_version)
# load with tag
print("tag version(v0):", dataset("new-ds/version/v0").loading_version)
print("tag version(v1):", dataset("new-ds/version/v1").loading_version)
latest version(default): a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf
latest version(specified): a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf
uuid version(full): n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw
uuid version(prefix): n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw
tag version(v0): n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw
tag version(v1): a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf