Tutorial: Features & labels¶
In Tutorial: Artifacts, we learned about provenance and data access. Here, we walk through validating & annotating datasets with features & labels to improve:
Findability: Which collections measured expression of cell marker
CD14
? Which characterized cell lineK562
? Which collections have a test & train split? Etc.Usability: Are there typos in feature names? Are there typos in sampled labels? Are types and units of features consistent? Etc.
Hint
This is a low-level tutorial aimed at a basic understanding of registering features and labels for annotation & validation.
If you’re just looking to readily validate and annotate a dataset with features and labels, see this guide: Validate, standardize & annotate.
import lamindb as ln
import pandas as pd
ln.settings.verbosity = "hint"
Show code cell output
💡 connected lamindb: anonymous/lamin-tutorial
Re-cap¶
Let’s briefly re-cap what we learned in Introduction.
Annotate by labels¶
We started with simple labeling:
# create a label
study0 = ln.ULabel(name="Study 0: initial plant gathering", description="My initial study").save()
# query an artifact from the previous tutorial
artifact = ln.Artifact.filter(key="iris_studies/study0_raw_images").one()
# label the artifact
artifact.labels.add(study0)
# look at artifact metadata
artifact.describe()
Show code cell output
Artifact(uid='1D5EmvL1nq5sTzpEY7Vy', key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, updated_at='2024-05-25 15:25:11 UTC')
Provenance
.created_by = 'anonymous'
.storage = 's3://lamindata'
.transform = 'Tutorial: Artifacts'
.run = '2024-05-25 15:25:09 UTC'
Labels
.ulabels = 'Study 0: initial plant gathering'
Annotate by features & labels¶
If you later want to feed labels into learning algorithms alongside measured features in artifacts, associate a label with a feature:
feature = ln.Feature(name="study", dtype="cat").save()
artifact.labels.add(study0, feature)
artifact.describe()
Artifact(uid='1D5EmvL1nq5sTzpEY7Vy', key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, updated_at='2024-05-25 15:25:11 UTC')
Provenance
.created_by = 'anonymous'
.storage = 's3://lamindata'
.transform = 'Tutorial: Artifacts'
.run = '2024-05-25 15:25:09 UTC'
Labels
.ulabels = 'Study 0: initial plant gathering'
Features
'study' = 'Study 0: initial plant gathering'
Feature sets
Annotate based on data¶
Often, data that you want to ingest comes with metadata.
Here, three metadata features species
, scientist
, instrument
were collected.
df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
df.head()
species | file_name | scientist | instrument | |
---|---|---|---|---|
0 | setosa | iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce... | Barbara McClintock | Leica IIIc Camera |
1 | versicolor | iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee710... | Edgar Anderson | Leica IIIc Camera |
2 | versicolor | iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf... | Edgar Anderson | Leica IIIc Camera |
3 | setosa | iris-83f433381b755101b9fc9fbc9743e35fbb8a1a109... | Edgar Anderson | Leica IIIc Camera |
4 | virginica | iris-bdae8314e4385d8e2322abd8e63a82758a9063c77... | Edgar Anderson | Leica IIIc Camera |
There are only a few values for features species
, scientist
& instrument
, and we’d like to label the artifact with these values:
df.nunique()
species 3
file_name 50
scientist 2
instrument 1
dtype: int64
ln.Feature(name='species', dtype='cat[ULabel]').save()
ln.Feature(name='scientist', dtype='cat[ULabel]').save()
ln.Feature(name='instrument', dtype='cat[ULabel]').save()
species = ln.ULabel.from_values(df['species'].unique(), create=True)
ln.save(species)
authors = ln.ULabel.from_values(df['scientist'].unique(), create=True)
ln.save(authors)
instruments = ln.ULabel.from_values(df['instrument'].unique(), create=True)
ln.save(instruments)
artifact.features.add({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique()})
artifact.describe()
Artifact(uid='1D5EmvL1nq5sTzpEY7Vy', key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, updated_at='2024-05-25 15:25:11 UTC')
Provenance
.created_by = 'anonymous'
.storage = 's3://lamindata'
.transform = 'Tutorial: Artifacts'
.run = '2024-05-25 15:25:09 UTC'
Labels
.ulabels = 'Study 0: initial plant gathering', 'setosa', 'versicolor', 'virginica', 'Barbara McClintock', 'Edgar Anderson', 'Leica IIIc Camera'
Features
'study' = 'Study 0: initial plant gathering'
'species' = 'setosa', 'versicolor', 'virginica'
'scientist' = 'Barbara McClintock', 'Edgar Anderson'
'instrument' = 'Leica IIIc Camera'
Feature sets
Register metadata¶
Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.
Features represent measurement dimensions (e.g. "species"
) and labels represent measured values (e.g. "iris setosa"
, "iris versicolor"
, "iris virginica"
).
In statistics, you’d say a feature is a categorical or numerical variable while a label is a simple category. Categorical variables draw their values from a set of categories.
Register labels¶
We study 3 species of the Iris plant: setosa
, versicolor
& virginica
. Let’s create 3 labels with ULabel
.
ULabel
enables you to manage an in-house ontology to manage all kinds of generic labels.
What are alternatives to ULabel?
In a complex project, you’ll likely want dedicated typed registries for selected label types, e.g., Gene
, Tissue
, etc. See: Manage biological registries.
ULabel
, however, will get you quite far and scale to ~1M labels.
Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are species labels:
is_species = ln.ULabel(name="is_species").save()
is_species.children.set(species)
is_species.view_parents(with_children=True)
Show code cell output
Query artifacts by labels¶
Using the new annotations, you can now query image artifacts by species & study labels:
ln.ULabel.df()
uid | name | description | reference | reference_type | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
8 | 8wNI5rzT | is_species | None | None | None | None | 1 | 2024-05-25 15:25:17.309042+00:00 |
7 | aBsceABa | Leica IIIc Camera | None | None | None | None | 1 | 2024-05-25 15:25:17.242712+00:00 |
6 | OlZVZdCu | Edgar Anderson | None | None | None | None | 1 | 2024-05-25 15:25:17.236027+00:00 |
5 | 0GngiLHD | Barbara McClintock | None | None | None | None | 1 | 2024-05-25 15:25:17.235847+00:00 |
4 | vhcaHq6t | virginica | None | None | None | None | 1 | 2024-05-25 15:25:17.225996+00:00 |
3 | beMnnjag | versicolor | None | None | None | None | 1 | 2024-05-25 15:25:17.225881+00:00 |
2 | I1AgTcUN | setosa | None | None | None | None | 1 | 2024-05-25 15:25:17.225753+00:00 |
1 | Y9aXwzva | Study 0: initial plant gathering | My initial study | None | None | None | 1 | 2024-05-25 15:25:16.636113+00:00 |
ulabels = ln.ULabel.lookup()
ln.Artifact.filter(ulabels=ulabels.study_0_initial_plant_gathering).one()
Show code cell output
Artifact(uid='1D5EmvL1nq5sTzpEY7Vy', key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, created_by_id=1, storage_id=2, transform_id=1, run_id=1, updated_at='2024-05-25 15:25:11 UTC')
Run an ML model¶
Let’s now run a mock ML model that transforms the images into 4 high-level features.
def run_ml_model() -> pd.DataFrame:
image_file_dir = artifact.cache()
output_data = ln.core.datasets.df_iris_in_meter_study1()
return output_data
transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
run = ln.track(transform=transform)
df = run_ml_model()
Show code cell output
💡 saved: Transform(uid='d2HPjRAWPTOx', name='Petal & sepal regressor', type='pipeline', created_by_id=1, updated_at='2024-05-25 15:25:17 UTC')
💡 saved: Run(uid='kmvSyTcMz2U1vsMREcuQ', transform_id=2, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_kmvSyTcMz2U1vsMREcuQ.txt
💡 adding artifact ids [1] as inputs for run 2, adding parent transform 1
The output is a dataframe:
df.head()
Show code cell output
sepal_length | sepal_width | petal_length | petal_width | iris_organism_name | |
---|---|---|---|---|---|
0 | 0.051 | 0.035 | 0.014 | 0.002 | setosa |
1 | 0.049 | 0.030 | 0.014 | 0.002 | setosa |
2 | 0.047 | 0.032 | 0.013 | 0.002 | setosa |
3 | 0.046 | 0.031 | 0.015 | 0.002 | setosa |
4 | 0.050 | 0.036 | 0.014 | 0.002 | setosa |
And this is the pipeline that produced the dataframe:
run
Run(uid='kmvSyTcMz2U1vsMREcuQ', started_at='2024-05-25 15:25:17 UTC', is_consecutive=True, transform_id=2, created_by_id=1)
run.transform.view_parents()
Show code cell output
Register the output data¶
Let’s first register the features of the transformed data:
new_features = ln.Feature.from_df(df)
ln.save(new_features)
How to track units of features?
Use the unit
field of Feature
. In the above example, you’d do:
for feature in features:
if feature.type == "number":
feature.unit = "m" # SI unit for meters
feature.save()
We can now validate & register the dataframe in one line:
artifact = ln.Artifact.from_df(
df,
description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()
Show code cell output
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/e4YjbPTTSJXGBWOQWzMU.parquet')
✅ storing artifact 'e4YjbPTTSJXGBWOQWzMU' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/e4YjbPTTSJXGBWOQWzMU.parquet'
Artifact(uid='e4YjbPTTSJXGBWOQWzMU', description='Iris study 1 - after measuring sepal & petal metrics', suffix='.parquet', accessor='DataFrame', size=5347, hash='zMBDnOFHeA8CwpaI_7KF9g', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=2, run_id=2, updated_at='2024-05-25 15:25:18 UTC')
There is one categorical feature, let’s add the species labels:
features = ln.Feature.lookup()
species_labels = ln.ULabel.filter(parents__name="is_species").all()
artifact.labels.add(species_labels, feature=features.species)
species_labels
<QuerySet [ULabel(uid='I1AgTcUN', name='setosa', created_by_id=1, updated_at='2024-05-25 15:25:17 UTC'), ULabel(uid='beMnnjag', name='versicolor', created_by_id=1, updated_at='2024-05-25 15:25:17 UTC'), ULabel(uid='vhcaHq6t', name='virginica', created_by_id=1, updated_at='2024-05-25 15:25:17 UTC')]>
Let’s now add study labels:
artifact.labels.add(ulabels.study_0_initial_plant_gathering, feature=features.study)
This is the context for our artifact:
artifact.describe()
artifact.view_lineage()
Show code cell output
Artifact(uid='e4YjbPTTSJXGBWOQWzMU', description='Iris study 1 - after measuring sepal & petal metrics', suffix='.parquet', accessor='DataFrame', size=5347, hash='zMBDnOFHeA8CwpaI_7KF9g', hash_type='md5', visibility=1, key_is_virtual=True, updated_at='2024-05-25 15:25:18 UTC')
Provenance
.created_by = 'anonymous'
.storage = '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial'
.transform = 'Petal & sepal regressor'
.run = '2024-05-25 15:25:17 UTC'
Labels
.ulabels = 'Study 0: initial plant gathering', 'setosa', 'versicolor', 'virginica'
Features
'species' = 'setosa', 'versicolor', 'virginica'
'study' = 'Study 0: initial plant gathering'
Feature sets
See the database content:
ln.view(registries=["Feature", "ULabel"])
Show code cell output
Feature
uid | name | dtype | unit | description | synonyms | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
9 | 4jVn4NnWg6EU | iris_organism_name | cat | None | None | None | 2.0 | 1 | 2024-05-25 15:25:18.855320+00:00 |
8 | T59xcXhqxKK2 | petal_width | float | None | None | None | 2.0 | 1 | 2024-05-25 15:25:18.855191+00:00 |
7 | c3Qosp4q87dK | petal_length | float | None | None | None | 2.0 | 1 | 2024-05-25 15:25:18.855064+00:00 |
6 | ZnOGfpp1U7KM | sepal_width | float | None | None | None | 2.0 | 1 | 2024-05-25 15:25:18.854935+00:00 |
5 | Fubsv2wb7dch | sepal_length | float | None | None | None | 2.0 | 1 | 2024-05-25 15:25:18.854783+00:00 |
4 | y0HgQADmBhUU | instrument | cat[ULabel] | None | None | None | NaN | 1 | 2024-05-25 15:25:17.213669+00:00 |
3 | Tb3WxoMmElyt | scientist | cat[ULabel] | None | None | None | NaN | 1 | 2024-05-25 15:25:17.208371+00:00 |
ULabel
uid | name | description | reference | reference_type | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
8 | 8wNI5rzT | is_species | None | None | None | None | 1 | 2024-05-25 15:25:17.309042+00:00 |
7 | aBsceABa | Leica IIIc Camera | None | None | None | None | 1 | 2024-05-25 15:25:17.242712+00:00 |
6 | OlZVZdCu | Edgar Anderson | None | None | None | None | 1 | 2024-05-25 15:25:17.236027+00:00 |
5 | 0GngiLHD | Barbara McClintock | None | None | None | None | 1 | 2024-05-25 15:25:17.235847+00:00 |
4 | vhcaHq6t | virginica | None | None | None | None | 1 | 2024-05-25 15:25:17.225996+00:00 |
3 | beMnnjag | versicolor | None | None | None | None | 1 | 2024-05-25 15:25:17.225881+00:00 |
2 | I1AgTcUN | setosa | None | None | None | None | 1 | 2024-05-25 15:25:17.225753+00:00 |
This is it! 😅
If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.
Appendix¶
Manage metadata¶
Avoid duplicates¶
Let’s create a label "project1"
:
ln.ULabel(name="project1").save()
Show code cell output
ULabel(uid='knUtfN5O', name='project1', created_by_id=1, run_id=2, updated_at='2024-05-25 15:25:19 UTC')
We already created a project1
label before, let’s see what happens if we try to create it again:
label = ln.ULabel(name="project1")
label.save()
Show code cell output
💡 returning existing ULabel record with same name: 'project1'
ULabel(uid='knUtfN5O', name='project1', created_by_id=1, run_id=2, updated_at='2024-05-25 15:25:19 UTC')
Instead of creating a new record, LaminDB loads and returns the existing record from the database.
If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.
Say, we spell “project 1” with a white space:
ln.ULabel(name="project 1")
Show code cell output
❗ record with similar name exists! did you mean to load it?
uid | name | description | reference | reference_type | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
9 | knUtfN5O | project1 | None | None | None | 2 | 1 | 2024-05-25 15:25:19.048926+00:00 |
ULabel(uid='ryGxtw3X', name='project 1', created_by_id=1, run_id=2)
To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.
You can switch it off for performance gains via upon_create_search_names
.
Update & delete records¶
label = ln.ULabel.filter(name="project1").first()
label
Show code cell output
ULabel(uid='knUtfN5O', name='project1', created_by_id=1, run_id=2, updated_at='2024-05-25 15:25:19 UTC')
label.name = "project1a"
label.save()
label
Show code cell output
ULabel(uid='knUtfN5O', name='project1a', created_by_id=1, run_id=2, updated_at='2024-05-25 15:25:19 UTC')
label.delete()
Show code cell output
(1, {'lnschema_core.ULabel': 1})
Manage storage¶
Change default storage¶
The default storage location is:
ln.settings.storage
Show code cell output
PosixUPath('/home/runner/work/lamindb/lamindb/docs/lamin-tutorial')
You can change it by setting ln.settings.storage = "s3://my-bucket"
.
See all storage locations¶
ln.Storage.df()
Show code cell output
uid | root | description | type | region | instance_uid | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
2 | YmV3ZoHv | s3://lamindata | None | s3 | us-east-1 | 4XIuR0tvaiXM | None | 1 | 2024-05-25 15:25:11.392320+00:00 |
1 | OVvaz1Qn9PV0 | /home/runner/work/lamindb/lamindb/docs/lamin-t... | None | local | None | 5WuFt3cW4zRx | None | 1 | 2024-05-25 15:25:07.483727+00:00 |
Show code cell content
# clean up what we wrote in this notebook
!rm -r lamin-tutorial
!lamin delete --force lamin-tutorial
❗ calling anonymously, will miss private instances
💡 deleting instance anonymous/lamin-tutorial