Tutorial: Features & labels¶

In Tutorial: Artifacts, we learned about provenance and data access. Here, we walk through validating & annotating datasets with features & labels to improve:

Findability: Which collections measured expression of cell marker CD14? Which characterized cell line K562? Which collections have a test & train split? Etc.
Usability: Are there typos in feature names? Are there typos in sampled labels? Are types and units of features consistent? Etc.

Hint

This is a low-level tutorial aimed at a basic understanding of registering features and labels for annotation & validation.

If you’re just looking to readily validate and annotate a dataset with features and labels, see this guide: Validate, standardize & annotate.

import lamindb as ln
import pandas as pd

ln.settings.verbosity = "hint"

Re-cap¶

Let’s briefly re-cap what we learned in Introduction.

Annotate by labels¶

We started with simple labeling:

# create a label
study0 = ln.ULabel(name="Study 0: initial plant gathering", description="My initial study").save()
# query an artifact from the previous tutorial
artifact = ln.Artifact.filter(key="iris_studies/study0_raw_images").one()
# label the artifact
artifact.labels.add(study0)
# look at artifact metadata
artifact.describe()

Annotate by features & labels¶

If you later want to feed labels into learning algorithms alongside measured features in artifacts, associate a label with a feature:

feature = ln.Feature(name="study", dtype="cat").save()
artifact.labels.add(study0, feature)
artifact.describe()

Artifact(uid='1D5EmvL1nq5sTzpEY7Vy', key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, updated_at='2024-05-25 15:25:11 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = 's3://lamindata'
    .transform = 'Tutorial: Artifacts'
    .run = '2024-05-25 15:25:09 UTC'
  Labels
    .ulabels = 'Study 0: initial plant gathering'
  Features
    'study' = 'Study 0: initial plant gathering'
  Feature sets

Annotate based on data¶

Often, data that you want to ingest comes with metadata.

Here, three metadata features species, scientist, instrument were collected.

df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
df.head()

	species	file_name	scientist	instrument
0	setosa	iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce...	Barbara McClintock	Leica IIIc Camera
1	versicolor	iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee710...	Edgar Anderson	Leica IIIc Camera
2	versicolor	iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf...	Edgar Anderson	Leica IIIc Camera
3	setosa	iris-83f433381b755101b9fc9fbc9743e35fbb8a1a109...	Edgar Anderson	Leica IIIc Camera
4	virginica	iris-bdae8314e4385d8e2322abd8e63a82758a9063c77...	Edgar Anderson	Leica IIIc Camera

There are only a few values for features species, scientist & instrument, and we’d like to label the artifact with these values:

df.nunique()

species        3
file_name     50
scientist      2
instrument     1
dtype: int64

ln.Feature(name='species', dtype='cat[ULabel]').save()
ln.Feature(name='scientist', dtype='cat[ULabel]').save()
ln.Feature(name='instrument', dtype='cat[ULabel]').save()
species = ln.ULabel.from_values(df['species'].unique(), create=True)
ln.save(species)
authors = ln.ULabel.from_values(df['scientist'].unique(), create=True)
ln.save(authors)
instruments = ln.ULabel.from_values(df['instrument'].unique(), create=True)
ln.save(instruments)

artifact.features.add({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique()})
artifact.describe()

Artifact(uid='1D5EmvL1nq5sTzpEY7Vy', key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, updated_at='2024-05-25 15:25:11 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = 's3://lamindata'
    .transform = 'Tutorial: Artifacts'
    .run = '2024-05-25 15:25:09 UTC'
  Labels
    .ulabels = 'Study 0: initial plant gathering', 'setosa', 'versicolor', 'virginica', 'Barbara McClintock', 'Edgar Anderson', 'Leica IIIc Camera'
  Features
    'study' = 'Study 0: initial plant gathering'
    'species' = 'setosa', 'versicolor', 'virginica'
    'scientist' = 'Barbara McClintock', 'Edgar Anderson'
    'instrument' = 'Leica IIIc Camera'
  Feature sets

Register metadata¶

Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.

Features represent measurement dimensions (e.g. "species") and labels represent measured values (e.g. "iris setosa", "iris versicolor", "iris virginica").

In statistics, you’d say a feature is a categorical or numerical variable while a label is a simple category. Categorical variables draw their values from a set of categories.

Register labels¶

We study 3 species of the Iris plant: setosa, versicolor & virginica. Let’s create 3 labels with ULabel.

ULabel enables you to manage an in-house ontology to manage all kinds of generic labels.

Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are species labels:

is_species = ln.ULabel(name="is_species").save()
is_species.children.set(species)
is_species.view_parents(with_children=True)

Query artifacts by labels¶

Using the new annotations, you can now query image artifacts by species & study labels:

ln.ULabel.df()

	uid	name	description	reference	reference_type	run_id	created_by_id	updated_at
id
8	8wNI5rzT	is_species	None	None	None	None	1	2024-05-25 15:25:17.309042+00:00
7	aBsceABa	Leica IIIc Camera	None	None	None	None	1	2024-05-25 15:25:17.242712+00:00
6	OlZVZdCu	Edgar Anderson	None	None	None	None	1	2024-05-25 15:25:17.236027+00:00
5	0GngiLHD	Barbara McClintock	None	None	None	None	1	2024-05-25 15:25:17.235847+00:00
4	vhcaHq6t	virginica	None	None	None	None	1	2024-05-25 15:25:17.225996+00:00
3	beMnnjag	versicolor	None	None	None	None	1	2024-05-25 15:25:17.225881+00:00
2	I1AgTcUN	setosa	None	None	None	None	1	2024-05-25 15:25:17.225753+00:00
1	Y9aXwzva	Study 0: initial plant gathering	My initial study	None	None	None	1	2024-05-25 15:25:16.636113+00:00

ulabels = ln.ULabel.lookup()
ln.Artifact.filter(ulabels=ulabels.study_0_initial_plant_gathering).one()

Run an ML model¶

Let’s now run a mock ML model that transforms the images into 4 high-level features.

def run_ml_model() -> pd.DataFrame:
    image_file_dir = artifact.cache()
    output_data = ln.core.datasets.df_iris_in_meter_study1()
    return output_data

transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
run = ln.track(transform=transform)
df = run_ml_model()

The output is a dataframe:

df.head()

Show code cell output Hide code cell output

	sepal_length	sepal_width	petal_length	petal_width	iris_organism_name
0	0.051	0.035	0.014	0.002	setosa
1	0.049	0.030	0.014	0.002	setosa
2	0.047	0.032	0.013	0.002	setosa
3	0.046	0.031	0.015	0.002	setosa
4	0.050	0.036	0.014	0.002	setosa

And this is the pipeline that produced the dataframe:

run

Run(uid='kmvSyTcMz2U1vsMREcuQ', started_at='2024-05-25 15:25:17 UTC', is_consecutive=True, transform_id=2, created_by_id=1)

run.transform.view_parents()

Register the output data¶

Let’s first register the features of the transformed data:

new_features = ln.Feature.from_df(df)
ln.save(new_features)

We can now validate & register the dataframe in one line:

artifact = ln.Artifact.from_df(
    df,
    description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()

There is one categorical feature, let’s add the species labels:

features = ln.Feature.lookup()

species_labels = ln.ULabel.filter(parents__name="is_species").all()
artifact.labels.add(species_labels, feature=features.species)

species_labels

<QuerySet [ULabel(uid='I1AgTcUN', name='setosa', created_by_id=1, updated_at='2024-05-25 15:25:17 UTC'), ULabel(uid='beMnnjag', name='versicolor', created_by_id=1, updated_at='2024-05-25 15:25:17 UTC'), ULabel(uid='vhcaHq6t', name='virginica', created_by_id=1, updated_at='2024-05-25 15:25:17 UTC')]>

Let’s now add study labels:

artifact.labels.add(ulabels.study_0_initial_plant_gathering, feature=features.study)

This is the context for our artifact:

artifact.describe()
artifact.view_lineage()

See the database content:

ln.view(registries=["Feature", "ULabel"])

Show code cell output Hide code cell output

Feature

	uid	name	dtype	unit	description	synonyms	run_id	created_by_id	updated_at
id
9	4jVn4NnWg6EU	iris_organism_name	cat	None	None	None	2.0	1	2024-05-25 15:25:18.855320+00:00
8	T59xcXhqxKK2	petal_width	float	None	None	None	2.0	1	2024-05-25 15:25:18.855191+00:00
7	c3Qosp4q87dK	petal_length	float	None	None	None	2.0	1	2024-05-25 15:25:18.855064+00:00
6	ZnOGfpp1U7KM	sepal_width	float	None	None	None	2.0	1	2024-05-25 15:25:18.854935+00:00
5	Fubsv2wb7dch	sepal_length	float	None	None	None	2.0	1	2024-05-25 15:25:18.854783+00:00
4	y0HgQADmBhUU	instrument	cat[ULabel]	None	None	None	NaN	1	2024-05-25 15:25:17.213669+00:00
3	Tb3WxoMmElyt	scientist	cat[ULabel]	None	None	None	NaN	1	2024-05-25 15:25:17.208371+00:00

ULabel

	uid	name	description	reference	reference_type	run_id	created_by_id	updated_at
id
8	8wNI5rzT	is_species	None	None	None	None	1	2024-05-25 15:25:17.309042+00:00
7	aBsceABa	Leica IIIc Camera	None	None	None	None	1	2024-05-25 15:25:17.242712+00:00
6	OlZVZdCu	Edgar Anderson	None	None	None	None	1	2024-05-25 15:25:17.236027+00:00
5	0GngiLHD	Barbara McClintock	None	None	None	None	1	2024-05-25 15:25:17.235847+00:00
4	vhcaHq6t	virginica	None	None	None	None	1	2024-05-25 15:25:17.225996+00:00
3	beMnnjag	versicolor	None	None	None	None	1	2024-05-25 15:25:17.225881+00:00
2	I1AgTcUN	setosa	None	None	None	None	1	2024-05-25 15:25:17.225753+00:00

This is it! 😅

If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.

Appendix¶

Manage metadata¶

Avoid duplicates¶

Let’s create a label "project1":

ln.ULabel(name="project1").save()

We already created a project1 label before, let’s see what happens if we try to create it again:

label = ln.ULabel(name="project1")
label.save()

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell “project 1” with a white space:

ln.ULabel(name="project 1")

Show code cell output Hide code cell output

❗ record with similar name exists! did you mean to load it?

	uid	name	description	reference	reference_type	run_id	created_by_id	updated_at
id
9	knUtfN5O	project1	None	None	None	2	1	2024-05-25 15:25:19.048926+00:00

ULabel(uid='ryGxtw3X', name='project 1', created_by_id=1, run_id=2)

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via upon_create_search_names.

Update & delete records¶

label = ln.ULabel.filter(name="project1").first()
label

label.name = "project1a"
label.save()
label

label.delete()

Manage storage¶

Change default storage¶

The default storage location is:

ln.settings.storage

You can change it by setting ln.settings.storage = "s3://my-bucket".

See all storage locations¶

ln.Storage.df()

Show code cell output Hide code cell output

	uid	root	description	type	region	instance_uid	run_id	created_by_id	updated_at
id
2	YmV3ZoHv	s3://lamindata	None	s3	us-east-1	4XIuR0tvaiXM	None	1	2024-05-25 15:25:11.392320+00:00
1	OVvaz1Qn9PV0	/home/runner/work/lamindb/lamindb/docs/lamin-t...	None	local	None	5WuFt3cW4zRx	None	1	2024-05-25 15:25:07.483727+00:00