Tutorial: Features & labels

In Tutorial: Artifacts, we learned about provenance and data access. Here, we walk through validating & annotating datasets with features & labels to improve:

  1. Findability: Which collections measured expression of cell marker CD14? Which characterized cell line K562? Which collections have a test & train split? Etc.

  2. Usability: Are there typos in feature names? Are there typos in sampled labels? Are types and units of features consistent? Etc.

Hint

This is a low-level tutorial aimed at a basic understanding of registering features and labels for annotation & validation.

If you’re just looking to readily validate and annotate a dataset with features and labels, see this guide: Validate, standardize & annotate.

import lamindb as ln
import pandas as pd

ln.settings.verbosity = "hint"
Hide code cell output
💡 connected lamindb: anonymous/lamin-tutorial

Re-cap

Let’s briefly re-cap what we learned in Introduction.

Annotate by labels

We started with simple labeling:

# create a label
study0 = ln.ULabel(name="Study 0: initial plant gathering", description="My initial study").save()
# query an artifact from the previous tutorial
artifact = ln.Artifact.filter(key="iris_studies/study0_raw_images").one()
# label the artifact
artifact.labels.add(study0)
# look at artifact metadata
artifact.describe()
Hide code cell output
Artifact(uid='1D5EmvL1nq5sTzpEY7Vy', key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, updated_at='2024-05-25 15:25:11 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = 's3://lamindata'
    .transform = 'Tutorial: Artifacts'
    .run = '2024-05-25 15:25:09 UTC'
  Labels
    .ulabels = 'Study 0: initial plant gathering'

Annotate by features & labels

If you later want to feed labels into learning algorithms alongside measured features in artifacts, associate a label with a feature:

feature = ln.Feature(name="study", dtype="cat").save()
artifact.labels.add(study0, feature)
artifact.describe()
Artifact(uid='1D5EmvL1nq5sTzpEY7Vy', key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, updated_at='2024-05-25 15:25:11 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = 's3://lamindata'
    .transform = 'Tutorial: Artifacts'
    .run = '2024-05-25 15:25:09 UTC'
  Labels
    .ulabels = 'Study 0: initial plant gathering'
  Features
    'study' = 'Study 0: initial plant gathering'
  Feature sets

Annotate based on data

Often, data that you want to ingest comes with metadata.

Here, three metadata features species, scientist, instrument were collected.

df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
df.head()
species file_name scientist instrument
0 setosa iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce... Barbara McClintock Leica IIIc Camera
1 versicolor iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee710... Edgar Anderson Leica IIIc Camera
2 versicolor iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf... Edgar Anderson Leica IIIc Camera
3 setosa iris-83f433381b755101b9fc9fbc9743e35fbb8a1a109... Edgar Anderson Leica IIIc Camera
4 virginica iris-bdae8314e4385d8e2322abd8e63a82758a9063c77... Edgar Anderson Leica IIIc Camera

There are only a few values for features species, scientist & instrument, and we’d like to label the artifact with these values:

df.nunique()
species        3
file_name     50
scientist      2
instrument     1
dtype: int64
ln.Feature(name='species', dtype='cat[ULabel]').save()
ln.Feature(name='scientist', dtype='cat[ULabel]').save()
ln.Feature(name='instrument', dtype='cat[ULabel]').save()
species = ln.ULabel.from_values(df['species'].unique(), create=True)
ln.save(species)
authors = ln.ULabel.from_values(df['scientist'].unique(), create=True)
ln.save(authors)
instruments = ln.ULabel.from_values(df['instrument'].unique(), create=True)
ln.save(instruments)
artifact.features.add({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique()})
artifact.describe()
Artifact(uid='1D5EmvL1nq5sTzpEY7Vy', key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, updated_at='2024-05-25 15:25:11 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = 's3://lamindata'
    .transform = 'Tutorial: Artifacts'
    .run = '2024-05-25 15:25:09 UTC'
  Labels
    .ulabels = 'Study 0: initial plant gathering', 'setosa', 'versicolor', 'virginica', 'Barbara McClintock', 'Edgar Anderson', 'Leica IIIc Camera'
  Features
    'study' = 'Study 0: initial plant gathering'
    'species' = 'setosa', 'versicolor', 'virginica'
    'scientist' = 'Barbara McClintock', 'Edgar Anderson'
    'instrument' = 'Leica IIIc Camera'
  Feature sets

Register metadata

Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.

Features represent measurement dimensions (e.g. "species") and labels represent measured values (e.g. "iris setosa", "iris versicolor", "iris virginica").

In statistics, you’d say a feature is a categorical or numerical variable while a label is a simple category. Categorical variables draw their values from a set of categories.

Register labels

We study 3 species of the Iris plant: setosa, versicolor & virginica. Let’s create 3 labels with ULabel.

ULabel enables you to manage an in-house ontology to manage all kinds of generic labels.

What are alternatives to ULabel?

In a complex project, you’ll likely want dedicated typed registries for selected label types, e.g., Gene, Tissue, etc. See: Manage biological registries.

ULabel, however, will get you quite far and scale to ~1M labels.

Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are species labels:

is_species = ln.ULabel(name="is_species").save()
is_species.children.set(species)
is_species.view_parents(with_children=True)
Hide code cell output
_images/19590b94cd506538ff492ae87f0ecd1f491fe1c4477a599d3b445acb97470415.svg

Query artifacts by labels

Using the new annotations, you can now query image artifacts by species & study labels:

ln.ULabel.df()
uid name description reference reference_type run_id created_by_id updated_at
id
8 8wNI5rzT is_species None None None None 1 2024-05-25 15:25:17.309042+00:00
7 aBsceABa Leica IIIc Camera None None None None 1 2024-05-25 15:25:17.242712+00:00
6 OlZVZdCu Edgar Anderson None None None None 1 2024-05-25 15:25:17.236027+00:00
5 0GngiLHD Barbara McClintock None None None None 1 2024-05-25 15:25:17.235847+00:00
4 vhcaHq6t virginica None None None None 1 2024-05-25 15:25:17.225996+00:00
3 beMnnjag versicolor None None None None 1 2024-05-25 15:25:17.225881+00:00
2 I1AgTcUN setosa None None None None 1 2024-05-25 15:25:17.225753+00:00
1 Y9aXwzva Study 0: initial plant gathering My initial study None None None 1 2024-05-25 15:25:16.636113+00:00
ulabels = ln.ULabel.lookup()
ln.Artifact.filter(ulabels=ulabels.study_0_initial_plant_gathering).one()
Hide code cell output
Artifact(uid='1D5EmvL1nq5sTzpEY7Vy', key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, created_by_id=1, storage_id=2, transform_id=1, run_id=1, updated_at='2024-05-25 15:25:11 UTC')

Run an ML model

Let’s now run a mock ML model that transforms the images into 4 high-level features.

def run_ml_model() -> pd.DataFrame:
    image_file_dir = artifact.cache()
    output_data = ln.core.datasets.df_iris_in_meter_study1()
    return output_data

transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
run = ln.track(transform=transform)
df = run_ml_model()
Hide code cell output
💡 saved: Transform(uid='d2HPjRAWPTOx', name='Petal & sepal regressor', type='pipeline', created_by_id=1, updated_at='2024-05-25 15:25:17 UTC')
💡 saved: Run(uid='kmvSyTcMz2U1vsMREcuQ', transform_id=2, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_kmvSyTcMz2U1vsMREcuQ.txt
💡 adding artifact ids [1] as inputs for run 2, adding parent transform 1

The output is a dataframe:

df.head()
Hide code cell output
sepal_length sepal_width petal_length petal_width iris_organism_name
0 0.051 0.035 0.014 0.002 setosa
1 0.049 0.030 0.014 0.002 setosa
2 0.047 0.032 0.013 0.002 setosa
3 0.046 0.031 0.015 0.002 setosa
4 0.050 0.036 0.014 0.002 setosa

And this is the pipeline that produced the dataframe:

run
Run(uid='kmvSyTcMz2U1vsMREcuQ', started_at='2024-05-25 15:25:17 UTC', is_consecutive=True, transform_id=2, created_by_id=1)
run.transform.view_parents()
Hide code cell output
_images/8f1fcc3ab4457a254de38ea4fc8a6724374f2ae7f717dde22d74c7e3ae5297df.svg

Register the output data

Let’s first register the features of the transformed data:

new_features = ln.Feature.from_df(df)
ln.save(new_features)
How to track units of features?

Use the unit field of Feature. In the above example, you’d do:

for feature in features:
    if feature.type == "number":
        feature.unit = "m"  # SI unit for meters
        feature.save()

We can now validate & register the dataframe in one line:

artifact = ln.Artifact.from_df(
    df,
    description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()
Hide code cell output
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/e4YjbPTTSJXGBWOQWzMU.parquet')
✅ storing artifact 'e4YjbPTTSJXGBWOQWzMU' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/e4YjbPTTSJXGBWOQWzMU.parquet'
Artifact(uid='e4YjbPTTSJXGBWOQWzMU', description='Iris study 1 - after measuring sepal & petal metrics', suffix='.parquet', accessor='DataFrame', size=5347, hash='zMBDnOFHeA8CwpaI_7KF9g', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=2, run_id=2, updated_at='2024-05-25 15:25:18 UTC')

There is one categorical feature, let’s add the species labels:

features = ln.Feature.lookup()
species_labels = ln.ULabel.filter(parents__name="is_species").all()
artifact.labels.add(species_labels, feature=features.species)
species_labels
<QuerySet [ULabel(uid='I1AgTcUN', name='setosa', created_by_id=1, updated_at='2024-05-25 15:25:17 UTC'), ULabel(uid='beMnnjag', name='versicolor', created_by_id=1, updated_at='2024-05-25 15:25:17 UTC'), ULabel(uid='vhcaHq6t', name='virginica', created_by_id=1, updated_at='2024-05-25 15:25:17 UTC')]>

Let’s now add study labels:

artifact.labels.add(ulabels.study_0_initial_plant_gathering, feature=features.study)

This is the context for our artifact:

artifact.describe()
artifact.view_lineage()
Hide code cell output
Artifact(uid='e4YjbPTTSJXGBWOQWzMU', description='Iris study 1 - after measuring sepal & petal metrics', suffix='.parquet', accessor='DataFrame', size=5347, hash='zMBDnOFHeA8CwpaI_7KF9g', hash_type='md5', visibility=1, key_is_virtual=True, updated_at='2024-05-25 15:25:18 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial'
    .transform = 'Petal & sepal regressor'
    .run = '2024-05-25 15:25:17 UTC'
  Labels
    .ulabels = 'Study 0: initial plant gathering', 'setosa', 'versicolor', 'virginica'
  Features
    'species' = 'setosa', 'versicolor', 'virginica'
    'study' = 'Study 0: initial plant gathering'
  Feature sets
_images/0a061621c6f7f3dc56d6f0ec9ca454babeeb3201405187233186207bf05d1234.svg

See the database content:

ln.view(registries=["Feature", "ULabel"])
Hide code cell output
Feature
uid name dtype unit description synonyms run_id created_by_id updated_at
id
9 4jVn4NnWg6EU iris_organism_name cat None None None 2.0 1 2024-05-25 15:25:18.855320+00:00
8 T59xcXhqxKK2 petal_width float None None None 2.0 1 2024-05-25 15:25:18.855191+00:00
7 c3Qosp4q87dK petal_length float None None None 2.0 1 2024-05-25 15:25:18.855064+00:00
6 ZnOGfpp1U7KM sepal_width float None None None 2.0 1 2024-05-25 15:25:18.854935+00:00
5 Fubsv2wb7dch sepal_length float None None None 2.0 1 2024-05-25 15:25:18.854783+00:00
4 y0HgQADmBhUU instrument cat[ULabel] None None None NaN 1 2024-05-25 15:25:17.213669+00:00
3 Tb3WxoMmElyt scientist cat[ULabel] None None None NaN 1 2024-05-25 15:25:17.208371+00:00
ULabel
uid name description reference reference_type run_id created_by_id updated_at
id
8 8wNI5rzT is_species None None None None 1 2024-05-25 15:25:17.309042+00:00
7 aBsceABa Leica IIIc Camera None None None None 1 2024-05-25 15:25:17.242712+00:00
6 OlZVZdCu Edgar Anderson None None None None 1 2024-05-25 15:25:17.236027+00:00
5 0GngiLHD Barbara McClintock None None None None 1 2024-05-25 15:25:17.235847+00:00
4 vhcaHq6t virginica None None None None 1 2024-05-25 15:25:17.225996+00:00
3 beMnnjag versicolor None None None None 1 2024-05-25 15:25:17.225881+00:00
2 I1AgTcUN setosa None None None None 1 2024-05-25 15:25:17.225753+00:00

This is it! 😅

If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.

Appendix

Manage metadata

Avoid duplicates

Let’s create a label "project1":

ln.ULabel(name="project1").save()
Hide code cell output
ULabel(uid='knUtfN5O', name='project1', created_by_id=1, run_id=2, updated_at='2024-05-25 15:25:19 UTC')

We already created a project1 label before, let’s see what happens if we try to create it again:

label = ln.ULabel(name="project1")
label.save()
Hide code cell output
💡 returning existing ULabel record with same name: 'project1'
ULabel(uid='knUtfN5O', name='project1', created_by_id=1, run_id=2, updated_at='2024-05-25 15:25:19 UTC')

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell “project 1” with a white space:

ln.ULabel(name="project 1")
Hide code cell output
❗ record with similar name exists! did you mean to load it?
uid name description reference reference_type run_id created_by_id updated_at
id
9 knUtfN5O project1 None None None 2 1 2024-05-25 15:25:19.048926+00:00
ULabel(uid='ryGxtw3X', name='project 1', created_by_id=1, run_id=2)

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via upon_create_search_names.

Update & delete records

label = ln.ULabel.filter(name="project1").first()
label
Hide code cell output
ULabel(uid='knUtfN5O', name='project1', created_by_id=1, run_id=2, updated_at='2024-05-25 15:25:19 UTC')
label.name = "project1a"
label.save()
label
Hide code cell output
ULabel(uid='knUtfN5O', name='project1a', created_by_id=1, run_id=2, updated_at='2024-05-25 15:25:19 UTC')
label.delete()
Hide code cell output
(1, {'lnschema_core.ULabel': 1})

Manage storage

Change default storage

The default storage location is:

ln.settings.storage
Hide code cell output
PosixUPath('/home/runner/work/lamindb/lamindb/docs/lamin-tutorial')

You can change it by setting ln.settings.storage = "s3://my-bucket".

See all storage locations

ln.Storage.df()
Hide code cell output
uid root description type region instance_uid run_id created_by_id updated_at
id
2 YmV3ZoHv s3://lamindata None s3 us-east-1 4XIuR0tvaiXM None 1 2024-05-25 15:25:11.392320+00:00
1 OVvaz1Qn9PV0 /home/runner/work/lamindb/lamindb/docs/lamin-t... None local None 5WuFt3cW4zRx None 1 2024-05-25 15:25:07.483727+00:00
Hide code cell content
# clean up what we wrote in this notebook
!rm -r lamin-tutorial
!lamin delete --force lamin-tutorial
❗ calling anonymously, will miss private instances
💡 deleting instance anonymous/lamin-tutorial