Annotate data for developers¶

While data is the primary information or raw facts that are collected and stored, metadata is the supporting information that provides context and meaning to that data.

LaminDB let’s you annotate data with metadata in two ways: features and labels. (Also see tutorial.)

This guide extends Quickstart to explain the details of annotating data.

Setup¶

Let us create an instance that has bionty mounted:

!lamin init --storage ./test-annotate --schema bionty

import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad

💡 connected lamindb: testuser1/test-annotate

bt.settings.auto_save_parents = False  # ignores ontological hierarchy
ln.settings.verbosity = "info"
ln.settings.transform.stem_uid = "sU0y1kF3igep"
ln.settings.transform.version = "0"

Register a artifact¶

Let’s use the same example data as in the Quickstart:

df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7]},
    index=["sample1", "sample2", "sample3"],
)

In addition to the data, we also have two types of metadata as follows:

# observational metadata (1:1 correspondence with samples)
obs_meta = pd.DataFrame(
    {
        "cell_type": ["T cell", "T cell", "Monocyte"],
        "tissue": ["capillary blood", "arterial blood", "capillary blood"],
    },
    index=["sample1", "sample2", "sample3"],
)

# external metadata (describes the entire artifact)
external_meta = {
    "organism": "human",
    "assay": "scRNA-seq",
    "experiment": "EXP0001",
    "project": "PRJ0001",
}

To store both data and observational metadata, we use an AnnData object:

# note that we didn't add external metadata to adata.uns, because we will use LaminDB to store it
adata = ad.AnnData(df, obs=obs_meta)
adata

AnnData object with n_obs × n_vars = 3 × 3
    obs: 'cell_type', 'tissue'

Now let’s register the AnnData object without annotating with any metadata:

ln.track()

artifact = ln.Artifact.from_anndata(adata, description="my RNA-seq")
artifact.save()

We don’t see any metadata in the registered artifact yet:

artifact.describe()

Artifact(uid='jcRswCI9gzVM7nAmtiKx', description='my RNA-seq', suffix='.h5ad', accessor='AnnData', size=21224, hash='jBNzT3fmTNEcfJ19FK2euw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at='2024-05-25 15:27:08 UTC')
  Provenance
    .created_by = 'testuser1'
    .storage = '/home/runner/work/lamindb/lamindb/docs/test-annotate'
    .transform = 'Annotate data for developers'
    .run = '2024-05-25 15:27:07 UTC'

Define features and labels¶

Features and labels are records from their respective registries.

You can define them schema-less using Feature and ULabel registries, or schema-full using dedicated registries.

Define data features¶

Data features refer to individual measurable properties or characteristics of a phenomenon being observed. In data analysis and machine learning, features are the input variables used to predict or classify an outcome.

Data features are often numeric, but can also be categorical. For example, in the case of gene expression data, the features are the expression levels of individual genes. They are often stored as columns in a data table (adata.var_names for AnnData objects).

Here we define them using the Gene registry:

data_features = bt.Gene.from_values(
    adata.var_names, 
    organism="human", # or set globally: bt.settings.organism = "human"
    )
ln.save(data_features)
data_features

✅ loaded 2 Gene records matching symbol: 'CD8A', 'CD4'

✅ created 1 Gene record from Bionty matching symbol: 'CD14'

[Gene(uid='1j4At3x7akJU', symbol='CD4', ensembl_gene_id='ENSG00000010610', ncbi_gene_ids='920', biotype='protein_coding', description='CD4 molecule ', synonyms='T4|LEU-3', created_by_id=1, organism_id=1, public_source_id=11, updated_at='2024-05-25 15:26:34 UTC'),
 Gene(uid='6Aqvc8ckDYeN', symbol='CD8A', ensembl_gene_id='ENSG00000153563', ncbi_gene_ids='925', biotype='protein_coding', description='CD8 subunit alpha ', synonyms='P32|CD8|CD8ALPHA', created_by_id=1, organism_id=1, public_source_id=11, updated_at='2024-05-25 15:26:34 UTC'),
 Gene(uid='3bhNYquOnA4s', symbol='CD14', ensembl_gene_id='ENSG00000170458', ncbi_gene_ids='929', biotype='protein_coding', description='CD14 molecule ', synonyms='', created_by_id=1, run_id=1, organism_id=1, public_source_id=11, updated_at='2024-05-25 15:27:10 UTC')]

Define metadata features¶

Metadata features refer to descriptive or contextual information about the data. They don’t directly describe the content of the data but rather its characteristics.

In this example, the metadata features are “cell_type”, “tissue” that describe observations (stored in adata.obs.columns) and “organism”, “assay”, “experiment” that describe the entire artifact.

Here we define them using the Feature registry:

# obs metadata features
obs_meta_features = ln.Feature.from_df(adata.obs)
ln.save(obs_meta_features)
obs_meta_features

RecordsList([Feature(uid='31g4zQBqlQBx', name='cell_type', dtype='cat[bionty.CellType]', created_by_id=1, updated_at='2024-05-25 15:26:40 UTC'),
             Feature(uid='hQ1KKcE3fKOa', name='tissue', dtype='cat', created_by_id=1, run_id=1, updated_at='2024-05-25 15:27:10 UTC')])

# external metadata features
external_meta_features = [
    ln.Feature(name=name, dtype="cat") for name in external_meta.keys()
]
ln.save(external_meta_features)
external_meta_features

❗ record with similar name exists! did you mean to load it?

	uid	name	dtype	unit	description	synonyms	run_id	created_by_id	updated_at
id
2	TDAxDheK8TQ3	assay_ontology_id	cat[bionty.ExperimentalFactor]	None	None	None	None	1	2024-05-25 15:26:41.014395+00:00

❗ record with similar name exists! did you mean to load it?

	uid	name	dtype	unit	description	synonyms	run_id	created_by_id	updated_at
id
2	TDAxDheK8TQ3	assay_ontology_id	cat[bionty.ExperimentalFactor]	None	None	None	None	1	2024-05-25 15:26:41.014395+00:00

[Feature(uid='UggBRZTqttuf', name='organism', dtype='cat', created_by_id=1, run_id=1, updated_at='2024-05-25 15:27:10 UTC'),
 Feature(uid='M9WryvOydo3q', name='assay', dtype='cat', created_by_id=1, run_id=1, updated_at='2024-05-25 15:27:10 UTC'),
 Feature(uid='94J1DVw8sKcg', name='experiment', dtype='cat', created_by_id=1, run_id=1, updated_at='2024-05-25 15:27:10 UTC'),
 Feature(uid='IwBzh1yjVj4j', name='project', dtype='cat', created_by_id=1, run_id=1, updated_at='2024-05-25 15:27:10 UTC')]

Define metadata labels¶

Metadata labels are the categorical values of metadata features. They are more specific than features and are often used in classification.

In this example, the metadata labels of feature “cell_type” are “T cell” and “Monocyte”; the metadata labels of feature “tissue” are “capillary blood”, “arterial blood”; the metadata labels of feature “organism” is “human”; and so on.

Let’s define them with their respective registries:

cell_types = bt.CellType.from_values(adata.obs["cell_type"])
ln.save(cell_types)
cell_types

✅ created 1 CellType record from Bionty matching name: 'T cell'

✅ created 1 CellType record from Bionty matching synonyms: 'Monocyte'

[CellType(uid='22LvKd01', name='T cell', ontology_id='CL:0000084', synonyms='T-lymphocyte|T lymphocyte|T-cell', description='A Type Of Lymphocyte Whose Defining Characteristic Is The Expression Of A T Cell Receptor Complex.', created_by_id=1, run_id=1, public_source_id=29, updated_at='2024-05-25 15:27:11 UTC'),
 CellType(uid='28V22coI', name='monocyte', ontology_id='CL:0000576', description='Myeloid Mononuclear Recirculating Leukocyte That Can Act As A Precursor Of Tissue Macrophages, Osteoclasts And Some Populations Of Tissue Dendritic Cells.', created_by_id=1, run_id=1, public_source_id=29, updated_at='2024-05-25 15:27:11 UTC')]

tissues = bt.Tissue.from_values(adata.obs["tissue"])
ln.save(tissues)
tissues

✅ created 2 Tissue records from Bionty matching name: 'capillary blood', 'arterial blood'

[Tissue(uid='7gWJkhPG', name='capillary blood', ontology_id='UBERON:0013757', synonyms='blood in capillary|portion of capillary blood|portion of blood in capillary', description='A Blood That Is Part Of A Capillary.', created_by_id=1, run_id=1, public_source_id=34, updated_at='2024-05-25 15:27:12 UTC'),
 Tissue(uid='3O0QD2cL', name='arterial blood', ontology_id='UBERON:0013755', synonyms='blood in artery|arterial blood|portion of arterial blood', description='A Blood That Is Part Of A Artery.', created_by_id=1, run_id=1, public_source_id=34, updated_at='2024-05-25 15:27:12 UTC')]

organism = bt.Organism.from_public(name=external_meta["organism"])
organism.save()
organism

Organism(uid='1dpCL6Td', name='human', ontology_id='NCBITaxon:9606', scientific_name='homo_sapiens', created_by_id=1, public_source_id=1, updated_at='2024-05-25 15:27:12 UTC')

assay = bt.ExperimentalFactor.from_public(name=external_meta["assay"])
assay.save()
assay

✅ loaded 1 ExperimentalFactor record matching synonyms: 'scRNA-seq'

ExperimentalFactor(uid='4WYv9kl0', name='single-cell RNA sequencing', ontology_id='EFO:0008913', synonyms='single-cell RNA-seq|scRNA-seq|single cell RNA sequencing|single-cell transcriptome sequencing', description='A Protocol That Provides The Expression Profiles Of Single Cells Via The Isolation And Barcoding Of Single Cells And Their Rna, Reverse Transcription, Amplification, Library Generation And Sequencing.', molecule='RNA assay', instrument='single cell sequencing', created_by_id=1, public_source_id=51, updated_at='2024-05-25 15:27:12 UTC')

experiment = ln.ULabel(name=external_meta["experiment"], description="An experiment")
experiment.save()
experiment

ULabel(uid='8HvfXgYn', name='EXP0001', description='An experiment', created_by_id=1, run_id=1, updated_at='2024-05-25 15:27:12 UTC')

project = ln.ULabel(name=external_meta["project"], description="A project")
project.save()
project

ULabel(uid='UMCN6PGz', name='PRJ0001', description='A project', created_by_id=1, run_id=1, updated_at='2024-05-25 15:27:12 UTC')

Annotate with features¶

Non-external features are annotated when registering artifacts using .from_df or .from_anndata methods:

(See the below “Annotate with labels stratified by metadata features” session for adding external features.)

artifact = ln.Artifact.from_anndata(adata, description="my RNA-seq")
artifact.save()

artifact.features.add_from_anndata(var_field=bt.Gene.symbol, organism="human")

💡 parsing feature names of X stored in slot 'var'

✅    3 terms (100.00%) are validated for symbol

✅    linked: FeatureSet(uid='u9xAhMoGe7XZNXDbn31b', n=3, dtype='int', registry='bionty.Gene', hash='f2UVeHefaZxXFjmUwo9O', created_by_id=1, run_id=1)

💡 parsing feature names of slot 'obs'

✅    2 terms (100.00%) are validated for name

✅    linked: FeatureSet(uid='SiiBViNMVo9JsjHoC2kK', n=2, registry='Feature', hash='JU8gQ7nDk7KW67NmsQqB', created_by_id=1, run_id=1)

✅ saved 2 feature sets for slots: 'var','obs'

This artifact is now annotated with features:

artifact.describe()

Artifact(uid='jcRswCI9gzVM7nAmtiKx', description='my RNA-seq', suffix='.h5ad', accessor='AnnData', size=21224, hash='jBNzT3fmTNEcfJ19FK2euw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at='2024-05-25 15:27:12 UTC')
  Provenance
    .created_by = 'testuser1'
    .storage = '/home/runner/work/lamindb/lamindb/docs/test-annotate'
    .transform = 'Annotate data for developers'
    .run = '2024-05-25 15:27:07 UTC'

You see two types of features are annotated and organized as featuresets by slot:

“var”: data features
“obs”: observational metadata features

artifact.features

Use slots to retrieve corresponding annotated features:

artifact.features["var"].df()

	uid	symbol	stable_id	ensembl_gene_id	ncbi_gene_ids	biotype	description	synonyms	organism_id	public_source_id	run_id	created_by_id	updated_at
id
5	1j4At3x7akJU	CD4	None	ENSG00000010610	920	protein_coding	CD4 molecule	T4\|LEU-3	1	11	NaN	1	2024-05-25 15:26:34.800890+00:00
6	6Aqvc8ckDYeN	CD8A	None	ENSG00000153563	925	protein_coding	CD8 subunit alpha	P32\|CD8\|CD8ALPHA	1	11	NaN	1	2024-05-25 15:26:34.801066+00:00
7	3bhNYquOnA4s	CD14	None	ENSG00000170458	929	protein_coding	CD14 molecule		1	11	1.0	1	2024-05-25 15:27:10.673467+00:00

artifact.features["obs"].df()

	uid	name	dtype	unit	description	synonyms	run_id	created_by_id	updated_at
id
1	31g4zQBqlQBx	cell_type	cat[bionty.CellType]	None	None	None	NaN	1	2024-05-25 15:26:40.999517+00:00
4	hQ1KKcE3fKOa	tissue	cat	None	None	None	1.0	1	2024-05-25 15:27:10.706385+00:00

Annotate with labels¶

If you simply want to tag a artifact with some descriptive labels, you can pass them to .labels.add. For example, let’s add the experiment label “EXP0001” and project label “PRJ0001” to the artifact:

artifact.labels.add(experiment)
artifact.labels.add(project)

Now you see the artifact is annotated with ‘EXP0001’, ‘PRJ0001’ labels:

artifact.describe()

Artifact(uid='jcRswCI9gzVM7nAmtiKx', description='my RNA-seq', suffix='.h5ad', accessor='AnnData', size=21224, hash='jBNzT3fmTNEcfJ19FK2euw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at='2024-05-25 15:27:12 UTC')
  Provenance
    .created_by = 'testuser1'
    .storage = '/home/runner/work/lamindb/lamindb/docs/test-annotate'
    .transform = 'Annotate data for developers'
    .run = '2024-05-25 15:27:07 UTC'
  Labels
    .ulabels = 'EXP0001', 'PRJ0001'

To view all annotated labels:

artifact.labels

  Labels
    .ulabels = 'EXP0001', 'PRJ0001'

Since we didn’t specify which features the labels belongs to, they are accessible only through the default accessor “.ulabels” for ULabel Registry.

You may already notice that it could be difficult to interpret labels without features if they belong to the same registry.

artifact.ulabels.df()

	uid	name	description	reference	reference_type	run_id	created_by_id	updated_at
id
5	8HvfXgYn	EXP0001	An experiment	None	None	1	1	2024-05-25 15:27:12.371823+00:00
6	UMCN6PGz	PRJ0001	A project	None	None	1	1	2024-05-25 15:27:12.386165+00:00

Annotate with labels stratified by metadata features¶

For labels associated with metadata features, you can pass “feature” to .labels.add to stratify them by feature. (Another way to stratify labels is through ontological hierarchy, which is covered in the Quickstart)

Let’s add the experiment label “EXP0001” and project label “PRJ0001” to the artifact again, this time specifying their features:

# an auto-complete object of registered features
features = ln.Feature.lookup()

artifact.labels.add(experiment, feature=features.experiment)
artifact.labels.add(project, feature=features.project)

You now see a 3rd featureset is added to the artifact at slot “external”, and the labels are stratified by two features:

artifact.describe()

Artifact(uid='jcRswCI9gzVM7nAmtiKx', description='my RNA-seq', suffix='.h5ad', accessor='AnnData', size=21224, hash='jBNzT3fmTNEcfJ19FK2euw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at='2024-05-25 15:27:12 UTC')
  Provenance
    .created_by = 'testuser1'
    .storage = '/home/runner/work/lamindb/lamindb/docs/test-annotate'
    .transform = 'Annotate data for developers'
    .run = '2024-05-25 15:27:07 UTC'
  Labels
    .ulabels = 'EXP0001', 'PRJ0001'
  Features
    'experiment' = 'EXP0001'
    'project' = 'PRJ0001'
  Feature sets
    'var' = 'CD4', 'CD8A', 'CD14'
    'obs' = 'cell_type', 'tissue'

With feature-stratified labels, you can retrieve labels by feature:

artifact.labels.get(features.experiment).df()

	uid	name	description	reference	reference_type	run_id	created_by_id	updated_at
id
5	8HvfXgYn	EXP0001	An experiment	None	None	1	1	2024-05-25 15:27:12.371823+00:00

Note that adding feature-stratified labels will also allow you to retrieve labels with the default accessor of respective registries:

artifact.labels.add(assay, feature=features.assay)

# access labels directly via default accessor "experimental_factors"
artifact.experimental_factors.df()

	uid	name	ontology_id	abbr	synonyms	description	molecule	instrument	measurement	public_source_id	run_id	created_by_id	updated_at
id
1	4WYv9kl0	single-cell RNA sequencing	EFO:0008913	None	single-cell RNA-seq\|scRNA-seq\|single cell RNA ...	A Protocol That Provides The Expression Profil...	RNA assay	single cell sequencing	None	51	None	1	2024-05-25 15:27:12.356465+00:00

# access labels via feature
artifact.labels.get(features.assay).df()

	uid	name	ontology_id	abbr	synonyms	description	molecule	instrument	measurement	public_source_id	run_id	created_by_id	updated_at
id
1	4WYv9kl0	single-cell RNA sequencing	EFO:0008913	None	single-cell RNA-seq\|scRNA-seq\|single cell RNA ...	A Protocol That Provides The Expression Profil...	RNA assay	single cell sequencing	None	51	None	1	2024-05-25 15:27:12.356465+00:00

Let’s finish the rest annotation of labels:

# labels of obs metadata features
artifact.labels.add(cell_types, feature=features.cell_type)
artifact.labels.add(tissues, feature=features.tissue)

# labels of external metadata features
artifact.labels.add(organism, feature=features.organism)

Now you’ve annotated your artifact with all features and labels:

artifact.describe()

Artifact(uid='jcRswCI9gzVM7nAmtiKx', description='my RNA-seq', suffix='.h5ad', accessor='AnnData', size=21224, hash='jBNzT3fmTNEcfJ19FK2euw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at='2024-05-25 15:27:12 UTC')
  Provenance
    .created_by = 'testuser1'
    .storage = '/home/runner/work/lamindb/lamindb/docs/test-annotate'
    .transform = 'Annotate data for developers'
    .run = '2024-05-25 15:27:07 UTC'
  Labels
    .organisms = 'human'
    .tissues = 'capillary blood', 'arterial blood'
    .cell_types = 'T cell', 'monocyte'
    .experimental_factors = 'single-cell RNA sequencing'
    .ulabels = 'EXP0001', 'PRJ0001'
  Features
    'experiment' = 'EXP0001'
    'project' = 'PRJ0001'
    'organism' = 'human'
    'tissue' = 'capillary blood', 'arterial blood'
    'cell_type' = 'T cell', 'monocyte'
    'assay' = 'single-cell RNA sequencing'
  Feature sets
    'var' = 'CD4', 'CD8A', 'CD14'
    'obs' = 'cell_type', 'tissue'