Validate, standardize & annotate¶
We’ll walk you through the following flow:
define validation criteria
validate & standardize metadata
save validated & annotated artifacts
How do we validate metadata?
Registries in your database define the “truth” for metadata.
For instance, if “Experiment 1” has been registered as the name
of a ULabel
record, it is a validated value for field ULabel.name
.
!lamin init --storage ./test-annotate --schema bionty
Show code cell output
💡 connected lamindb: testuser1/test-annotate
import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad
ln.settings.verbosity = "hint"
💡 connected lamindb: testuser1/test-annotate
Let’s start with a DataFrame that we’d like to validate:
df = pd.DataFrame({
"temperature": [37.2, 36.3, 38.2],
"cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
"assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
"donor": ["D0001", "D0002", "DOOO3"],
})
df
temperature | cell_type | assay_ontology_id | donor | |
---|---|---|---|---|
0 | 37.2 | cerebral pyramidal neuron | EFO:0008913 | D0001 |
1 | 36.3 | astrocyte | EFO:0008913 | D0002 |
2 | 38.2 | oligodendrocyte | EFO:0008913 | DOOO3 |
Validate and standardize metadata¶
# define validation criteria for the categoricals
categoricals = {
"cell_type": bt.CellType.name,
"assay_ontology_id": bt.ExperimentalFactor.ontology_id,
"donor": ln.ULabel.name,
}
# create an object to guide validation and annotation
annotate = ln.Annotate.from_df(df, categoricals=categoricals)
# validate
validated = annotate.validate()
validated
✅ added 3 records with Feature.name for columns: 'cell_type', 'assay_ontology_id', 'donor'
❗ 1 non-validated categories are not saved in Feature.name: ['temperature']!
→ to lookup categories, use lookup().columns
→ to save, run add_new_from_columns
💡 mapping cell_type on CellType.name
❗ found 2 terms validated terms: ['astrocyte', 'oligodendrocyte']
→ save terms via .add_validated_from('cell_type')
❗ 1 terms is not validated: 'cerebral pyramidal neuron'
→ save terms via .add_new_from('cell_type')
💡 mapping assay_ontology_id on ExperimentalFactor.ontology_id
❗ found 1 terms validated terms: ['EFO:0008913']
→ save terms via .add_validated_from('assay_ontology_id')
✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id
💡 mapping donor on ULabel.name
❗ 3 terms are not validated: 'D0001', 'D0002', 'DOOO3'
→ save terms via .add_new_from('donor')
False
Validate using registries in another instance¶
Sometimes you want to validate against other existing registries, for instance cellxgene.
This allows us to directly transfer values that are currently missing in our registries from the cellxgene instance.
annotate = ln.Annotate.from_df(
df,
categoricals=categoricals,
using="laminlabs/cellxgene", # pass the instance slug
)
annotate.validate()
❗ 1 non-validated categories are not saved in Feature.name: ['temperature']!
→ to lookup categories, use lookup().columns
→ to save, run add_new_from_columns
💡 mapping cell_type on CellType.name
❗ found 2 terms validated terms: ['astrocyte', 'oligodendrocyte']
→ save terms via .add_validated_from('cell_type')
❗ 1 terms is not validated: 'cerebral pyramidal neuron'
→ save terms via .add_new_from('cell_type')
💡 mapping assay_ontology_id on ExperimentalFactor.ontology_id
❗ found 1 terms validated terms: ['EFO:0008913']
→ save terms via .add_validated_from('assay_ontology_id')
✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id
💡 mapping donor on ULabel.name
❗ 3 terms are not validated: 'D0001', 'D0002', 'DOOO3'
→ save terms via .add_new_from('donor')
False
Register new metadata labels¶
Our current database instance is empty. Once you populated its registries, saving new labels will only rarely be needed. You’ll mostly use your lamindb instance to validate any incoming new data and annotate it.
annotate.add_validated_from(df.cell_type.name)
❗ 1 non-validated categories are not saved in CellType.name: ['cerebral pyramidal neuron']!
→ to lookup categories, use lookup().cell_type
→ to save, run .add_new_from('cell_type')
✅ added 2 records from laminlabs/cellxgene with CellType.name for cell_type: 'astrocyte', 'oligodendrocyte'
# use a lookup object to get the correct spelling of categories from public reference
# pass "public" to use the public reference
lookup = annotate.lookup()
lookup
Lookup objects from the laminlabs/cellxgene:
.cell_type
.assay_ontology_id
.donor
.columns
Example:
→ categories = validator.lookup().cell_type
→ categories.alveolar_type_1_fibroblast_cell
cell_types = lookup[df.cell_type.name]
cell_types.cerebral_cortex_pyramidal_neuron
CellType(uid='2sgq6sE7', name='cerebral cortex pyramidal neuron', ontology_id='CL:4023111', description='A Pyramidal Neuron With Soma Located In The Cerebral Cortex.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:37:06 UTC')
# fix the typo
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})
annotate.add_validated_from(df.cell_type.name)
✅ added 1 record from laminlabs/cellxgene with CellType.name for cell_type: 'cerebral cortex pyramidal neuron'
# register non-validated terms
annotate.add_new_from(df.donor.name)
✅ added 3 records with ULabel.name for donor: 'D0001', 'D0002', 'DOOO3'
# validate again
validated = annotate.validate()
validated
✅ cell_type is validated against CellType.name
💡 mapping assay_ontology_id on ExperimentalFactor.ontology_id
❗ found 1 terms validated terms: ['EFO:0008913']
→ save terms via .add_validated_from('assay_ontology_id')
✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id
✅ donor is validated against ULabel.name
True
Validate an AnnData object¶
Here we specify which var_fields
and obs_fields
to validate against.
df.index = ["obs1", "obs2", "obs3"]
X = pd.DataFrame({"TCF7": [1, 2, 3], "PDCD1": [4, 5, 6], "CD3E": [7, 8, 9], "CD4": [10, 11, 12], "CD8A": [13, 14, 15]}, index=["obs1", "obs2", "obs3"])
adata = ad.AnnData(X=X, obs=df)
adata
AnnData object with n_obs × n_vars = 3 × 5
obs: 'temperature', 'cell_type', 'assay_ontology_id', 'donor'
annotate = ln.Annotate.from_anndata(
adata,
var_index=bt.Gene.symbol,
categoricals=categoricals,
organism="human",
)
❗ 1 non-validated categories are not saved in Feature.name: ['temperature']!
→ to lookup categories, use lookup().columns
→ to save, run add_new_from_columns
✅ added 6 records from public with Gene.symbol for var_index: 'TCF7', 'PDCD1', 'CD3E', 'CD4', 'CD8A'
annotate.validate()
✅ var_index is validated against Gene.symbol
✅ cell_type is validated against CellType.name
💡 mapping assay_ontology_id on ExperimentalFactor.ontology_id
❗ found 1 terms validated terms: ['EFO:0008913']
→ save terms via .add_validated_from('assay_ontology_id')
✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id
✅ donor is validated against ULabel.name
True
annotate.add_validated_from("all")
💡 saving labels for 'cell_type'
💡 saving labels for 'assay_ontology_id'
✅ added 1 record from public with ExperimentalFactor.ontology_id for assay_ontology_id: 'EFO:0008913'
💡 saving labels for 'donor'
annotate.validate()
✅ var_index is validated against Gene.symbol
✅ cell_type is validated against CellType.name
✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id
✅ donor is validated against ULabel.name
True
Save an artifact¶
The validated object can be subsequently saved as an Artifact
:
artifact = annotate.save_artifact(description="test AnnData")
❗ no run & transform get linked, consider calling ln.track()
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/qjr1ChZfVgjWWJQnDqKz.h5ad')
✅ storing artifact 'qjr1ChZfVgjWWJQnDqKz' at '/home/runner/work/lamindb/lamindb/docs/test-annotate/.lamindb/qjr1ChZfVgjWWJQnDqKz.h5ad'
💡 you can auto-track these data as a run input by calling `ln.track()`
💡 parsing feature names of X stored in slot 'var'
✅ 5 terms (100.00%) are validated for symbol
✅ linked: FeatureSet(uid='oWPzxUf81Lw6eXn0orJg', n=6, dtype='int', registry='bionty.Gene', hash='12Mh3I-mUBuOvj1q6wNn', created_by_id=1)
💡 parsing feature names of slot 'obs'
✅ 3 terms (75.00%) are validated for name
❗ 1 term (25.00%) is not validated for name: temperature
✅ linked: FeatureSet(uid='4iZxtxpDNa206AoIaL4O', n=3, registry='Feature', hash='k4veAGwD_qrc9ztjkKEX', created_by_id=1)
✅ saved 2 feature sets for slots: 'var','obs'
artifact.describe()
Artifact(uid='qjr1ChZfVgjWWJQnDqKz', description='test AnnData', suffix='.h5ad', accessor='AnnData', size=20336, hash='wozXf_B6VsK6QXH81skJ8A', hash_type='md5', n_observations=3, visibility=1, key_is_virtual=True, updated_at='2024-05-25 15:26:40 UTC')
Provenance
.created_by = 'testuser1'
.storage = '/home/runner/work/lamindb/lamindb/docs/test-annotate'
Labels
.cell_types = 'astrocyte', 'oligodendrocyte', 'cerebral cortex pyramidal neuron'
.experimental_factors = 'single-cell RNA sequencing'
.ulabels = 'D0001', 'D0002', 'DOOO3'
Features
'donor' = 'D0001', 'D0002', 'DOOO3'
'cell_type' = 'astrocyte', 'oligodendrocyte', 'cerebral cortex pyramidal neuron'
'assay_ontology_id' = 'single-cell RNA sequencing'
Feature sets
'var' = 'TCF7', 'PDCD1', 'CD3E', 'CD4', 'CD8A'
'obs' = 'cell_type', 'assay_ontology_id', 'donor'
Save a collection¶
Register a new collection for the registered artifact:
# register a new collection
collection = annotate.save_collection(
artifact, # registered artifact above, can also pass a list of artifacts
name="Experiment X in brain", # title of the publication
description="10.1126/science.xxxxx", # DOI of the publication
reference="E-MTAB-xxxxx", # accession number (e.g. GSE#, E-MTAB#, etc.)
reference_type="ArrayExpress" # source type (e.g. GEO, ArrayExpress, SRA, etc.)
)
❗ no run & transform get linked, consider calling ln.track()
💡 you can auto-track these data as a run input by calling `ln.track()`
collection.artifacts.df()
uid | version | description | key | suffix | accessor | size | hash | hash_type | n_objects | n_observations | visibility | key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||
1 | qjr1ChZfVgjWWJQnDqKz | None | test AnnData | None | .h5ad | AnnData | 20336 | wozXf_B6VsK6QXH81skJ8A | md5 | None | 3 | 1 | True | 1 | None | None | 1 | 2024-05-25 15:26:40.973890+00:00 |