Validate & standardize for developers¶
LaminDB makes it easy to validate categorical variables based on registries (CanValidate
).
How do I validate based on a public ontology?
CanValidate
methods validate against the registries in your LaminDB instance.
In Manage biological registries, you’ll see how to extend standard validation to validation against public references using a ReferenceTable
ontology object: public = Registry.public()
.
By default, from_values()
considers a match in a public reference a validated value for any bionty
entity.
What to do for non-validated values?
Be aware when you are working with a freshly initialized instance: nothing is validated as no records have yet been registered.
Run inspect
to get instructions of how to register non-validated values. You may need to standardize your values, fix typos or simply register them.
Setup¶
!lamin init --storage ./test-validate --schema bionty
Show code cell output
💡 connected lamindb: testuser1/test-validate
import lamindb as ln
import bionty as bt
import pandas as pd
💡 connected lamindb: testuser1/test-validate
ln.settings.verbosity = "info"
Pre-populate registries:
df = pd.DataFrame({"A": 1, "B": 2}, index=["i1"])
ln.Artifact.from_df(df, description="test data").save()
ln.ULabel(name="Project A").save()
ln.ULabel(name="Project B").save()
bt.Disease.from_public(ontology_id="MONDO:0004975").save()
Show code cell output
❗ no run & transform get linked, consider calling ln.track()
✅ storing artifact 'VwIvtvMPZySqecE1Pz5W' at '/home/runner/work/lamindb/lamindb/docs/test-validate/.lamindb/VwIvtvMPZySqecE1Pz5W.parquet'
❗ record with similar name exists! did you mean to load it?
uid | name | description | reference | reference_type | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
1 | QLJCcRSc | Project A | None | None | None | None | 1 | 2024-05-25 15:26:53.472692+00:00 |
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0004975'
💡 also saving parents of Disease(uid='4F2HPJ3w', name='Alzheimer disease', ontology_id='MONDO:0004975', synonyms='Alzheimer disease|Alzheimers dementia|Alzheimer dementia|presenile and senile dementia|Alzheimer's disease|Alzheimer's dementia|Alzheimers disease|AD', description='A Progressive, Neurodegenerative Disease Characterized By Loss Of Function And Death Of Nerve Cells In Several Areas Of The Brain Leading To Loss Of Cognitive Function Such As Memory And Language.', created_by_id=1, public_source_id=39, updated_at='2024-05-25 15:26:54 UTC')
✅ created 2 Disease records from Bionty matching ontology_id: 'MONDO:0001627', 'MONDO:0005574'
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
💡 also saving parents of Disease(uid='6AMrlbw8', name='dementia', ontology_id='MONDO:0001627', synonyms='dementia (disease)|dementia', description='Loss Of Intellectual Abilities Interfering With An Individual'S Social And Occupational Functions. Causes Include Alzheimer'S Disease, Brain Injuries, Brain Tumors, And Vascular Disorders.', created_by_id=1, public_source_id=39, updated_at='2024-05-25 15:26:55 UTC')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0002039'
💡 also saving parents of Disease(uid='6yfRDD23', name='cognitive disorder', ontology_id='MONDO:0002039', synonyms='cognitive disease|cognitive disorder', description='A Disease Affects Cognitive Processes.', created_by_id=1, public_source_id=39, updated_at='2024-05-25 15:26:56 UTC')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0002025'
💡 also saving parents of Disease(uid='6HNgrMK9', name='psychiatric disorder', ontology_id='MONDO:0002025', synonyms='Psychiatric disease|Psychiatric disorder', description='A Disorder Characterized By Behavioral And/Or Psychological Abnormalities, Often Accompanied By Physical Symptoms. The Symptoms May Cause Clinically Significant Distress Or Impairment In Social And Occupational Areas Of Functioning. Representative Examples Include Anxiety Disorders, Cognitive Disorders, Mood Disorders And Schizophrenia.', created_by_id=1, public_source_id=39, updated_at='2024-05-25 15:26:56 UTC')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0700096'
💡 also saving parents of Disease(uid='3Pcu72hb', name='human disease', ontology_id='MONDO:0700096', synonyms='human disease or disorder', created_by_id=1, public_source_id=39, updated_at='2024-05-25 15:26:57 UTC')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0000001'
💡 also saving parents of Disease(uid='6PeduEmE', name='tauopathy', ontology_id='MONDO:0005574', description='Neurodegenerative Disorders Involving Deposition Of Abnormal Tau Protein Isoforms (Tau Proteins) In Neurons And Glial Cells In The Brain. Pathological Aggregations Of Tau Proteins Are Associated With Mutation Of The Tau Gene On Chromosome 17 In Patients With Alzheimer Disease; Dementia; Parkinsonian Disorders; Progressive Supranuclear Palsy (Supranuclear Palsy, Progressive); And Corticobasal Degeneration.', created_by_id=1, public_source_id=39, updated_at='2024-05-25 15:26:55 UTC')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0005559'
💡 also saving parents of Disease(uid='6sgNFaDE', name='neurodegenerative disease', ontology_id='MONDO:0005559', synonyms='brain degeneration|central nervous system degenerative disorder|central nervous system neurodegenerative disorder|neurodegenerative disease|degenerative disorder of central nervous system', description='A Disorder Of The Central Nervous System Characterized By Gradual And Progressive Loss Of Neural Tissue And Neurologic Function.', created_by_id=1, public_source_id=39, updated_at='2024-05-25 15:26:59 UTC')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0002602'
💡 also saving parents of Disease(uid='5dTDVEfc', name='central nervous system disorder', ontology_id='MONDO:0002602', synonyms='disorder of central nervous system|central nervous system disease or disorder|central nervous disease|central nervous system disorder|central nervous system disease|disease or disorder of central nervous system|disease of central nervous system|CNS disorder|disease of the central nervous system', description='A Disease Involving The Central Nervous System.', created_by_id=1, public_source_id=39, updated_at='2024-05-25 15:26:59 UTC')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0005071'
💡 also saving parents of Disease(uid='3NKHns2m', name='nervous system disorder', ontology_id='MONDO:0005071', synonyms='disease or disorder of nervous system|neurologic disorder|neurological disorder|disease of nervous system|nervous system disease|nervous system disease or disorder|nervous system disorder|neurological disease|neurologic disease|disorder of nervous system', description='A Non-Neoplastic Or Neoplastic Disorder That Affects The Brain, Spinal Cord, Or Peripheral Nerves.', created_by_id=1, public_source_id=39, updated_at='2024-05-25 15:27:00 UTC')
Standard validation¶
Name duplication¶
Creating a record with the same name field automatically returns the existing record:
ln.ULabel(name="Project A")
💡 returning existing ULabel record with same name: 'Project A'
ULabel(uid='QLJCcRSc', name='Project A', created_by_id=1, updated_at='2024-05-25 15:26:53 UTC')
Bulk creating records using from_values()
only returns validated records:
Note: Terms validated with public reference are also created with .from_values
, see Manage biological registries for details.
projects = ["Project A", "Project B", "Project D", "Project E"]
ln.ULabel.from_values(projects)
✅ loaded 2 ULabel records matching name: 'Project A', 'Project B'
❗ did not create ULabel records for 2 non-validated names: 'Project D', 'Project E'
[ULabel(uid='QLJCcRSc', name='Project A', created_by_id=1, updated_at='2024-05-25 15:26:53 UTC'),
ULabel(uid='S8Y16gRY', name='Project B', created_by_id=1, updated_at='2024-05-25 15:26:53 UTC')]
(Versioned records also account for version
in addition to name
. Also see: idempotency.)
Data duplication¶
Creating an artifact or collection with the same content automatically returns the existing record:
ln.Artifact.from_df(df, description="same data")
❗ no run & transform get linked, consider calling ln.track()
💡 returning existing artifact with same hash: Artifact(uid='VwIvtvMPZySqecE1Pz5W', description='test data', suffix='.parquet', accessor='DataFrame', size=2722, hash='-xXHpj8x-liAvd51DtHVnA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, updated_at='2024-05-25 15:26:53 UTC')
❗ updated description from test data to same data
Artifact(uid='VwIvtvMPZySqecE1Pz5W', description='same data', suffix='.parquet', accessor='DataFrame', size=2722, hash='-xXHpj8x-liAvd51DtHVnA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, updated_at='2024-05-25 15:26:53 UTC')
Schema-based validation¶
Type checks, constraint checks, and Django validators can be configured in the schema.
Registry-based validation¶
validate()
validates passed values against reference values in a registry.
It returns a boolean vector indicating whether a value has an exact match in the reference values.
Using dedicated registries¶
For instance, bionty
types basic biological entities: every entity has its own registry, a Python class.
By default, the first string field is used for validation. For Disease
, it’s name
:
diseases = ["Alzheimer disease", "Alzheimer's disease", "AD"]
validated = bt.Disease.validate(diseases)
validated
✅ 1 term (33.30%) is validated for name
❗ 2 terms (66.70%) are not validated for name: Alzheimer's disease, AD
array([ True, False, False])
Validate against a non-default field:
bt.Disease.validate(
["MONDO:0004975", "MONDO:0004976", "MONDO:0004977"], bt.Disease.ontology_id
)
✅ 1 term (33.30%) is validated for ontology_id
❗ 2 terms (66.70%) are not validated for ontology_id: MONDO:0004976, MONDO:0004977
array([ True, False, False])
Using the ULabel
registry¶
Any entity that doesn’t have its dedicated registry (“is not typed”) can be validated & registered using ULabel
:
ln.ULabel.validate(["Project A", "Project B", "Project C"])
✅ 2 terms (66.70%) are validated for name
❗ 1 term (33.30%) is not validated for name: Project C
array([ True, True, False])
Inspect & standardize¶
When validation fails, you can call inspect()
to figure out what to do.
inspect()
applies the same definition of validation as validate()
, but returns a rich return value InspectResult
. Most importantly, it logs recommended curation steps that would render the data validated.
result = bt.Disease.inspect(diseases)
✅ 1 term (33.30%) is validated for name
❗ 2 terms (66.70%) are not validated for name: Alzheimer's disease, AD
detected 2 terms with synonyms: Alzheimer's disease, AD
→ standardize terms via .standardize()
In this case, it suggests to call standardize()
to standardize synonyms:
bt.Disease.standardize(result.non_validated)
💡 standardized 2/2 terms
['Alzheimer disease', 'Alzheimer disease']
For more, see Manage biological registries.
Extend registries¶
Sometimes, we simply want to register new records to extend the content of registries:
result = ln.ULabel.inspect(projects)
✅ 2 terms (50.00%) are validated for name
❗ 2 terms (50.00%) are not validated for name: Project D, Project E
couldn't validate 2 terms: 'Project D', 'Project E'
→ if you are sure, create new records via ln.ULabel() and save to your registry
new_labels = [ln.ULabel(name=name) for name in result.non_validated]
ln.save(new_labels)
new_labels
Show code cell output
❗ records with similar names exist! did you mean to load one of them?
uid | name | description | reference | reference_type | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
1 | QLJCcRSc | Project A | None | None | None | None | 1 | 2024-05-25 15:26:53.472692+00:00 |
2 | S8Y16gRY | Project B | None | None | None | None | 1 | 2024-05-25 15:26:53.498449+00:00 |
❗ records with similar names exist! did you mean to load one of them?
uid | name | description | reference | reference_type | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
1 | QLJCcRSc | Project A | None | None | None | None | 1 | 2024-05-25 15:26:53.472692+00:00 |
2 | S8Y16gRY | Project B | None | None | None | None | 1 | 2024-05-25 15:26:53.498449+00:00 |
[ULabel(uid='GxwsBUug', name='Project D', created_by_id=1, updated_at='2024-05-25 15:27:00 UTC'),
ULabel(uid='87eMOzdF', name='Project E', created_by_id=1, updated_at='2024-05-25 15:27:00 UTC')]
Validate features¶
When calling File.from_...
and Collection.from_...
, features are automatically validated.
Validated features are grouped in “feature sets” indexed by “slots”.
For a basic example, see Tutorial: Features & labels.
For an overview of data formats used to model different data types, see Data types.
Bulk validation¶
Show code cell content
# clean up test instance
!lamin delete --force test-validate
!rm -r test-validate
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.11.9/x64/bin/lamin", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 367, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 152, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamin_cli/__main__.py", line 103, in delete
return delete(instance, force=force)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/_delete.py", line 98, in delete
n_objects = check_storage_is_empty(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/core/upath.py", line 798, in check_storage_is_empty
raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/lamindb/lamindb/docs/test-validate/.lamindb contains 1 objects ('_is_initialized' ignored) - delete them prior to deleting the instance
['/home/runner/work/lamindb/lamindb/docs/test-validate/.lamindb/VwIvtvMPZySqecE1Pz5W.parquet', '/home/runner/work/lamindb/lamindb/docs/test-validate/.lamindb/_is_initialized']