lamindb.Collection¶

Bases: Registry, Data, IsVersioned, TracksRun, TracksUpdates

Collections: collections of artifacts.

For more info: Tutorial: Artifacts.

Parameters:

data – List[Artifact] A list of artifacts.
name – str A name.
description – str | None = None A description.
version – str | None = None A version string.
is_new_version_of – Collection | None = None An old version of the collection.
run – Run | None = None The run that creates the collection.
meta – Artifact | None = None An artifact that defines metadata for the collection.
reference – str | None = None For instance, an external ID or a URL.
reference_type – str | None = None For instance, "url".

See also

Artifact

Examples

Create a collection from a collection of Artifact objects:

>>> collection = ln.Collection([artifact1, artifact2], name="My collection")
>>> collection.save()

If you have more than 100k artifacts, consider creating a collection directly from the directory without creating File records (e.g., here RxRx: cell imaging):

>>> collection = ln.Artifact("s3://my-bucket/my-images/", name="My collection", meta=df)
>>> collection.save()

Make a new version of a collection:

>>> # a non-versioned collection
>>> collection = ln.Collection(df1, description="My dataframe")
>>> collection.save()
>>> # create new collection from old collection and version both
>>> new_collection = ln.Collection(df2, is_new_version_of=collection)
>>> assert new_collection.stem_uid == collection.stem_uid
>>> assert collection.version == "1"
>>> assert new_collection.version == "2"

Fields

version CharField

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

created_at DateTimeField: Time of creation of record.

created_by ForeignKey: Creator of record, a User.

updated_at DateTimeField: Time of last update to record.

id AutoField: Internal id, valid only in one DB instance.

uid CharField: Universal id, valid across DB instances.

name CharField: Name or title of collection (required).

description TextField: A description.

hash CharField: Hash of collection content. 86 base64 chars allow to store 64 bytes, 512 bits.

reference CharField: A reference like URL or external ID.

reference_type CharField: Type of reference, e.g., cellxgene Census collection_id.

transform ForeignKey: Transform whose run created the collection.

run ForeignKey: Run that created the collection.

artifact OneToOneField: Storage of collection as a one artifact.

visibility SmallIntegerField: Visibility of record, 0-default, 1-hidden, 2-trash.

feature_sets ManyToManyField: The feature sets measured in this collection (see FeatureSet).

ulabels ManyToManyField: ULabels sampled in the collection (see Feature).

input_of ManyToManyField: Runs that use this collection as an input.

previous_runs ManyToManyField: Sequence of runs that created or updated the record.

unordered_artifacts ManyToManyField: Storage of collection as multiple artifacts.

Methods

cache(is_run_input=None)¶

Download cloud artifacts in collection to local cache.

Follows synching logic: only caches outdated artifacts.

Returns paths to locally cached on-disk artifacts.

Parameters:: is_run_input (bool | None, default: None) – Whether to track this collection as run input.
Return type:: list[UPath]

delete(permanent=None)¶

Delete collection.

Parameters:: permanent (bool | None, default: None) – Whether to permanently delete the collection record (skips trash).
Return type:: None

Examples

For any Collection object collection, call:

>>> collection.delete()

load(join='outer', is_run_input=None, **kwargs)¶

Stage and load to memory.

Returns in-memory representation if possible, e.g., a concatenated DataFrame or AnnData object.

Return type:: Any

mapped(layers_keys=None, obs_keys=None, obsm_keys=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)¶

Return a map-style dataset.

Returns a pytorch map-style dataset by virtually concatenating AnnData arrays.

If your AnnData collection is in the cloud, move them into a local cache first via cache().

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in the collection. The dictionary has keys for layers_keys (.X is in "X"), obs_keys, obsm_keys (under f"obsm_{key}") and also "_store_idx" for the index of the AnnData object containing this observation sample.

Note

For a guide, see Train a machine learning model on a collection.

This method currently only works for collections of AnnData artifacts.

Parameters:

layers_keys (str | list[str] | None, default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X.
obsm_keys (str | list[str] | None, default: None) – Keys from the .obsm slots.
obs_keys (str | list[str] | None, default: None) – Keys from the .obs slots.
join (Literal['inner', 'outer'] | None, default: 'inner') – "inner" or "outer" virtual joins. If None is passed, does not join.
encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.
unknown_label (str | dict[str, str] | None, default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.
cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.
parallel (bool, default: False) – Enable sampling with multiple processes.
dtype (str | None, default: None) – Convert numpy arrays from .X, .layers and .obsm
stream (bool, default: False) – Whether to stream data from the array backend.
is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

MappedCollection

Examples

>>> import lamindb as ln
>>> from torch.utils.data import DataLoader
>>> ds = ln.Collection.filter(description="my collection").one()
>>> mapped = collection.mapped(label_keys=["cell_type", "batch"])
>>> dl = DataLoader(mapped, batch_size=128, shuffle=True)

restore()¶

Restore collection record from trash.

Return type:: None

Examples

For any Collection object collection, call:

>>> collection.restore()

save(using=None)¶

Save the collection and underlying artifacts to database & storage.

Parameters:: using (str | None, default: None) – The database to which you want to save.
Return type:: None

Examples

>>> collection = ln.Collection("./myfile.csv", name="myfile")
>>> collection.save()

stage(is_run_input=None)¶

Download cloud artifacts in collection to local cache.

Follows synching logic: only caches outdated artifacts.

Returns paths to locally cached on-disk artifacts.

Parameters:: is_run_input (bool | None, default: None) – Whether to track this collection as run input.
Return type:: list[UPath]