lamindb.Collection¶
- class lamindb.Collection(artifacts: list[Artifact], name: str, version: str, description: str | None = None, meta: Any | None = None, reference: str | None = None, reference_type: str | None = None, run: Run | None = None, is_new_version_of: Collection | None = None)¶
Bases:
Registry
,Data
,IsVersioned
,TracksRun
,TracksUpdates
Collections: collections of artifacts.
For more info: Tutorial: Artifacts.
- Parameters:
data –
List[Artifact]
A list of artifacts.name –
str
A name.description –
str | None = None
A description.version –
str | None = None
A version string.is_new_version_of –
Collection | None = None
An old version of the collection.run –
Run | None = None
The run that creates the collection.meta –
Artifact | None = None
An artifact that defines metadata for the collection.reference –
str | None = None
For instance, an external ID or a URL.reference_type –
str | None = None
For instance,"url"
.
See also
Examples
Create a collection from a collection of
Artifact
objects:>>> collection = ln.Collection([artifact1, artifact2], name="My collection") >>> collection.save()
If you have more than 100k artifacts, consider creating a collection directly from the directory without creating File records (e.g., here RxRx: cell imaging):
>>> collection = ln.Artifact("s3://my-bucket/my-images/", name="My collection", meta=df) >>> collection.save()
Make a new version of a collection:
>>> # a non-versioned collection >>> collection = ln.Collection(df1, description="My dataframe") >>> collection.save() >>> # create new collection from old collection and version both >>> new_collection = ln.Collection(df2, is_new_version_of=collection) >>> assert new_collection.stem_uid == collection.stem_uid >>> assert collection.version == "1" >>> assert new_collection.version == "2"
Fields
- version CharField
Version (default
None
).Defines version of a family of records characterized by the same
stem_uid
.Consider using semantic versioning with Python versioning.
- created_at DateTimeField
Time of creation of record.
- created_by ForeignKey
Creator of record, a
User
.
- updated_at DateTimeField
Time of last update to record.
- id AutoField
Internal id, valid only in one DB instance.
- uid CharField
Universal id, valid across DB instances.
- name CharField
Name or title of collection (required).
- description TextField
A description.
- hash CharField
Hash of collection content. 86 base64 chars allow to store 64 bytes, 512 bits.
- reference CharField
A reference like URL or external ID.
- reference_type CharField
Type of reference, e.g., cellxgene Census collection_id.
- transform ForeignKey
Transform
whose run created the collection.
- run ForeignKey
Run
that created thecollection
.
- artifact OneToOneField
Storage of collection as a one artifact.
- visibility SmallIntegerField
Visibility of record, 0-default, 1-hidden, 2-trash.
- feature_sets ManyToManyField
The feature sets measured in this collection (see
FeatureSet
).
- ulabels ManyToManyField
ULabels sampled in the collection (see
Feature
).
- input_of ManyToManyField
Runs that use this collection as an input.
- previous_runs ManyToManyField
Sequence of runs that created or updated the record.
- unordered_artifacts ManyToManyField
Storage of collection as multiple artifacts.
Methods
- cache(is_run_input=None)¶
Download cloud artifacts in collection to local cache.
Follows synching logic: only caches outdated artifacts.
Returns paths to locally cached on-disk artifacts.
- Parameters:
is_run_input (
bool
|None
, default:None
) – Whether to track this collection as run input.- Return type:
list
[UPath
]
- delete(permanent=None)¶
Delete collection.
- Parameters:
permanent (
bool
|None
, default:None
) – Whether to permanently delete the collection record (skips trash).- Return type:
None
Examples
For any
Collection
objectcollection
, call:>>> collection.delete()
- load(join='outer', is_run_input=None, **kwargs)¶
Stage and load to memory.
Returns in-memory representation if possible, e.g., a concatenated
DataFrame
orAnnData
object.- Return type:
Any
- mapped(layers_keys=None, obs_keys=None, obsm_keys=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)¶
Return a map-style dataset.
Returns a pytorch map-style dataset by virtually concatenating
AnnData
arrays.If your
AnnData
collection is in the cloud, move them into a local cache first viacache()
.__getitem__
of theMappedCollection
object takes a single integer index and returns a dictionary with the observation data sample for this index from theAnnData
objects in the collection. The dictionary has keys forlayers_keys
(.X
is in"X"
),obs_keys
,obsm_keys
(underf"obsm_{key}"
) and also"_store_idx"
for the index of theAnnData
object containing this observation sample.Note
For a guide, see Train a machine learning model on a collection.
This method currently only works for collections of
AnnData
artifacts.- Parameters:
layers_keys (
str
|list
[str
] |None
, default:None
) – Keys from the.layers
slot.layers_keys=None
or"X"
in the list retrieves.X
.obsm_keys (
str
|list
[str
] |None
, default:None
) – Keys from the.obsm
slots.obs_keys (
str
|list
[str
] |None
, default:None
) – Keys from the.obs
slots.join (
Literal
['inner'
,'outer'
] |None
, default:'inner'
) –"inner"
or"outer"
virtual joins. IfNone
is passed, does not join.encode_labels (
bool
|list
[str
], default:True
) – Encode labels into integers. Can be a list with elements fromobs_keys
.unknown_label (
str
|dict
[str
,str
] |None
, default:None
) – Encode this label to -1. Can be a dictionary with keys fromobs_keys
ifencode_labels=True
or fromencode_labels
if it is a list.cache_categories (
bool
, default:True
) – Enable caching categories ofobs_keys
for faster access.parallel (
bool
, default:False
) – Enable sampling with multiple processes.dtype (
str
|None
, default:None
) – Convert numpy arrays from.X
,.layers
and.obsm
stream (
bool
, default:False
) – Whether to stream data from the array backend.is_run_input (
bool
|None
, default:None
) – Whether to track this collection as run input.
- Return type:
Examples
>>> import lamindb as ln >>> from torch.utils.data import DataLoader >>> ds = ln.Collection.filter(description="my collection").one() >>> mapped = collection.mapped(label_keys=["cell_type", "batch"]) >>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
- restore()¶
Restore collection record from trash.
- Return type:
None
Examples
For any
Collection
objectcollection
, call:>>> collection.restore()
- save(using=None)¶
Save the collection and underlying artifacts to database & storage.
- Parameters:
using (
str
|None
, default:None
) – The database to which you want to save.- Return type:
None
Examples
>>> collection = ln.Collection("./myfile.csv", name="myfile") >>> collection.save()
- stage(is_run_input=None)¶
Download cloud artifacts in collection to local cache.
Follows synching logic: only caches outdated artifacts.
Returns paths to locally cached on-disk artifacts.
- Parameters:
is_run_input (
bool
|None
, default:None
) – Whether to track this collection as run input.- Return type:
list
[UPath
]