Test AnnDataAccessor
¶
import lamindb as ln
ln.setup.init(storage="s3://lamindb-ci/test-anndata")
❗ To use lamindb, you need to connect to an instance.
Connect to an instance: `ln.connect()`. Init an instance: `ln.setup.init()`.
If you used the CLI to set up lamindb in a notebook, restart the Python session.
💡 go to: https://lamin.ai/testuser1/test-anndata
❗ updating & unlocking cloud SQLite 's3://lamindb-ci/test-anndata/183bc48fd12a5d5b8ff8153b79de292c.lndb' of instance 'testuser1/test-anndata'
💡 connected lamindb: testuser1/test-anndata
❗ locked instance (to unlock and push changes to the cloud SQLite file, call: lamin close)
We’ll need some test data:
ln.Artifact("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
❗ no run & transform get linked, consider calling ln.track()
❗ will manage storage location s3://lamindb-ci with instance testuser1/test-anndata
Artifact(uid='QZFy86scchybylgjFhLV', key='lndb-storage/pbmc68k.h5ad', suffix='.h5ad', accessor='AnnData', size=638484, hash='-QNUPBbAug3jFmmk3fsOQA', hash_type='md5', visibility=1, key_is_virtual=False, created_by_id=1, storage_id=2, updated_at='2024-05-25 15:34:36 UTC')
An h5ad
artifact stored on s3:
artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one()
artifact.path
S3Path('s3://lamindb-ci/lndb-storage/pbmc68k.h5ad')
adata = artifact.backed()
adata
AnnDataAccessor object with n_obs × n_vars = 70 × 765
constructed for the AnnData object pbmc68k.h5ad
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
It is possible to access AnnData
attributes without loading them into memory
Show code cell content
print(adata.obsm)
print(adata.varm)
print(adata.obsp)
print(adata.varm)
Accessor for the AnnData attribute obsm
with keys: ['X_pca', 'X_umap']
Accessor for the AnnData attribute varm
with keys: ['PCs']
Accessor for the AnnData attribute obsp
with keys: ['connectivities', 'distances']
Accessor for the AnnData attribute varm
with keys: ['PCs']
However, .obs
, .var
and .uns
are always loaded fully into memory on AnnDataAccessor
initialization
adata.obs.head()
cell_type | n_genes | percent_mito | louvain | |
---|---|---|---|---|
index | ||||
GCAGGGCTGGATTC-1 | Dendritic cells | 1168 | 0.014345 | 2 |
CTTTAGTGGTTACG-6 | CD19+ B | 1121 | 0.019679 | 8 |
TGACTGGAACCATG-7 | Dendritic cells | 1277 | 0.012961 | 1 |
TCAATCACCCTTCG-8 | CD19+ B | 1139 | 0.018467 | 4 |
CGTTATACAGTACC-8 | CD4+/CD45RO+ Memory | 1034 | 0.010163 | 0 |
adata.var.head()
n_counts | highly_variable | |
---|---|---|
index | ||
HES4 | 1153.387451 | True |
TNFRSF4 | 304.358154 | True |
SSU72 | 2530.272705 | False |
PARK7 | 7451.664062 | False |
RBP7 | 272.811035 | True |
adata.uns.keys()
dict_keys(['louvain', 'louvain_colors', 'neighbors', 'pca'])
Without subsetting, the AnnDataAccessor
object gives references to underlying lazy h5
or zarr
arrays:
adata.X
<HDF5 dataset "X": shape (70, 765), type "<f4">
adata.obsm["X_pca"]
<HDF5 dataset "X_pca": shape (70, 50), type "<f4">
And to a lazy SparseDataset
from the anndata
package:
adata.obsp["distances"]
CSRDataset: backend hdf5, shape (70, 70), data_dtype float64
Get a subset of the object, attributes are loaded only on explicit access:
obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Check shapes of the subset
Show code cell content
num_idx = sum(obs_idx)
assert adata_subset.shape == (num_idx, adata.shape[1])
assert (adata_subset.obs.cell_type == "CD34+").sum() == 0
adata_subset.obs.cell_type.value_counts()
Dendritic cells 28
CD14+ Monocytes 7
CD4+/CD25 T Reg 0
CD4+/CD45RO+ Memory 0
CD8+ Cytotoxic T 0
CD8+/CD45RA+ Naive Cytotoxic 0
CD19+ B 0
CD34+ 0
CD56+ NK 0
Name: cell_type, dtype: int64
Subsets load the arrays into memory only on direct access
print(adata_subset.X)
[[-0.326 -0.191 0.499 ... -0.21 -0.636 -0.49 ]
[ 0.811 -0.191 -0.728 ... -0.21 0.604 -0.49 ]
[-0.326 -0.191 0.643 ... -0.21 2.303 -0.49 ]
...
[-0.326 -0.191 -0.728 ... -0.21 0.626 -0.49 ]
[-0.326 -0.191 -0.728 ... -0.21 -0.636 -0.49 ]
[-0.326 -0.191 -0.728 ... -0.21 -0.636 -0.49 ]]
print(adata_subset.obsm["X_pca"])
[[-5.750601 -4.096395 -2.9178936 ... -0.3169805 -0.20286919
-0.4912242 ]
[-6.516435 4.5414424 1.629511 ... -2.0872126 2.4427452
0.67004365]
[-2.0939696 4.8808017 -2.0491498 ... -3.3238401 -1.6365678
1.0325491 ]
...
[-2.284083 -4.8995905 -2.5168793 ... -0.22459485 -0.28241014
-0.45557737]
[-7.1581526 5.147818 2.4819682 ... 2.1289759 -0.27535897
0.5335301 ]
[-4.0010567 -6.0705996 -3.1599348 ... 1.1530831 0.48674038
-0.24262637]]
Show code cell content
assert adata_subset.obsp["distances"].shape[0] == num_idx
To load the entire subset into memory as an actual AnnData
object, use to_memory()
:
adata_subset.to_memory()
AnnData object with n_obs × n_vars = 35 × 765
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
Show code cell content
!lamin delete --force test-anndata
💡 deleting instance testuser1/test-anndata
💡 deleted storage record on hub 0b060fdbd72e55ae864c531f35d458ee
💡 deleted storage record on hub 762b2d23bcb752cf88c5b7bab2d4e03e
💡 deleted instance record on hub 183bc48fd12a5d5b8ff8153b79de292c