Standardize metadata on-the-fly#

This use cases runs on a LaminDB instance with populated CellType and Pathway registries. Make sure you run the GO Ontology notebook before executing this use case.

Here, we demonstrate how to standardize the metadata on-the-fly during cell type annotation and pathway enrichment analysis using these two registries.

For more information, see:

!lamin load use-cases-registries
πŸ’‘ connected lamindb: testuser1/use-cases-registries
import lamindb as ln
import bionty as bt
from lamin_usecases import datasets as ds
import scanpy as sc
import matplotlib.pyplot as plt
import celltypist
import gseapy as gp
πŸ’‘ connected lamindb: testuser1/use-cases-registries
sc.settings.set_figure_params(dpi=50, facecolor="white")
ln.settings.transform.stem_uid = "hsPU1OENv0LS"
ln.settings.transform.version = "0"
ln.track()
πŸ’‘ notebook imports: bionty==0.42.7 celltypist==1.6.2 gseapy==1.1.2 lamin_usecases==0.0.1 lamindb==0.69.9 matplotlib==3.8.4 scanpy==1.10.1
πŸ’‘ saved: Transform(uid='hsPU1OENv0LS6K79', name='Standardize metadata on-the-fly', key='analysis-registries', version='0', type='notebook', updated_at=2024-04-10 17:50:13 UTC, created_by_id=1)
πŸ’‘ saved: Run(uid='iXvY9NGDNSn9IVubhnyS', transform_id=1, created_by_id=1)

An interferon-beta treated dataset#

A small peripheral blood mononuclear cell dataset that is split into control and stimulated groups. The stimulated group was treated with interferon beta.

Let’s load the dataset and perform some preprocessing:

adata = ds.anndata_seurat_ifnb(preprocess=False, populate_registries=True)
adata


AnnData object with n_obs Γ— n_vars = 13999 Γ— 9937
    obs: 'stim'
    var: 'symbol'
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.pca(adata, n_comps=20)
sc.pp.neighbors(adata, n_pcs=10)
sc.tl.umap(adata)

Analysis: cell type annotation using CellTypist#

model = celltypist.models.Model.load(model="Immune_All_Low.pkl")
Hide code cell output
2024-04-10 17:51:28,805:INFO - πŸ”Ž No available models. Downloading...
2024-04-10 17:51:28,806:INFO - πŸ“œ Retrieving model list from server https://celltypist.cog.sanger.ac.uk/models/models.json
2024-04-10 17:51:37,479:INFO - πŸ“š Total models in list: 44
2024-04-10 17:51:37,481:INFO - πŸ“‚ Storing models in /home/runner/.celltypist/data/models
2024-04-10 17:51:37,482:INFO - πŸ’Ύ Downloading model [1/44]: Immune_All_Low.pkl
2024-04-10 17:51:38,376:INFO - πŸ’Ύ Downloading model [2/44]: Immune_All_High.pkl
2024-04-10 17:51:42,889:INFO - πŸ’Ύ Downloading model [3/44]: Adult_CynomolgusMacaque_Hippocampus.pkl
2024-04-10 17:51:44,159:INFO - πŸ’Ύ Downloading model [4/44]: Adult_Human_PancreaticIslet.pkl
2024-04-10 17:51:49,238:INFO - πŸ’Ύ Downloading model [5/44]: Adult_Human_Skin.pkl
2024-04-10 17:51:55,276:INFO - πŸ’Ύ Downloading model [6/44]: Adult_Mouse_Gut.pkl
2024-04-10 17:51:56,269:INFO - πŸ’Ύ Downloading model [7/44]: Adult_Mouse_OlfactoryBulb.pkl
2024-04-10 17:52:01,287:INFO - πŸ’Ύ Downloading model [8/44]: Adult_Pig_Hippocampus.pkl
2024-04-10 17:52:06,914:INFO - πŸ’Ύ Downloading model [9/44]: Adult_RhesusMacaque_Hippocampus.pkl
2024-04-10 17:52:07,628:INFO - πŸ’Ύ Downloading model [10/44]: Autopsy_COVID19_Lung.pkl
2024-04-10 17:52:08,436:INFO - πŸ’Ύ Downloading model [11/44]: COVID19_HumanChallenge_Blood.pkl
2024-04-10 17:52:13,639:INFO - πŸ’Ύ Downloading model [12/44]: COVID19_Immune_Landscape.pkl
2024-04-10 17:52:18,008:INFO - πŸ’Ύ Downloading model [13/44]: Cells_Fetal_Lung.pkl
2024-04-10 17:52:19,028:INFO - πŸ’Ύ Downloading model [14/44]: Cells_Intestinal_Tract.pkl
2024-04-10 17:52:20,231:INFO - πŸ’Ύ Downloading model [15/44]: Cells_Lung_Airway.pkl
2024-04-10 17:52:26,033:INFO - πŸ’Ύ Downloading model [16/44]: Developing_Human_Brain.pkl
2024-04-10 17:52:26,746:INFO - πŸ’Ύ Downloading model [17/44]: Developing_Human_Gonads.pkl
2024-04-10 17:52:27,530:INFO - πŸ’Ύ Downloading model [18/44]: Developing_Human_Hippocampus.pkl
2024-04-10 17:52:28,162:INFO - πŸ’Ύ Downloading model [19/44]: Developing_Human_Organs.pkl
2024-04-10 17:52:28,855:INFO - πŸ’Ύ Downloading model [20/44]: Developing_Human_Thymus.pkl
2024-04-10 17:52:29,741:INFO - πŸ’Ύ Downloading model [21/44]: Developing_Mouse_Brain.pkl
2024-04-10 17:52:30,763:INFO - πŸ’Ύ Downloading model [22/44]: Developing_Mouse_Hippocampus.pkl
2024-04-10 17:52:31,431:INFO - πŸ’Ύ Downloading model [23/44]: Fetal_Human_AdrenalGlands.pkl
2024-04-10 17:52:32,057:INFO - πŸ’Ύ Downloading model [24/44]: Fetal_Human_Pancreas.pkl
2024-04-10 17:52:33,054:INFO - πŸ’Ύ Downloading model [25/44]: Fetal_Human_Pituitary.pkl
2024-04-10 17:52:33,758:INFO - πŸ’Ύ Downloading model [26/44]: Fetal_Human_Retina.pkl
2024-04-10 17:52:34,553:INFO - πŸ’Ύ Downloading model [27/44]: Fetal_Human_Skin.pkl
2024-04-10 17:52:35,175:INFO - πŸ’Ύ Downloading model [28/44]: Healthy_Adult_Heart.pkl
2024-04-10 17:52:35,976:INFO - πŸ’Ύ Downloading model [29/44]: Healthy_COVID19_PBMC.pkl
2024-04-10 17:52:36,685:INFO - πŸ’Ύ Downloading model [30/44]: Healthy_Human_Liver.pkl
2024-04-10 17:52:37,311:INFO - πŸ’Ύ Downloading model [31/44]: Healthy_Mouse_Liver.pkl
2024-04-10 17:52:37,938:INFO - πŸ’Ύ Downloading model [32/44]: Human_AdultAged_Hippocampus.pkl
2024-04-10 17:52:38,559:INFO - πŸ’Ύ Downloading model [33/44]: Human_Developmental_Retina.pkl
2024-04-10 17:52:39,559:INFO - πŸ’Ύ Downloading model [34/44]: Human_Embryonic_YolkSac.pkl
2024-04-10 17:52:44,197:INFO - πŸ’Ύ Downloading model [35/44]: Human_IPF_Lung.pkl
2024-04-10 17:52:44,907:INFO - πŸ’Ύ Downloading model [36/44]: Human_Longitudinal_Hippocampus.pkl
2024-04-10 17:52:45,948:INFO - πŸ’Ύ Downloading model [37/44]: Human_Lung_Atlas.pkl
2024-04-10 17:52:46,747:INFO - πŸ’Ύ Downloading model [38/44]: Human_PF_Lung.pkl
2024-04-10 17:52:47,453:INFO - πŸ’Ύ Downloading model [39/44]: Lethal_COVID19_Lung.pkl
2024-04-10 17:52:53,125:INFO - πŸ’Ύ Downloading model [40/44]: Mouse_Dentate_Gyrus.pkl
2024-04-10 17:52:53,911:INFO - πŸ’Ύ Downloading model [41/44]: Mouse_Isocortex_Hippocampus.pkl
2024-04-10 17:52:54,622:INFO - πŸ’Ύ Downloading model [42/44]: Mouse_Postnatal_DentateGyrus.pkl
2024-04-10 17:52:55,409:INFO - πŸ’Ύ Downloading model [43/44]: Nuclei_Lung_Airway.pkl
2024-04-10 17:52:56,284:INFO - πŸ’Ύ Downloading model [44/44]: Pan_Fetal_Human.pkl
predictions = celltypist.annotate(
    adata, model="Immune_All_Low.pkl", majority_voting=True
)
adata.obs["cell_type_celltypist"] = predictions.predicted_labels.majority_voting
2024-04-10 17:52:57,254:INFO - πŸ”¬ Input data has 13999 cells and 9937 genes
2024-04-10 17:52:57,255:INFO - πŸ”— Matching reference genes in the model
2024-04-10 17:52:58,417:INFO - 🧬 3701 features used for prediction
2024-04-10 17:52:58,421:INFO - βš–οΈ Scaling input data
2024-04-10 17:52:58,905:INFO - πŸ–‹οΈ Predicting labels
2024-04-10 17:52:59,095:INFO - βœ… Prediction done!
2024-04-10 17:52:59,098:INFO - πŸ‘€ Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
2024-04-10 17:52:59,099:INFO - ⛓️ Over-clustering input data with resolution set to 10
2024-04-10 17:53:07,923:INFO - πŸ—³οΈ Majority voting the predictions
2024-04-10 17:53:07,976:INFO - βœ… Majority voting done!
bt.CellType.inspect(adata.obs["cell_type_celltypist"]);
❗ received 13 unique terms, 13986 empty/duplicated terms are ignored
❗ 13 terms (100.00%) are not validated for name: Intermediate macrophages, B cells, Tem/Effector helper T cells, Non-classical monocytes, Regulatory T cells, Tcm/Naive helper T cells, Tem/Trm cytotoxic T cells, Tcm/Naive cytotoxic T cells, CD16+ NK cells, pDC, DC2, DC, Classical monocytes
   detected 2 CellType terms in Bionty as synonyms: 'pDC', 'DC2'
β†’  add records from Bionty to your CellType registry via .from_values()
   couldn't validate 13 terms: 'Regulatory T cells', 'pDC', 'Tcm/Naive cytotoxic T cells', 'DC', 'Intermediate macrophages', 'DC2', 'CD16+ NK cells', 'Tem/Effector helper T cells', 'Tem/Trm cytotoxic T cells', 'B cells', 'Tcm/Naive helper T cells', 'Classical monocytes', 'Non-classical monocytes'
β†’  if you are sure, create new records via ln.CellType() and save to your registry
adata.obs["cell_type_celltypist"] = bt.CellType.standardize(
    adata.obs["cell_type_celltypist"]
)
❗ found 2 synonyms in Bionty: ['pDC', 'DC2']
   please add corresponding CellType records via `.from_values(['plasmacytoid dendritic cell'])`
# Register cell type of found synonym
bt.CellType.from_public(name='plasmacytoid dendritic cell').save()
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
sc.pl.umap(
    adata,
    color=["cell_type_celltypist", "stim"],
    frameon=False,
    legend_fontsize=10,
    wspace=0.4,
)
... storing 'cell_type_celltypist' as categorical
_images/439a2f13052ee96f5891ccfec4acc3a4f5f0aef12e8c172f1b819925ce1d09fe.png

Analysis: Pathway enrichment analysis using Enrichr#

This analysis is based on the GSEApy scRNA-seq Example.

First, we compute differentially expressed genes using a Wilcoxon test between stimulated and control cells.

# compute differentially expressed genes
sc.tl.rank_genes_groups(
    adata,
    groupby="stim",
    use_raw=False,
    method="wilcoxon",
    groups=["STIM"],
    reference="CTRL",
)

rank_genes_groups_df = sc.get.rank_genes_groups_df(adata, "STIM")
rank_genes_groups_df.head()
names scores logfoldchanges pvals pvals_adj
0 ISG15 99.456940 7.132795 0.0 0.0
1 ISG20 96.736771 5.074236 0.0 0.0
2 IFI6 94.973534 5.828891 0.0 0.0
3 IFIT3 92.482620 7.432498 0.0 0.0
4 IFIT1 90.699150 8.053581 0.0 0.0

Next, we filter out up/down-regulated differentially expressed gene sets:

degs_up = rank_genes_groups_df[
    (rank_genes_groups_df["logfoldchanges"] > 0)
    & (rank_genes_groups_df["pvals_adj"] < 0.05)
]
degs_dw = rank_genes_groups_df[
    (rank_genes_groups_df["logfoldchanges"] < 0)
    & (rank_genes_groups_df["pvals_adj"] < 0.05)
]

degs_up.shape, degs_dw.shape
((541, 5), (937, 5))

Run pathway enrichment analysis on DEGs and plot top 10 pathways:

enr_up = gp.enrichr(degs_up.names, gene_sets="GO_Biological_Process_2023").res2d
gp.dotplot(enr_up, figsize=(2, 3), title="Up", cmap=plt.cm.autumn_r);
_images/6c90635b458524c55b48a0826431d5fec9617620f7e4d4d2b40a3f7cac7a79c2.png
enr_dw = gp.enrichr(degs_dw.names, gene_sets="GO_Biological_Process_2023").res2d
gp.dotplot(enr_dw, figsize=(2, 3), title="Down", cmap=plt.cm.winter_r);
_images/e6ad4f33d51f38b46a6925021611af53dcf3ef21b8d8b6ddf4b2a77a69d875ac.png

Register analyzed dataset and annotate with metadata#

Register new features and labels (check out more details here):

new_features = ln.Feature.from_df(adata.obs)
ln.save(new_features)
new_labels = [ln.ULabel(name=i) for i in adata.obs["stim"].unique()]
ln.save(new_labels)
features = ln.Feature.lookup()

Register dataset using a Artifact object:

artifact = ln.Artifact.from_anndata(
    adata,
    description="seurat_ifnb_activated_Bcells",
)
artifact.save()
artifact.features.add_from_anndata(
    var_field=bt.Gene.symbol,
    organism="human", # optionally, globally set organism via bt.settings.organism = "human"
)

Querying metadata#

artifact.describe()
Artifact(uid='ZU5duSgyIr5dwSU16eEX', suffix='.h5ad', accessor='AnnData', description='seurat_ifnb_activated_Bcells', size=214987291, hash='VIKQ1FPPt5W42c0IPQxD19', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2024-04-10 17:53:24 UTC)

Provenance:
  πŸ“Ž storage: Storage(uid='YoUeR3xS', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/use-cases-registries', type='local', updated_at=2024-04-10 17:49:27 UTC, created_by_id=1)
  πŸ“Ž transform: Transform(uid='hsPU1OENv0LS6K79', name='Standardize metadata on-the-fly', key='analysis-registries', version='0', type='notebook', updated_at=2024-04-10 17:50:13 UTC, created_by_id=1)
  πŸ“Ž run: Run(uid='iXvY9NGDNSn9IVubhnyS', started_at=2024-04-10 17:50:13 UTC, is_consecutive=True, transform_id=1, created_by_id=1)
  πŸ“Ž created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-04-10 17:49:27 UTC)
Features:
  var: FeatureSet(uid='ZBPqiXq7FYhUrgMHMTxg', n=11279, type='number', registry='bionty.Gene', hash='qz7O5E1M_PUEnsch0480', updated_at=2024-04-10 17:53:24 UTC, created_by_id=1)
    'DPH2', 'RPS6KA5', 'IFT88', 'B3GLCT', 'SLAMF1', 'SEZ6', 'HDX', 'RABL3', 'SBF1', 'NOP2', 'SLC12A6', 'WHRN', 'KIF13A', 'PABIR1', 'SEPTIN1', 'OGG1', 'COPG1', 'RUBCNL', 'F8', 'SEC61B', ...
  obs: FeatureSet(uid='a03eIPhlJdYTDX4G3FD2', n=2, registry='core.Feature', hash='t4pw6xsnS9dQ06ieEdZy', updated_at=2024-04-10 17:53:24 UTC, created_by_id=1)
    πŸ”— stim (2, core.ULabel): 'STIM', 'CTRL'
    πŸ”— cell_type_celltypist (1, bionty.CellType): 'plasmacytoid dendritic cell'
  STIM-up-DEGs: FeatureSet(uid='8ek9p8RQUylcQX2pHepi', name='Up-regulated DEGs STIM vs CTRL', n=660, type='category', registry='bionty.Gene', hash='cacjBon01FqbJRaMBYQG', updated_at=2024-04-10 17:53:25 UTC, created_by_id=1)
    'IRF9', 'IRF9', 'RIPOR2', 'SCPEP1', 'PAIP1', 'SDS', 'THEMIS2', 'GBP3', 'BAG1', 'HES4', 'PARP12', 'MX2', 'IER2', 'APOBEC3B', 'IFI16', 'TMEM170A', 'CCDC50', 'NFE2L2', 'RAD9A', 'RSAD2', ...
  STIM-down-DEGs: FeatureSet(uid='PfJFT6gRRR7EH9DWnKzI', name='Down-regulated DEGs STIM vs CTRL', n=1094, type='category', registry='bionty.Gene', hash='rZNQozQGraSsn9g0aPR8', updated_at=2024-04-10 17:53:26 UTC, created_by_id=1)
    'HLA-DRA', 'CD27', 'SDC2', 'HLA-DRA', 'HLA-DRA', 'HLA-DRA', 'HLA-DRA', 'HLA-DRA', 'RGCC', 'HLA-DRA', 'HLA-DRA', 'PSMD13', 'BEX4', 'RNF130', 'SYF2', 'XRCC6', 'CHCHD10', 'SEPTIN1', 'CHCHD10', 'ATP6V1E1', ...
Labels:
  πŸ“Ž cell_types (1, bionty.CellType): 'plasmacytoid dendritic cell'
  πŸ“Ž ulabels (2, core.ULabel): 'STIM', 'CTRL'

Querying cell types#

Querying for cell types contains β€œB cell” in the name:

bt.CellType.filter(name__contains="B cell").df().head()
uid name ontology_id abbr synonyms description created_at updated_at public_source_id created_by_id
id

Querying for all artifacts annotated with a cell type:

celltypes = bt.CellType.lookup()
celltypes.plasmacytoid_dendritic_cell
Private registry
Entity: CellType
πŸ“– .df(): reference table
πŸ”Ž .lookup(): autocompletion of terms
🎯 .search(): free text search of terms
βœ… .validate(): strictly validate values
🧐 .inspect(): full inspection of values
πŸ‘½ .standardize(): convert to standardized names
ln.Artifact.filter(cell_types=celltypes.plasmacytoid_dendritic_cell).df()
uid storage_id key suffix accessor description version size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
1 ZU5duSgyIr5dwSU16eEX 1 None .h5ad AnnData seurat_ifnb_activated_Bcells None 214987291 VIKQ1FPPt5W42c0IPQxD19 sha1-fl None None 1 1 1 True 2024-04-10 17:53:23.133926+00:00 2024-04-10 17:53:24.682369+00:00 1

Querying pathways#

Querying for pathways contains β€œinterferon-beta” in the name:

bt.Pathway.filter(name__contains="interferon-beta").df()
uid name ontology_id abbr synonyms description public_source_id created_at updated_at created_by_id
id
684 1l4z0v8W cellular response to interferon-beta GO:0035458 None cellular response to fibroblast interferon|cel... Any Process That Results In A Change In State ... 48 2024-04-10 17:49:36.608349+00:00 2024-04-10 17:49:36.608358+00:00 1
2130 1NzHDJDi negative regulation of interferon-beta production GO:0032688 None down regulation of interferon-beta production|... Any Process That Stops, Prevents, Or Reduces T... 48 2024-04-10 17:49:36.756949+00:00 2024-04-10 17:49:36.756958+00:00 1
3127 3x0xmK1y positive regulation of interferon-beta production GO:0032728 None positive regulation of IFN-beta production|up-... Any Process That Activates Or Increases The Fr... 48 2024-04-10 17:49:36.861406+00:00 2024-04-10 17:49:36.861415+00:00 1
4334 54R2a0el regulation of interferon-beta production GO:0032648 None regulation of IFN-beta production Any Process That Modulates The Frequency, Rate... 48 2024-04-10 17:49:36.986651+00:00 2024-04-10 17:49:36.986660+00:00 1
4953 3VZq4dMe response to interferon-beta GO:0035456 None response to fiblaferon|response to fibroblast ... Any Process That Results In A Change In State ... 48 2024-04-10 17:49:37.168189+00:00 2024-04-10 17:49:37.168198+00:00 1

Query pathways from a gene:

bt.Pathway.filter(genes__symbol="KIR2DL1").df()
uid name ontology_id abbr synonyms description public_source_id created_at updated_at created_by_id
id
1346 7S7qlEkG immune response-inhibiting cell surface recept... GO:0002767 None immune response-inhibiting cell surface recept... The Series Of Molecular Signals Initiated By A... 48 2024-04-10 17:49:36.675887+00:00 2024-04-10 17:49:36.675896+00:00 1

Query artifacts from a pathway:

ln.Artifact.filter(feature_sets__pathways__name__icontains="interferon-beta").first()
Artifact(uid='ZU5duSgyIr5dwSU16eEX', suffix='.h5ad', accessor='AnnData', description='seurat_ifnb_activated_Bcells', size=214987291, hash='VIKQ1FPPt5W42c0IPQxD19', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2024-04-10 17:53:24 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)

Query featuresets from a pathway to learn from which geneset this pathway was computed:

pathway = bt.Pathway.filter(ontology_id="GO:0035456").one()
pathway
Private registry
Entity: Pathway
πŸ“– .df(): reference table
πŸ”Ž .lookup(): autocompletion of terms
🎯 .search(): free text search of terms
βœ… .validate(): strictly validate values
🧐 .inspect(): full inspection of values
πŸ‘½ .standardize(): convert to standardized names
degs = ln.FeatureSet.filter(pathways__ontology_id=pathway.ontology_id).one()

Now we can get the list of genes that are differentially expressed and belong to this pathway:

contributing_genes = pathway.genes.all() & degs.genes.all()
contributing_genes.list("symbol")
['IFITM2',
 'AIM2',
 'BST2',
 'SHFL',
 'PNPT1',
 'PLSCR1',
 'CALM1',
 'IRF1',
 'OAS1',
 'IFITM1',
 'MNDA',
 'XAF1',
 'IFITM3',
 'STAT1',
 'IFI16']
# clean up test instance
!lamin delete --force use-cases-registries
!rm -r ./use-cases-registries
Hide code cell output
πŸ’‘ deleting instance testuser1/use-cases-registries
❗ manually delete your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/use-cases-registries