scQUEST Tutorial

In this tutorial we give detailed step-by-step instructions on how to use scQUEST to explore heterogeneity from large, heterogeneous mass or flow cytometry datasets. Often, these “atlas” datasets consist of single-cell profiles from multiple patients that collectively describe disease ecosystems. We showcase scQUEST using our previously generated single-cell atlas of breast cancer heterogeneity from (Wagner et al.,2019). The dataset contains 226 samples of mass cytometry (CyTOF) measurements of 163 breast cancer patients, with samples originating either from the tumor (T) or the juxta-tumoral non-tumor tissue (N). Additional measurements of mammoplasty are also available (H).

Import needed packages

import scQUEST as scq
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, roc_curve
import warnings

warnings.filterwarnings("ignore")

Step 1: Load and explore CyTOF dataset

scQUEST is based on annData, a popular Python library for single-cell analysis. The CyTOF dataset that we will use for this tutorial is pre-uploaded in scQUEST and can be easily imported as an annData object. If you have never used annData before, we highly recommend checking this simple tutorial in the annData online documentation to familiarize yourself with the structure of the annData object. A basic understanding of numpy and pandas will also help you explore the data stored in the annData object.

Exploring the annData object we see that it contains 13,384,828 single-cell measurements of 68 channels (stored in .X), with channel annotation stored in .var. Cell-level annotation is stored in .obs and includes various patient metadata, such as tissue type of origin (H, N, T), histopathology, grade or patient ID (patient_number).

ad = scq.dataset.breastCancerAtlasRaw()
ad
AnnData object with n_obs × n_vars = 13384828 × 68
    obs: 'tissue_type', 'patient_number', 'breast', 'tumor_region', 'plate', 'plate_location', 'gadolinium_status', 'fcs_file', 'Health Status', 'Gender', 'Age at Surgery', 'Menopause Status', 'Histopathology', 'T Staging', 'N Staging', 'M Staging', 'Grade', 'Grade comment', 'Lymph Node Status (L)', 'Resection', 'Estrogen Receptor Status by IHC', 'Estrogen Receptor IRS Score', 'Estrogen Receptor percent positive cells by IHC', 'Progesterone Receptor Status by IHC', 'Progesterone Receptor IRS Score', 'Progesterone Receptor percent positive cells by IHC', 'HER2 Status by IHC', 'HER2 IHC Score', 'HER2 Status by FISH or SISH', 'Ki-67 percent positive cells by IHC', 'Clinical subtype derived from the receptor status', 'Neoadjuvant Therapy Received', 'Previous Cancer Incidences'
    var: 'channel', 'desc'
    uns: 'fcs_header'

We can easily see that the data contain 226 files originating from 163 patients:

print("Number of unique files: " + str(len(ad.obs["fcs_file"].unique())))
print("Number of unique patients: " + str(len(ad.obs["patient_number"].unique())))
Number of unique files: 226
Number of unique patients: 163

Step 2: Cell type identification

In this step we show how one can use scQUEST to identify cell populations of interest among millions of highly multiplexed single-cell profiles. We are specifically interested in identifying all cells with an epithelial phenotype among the ~13.5 million cells. To achieve this, we will first load a smaller, already annotated dataset (ad_anno) that contains 687,161 cells that have been previously clustered and annotated into distinct cell type. This dataset will serve as training data to train a neural network classifier that can then in turn be used to classify all ~13.5 million cells into epithelial/non-epithelial.

Load and explore the pre-annotated dataset ad_anno

ad_anno = scq.dataset.breastCancerAtlas()
ad_anno
AnnData object with n_obs × n_vars = 687161 × 68
    obs: 'tissue_type', 'patient_number', 'breast', 'tumor_region', 'plate', 'plate_location', 'fcs_file', 'cluster', 'celltype', 'celltype_class', 'Health Status', 'Gender', 'Age at Surgery', 'Menopause Status', 'Histopathology', 'T Staging', 'N Staging', 'M Staging', 'Grade', 'Grade comment', 'Lymph Node Status (L)', 'Resection', 'Estrogen Receptor Status by IHC', 'Estrogen Receptor IRS Score', 'Estrogen Receptor percent positive cells by IHC', 'Progesterone Receptor Status by IHC', 'Progesterone Receptor IRS Score', 'Progesterone Receptor percent positive cells by IHC', 'HER2 Status by IHC', 'HER2 IHC Score', 'HER2 Status by FISH or SISH', 'Ki-67 percent positive cells by IHC', 'Clinical subtype derived from the receptor status', 'Neoadjuvant Therapy Received', 'Previous Cancer Incidences'
    var: 'channel', 'desc'
    uns: 'fcs_header'

The single-cell data have been previously clustered in 42 clusters (cluster id found in obs['cluster']), and the clusters have been annotated by celltype and celltype_class included in .obs:

print("Number of unique clusters: " + str(len(ad_anno.obs["cluster"].unique())))
Number of unique clusters: 42
print("Number of cells per cell type:\n" + str(ad_anno.obs["celltype"].value_counts()))
Number of cells per cell type:
Leukocyte      211762
Epithelial     202882
Apoptotic      100724
Other           99435
Fibroblast      50045
Endothelial     22313
Name: celltype, dtype: int64
print(
    "Epithelial/non-epithelial cells:\n"
    + str(ad_anno.obs["celltype_class"].value_counts())
)
Epithelial/non-epithelial cells:
nonepithelial    484279
epithelial       202882
Name: celltype_class, dtype: int64

Prepare dataset for classification

Let’s prepare the dataset for classification by first selecting a list of markers that we will use for training the classifier.

# define the markers used in the classifier
marker = scq.utils.DEFAULT_MARKER_CLF
print(marker)
['139La_H3K27me3', '141Pr_K5', '142Nd_PTEN', '143Nd_CD44', '144Nd_K8K18', '145Nd_CD31', '146Nd_FAP', '147Sm_cMYC', '148Nd_SMA', '149Sm_CD24', '150Nd_CD68', '151Eu_HER2', '152Sm_AR', '153Eu_BCL2', '154Sm_p53', '155Gd_EpCAM', '156Gd_CyclinB1', '158Gd_PRB', '159Tb_CD49f', '160Gd_Survivin', '161Dy_EZH2', '162Dy_Vimentin', '163Dy_cMET', '164Dy_AKT', '165Ho_ERa', '166Er_CA9', '167Er_ECadherin', '168Er_Ki67', '169Tm_EGFR', '170Er_K14', '171Yb_HLADR', '172Yb_clCASP3clPARP1', '173Yb_CD3', '174Yb_K7', '175Lu_panK', '176Yb_CD45']

To train your model on a custom list of markers, replace 'marker1','marker2' etc in the cell below with your desired markers names.

Attention: marker names should be as they appear in ad_anno.var['desc']

# marker = ['marker1','marker2','marker3'] <------remove comment and replace with your marker names

In a next step we annotate the markers in ad_anno.var that are used for the classification and subset the annData object to only contain these markers. In this example dataset the marker name is given by the desc column.

# get only the channels needed for classification
mask = []
for m in marker:
    mask.append(ad_anno.var.desc.str.contains(m))
mask = pd.concat(mask, axis=1)
mask = mask.any(1)
ad_anno.var["used_in_clf"] = mask
ad_anno = ad_anno[:, ad_anno.var.used_in_clf]
ad_anno.obs["is_epithelial"] = (ad_anno.obs.celltype_class == "epithelial").astype(int)

Apply the inverse hyperbolic sine (arcsinh) transformation and save the transformed data in a new layer of the AnnData object

# get raw data and arcsinh-transform using a cofactor of 5
X = ad_anno.X.copy()
cofactor = 5
np.divide(X, cofactor, out=X)
np.arcsinh(X, out=X)

# add transformed data to a separate layer in the annData object
ad_anno.layers["arcsinh"] = X

# min-max normalization
minMax = MinMaxScaler()
X = minMax.fit_transform(X)
ad_anno.layers["arcsinh_norm"] = X

Before we train the classifier, let’s visualize the data distribution per class.

# create a pandas dataframe with data for plotting with seaborn
df = pd.DataFrame(data=ad_anno.layers["arcsinh"], columns=ad_anno.var.desc)
df.insert(0, "epithelial", ad_anno.obs.is_epithelial.values)
# sample 100,000 cells per category
df_plot = df.groupby("epithelial").sample(n=10000)

fig = plt.figure(figsize=(20, 20))
for m in range(1, df_plot.shape[1]):
    plt.subplot(6, 6, m)
    sns.histplot(data=df_plot, x=df_plot.columns[m], hue="epithelial", stat="percent")
    plt.title(df_plot.columns[m])
    plt.xlim([-0.5, 8])
    plt.xlabel("")

plt.savefig("marker_hist_annotated.pdf", dpi=300)
plt.show()

png

Train the neural network classifier

We will next train a simple neural network classifier using the annotated data. We first initialize the model by defining the dimensions of the input layer equal to the number of markers in the training data. By default, the model architecture consists of 1 hidden layer of 20 neurons, (with a ReLU activation function) and one output layer of one neuron. The dataset is split in a stratified fashion into training (50%), validation (25%), and test (25%) sets. The classifier is trained using the scaled conjugate gradient method, and its performance is evaluated using a standard cross-entropy loss function. We also use an early stopping criterion by terminating training when the model’s performance fails to improve for 10 consecutive runs. For reproducibility, you can also optionally set a seed.

# define early stopping criterion
from pytorch_lightning.callbacks.early_stopping import EarlyStopping

es = EarlyStopping(monitor="val_loss", mode="min", min_delta=1e-3, patience=10)
# initialize model and train classifier
clf = scq.Classifier(n_in=ad_anno.shape[1], seed=1)
clf.fit(
    ad_anno,
    layer="arcsinh_norm",
    target="is_epithelial",
    callbacks=[es],
    max_epochs=20,
    seed=1,
)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name             | Type             | Params
------------------------------------------------------
0 | model            | DefaultCLF       | 782   
1 | loss             | CrossEntropyLoss | 0     
2 | metric_accuracy  | Accuracy         | 0     
3 | metric_f1score   | F1Score          | 0     
4 | metric_precision | Precision        | 0     
5 | metric_recall    | Recall           | 0     
------------------------------------------------------
782       Trainable params
0         Non-trainable params
782       Total params
0.003     Total estimated model params size (MB)



Sanity Checking: 0it [00:00, ?it/s]



Training: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Testing: 0it [00:00, ?it/s]


────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_loss          0.014771309681236744
  test_metric_accuracy      0.9948338866233826
   test_metric_f1score      0.9948338866233826
  test_metric_precision     0.9948338866233826
   test_metric_recall       0.9948338866233826
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

By observing the results, we see that the model achieves a 99.5% accuracy and 99.5% precision in the test set. Let’s explore the model’s performance in more detail by plotting the loss, the confusion matrix and the ROC curve:

# examine model performance
hist = clf.logger.history
fit_loss = pd.DataFrame.from_records(hist["fit_loss"], columns=["step", "loss"]).assign(
    stage="fit"
)
val_loss = pd.DataFrame.from_records(hist["val_loss"], columns=["step", "loss"]).assign(
    stage="validation"
)
test_loss = pd.DataFrame.from_records(
    hist["test_loss"], columns=["step", "loss"]
).assign(stage="test")
loss = pd.concat((fit_loss, val_loss, test_loss))
loss = loss.reset_index(drop=True)

fig, ax = plt.subplots(figsize=(10, 6))
for stage in ["fit", "validation", "test"]:
    if stage == "test":
        ax.plot(
            [0, ax.get_xlim()[1]],
            [loss[loss.stage == stage].loss, loss[loss.stage == stage].loss],
            label="test",
        )
    else:
        ax.plot(
            loss[loss.stage == stage].step,
            loss[loss.stage == stage].loss,
            ".-",
            label=stage,
        )
plt.xlabel("Training steps")
plt.ylabel("Loss")
ax.legend()
plt.savefig("loss.pdf", dpi=300)
<matplotlib.legend.Legend at 0x7fb5eae1c100>

png

# %% confusion matrix
dl = clf.datamodule.test_dataloader()
data = dl.dataset.dataset.data
y = dl.dataset.dataset.targets
yhat = clf.model(data)

m = confusion_matrix(y, yhat, normalize="pred")
m = pd.DataFrame(
    m, columns=["non-epithelial", "epithelial"], index=["non-epithelial", "epithelial"]
)
m.index.name = "True"
m.columns.name = "Predicted"
sns.heatmap(m.T * 100, annot=True, fmt=".3f")
plt.savefig("confusion.pdf", dpi=300)

plt.show()

png

# %% plot ROC curve
import torch

yhat = clf.model(data)
yhat1 = clf.model.model(data).detach()  # here we access the actual pytorch model
yhat1 = torch.softmax(yhat1, dim=1)  # compute propabilities
tmp = np.array([yhat1[i][yhat[i]] for i in range(len(yhat))])
score = np.zeros_like(yhat, dtype=float) - 1
score[yhat == 1] = tmp[yhat == 1]
score[yhat == 0] = 1 - tmp[yhat == 0]
fpr, tpr, thresholds = roc_curve(y, score)
plt.plot(fpr, tpr, ".-")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot([0, 1], [0, 1], "b:")
plt.title("ROC Curve")
plt.savefig("roc.pdf", dpi=300)

plt.show()

png

Classify whole dataset

Once we are confident about the model performance, we can feed the whole dataset to the classifier to predict the epithelial cells:

# prepare the whole dataset for celltype classification
mask = []
for m in marker:
    mask.append(ad.var.desc.str.contains(m))
mask = pd.concat(mask, axis=1)
mask = mask.any(1)
ad.var["used_in_clf"] = mask
ad_pred = ad[:, ad.var.used_in_clf]
X = ad_pred.X.copy()

# arcsinh transformation and minmax scaling
cofactor = 5
np.divide(X, cofactor, out=X)
np.arcsinh(X, out=X)
ad_pred.layers["arcsinh"] = X
X = minMax.transform(X)
ad_pred.layers["arcsinh_norm"] = X

# feed the data to the classifier
clf.predict(ad_pred, layer="arcsinh_norm")
ad.obs["is_epithelial"] = ad_pred.obs.clf_is_epithelial.values
ad.obs["is_epithelial"].value_counts()
0    9427458
1    3957370
Name: is_epithelial, dtype: int64

Optionally, save all epithelial cells in a .csv file for downstream processing:

mask = ad.obs["is_epithelial"] == 1
temp = pd.DataFrame(data=ad.X[mask], columns=ad.var["desc"])
temp.to_csv("all_epithelial_cells.csv")
ad.obs[mask].to_csv("epithelial_metadata.csv")

Since we do not have cell-level annotations for the large dataset, to evaluate the results we can simply plot all marker distributions and cross-compare them with the annotated dataset:

# create pandas dataframes for plotting with seaborn
n = 30000
df1 = pd.DataFrame(data=ad_anno.layers["arcsinh"], columns=ad_anno.var.desc)
df1.insert(0, "annotated", "true")
df1.insert(0, "epithelial", ad_anno.obs.is_epithelial.values)
df_plot1 = df1.groupby("epithelial").sample(n=n)

df2 = pd.DataFrame(data=ad_pred.layers["arcsinh"], columns=ad_pred.var.desc)
df2.insert(0, "annotated", "predicted")
df2.insert(0, "epithelial", ad_pred.obs.clf_is_epithelial.values)
df_plot2 = df2.groupby("epithelial").sample(n=n)
df_plot = pd.concat([df_plot1, df_plot2])
df_plot = df_plot.sample(frac=1, replace=False, ignore_index=True)
multilabel = df_plot["epithelial"].astype(str) + df_plot["annotated"]
# plot marker distributions across all epithelial cells (predicted vs. annotated)
data = df_plot[df_plot["epithelial"] == 1].iloc[:, 1:].reset_index(drop=True)
fig = plt.figure(figsize=(20, 20))
for m in range(1, data.shape[1]):
    plt.subplot(6, 6, m)
    sns.histplot(data=data, x=data.columns[m], hue="annotated", stat="percent")
    plt.title(data.columns[m])
    plt.xlim([-0.5, 8])
    plt.xlabel("")
plt.show()

# plot marker distributions across all non-epithelial cells (predicted vs. annotated)
data = df_plot[df_plot["epithelial"] == 0].iloc[:, 1:].reset_index(drop=True)
fig = plt.figure(figsize=(20, 20))
for m in range(1, data.shape[1]):
    plt.subplot(6, 6, m)
    sns.histplot(data=data, x=data.columns[m], hue="annotated", stat="percent")
    plt.title(data.columns[m])
    plt.xlim([-0.5, 8])
    plt.xlabel("")
plt.show()

png

png

import umap as umap

Y1 = umap.UMAP(n_neighbors=50, min_dist=0.5, metric="euclidean").fit_transform(
    df_plot.iloc[:, 2:]
)
labels = list(
    map(
        lambda x: x.replace("0", "epithelial_").replace("1", "non-epithelial_"),
        multilabel,
    )
)
labels_uniq = np.unique(labels)
palette = dict(zip(labels_uniq, ["blue", "lightblue", "orange", "peachpuff"]))
fig = plt.figure(figsize=(10, 8))
sns.scatterplot(
    Y1[:, 0], Y1[:, 1], s=1, hue=labels, palette=palette, hue_order=palette.keys()
)
plt.xticks([])
plt.yticks([])
plt.xlabel("UMAP_1")
plt.ylabel("UMAP_2")
plt.savefig("umap.png", dpi=300)
plt.show()

png

del ad_pred
del ad_anno

Finally, we can plot the proportion of epithelial cells per sample, and categorize per tissue type as below:

# plot fraction of epithelial cells
data = ad.obs.copy()
frac_epith = (
    data.groupby(["tissue_type", "breast", "patient_number"])
    .is_epithelial.mean()
    .dropna()
    .reset_index()
)
fig, ax = plt.subplots()
sns.boxplot(y="is_epithelial", x="tissue_type", data=frac_epith, ax=ax, whis=[0, 100])
sns.stripplot(y="is_epithelial", x="tissue_type", data=frac_epith, ax=ax, color=".3")
<AxesSubplot:xlabel='tissue_type', ylabel='is_epithelial'>

png

Step 3: Phenotypic abnormality

Prepare dataset

Let’s start by first selecting a list of patients that will be used as a reference input dataset to train the Abnormality autoencoder. Here, we are using only patients for which we have both a juxta-tumoral and tumor sample. We then select a set of markers that will be the features of the input data.

If using the protocol for other applications, samples used are reference could be normal vs. disease, non-treatment vs. treatment, etc.

patients = [
    "N_BB013",
    "N_BB028",
    "N_BB034",
    "N_BB035",
    "N_BB037",
    "N_BB046",
    "N_BB051",
    "N_BB055",
    "N_BB058",
    "N_BB064",
    "N_BB065",
    "N_BB072",
    "N_BB073",
    "N_BB075",
    "N_BB076",
    "N_BB090",
    "N_BB091",
    "N_BB093",
    "N_BB094",
    "N_BB096",
    "N_BB099",
    "N_BB101",
    "N_BB102",
    "N_BB110",
    "N_BB120",
    "N_BB131",
    "N_BB144",
    "N_BB147",
    "N_BB153",
    "N_BB154",
    "N_BB155",
    "N_BB167",
    "N_BB192",
    "N_BB194",
    "N_BB197",
    "N_BB201",
    "N_BB204",
    "N_BB209",
    "N_BB210",
    "N_BB214",
    "N_BB221",
]
patients = [i.split("N_")[1] for i in patients]

markers = set(
    [
        "Vol7",
        "H3K27me3",
        "K5",
        "PTEN",
        "CD44",
        "K8K18",
        "cMYC",
        "SMA",
        "CD24",
        "HER2",
        "AR",
        "BCL2",
        "p53",
        "EpCAM",
        "CyclinB1",
        "PRB",
        "CD49f",
        "Survivin",
        "EZH2",
        "Vimentin",
        "cMET",
        "AKT",
        "ERa",
        "CA9",
        "ECadherin",
        "Ki67",
        "EGFR",
        "K14",
        "HLADR",
        "K7",
        "panK",
    ]
)
drop_markers = set(["Vol7", "H3K27me3", "CyclinB1", "Ki67"])
markers = markers - drop_markers

mask = []
for m in markers:
    mask.append(ad.var.desc.str.contains(m))
mask = pd.concat(mask, axis=1)
mask = mask.any(1)
ad.var["used_in_abnormality"] = mask

Next, subset the whole dataset to only include epithelial cells from the selected patients and juxta-tumoral (N) tissue, and preprocess the data as before:

ad_train = ad[
    (ad.obs.patient_number.isin(patients))
    & (ad.obs.tissue_type == "N")
    & (ad.obs.is_epithelial == 1),
    ad.var.used_in_abnormality,
]

X = ad_train.X.copy()
cofactor = 5
np.divide(X, cofactor, out=X)
np.arcsinh(X, out=X)
minMax = MinMaxScaler()
X = minMax.fit_transform(X)
ad_train.layers["arcsinh_norm"] = X

Train the abnormality Autoencoder

# Initialize and fit the model
Abn = scq.Abnormality(n_in=ad_train.shape[1])
Abn.fit(ad_train, layer="arcsinh_norm", max_epochs=20)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name                    | Type             | Params
-------------------------------------------------------------
0 | model                   | DefaultAE        | 650   
1 | loss                    | MSELoss          | 0     
2 | metric_meansquarederror | MeanSquaredError | 0     
-------------------------------------------------------------
650       Trainable params
0         Non-trainable params
650       Total params
0.003     Total estimated model params size (MB)



Sanity Checking: 0it [00:00, ?it/s]



Training: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]



Testing: 0it [00:00, ?it/s]


────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        Test metric                 DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
         test_loss              0.007171750068664551
test_metric_meansquarederror     0.0071717519313097
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Note: The abnormality model (Abn.model) predicts the abnormality a.k.a. reconstruction error, i.e. $X-F(X)$ where $F$ is the autoencoder (prediction results are saved to ad.layers['abnormality']). On the other hand, the base torch model (Abn.model.model) predicts the reconstruction, i.e. $F(X)$.

# estimate reconstruction error on the training data
Abn.predict(ad_train, layer="arcsinh_norm")
mse_train = (ad_train.layers["abnormality"] ** 2).mean(axis=1)
ad_train.obs["abnormality"] = mse_train

y = ad_train.layers["arcsinh_norm"]
yhat = Abn.model.model(torch.tensor(y)).detach().numpy()  # access base torch model
err = ad_train.layers["abnormality"]

fig, axs = plt.subplots(1, 3, figsize=(14, 6))
for ax, dat, title in zip(
    axs.flat, [y, yhat, err], ["Input", "Reconstruction", "Error"]
):
    ax: plt.Axes
    cmap = "seismic" if title == "Error" else "magma"
    vmin = -1 if title == "Error" else 0
    im = ax.imshow(dat, cmap=cmap, vmin=vmin, vmax=1)
    ax.set_aspect(dat.shape[1] * 2 / dat.shape[0])
    ax.set_xticks(range(28), labels=ad_train.var["desc"].values, rotation="vertical")
    ax.set_xlabel("features")
    ax.set_title(title)
    fig.colorbar(im, ax=ax)
axs[0].set_ylabel("cells")
fig.show()

png

hist = Abn.logger.history
fit_loss = pd.DataFrame.from_records(hist["fit_loss"], columns=["step", "loss"]).assign(
    stage="fit"
)
val_loss = pd.DataFrame.from_records(hist["val_loss"], columns=["step", "loss"]).assign(
    stage="validation"
)
test_loss = pd.DataFrame.from_records(
    hist["test_loss"], columns=["step", "loss"]
).assign(stage="test")
loss = pd.concat((fit_loss, val_loss, test_loss))
loss = loss.reset_index(drop=True)

fig, ax = plt.subplots(figsize=(10, 6))
for stage in ["fit", "validation", "test"]:
    if stage == "test":
        ax.plot(
            [0, ax.get_xlim()[1]],
            [loss[loss.stage == stage].loss, loss[loss.stage == stage].loss],
            label="test",
        )
    else:
        ax.plot(
            loss[loss.stage == stage].step,
            loss[loss.stage == stage].loss,
            ".-",
            label=stage,
        )
plt.xlabel("Training steps")
plt.ylabel("Loss")
ax.legend()
<matplotlib.legend.Legend at 0x7fb5eddcc1c0>

png

Reconstruct all cells

Now let’s use the model to reconstruct all epithelial data. The reconstruction errors for each cell are stored in the layer abnormality. We define the cell-level abnormality as the mean squared error of the reconstruction.

# %% pre-process for Abnormality prediction
ad_pred = ad[ad.obs.is_epithelial == 1, ad.var.used_in_abnormality]
X = ad_pred.X.copy()
cofactor = 5
np.divide(X, cofactor, out=X)
np.arcsinh(X, out=X)
X = minMax.transform(X)
ad_pred.layers["arcsinh_norm"] = X

# estimate reconstruction error
Abn.predict(ad_pred, layer="arcsinh_norm")
mse = (ad_pred.layers["abnormality"] ** 2).mean(axis=1)
ad_pred.obs["abnormality"] = mse

To evaluate the results, let’s reconstruct all single-cells and visualize the prediction residuals:

y = ad_pred.layers["arcsinh_norm"]
yhat = Abn.model.model(torch.tensor(y)).detach().numpy()  # access base torch model
err = ad_pred.layers["abnormality"]
idx = np.nonzero((ad_pred.obs["tissue_type"] == "N").values)[0]
y1, y2 = idx.min(), idx.max()
fig, axs = plt.subplots(1, 3, figsize=(20, 10))
for ax, dat, title in zip(
    axs.flat, [y, yhat, err], ["Input", "Reconstruction", "Error"]
):
    ax: plt.Axes
    cmap = "seismic" if title == "Error" else "magma"
    vmin = -1 if title == "Error" else 0
    im = ax.imshow(dat, cmap=cmap, vmin=vmin, vmax=1)
    plt.hlines(y1, -0.5, 27.5)
    plt.hlines(y2, -0.5, 27.5)
    ax.set_aspect(dat.shape[1] * 2 / dat.shape[0])
    ax.set_xticks(range(28), labels=ad_pred.var["desc"].values, rotation="vertical")
    ax.set_xlabel("features")
    ax.set_title(title)
    fig.colorbar(im, ax=ax)
axs[0].set_ylabel("cells")
plt.savefig("reconstruction.pdf", dpi=300)
fig.show()

png

Now let’s visualize the abnormality for different clinical data:

clinical_data = [
    "tissue_type",
    "Grade",
    "Estrogen Receptor Status by IHC",
    "Clinical subtype derived from the receptor status",
]
order = [
    ["H", "N", "T"],
    ["G1", "G2", "G3"],
    ["P", "N"],
    ["LumA", "LumB", "LumB-HER2", "HER2", "TN"],
]

fig, axs = plt.subplots(1, 4, figsize=(20, 8), sharey=True)
for i in range(4):
    dat = (
        ad_pred.obs.groupby([clinical_data[i], "breast", "patient_number"])
        .abnormality.agg(np.median)
        .dropna()
        .reset_index()
    )
    sns.boxplot(
        data=dat,
        x=clinical_data[i],
        y="abnormality",
        order=order[i],
        whis=[0, 100],
        ax=axs.flat[i],
    )
    sns.stripplot(
        data=dat, x=clinical_data[i], y="abnormality", order=order[i], ax=axs.flat[i]
    )
plt.savefig("abnormality_boxplots.pdf", dpi=300)

png

Tumor individuality

For computing the individuality let’s start by preparing the dataset. Here, because the k-nn graph construction is computationally intensive, we will randomly select a subset of cells from each sample.

# preprocessing
ad_indiv = ad[ad.obs.is_epithelial == 1, ad.var.used_in_abnormality]
X = ad_indiv.X.copy()
cofactor = 5
np.divide(X, cofactor, out=X)
np.arcsinh(X, out=X)
ad_indiv.layers["arcsinh"] = X

# generate a sample identifier to enable subsampling
ad_indiv.obs["sample_id"] = ad_indiv.obs.groupby(
    ["tissue_type", "breast", "patient_number"]
).ngroup()

# sub-sampling
tmp = ad_indiv.obs.groupby(["sample_id"]).indices
n_cells = 200
indices = []
for key, item in tmp.items():
    size = min(len(item), n_cells)
    idx = np.random.choice(range(len(item)), size, replace=False)
    indices.extend(item[idx])
indices = np.array(indices)
ad_indiv = ad_indiv[indices]

Let us initialize the individuality model and use it to predict the scores:

Indiv = scq.Individuality()
Indiv.predict(ad_indiv, ad_indiv.obs.sample_id, layer="arcsinh")

The observation-level scores (one vector per single cell, indicating it’s similarity to all other samples) are saved in the .obsm attribute of the AnnData object, as a matrix of size n_cells x n_samples:

ad_indiv.obsm["individuality"]
matrix([[0.  , 0.02, 0.04, ..., 0.  , 0.  , 0.  ],
        [0.02, 0.01, 0.  , ..., 0.  , 0.  , 0.  ],
        [0.05, 0.  , 0.03, ..., 0.  , 0.  , 0.  ],
        ...,
        [0.02, 0.  , 0.  , ..., 0.  , 0.  , 0.13],
        [0.  , 0.  , 0.01, ..., 0.  , 0.  , 0.62],
        [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.48]])

The aggregated sample-level scores (one vector per sample, indicating a sample’s similarity to all other samples) are saved in the .uns attribute of the AnnData object, as a matrix of size n_samples x n_samples:

# access sample-level aggregated individuality
ad_indiv.uns["individuality_agg"].head()
0 1 2 3 4 5 6 7 8 9 ... 201 202 203 204 205 206 207 208 209 210
0 0.03165 0.01095 0.00950 0.01265 0.01020 0.00385 0.00025 0.00070 0.00475 0.00250 ... 0.00095 0.00180 0.00610 0.01615 0.00480 0.00885 0.00590 0.00055 0.00155 0.00170
1 0.00970 0.03995 0.00835 0.02820 0.00550 0.00755 0.00000 0.00120 0.00155 0.00100 ... 0.00055 0.00280 0.01625 0.00700 0.00155 0.01080 0.00055 0.00145 0.00160 0.00055
2 0.00670 0.00860 0.04630 0.01245 0.03975 0.00330 0.00000 0.00180 0.01890 0.00075 ... 0.00065 0.00045 0.00580 0.00675 0.00015 0.00555 0.01440 0.00120 0.00060 0.00345
3 0.01010 0.02590 0.00950 0.04170 0.00960 0.00920 0.00000 0.00085 0.00495 0.00085 ... 0.00035 0.00135 0.01000 0.00570 0.00100 0.00995 0.00275 0.00100 0.00140 0.00095
4 0.00975 0.00835 0.04945 0.01310 0.05810 0.00385 0.00000 0.00210 0.03200 0.00125 ... 0.00005 0.00020 0.00540 0.00570 0.00005 0.00805 0.01270 0.00075 0.00040 0.00090

5 rows × 211 columns

To assess a sample’s uniqueness within the cohort, we will use the diagonal of that matrix:

dat = ad_indiv.uns["individuality_agg"].copy()
dat = pd.DataFrame(np.diag(dat), index=dat.index, columns=["individuality"])

Now let’s assess how individuality scores are related to a patient’s clinical data:

res = ad_indiv.obs[
    [
        "tissue_type",
        "breast",
        "Grade",
        "Estrogen Receptor Status by IHC",
        "Clinical subtype derived from the receptor status",
        "patient_number",
        "sample_id",
    ]
]
res = res[~res.duplicated()].set_index("sample_id")
dat = pd.concat((dat, res), axis=1)

clinical_data = [
    "tissue_type",
    "Grade",
    "Estrogen Receptor Status by IHC",
    "Clinical subtype derived from the receptor status",
]
order = [
    ["H", "N", "T"],
    ["G1", "G2", "G3"],
    ["P", "N"],
    ["LumA", "LumB", "LumB-HER2", "HER2", "TN"],
]

fig, axs = plt.subplots(1, 4, figsize=(20, 8), sharey=True)
for i in range(4):
    sns.boxplot(
        data=dat,
        x=clinical_data[i],
        y="individuality",
        order=order[i],
        whis=[0, 100],
        ax=axs.flat[i],
    )
    sns.stripplot(
        data=dat, x=clinical_data[i], y="individuality", order=order[i], ax=axs.flat[i]
    )
plt.savefig("individuality_boxplots.pdf", dpi=300)

png