Quickstart

ATHENA is an open-source computational framework written in Python that facilitates the visualization, processing and analysis of (spatial) heterogeneity from spatial omics data. ATHENA supports any spatially resolved dataset that contains spatial transcriptomic or proteomic measurements, including Imaging Mass Cytometry (IMC), Multiplexed Ion Beam Imaging (MIBI), multiplexed Immunohistochemisty (mIHC) or Immunofluorescence (mIF), seqFISH, MERFISH, Visium.

Important

You are strongly advised to read the sections:

  • Note on Phenotype Encodings

  • Note on Segmentation Masks

at the end of this document.

Requirements

To use all the capabilities of ATHENA one needs the (cell-) segmenation masks for a sample and the omics-profiles of the observations in the sample (as extracted from the high-dimensional images produced by different omics-technologies).

However, it is also possible to use ATHENA (with some limitations) without segmentation masks and the omics-profiles and just the single-observation coordinates and a classification of those observations (phenotypes).

Further Resources

  • A more comprehensive tutorial on IMC data can be found here

  • A tutorial on how to load your data can be found here

import athena as ath
import numpy as np
import pandas as pd
import seaborn as sns
from pathlib import Path

pd.set_option("display.max_columns", 5)
print(ath.__version__)
<frozen importlib._bootstrap>:219: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility. Expected 80 from C header, got 88 from PyObject


0.1.2

Load Example Data

To use ATHENA you need to load the data in a SpatialOmics instance (see below, here for a general description and here for a more detailed tutorial). The structure of the SpatialOmics instance is very similar to the AnnData.

For this tutorial we provide a small example of a populated SpatialOmics instance. Meta data for each stored sample can be found in so.spl. You can access the data by indexing the different attributes by the sample name (so.X[spl], so.obs[spl, …).

so = ath.dataset.imc_quickstart()
so
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


warning: to get the latest version of this dataset use `so = sh.dataset.imc(force_download=True)`






SpatialOmics object with 5505 observations across 2 samples.
    X: 2 samples,
    spl: 2 samples,
        columns: ['pid', 'cell_count', 'immune_cell_count']
    obs: 2 samples,
        columns: ['meta_label', 'id', 'core', 'cell_type', 'meta_id', 'CellId', 'x', 'cell_type_id', 'tumor_immune_id', 'tumor_immune', 'phenograph_cluster', 'y']
    var: 2 samples,
        columns: ['target', 'fullstack_index', 'metal_tag', 'feature_type', 'channel', 'full_target_name']
    G: 0 samples,
        keys: []
    masks: 2 samples
        keys: [{'cellmasks'}]
    images: 0 samples
samples = so.spl.index
samples
Index(['SP43_75_X2Y5', 'SP41_239_X11Y3_165'], dtype='object', name='core')

How to Create a SpatialOmics Instance with Dummy Data

# create dummy data

n_obs = 100  # number of observations (cells)
n_var = 5  # number of features
samples = ['sample_'+ str(i) for i in range(5)]  # sample names
mask_dim = [100,100]  # dimension of segmentation masks

# sample-level meta data
spl = pd.DataFrame({'samples': samples, 'age': np.random.choice(range(50,80), len(samples))}).set_index('samples')

# observation-level meta data
obs = pd.DataFrame({
    'phenotype': range(n_obs),
    'x': np.random.randint(0, mask_dim[0], n_obs),
    'y': np.random.randint(0, mask_dim[1], n_obs)})

# observation-level omics profiles
X = pd.DataFrame(np.random.rand(n_obs, n_var), columns=['var_' + str(i) for i in range(n_var)])
# high-dimensional images of samples, first dimension is the number of features, i.e. multiplexing
img = np.random.rand(n_var, *mask_dim)
# segmentation masks of images
mask = np.zeros(mask_dim, dtype=int)
# feature annotation
var = pd.DataFrame({'target': ['var_' + str(i) for i in range(n_var)]})
# create SpatialOmics instance

from spatialOmics import SpatialOmics
dummy = SpatialOmics()

dummy.spl = spl
for spl in samples:
    dummy.obs[spl] = obs
    dummy.X[spl] = X
    dummy.var[spl] = var
    dummy.images[spl] = img
    dummy.masks[spl] = {'cellmasks': mask}
    
dummy
SpatialOmics object with 500 observations across 5 samples.
    X: 5 samples,
    spl: 5 samples,
        columns: ['age']
    obs: 5 samples,
        columns: ['y', 'phenotype', 'x']
    var: 5 samples,
        columns: ['target']
    G: 0 samples,
        keys: []
    masks: 5 samples
        keys: [{'cellmasks'}]
    images: 5 samples

Minimal Working Example

Depending on your use case you do not need to populate all attributes of the SpatialOmics instance.

For a minimal working example you only need to populate the .obs attribute with the observation coordinates and a label.

dummy_min = SpatialOmics()
dummy_min.obs[spl] = obs
dummy_min
SpatialOmics object with 100 observations across 1 samples.
    X: 0 samples,
    spl: 0 samples,
        columns: []
    obs: 0 samples,
        columns: ['y', 'phenotype', 'x']
    var: 0 samples,
        columns: []
    G: 0 samples,
        keys: []
    masks: 0 samples
        keys: [set()]
    images: 0 samples

Sample-level Meta Data

Sample-level meta data is stored in the so.spl attribute but not required to use ATHENA.

so.spl.head(3) 
pid cell_count immune_cell_count
core
SP43_75_X2Y5 75 2771 940
SP41_239_X11Y3_165 239 2734 412

Observation-level Meta Data

Observation-level meta data is stored in the so.obs attribute and required to use ATHENA. At least the coordinates of the observations along with some classification of those (phenotypes) are required.

spl = so.spl.index[0]
so.obs[spl].head()
core meta_id ... y x
cell_id
1 SP43_75_X2Y5 9 ... 1.142857 33.107143
2 SP43_75_X2Y5 3 ... 4.868613 43.693431
3 SP43_75_X2Y5 10 ... 5.467391 61.326087
4 SP43_75_X2Y5 8 ... 2.353659 79.987805
5 SP43_75_X2Y5 3 ... 2.026316 128.302632

5 rows × 12 columns

Observation-level Omics Profiles

The omics-profiles of the obserations are stores in the so.X attribute. These are the gene / protein expression values of the observations (most likely single cells or patches). You only need to provide this information if you want to use metrics.quadratic_entropy().

so.X[spl].head()
target H3 H3K28me3 ... PARP DNA2
cell_id
1 8.252318 0.442500 ... 0.674024 9.002035
2 12.754854 1.436547 ... 0.852686 19.567230
3 1.946717 0.470739 ... 0.858645 2.316239
4 7.552009 0.562183 ... 0.939330 8.991525
5 15.079700 0.492224 ... 0.828238 12.647001

5 rows × 34 columns

Compute Graphs

ATHENA can construct knn, radius and contact graphs. By default, ATHENA tries to build the graphs from segmentation masks. Without segmentation masks, one can only use the kNN and radius graph functionality of ATHENA. For the contact graph the segmentation masks are required. See the main tutorial to see how the graph construction can be customized.

From Segmentation Masks

Provide the spatialOmics instance and sample name for which the graph should be computed. Furthermore, builder_type defines the type of graph that is constructed and mask_key the segmentation masks that should be used (stored at so.masks[spl][MASK_KEY])

ath.graph.build_graph(so, spl, builder_type='knn', mask_key='cellmasks')
so.G[spl].keys()  # graphs are stored at so.G[spl]
dict_keys(['knn'])

One can build multiple graph-representations for each sample by simply calling build_graph again with another builder_type

ath.graph.build_graph(so, spl, builder_type='radius', mask_key='cellmasks')
so.G[spl].keys()  # graphs are stored at so.G[spl]
dict_keys(['knn', 'radius'])

From Coordinates

One can build the knn and radius graphs from coordinates only by setting mask_key=None and providing the column names of the coordinates with coordinate_keys.

ath.graph.build_graph(so, spl, builder_type='knn', mask_key=None, coordinate_keys=('x', 'y'))
ath.graph.build_graph(so, spl, builder_type='radius', mask_key=None, coordinate_keys=('x', 'y'))
so.G[spl].keys()  # graphs are stored at so.G[spl]
dict_keys(['knn', 'radius'])

Visualise the Data

For some of the plotting functionalities ATHENA requires the x,y coordinates of each observation. They can be extracted from segmentation masks with

ath.pp.extract_centroids(so, spl, mask_key='cellmasks')

The data can then be visualised with

ath.pl.spatial(so, spl, attr='meta_id')
/Users/art/Documents/projects/athena/athena/plotting/visualization.py:217: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure.
  fig.show()

png

or if one wants to use custom coordinates, one can provide coordinate keys that are stored in the so.obs[spl] dataframe ATHENA should use

ath.pl.spatial(so, spl, attr='meta_id', coordinate_keys=['x', 'y'])

png

See the main tutorial and the docs to see how the plotting can be customized.

Compute Metrics

Once the graphs are built we can use the quantifications methods from the ath.metrics and the ath.neigh module. Again, provide the spatialOmics instance and the sample name for which the metric should be computed. Furthermore, since ATHENA quantifies the phenotypic heterogeneity in the data, provide the column name that indicates the different phenotypes of each observation with attr and specify the graph topologie to use with graph_key.

Most of the metrics can be computed either on a sample-level (i.e. 1 value per sample) or on a observation-level (i.e. 1 value per observation). This behaviour can be controlled by setting local={True,False}.

# compute shannon entropy for each observation, once for the radius graph and once for the knn graph
ath.metrics.shannon(so, spl, attr='meta_id',graph_key='radius')
ath.metrics.shannon(so, spl, attr='meta_id',graph_key='knn')

# compute the shannon entropy for the whole sample
ath.metrics.shannon(so, spl, attr='meta_id', local=False)

The results for local=True (default) are stores in so.obs as {method}_{attr}_{graph_key}.

so.obs[spl].head()
core meta_id ... shannon_meta_id_radius shannon_meta_id_knn
cell_id
1 SP43_75_X2Y5 9 ... 1.921928 1.918296
2 SP43_75_X2Y5 3 ... 1.918296 1.918296
3 SP43_75_X2Y5 10 ... 2.446439 2.521641
4 SP43_75_X2Y5 8 ... 2.419382 2.128085
5 SP43_75_X2Y5 3 ... 1.870254 1.459148

5 rows × 14 columns

The results for local=False are stored in so.spl

so.spl.loc[spl].head()
pid                    75.000000
cell_count           2771.000000
immune_cell_count     940.000000
shannon_meta_id         3.311329
Name: SP43_75_X2Y5, dtype: float64

Note On Phenotype Encodings

We advise the users to use numeric phenotype encodings when computing metrics. If one has label names stored in a pd.DataFrame as label_names one can easily create numeric labels by

df = pd.DataFrame({'label_names': ['A', 'B', 'B', 'C']})
df['labels_names_id'] = df.groupby('label_names').ngroup()

Furthermore, some functions require a strict categorical encoding. Thus the columns dtype should be set to categorical and include all categories across the samples. This can be achived by running the following snippet for all categorical columns in so.obs[spl]

# collect all occurences in the dataset
i = set()
for spl in so.spl.index:
    i.update(so.obs[spl]['meta_id'].values)

# define categorical dtype
dtype = pd.CategoricalDtype(i)
for spl in so.spl.index:
    so.obs[spl].loc[:,'meta_id'] = so.obs[spl]['meta_id'].astype(dtype)

Note On Segmentation Masks

Segmentation masks should be stored as np.ndarray and have int encodings with 0 being background. The labels do not need to be sequential but aligned with the index in so.obs[spl], i.e. the label of the segmentation mask should be the same as in the index of the so.obs dataframe for a given observation. One can test that the labels overlap

ids = np.unique(so.masks[spl]['cellmasks'])
ids = ids[ids!=0]  # remove background label
s_mask = set(ids)
s_obs = set(so.obs[spl].index.values)
assert len(s_obs - s_mask) == 0
assert len(s_mask - s_obs) == 0