cellarr package

Subpackages

Submodules

cellarr.CellArrDataset module

Query the CellArrDataset.

This class provides methods to access the directory containing the generated TileDB files usually using the build_cellarrdataset().

Example

from cellarr import (
    CellArrDataset,
)

cd = CellArrDataset(
    dataset_path="/path/to/cellar/dir"
)
gene_list = [
    "gene_1",
    "gene_95",
    "gene_50",
]
result1 = cd[
    0, gene_list
]

print(result1)
class cellarr.CellArrDataset.CellArrCellIterator(obj)[source]

Bases: object

Cell iterator to a CellArrDataset object.

__init__(obj)[source]

Initialize the iterator.

Parameters:

obj (CellArrDataset) – Source object to iterate.

__iter__()[source]
__next__()[source]
class cellarr.CellArrDataset.CellArrDataset(dataset_path, assay_tiledb_group='assays', assay_uri='counts', gene_annotation_uri='gene_annotation', cell_metadata_uri='cell_metadata', sample_metadata_uri='sample_metadata', config=None)[source]

Bases: object

A class that represent a collection of cells and their associated metadata in a TileDB backed store.

__del__()[source]
__enter__()[source]
__exit__(exc_type, exc_val, exc_tb)[source]
__getitem__(args)[source]

Subset a CellArrDataset.

Mostly an alias to get_slice().

Parameters:

args (Union[int, Sequence, tuple]) –

Integer indices, a boolean filter, or (if the current object is named) names specifying the ranges to be extracted.

Alternatively a tuple of length 1. The first entry specifies the rows (or cells) to retain based on their names or indices.

Alternatively a tuple of length 2. The first entry specifies the rows (or cells) to retain, while the second entry specifies the columns (or features/genes) to retain, based on their names or indices.

Note

Slices are inclusive of the upper bounds. This is the default TileDB behavior.

Raises:

ValueError – If too many or too few slices provided.

Return type:

CellArrDatasetSlice

Returns:

A CellArrDatasetSlice object containing the cell_metadata, gene_annotation and the matrix.

__init__(dataset_path, assay_tiledb_group='assays', assay_uri='counts', gene_annotation_uri='gene_annotation', cell_metadata_uri='cell_metadata', sample_metadata_uri='sample_metadata', config=None)[source]

Initialize a CellArrDataset.

Parameters:
  • dataset_path (str) –

    Path to the directory containing the TileDB stores. Usually the output_path from the build_cellarrdataset().

    You may provide any tiledb compatible base path (e.g. local directory, S3, minio etc.).

  • assay_tiledb_group (str) –

    TileDB group containing the assay matrices.

    If the provided build process was used, the matrices are stored in the “assay” TileDB group.

    May be an empty string or None to specify no group. This is mostly for backwards compatibility of cellarr builds for versions before 0.3.

  • assay_uri (Union[str, List[str]]) – Relative path to matrix store. Must be in tiledb group specified by assay_tiledb_group.

  • gene_annotation_uri (str) – Relative path to gene annotation store.

  • cell_metadata_uri (str) – Relative path to cell metadata store.

  • sample_metadata_uri (str) – Relative path to sample metadata store.

  • config (Config) – Custom TileDB configuration. If None, defaults will be used.

__len__()[source]
__repr__()[source]
Return type:

str

Returns:

A string representation.

get_cell_metadata_column(column_name)[source]

Access a column from the cell_metadata store.

Parameters:

column_name (str) – Name of the column or attribute. Usually one of the column names from of get_cell_metadata_columns().

Return type:

DataFrame

Returns:

A list of values for this column.

get_cell_metadata_columns()[source]

Get column names from cell_metadata store.

Return type:

List[str]

Returns:

List of available metadata columns.

get_cell_subset(subset, columns=None)[source]

Slice the cell_metadata store.

Parameters:
  • subset (Union[slice, QueryCondition]) –

    A list of integer indices to subset the cell_metadata store.

    Alternatively, may also provide a tiledb.QueryCondition to query the store.

  • columns

    List of specific column names to access.

    Defaults to None, in which case all columns are extracted.

Return type:

DataFrame

Returns:

A pandas Dataframe of the subset.

get_cells_for_sample(sample)[source]

Slice and access all cells for a sample.

Parameters:

sample (Union[int, str]) –

A string specifying the sample index to access. This must be a value in the cellarr_sample column.

Alternatively, an integer index may be provided to access the sample at the given position.

Return type:

CellArrDatasetSlice

Returns:

A CellArrDatasetSlice object containing the cell_metadata, gene_annotation and the matrix.

get_gene_annotation_column(column_name)[source]

Access a column from the gene_annotation store.

Parameters:

column_name (str) – Name of the column or attribute. Usually one of the column names from of get_gene_annotation_columns().

Return type:

DataFrame

Returns:

A list of values for this column.

get_gene_annotation_columns()[source]

Get annotation column names from gene_annotation store.

Return type:

List[str]

Returns:

List of available annotations.

get_gene_annotation_index()[source]

Get index of the gene_annotation store.

Return type:

List[str]

Returns:

List of unique symbols.

get_gene_subset(subset, columns=None)[source]

Slice the gene_metadata store.

Parameters:
  • subset (Union[slice, List[str], QueryCondition]) –

    A list of integer indices to subset the gene_metadata store.

    Alternatively, may provide a tiledb.QueryCondition to query the store.

    Alternatively, may provide a list of strings to match with the index of gene_metadata store.

  • columns

    List of specific column names to access.

    Defaults to None, in which case all columns are extracted.

Return type:

DataFrame

Returns:

A pandas Dataframe of the subset.

get_matrix_subset(subset)[source]

Slice the sample_metadata store.

Parameters:

subset (Union[int, Sequence, tuple]) – Any slice supported by TileDB’s array slicing. For more info refer to <TileDB docs https://docs.tiledb.com/main/how-to/arrays/reading-arrays/basic-reading>_.

Return type:

DataFrame

Returns:

A dictionary containing the slice for each matrix in the path.

get_number_of_cells()[source]

Get number of cells.

Return type:

int

get_number_of_features()[source]

Get number of features.

Return type:

int

get_number_of_samples()[source]

Get number of samples.

Return type:

int

get_sample_metadata_column(column_name)[source]

Access a column from the sample_metadata store.

Parameters:

column_name (str) – Name of the column or attribute. Usually one of the column names from of get_sample_metadata_columns().

Return type:

DataFrame

Returns:

A list of values for this column.

get_sample_metadata_columns()[source]

Get column names from sample_metadata store.

Return type:

List[str]

Returns:

List of available metadata columns.

get_sample_metadata_index()[source]

Get index of the sample_metadata store.

Return type:

List[str]

Returns:

List of unique sample names.

get_sample_subset(subset, columns=None)[source]

Slice the sample_metadata store.

Parameters:
  • subset (Union[slice, QueryCondition]) –

    A list of integer indices to subset the sample_metadata store.

    Alternatively, may also provide a tiledb.QueryCondition to query the store.

  • columns

    List of specific column names to access.

    Defaults to None, in which case all columns are extracted.

Return type:

DataFrame

Returns:

A pandas Dataframe of the subset.

get_slice(cell_subset, gene_subset)[source]

Subset a CellArrDataset.

Parameters:
  • cell_subset (Union[slice, QueryCondition]) – Integer indices, a boolean filter, or (if the current object is named) names specifying the rows (or cells) to retain.

  • gene_subset (Union[slice, List[str], QueryCondition]) – Integer indices, a boolean filter, or (if the current object is named) names specifying the columns (or features/genes) to retain.

Return type:

CellArrDatasetSlice

Returns:

A CellArrDatasetSlice object containing the cell_metadata, gene_annotation and the matrix for the given slice ranges.

itercells()[source]

Iterator over samples.

Return type:

CellArrCellIterator

itersamples()[source]

Iterator over samples.

Return type:

CellArrSampleIterator

property shape
class cellarr.CellArrDataset.CellArrSampleIterator(obj)[source]

Bases: object

Sample iterator to a CellArrDataset object.

__init__(obj)[source]

Initialize the iterator.

Parameters:

obj (CellArrDataset) – Source object to iterate.

__iter__()[source]
__next__()[source]

cellarr.CellArrDatasetSlice module

Class that represents a realized subset of the CellArrDataset.

This class provides a slice data class usually generated by the access methods from cellarr.CellArrDataset.CellArrDataset().

Example

from cellarr import (
    CellArrDataset,
)

cd = CellArrDataset(
    dataset_path="/path/to/cellar/dir"
)
gene_list = [
    "gene_1",
    "gene_95",
    "gene_50",
]
result1 = cd[
    0, gene_list
]

print(result1)
class cellarr.CellArrDatasetSlice.CellArrDatasetSlice(cell_metadata, gene_annotation, matrix)[source]

Bases: object

Class that represents a realized subset of the CellArrDataset.

__annotations__ = {'cell_metadata': <class 'pandas.core.frame.DataFrame'>, 'gene_annotation': <class 'pandas.core.frame.DataFrame'>, 'matrix': typing.Any}
__dataclass_fields__ = {'cell_metadata': Field(name='cell_metadata',type=<class 'pandas.core.frame.DataFrame'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_annotation': Field(name='gene_annotation',type=<class 'pandas.core.frame.DataFrame'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'matrix': Field(name='matrix',type=typing.Any,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)
__eq__(other)

Return self==value.

__hash__ = None
__init__(cell_metadata, gene_annotation, matrix)
__len__()[source]
__match_args__ = ('cell_metadata', 'gene_annotation', 'matrix')
__repr__()[source]
Return type:

str

Returns:

A string representation.

cell_metadata: DataFrame
gene_annotation: DataFrame
get_assays(transpose=False)[source]
matrix: Any
property shape
to_anndata()[source]

Convert the realized slice to AnnData.

to_summarizedexperiment()[source]

Convert the realized slice to SummarizedExperiment.

cellarr.autoencoder module

class cellarr.autoencoder.AutoEncoder(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, input_dropout=0.4, lr=0.005, residual=False)[source]

Bases: LightningModule

A class encapsulating training.

__annotations__ = {}
__init__(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, input_dropout=0.4, lr=0.005, residual=False)[source]

Constructor.

Parameters:
  • n_genes (int) – The number of genes in the gene space, representing the input dimensions.

  • latent_dim (int) – The latent space dimensions. Defaults to 128.

  • hidden_dim (List[int]) – A list of hidden layer dimensions, describing the number of layers and their dimensions. Hidden layers are constructed in the order of the list for the encoder and in reverse for the decoder.

  • dropout (float) – The dropout rate for hidden layers

  • input_dropout (float) – The dropout rate for the input layer

  • lr (float) – The initial learning rate

  • residual (bool) – Use residual connections.

configure_optimizers()[source]

Configure optimizers.

forward(x)[source]

Forward.

Parameters:

x – Input tensor corresponding to input layer.

Returns:

Output tensor corresponding to the last encoder layer.

Output tensor corresponding to the last decoder layer.

get_loss(batch)[source]

Calculate the loss.

Parameters:

batch – A batch as defined by a pytorch DataLoader.

Returns:

The training loss

load_state(encoder_filename, decoder_filename, use_gpu=False)[source]

Load model state.

Parameters:
  • encoder_filename (str) – Filename containing the encoder model state.

  • decoder_filename (str) – Filename containing the decoder model state.

  • use_gpu (bool) – Boolean indicating whether or not to use GPUs.

on_validation_epoch_end()[source]

Pytorch-lightning validation epoch end evaluation.

on_validation_epoch_start()[source]

Pytorch-lightning validation epoch start.

save_all(model_path)[source]
training_step(batch, batch_idx)[source]

Pytorch-lightning training step.

validation_step(batch, batch_idx)[source]

Pytorch-lightning validation step.

class cellarr.autoencoder.Decoder(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, residual=False)[source]

Bases: Module

A class that encapsulates the decoder.

__annotations__ = {}
__init__(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, residual=False)[source]

Constructor.

Parameters:
  • n_genes (int) – The number of genes in the gene space, representing the input dimensions.

  • latent_dim (int) – The latent space dimensions

  • hidden_dim (List[int]) – A list of hidden layer dimensions, describing the number of layers and their dimensions. Hidden layers are constructed in the order of the list for the encoder and in reverse for the decoder.

  • dropout (float) – The dropout rate for hidden layers

  • residual (bool) – Use residual connections.

forward(x)[source]

Forward.

Parameters:

x – Input tensor corresponding to input layer.

Return type:

Tensor

Returns:

Output tensor corresponding to output layer.

load_state(filename, use_gpu=False)[source]

Load model state.

Parameters:
  • filename (str) – Filename containing the model state.

  • use_gpu (bool) – Boolean indicating whether or not to use GPUs.

save_state(filename)[source]

Save model state.

Parameters:

filename (str) – Filename to save the model state.

class cellarr.autoencoder.Encoder(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, input_dropout=0.4, residual=False)[source]

Bases: Module

A class that encapsulates the encoder.

__annotations__ = {}
__init__(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, input_dropout=0.4, residual=False)[source]

Constructor.

Parameters:
  • n_genes (int) – The number of genes in the gene space, representing the input dimensions.

  • latent_dim (int) – The latent space dimensions

  • hidden_dim (List[int]) – A list of hidden layer dimensions, describing the number of layers and their dimensions. Hidden layers are constructed in the order of the list for the encoder and in reverse for the decoder.

  • dropout (float) – The dropout rate for hidden layers

  • input_dropout (float) – The dropout rate for the input layer

  • residual (bool) – Use residual connections.

forward(x)[source]

Forward.

Parameters:

x – torch.Tensor Input tensor corresponding to input layer.

Return type:

Tensor

Returns:

Output tensor corresponding to output layer.

load_state(filename, use_gpu=False)[source]

Load model state.

Parameters:
  • filename (str) – Filename containing the model state.

  • use_gpu (bool) – Boolean indicating whether or not to use GPUs.

save_state(filename)[source]

Save model state.

Parameters:

filename (str) – Filename to save the model state.

cellarr.build_cellarrdataset module

Build the CellArrDatset.

The CellArrDataset method is designed to store single-cell RNA-seq datasets but can be generalized to store any 2-dimensional experimental data.

This method creates four TileDB files in the directory specified by output_path:

  • gene_annotation: A TileDB file containing feature/gene annotations.

  • sample_metadata: A TileDB file containing sample metadata.

  • cell_metadata: A TileDB file containing cell metadata including mapping to the samples

they are tagged with in sample_metadata. - An assay TileDB group containing various matrices. This allows the package to store multiple different matrices, e.g. ‘counts’, ‘normalized’, ‘scaled’ for the same sample/cell and gene attributes.

The TileDB matrix file is stored in a cell X gene orientation. This orientation is chosen because the fastest-changing dimension as new files are added to the collection is usually the cells rather than genes.

Process:

1. Scan the Collection: Scan the entire collection of files to create a unique set of feature ids (e.g. gene symbols). Store this set as the gene_annotation TileDB file.

2. Sample Metadata: Store sample metadata in sample_metadata TileDB file. Each file is typically considered a sample, and an automatic mapping is created between files and samples.

3. Store Cell Metadata: Store cell metadata in the cell_metadata TileDB file.

4. Remap and Orient Data: For each dataset in the collection, remap and orient the feature dimension using the feature set from Step 1. This step ensures consistency in gene measurement and order, even if some genes are unmeasured or ordered differently in the original experiments.

Example

import anndata
import numpy as np
import tempfile
from cellarr import (
    build_cellarrdataset,
    CellArrDataset,
    MatrixOptions,
)

# Create a temporary directory
tempdir = tempfile.mkdtemp()

# Read AnnData objects
adata1 = anndata.read_h5ad(
    "path/to/object1.h5ad",
    "r",
)
# or just provide the path
adata2 = "path/to/object2.h5ad"

# Build CellArrDataset
dataset = build_cellarrdataset(
    output_path=tempdir,
    files=[
        adata1,
        adata2,
    ],
    matrix_options=MatrixOptions(
        dtype=np.float32
    ),
)
cellarr.build_cellarrdataset.build_cellarrdataset(files, output_path, gene_annotation=None, sample_metadata=None, cell_metadata=None, sample_metadata_options=SampleMetadataOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='sample_metadata', column_types=None), cell_metadata_options=CellMetadataOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='cell_metadata', column_types=None), gene_annotation_options=GeneAnnotationOptions(skip=False, feature_column='index', dtype=<class 'numpy.uint32'>, tiledb_store_name='gene_annotation', column_types=None), matrix_options=MatrixOptions(skip=False, consolidate_duplicate_gene_func=<built-in function sum>, matrix_name='counts', matrix_attr_name='data', dtype=<class 'numpy.uint16'>, tiledb_store_name='counts'), optimize_tiledb=True, num_threads=1)[source]

Create the CellArrDataset from a list of single-cell experiment objects.

All files are expected to be consistent and any modifications to make them consistent is outside the scope of this function and package.

There’s a few assumptions this process makes: - If object in files is an AnnData or H5AD object, these must contain an assay matrix in the layers slot of the object named as layer_matrix_name parameter. - Feature information must contain a column defined by the parameter feature_column in the that contains feature ids or gene symbols across all files. - If no cell_metadata is provided, we scan to count the number of cells and create a simple range index. - Each file is considered a sample and a mapping between cells and samples is automatically created. Hence the sample information provided must match the number of input files and is expected to be in the same order.

Parameters:
  • files (List[Union[str, AnnData]]) – List of file paths to H5AD or AnnData objects.

  • output_path (str) – Path to where the output TileDB files should be stored.

  • gene_annotation (Union[List[str], str, DataFrame]) –

    A DataFrame containing the feature/gene annotations across all objects.

    Alternatively, may provide a path to the file containing a concatenated gene annotations across all datasets. In this case, the first row is expected to contain the column names and an index column containing the feature ids or gene symbols.

    Alternatively, a list or a dictionary of gene symbols.

    Irrespective of the input, the object will be appended with a cellarr_gene_index column that contains numerical gene index across all objects.

    Defaults to None, then a gene set is generated by scanning all objects in files.

    Additional options may be specified by gene_annotations_options.

  • sample_metadata (Union[DataFrame, str]) –

    A DataFrame containing the sample metadata for each file in files. Hences the number of rows in the dataframe must match the number of files.

    Alternatively, may provide path to the file containing a concatenated sample metadata across all cells. In this case, the first row is expected to contain the column names.

    Additionally, the order of rows is expected to be in the same order as the input list of files.

    Irrespective of the input, this object is appended with a cellarr_original_gene_set column that contains the original set of feature ids (or gene symbols) from the dataset to differentiate between zero-expressed vs unmeasured genes. Additional columns are added to help with slicing and accessing chunks.

    Defaults to None, in which case, we create a simple sample metadata dataframe containing the list of datasets. Each dataset is named as sample_{i} where i refers to the index position of the object in files.

    Additional options may be specified by sample_metadata_options.

  • cell_metadata (Union[DataFrame, str]) –

    A DataFrame containing the cell metadata for cells across files. Hences the number of rows in the dataframe must match the number of cells across all files.

    Alternatively, may provide path to the file containing a concatenated cell metadata across all cells. In this case, the first row is expected to contain the column names.

    Additionally, the order of cells is expected to be in the same order as the input list of files. If the input is a path, the file is expected to contain mappings between cells and datasets (or samples).

    Defaults to None, we scan all files to count the number of cells, then create a simple cell metadata DataFrame containing mappings from cells to their associated datasets. Each dataset is named as sample_{i} where i refers to the index position of the object in files.

    Additional options may be specified by cell_metadata_options.

  • sample_metadata_options (SampleMetadataOptions) – Optional parameters when generating sample_metadata store.

  • cell_metadata_options (CellMetadataOptions) – Optional parameters when generating cell_metadata store.

  • gene_annotation_options (GeneAnnotationOptions) – Optional parameters when generating gene_annotation store.

  • matrix_options (Union[MatrixOptions, List[MatrixOptions]]) – Optional parameters when generating matrix store.

  • optimize_tiledb (bool) – Whether to run TileDB’s vaccum and consolidation (may take long).

  • num_threads (int) – Number of threads. Defaults to 1.

cellarr.build_cellarrdataset.generate_metadata_tiledb_csv(output_uri, input, column_dtype=None, index_col=False, chunksize=1000)[source]

Generate a metadata TileDB from csv.

The difference between this and generate_metadata_tiledb_frame is when the csv is super large and it won’t fit into memory.

Parameters:
  • output_uri (str) – TileDB URI or path to save the file.

  • input (str) – Path to the csv file. The first row is expected to contain the column names.

  • column_dtype (Dict[str, dtype]) – Dtype for each of the columns. Defaults to None.

  • chunksize – Chunk size to read the dataframe. Defaults to 1000.

cellarr.build_cellarrdataset.generate_metadata_tiledb_frame(output_uri, input, column_types=None)[source]

Generate metadata TileDB from a DataFrame.

Parameters:
  • output_uri (str) – TileDB URI or path to save the file.

  • input (DataFrame) – Input dataframe.

  • column_types (dict) –

    You can specify type of each column name to cast into. “ascii” or str works best for most scenarios.

    Defaults to None.

cellarr.build_options module

class cellarr.build_options.CellMetadataOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='cell_metadata', column_types=None)[source]

Bases: object

Optional arguments for the cell_metadata store for build_cellarrdataset().

skip

Whether to skip generating cell metadata TileDB. Defaults to False.

dtype

NumPy dtype for the cell dimension. Defaults to np.uint32.

Note: make sure the number of cells fit within the integer limits of unsigned-int32.

tiledb_store_name

Name of the TileDB file. Defaults to “cell_metadata”.

column_names

List of cell metadata columns to extract from each data object. If a column is not available, it is represented as ‘NA’.

column_types

A dictionary containing column names as keys and the value representing the type to in the TileDB. The TileDB will only contain the columns listed here. If the column is not present in a dataset, it is represented as ‘NA’.

__annotations__ = {'column_types': typing.Dict[str, numpy.dtype], 'dtype': <class 'numpy.dtype'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}
__dataclass_fields__ = {'column_types': Field(name='column_types',type=typing.Dict[str, numpy.dtype],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.uint32'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='cell_metadata',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)
__eq__(other)

Return self==value.

__hash__ = None
__init__(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='cell_metadata', column_types=None)
__match_args__ = ('skip', 'dtype', 'tiledb_store_name', 'column_types')
__repr__()

Return repr(self).

column_types: Dict[str, dtype] = None
dtype

alias of uint32

skip: bool = False
tiledb_store_name: str = 'cell_metadata'
class cellarr.build_options.GeneAnnotationOptions(skip=False, feature_column='index', dtype=<class 'numpy.uint32'>, tiledb_store_name='gene_annotation', column_types=None)[source]

Bases: object

Optional arguments for the gene_annotation store for build_cellarrdataset().

feature_column

Column in var containing the feature ids (e.g. gene symbols). Defaults to the index of the var slot.

skip

Whether to skip generating gene annotation TileDB. Defaults to False.

dtype

NumPy dtype for the gene dimension. Defaults to np.uint32.

Note: make sure the number of genes fit within the integer limits of unsigned-int32.

tiledb_store_name

Name of the TileDB file. Defaults to “gene_annotation”.

column_types

A dictionary containing column names as keys and the value representing the type to in the TileDB.

If None, all columns are cast as ‘ascii’.

__annotations__ = {'column_types': typing.Dict[str, numpy.dtype], 'dtype': <class 'numpy.dtype'>, 'feature_column': <class 'str'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}
__dataclass_fields__ = {'column_types': Field(name='column_types',type=typing.Dict[str, numpy.dtype],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.uint32'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'feature_column': Field(name='feature_column',type=<class 'str'>,default='index',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='gene_annotation',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)
__eq__(other)

Return self==value.

__hash__ = None
__init__(skip=False, feature_column='index', dtype=<class 'numpy.uint32'>, tiledb_store_name='gene_annotation', column_types=None)
__match_args__ = ('skip', 'feature_column', 'dtype', 'tiledb_store_name', 'column_types')
__repr__()

Return repr(self).

column_types: Dict[str, dtype] = None
dtype

alias of uint32

feature_column: str = 'index'
skip: bool = False
tiledb_store_name: str = 'gene_annotation'
class cellarr.build_options.MatrixOptions(skip=False, consolidate_duplicate_gene_func=<built-in function sum>, matrix_name='counts', matrix_attr_name='data', dtype=<class 'numpy.uint16'>, tiledb_store_name='counts')[source]

Bases: object

Optional arguments for the matrix store for build_cellarrdataset().

matrix_name

Matrix name from layers slot to add to TileDB. Must be consistent across all objects in files.

Defaults to “counts”.

matrix_attr_name

Name of the matrix to be stored in the TileDB file. Defaults to “data”.

consolidate_duplicate_gene_func

Function to consolidate when the AnnData object contains multiple rows with the same feature id or gene symbol.

Defaults to sum().

skip

Whether to skip generating matrix TileDB. Defaults to False.

dtype

NumPy dtype for the values in the matrix. Defaults to np.uint16.

Note: make sure the matrix values fit within the range limits of unsigned-int16.

tiledb_store_name

Name of the TileDB file. Defaults to counts.

__annotations__ = {'consolidate_duplicate_gene_func': <built-in function callable>, 'dtype': <class 'numpy.dtype'>, 'matrix_attr_name': <class 'str'>, 'matrix_name': <class 'str'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}
__dataclass_fields__ = {'consolidate_duplicate_gene_func': Field(name='consolidate_duplicate_gene_func',type=<built-in function callable>,default=<built-in function sum>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.uint16'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'matrix_attr_name': Field(name='matrix_attr_name',type=<class 'str'>,default='data',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'matrix_name': Field(name='matrix_name',type=<class 'str'>,default='counts',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='counts',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)
__eq__(other)

Return self==value.

__hash__ = None
__init__(skip=False, consolidate_duplicate_gene_func=<built-in function sum>, matrix_name='counts', matrix_attr_name='data', dtype=<class 'numpy.uint16'>, tiledb_store_name='counts')
__match_args__ = ('skip', 'consolidate_duplicate_gene_func', 'matrix_name', 'matrix_attr_name', 'dtype', 'tiledb_store_name')
__repr__()

Return repr(self).

consolidate_duplicate_gene_func(start=0)

Return the sum of a ‘start’ value (default: 0) plus an iterable of numbers

When the iterable is empty, return the start value. This function is intended specifically for use with numeric values and may reject non-numeric types.

dtype

alias of uint16

matrix_attr_name: str = 'data'
matrix_name: str = 'counts'
skip: bool = False
tiledb_store_name: str = 'counts'
class cellarr.build_options.SampleMetadataOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='sample_metadata', column_types=None)[source]

Bases: object

Optional arguments for the sample store for build_cellarrdataset().

skip

Whether to skip generating sample TileDB. Defaults to False.

dtype

NumPy dtype for the sample dimension. Defaults to np.uint32.

Note: make sure the number of samples fit within the integer limits of unsigned-int32.

tiledb_store_name

Name of the TileDB file. Defaults to “sample_metadata”.

column_types

A dictionary containing column names as keys and the value representing the type to in the TileDB.

If None, all columns are cast as ‘ascii’.

__annotations__ = {'column_types': typing.Dict[str, numpy.dtype], 'dtype': <class 'numpy.dtype'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}
__dataclass_fields__ = {'column_types': Field(name='column_types',type=typing.Dict[str, numpy.dtype],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.uint32'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='sample_metadata',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)
__eq__(other)

Return self==value.

__hash__ = None
__init__(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='sample_metadata', column_types=None)
__match_args__ = ('skip', 'dtype', 'tiledb_store_name', 'column_types')
__repr__()

Return repr(self).

column_types: Dict[str, dtype] = None
dtype

alias of uint32

skip: bool = False
tiledb_store_name: str = 'sample_metadata'

cellarr.buildutils_tiledb_array module

cellarr.buildutils_tiledb_array.create_group(output_path, group_name)[source]
cellarr.buildutils_tiledb_array.create_tiledb_array(tiledb_uri_path, x_dim_length=None, y_dim_length=None, x_dim_name='cell_index', y_dim_name='gene_index', matrix_attr_name='data', x_dim_dtype=<class 'numpy.uint32'>, y_dim_dtype=<class 'numpy.uint32'>, matrix_dim_dtype=<class 'numpy.uint32'>, is_sparse=True)[source]

Create a TileDB file with the provided attributes to persistent storage.

This will materialize the array directory and all related schema files.

Parameters:
  • tiledb_uri_path (str) – Path to create the array TileDB file.

  • x_dim_length (int) – Number of entries along the x/fastest-changing dimension. e.g. Number of cells. Defaults to None, in which case, the max integer value of x_dim_dtype is used.

  • y_dim_length (int) – Number of entries along the y dimension. e.g. Number of genes. Defaults to None, in which case, the max integer value of y_dim_dtype is used.

  • x_dim_name (str) – Name for the x-dimension. Defaults to “cell_index”.

  • y_dim_name (str) – Name for the y-dimension. Defaults to “gene_index”.

  • matrix_attr_name (str) – Name for the attribute in the array. Defaults to “data”.

  • x_dim_dtype (dtype) – NumPy dtype for the x-dimension. Defaults to np.uint32.

  • y_dim_dtype (dtype) – NumPy dtype for the y-dimension. Defaults to np.uint32.

  • matrix_dim_dtype (dtype) – NumPy dtype for the values in the matrix. Defaults to np.uint32.

  • is_sparse (bool) – Whether the matrix is sparse. Defaults to True.

cellarr.buildutils_tiledb_array.optimize_tiledb_array(tiledb_array_uri, verbose=True)[source]

Consolidate TileDB fragments.

cellarr.buildutils_tiledb_array.write_csr_matrix_to_tiledb(tiledb_array_uri, matrix, value_dtype=<class 'numpy.uint32'>, row_offset=0, batch_size=25000)[source]

Append and save a csr_matrix to TileDB.

Parameters:
  • tiledb_array_uri (Union[str, SparseArray]) – TileDB array object or path to a TileDB object.

  • matrix (csr_matrix) – Input matrix to write to TileDB, must be a csr_matrix matrix.

  • value_dtype (dtype) – NumPy dtype to reformat the matrix values. Defaults to uint32.

  • row_offset (int) – Offset row number to append to matrix. Defaults to 0.

  • batch_size (int) – Batch size. Defaults to 25000.

cellarr.buildutils_tiledb_frame module

cellarr.buildutils_tiledb_frame.append_to_tiledb_frame(tiledb_uri_path, frame, row_offset=0)[source]

Create a TileDB file with the provided attributes to persistent storage.

This will materialize the array directory and all related schema files.

Parameters:
  • tiledb_uri_path (str) – Path to create the metadata TileDB file.

  • frame (DataFrame) – Pandas Dataframe to append to TileDB.

  • row_offset (int) – Row offset to append new rows to. Defaults to 0.

cellarr.buildutils_tiledb_frame.create_tiledb_frame_from_chunk(tiledb_uri_path, chunk, column_types)[source]

Create a TileDB file from the DataFrame chunk, to persistent storage. This is used by the importer for large datasets stored in csv.

This will materialize the array directory and all related schema files.

Parameters:
  • tiledb_uri_path (str) – Path to create the metadata TileDB file.

  • chunk (DataFrame) – Pandas data frame.

  • column_types (Dict[str, dtype]) – Dictionary specifying the column types for each column in the frame.

cellarr.buildutils_tiledb_frame.create_tiledb_frame_from_column_names(tiledb_uri_path, column_names, column_types)[source]

Create a TileDB file with the provided attributes to persistent storage.

This will materialize the array directory and all related schema files.

Parameters:
  • tiledb_uri_path (str) – Path to create the metadata TileDB file.

  • column_names (List[str]) – Column names of the data frame.

  • column_types (Dict[str, dtype]) – Dictionary specifying the column types for each column in the frame.

cellarr.buildutils_tiledb_frame.create_tiledb_frame_from_dataframe(tiledb_uri_path, frame, column_types=None)[source]

Create a TileDB file with the provided attributes to persistent storage.

This will materialize the array directory and all related schema files.

Parameters:
  • tiledb_uri_path (str) – Path to create the metadata TileDB file.

  • column_names – Column names of the data frame.

  • column_types (dict) – Dictionary specifying the column types for each column in the frame.

cellarr.buildutils_tiledb_frame.infer_column_types(frame, col_types)[source]

Infer column types based on pandas types for each column.

Note: Currently sets all columns to ‘ascii’.

Parameters:

frame (DataFrame) – DataFrame to infer column types from.

Return type:

Dict[str, str]

Returns:

Dictionary containing column names as keys and value representing the column types.

cellarr.dataloader module

A dataloader using TileDB files in the pytorch-lightning framework.

This class provides a dataloader using the generated TileDB files built using the build_cellarrdataset().

Example

from cellarr.dataloader import (
    DataModule,
)

datamodule = DataModule(
    dataset_path="/path/to/cellar/dir",
    cell_metadata_uri="cell_metadata",
    gene_annotation_uri="gene_annotation",
    matrix_uri="counts",
    val_studies=[
        "test3"
    ],
    label_column_name="label",
    study_column_name="study",
    batch_size=100,
    lognorm=True,
    target_sum=1e4,
)

dataloader = datamodule.train_dataloader()
batch = next(
    iter(dataloader)
)
(
    data,
    labels,
    studies,
) = batch
print(
    data,
    labels,
    studies,
)
class cellarr.dataloader.BaseBatchSampler(data_df, int2sample, bsz, shuffle=True, **kwargs)[source]

Bases: Sampler[int]

Simplest sampler class for composition of samples in minibatch.

__init__(data_df, int2sample, bsz, shuffle=True, **kwargs)[source]

Constructor.

Parameters:
  • data_df (DataFrame) – DataFrame with columns “study::::sample”

  • int2sample (dict) – Dictionary mapping integer to sample id

  • bsz (int) – Batch size

  • shuffle (bool) – Whether to shuffle the samples across epochs

__iter__()[source]
__len__()[source]
Return type:

int

__orig_bases__ = (torch.utils.data.sampler.Sampler[int],)
__parameters__ = ()
class cellarr.dataloader.DataModule(dataset_path, cell_metadata_uri='cell_metadata', gene_annotation_uri='gene_annotation', matrix_uri='assays/counts', label_column_name='celltype_id', study_column_name='study', sample_column_name='cellarr_sample', val_studies=None, gene_order=None, batch_size=100, sample_size=100, num_workers=1, lognorm=True, target_sum=10000.0, sparse=False, sampling_by_class=False, remove_singleton_classes=False, min_sample_size=None, nan_string='nan', sampler_cls=<class 'cellarr.dataloader.BaseBatchSampler'>, dataset_cls=<class 'cellarr.dataloader.scDataset'>, persistent_workers=False, multiprocessing_context='spawn')[source]

Bases: LightningDataModule

A class that extends a pytorch-lightning LightningDataModule to create pytorch dataloaders using TileDB.

The dataloader uniformly samples across training labels and study labels to create a diverse batch of cells.

__annotations__ = {}
__del__()[source]
__init__(dataset_path, cell_metadata_uri='cell_metadata', gene_annotation_uri='gene_annotation', matrix_uri='assays/counts', label_column_name='celltype_id', study_column_name='study', sample_column_name='cellarr_sample', val_studies=None, gene_order=None, batch_size=100, sample_size=100, num_workers=1, lognorm=True, target_sum=10000.0, sparse=False, sampling_by_class=False, remove_singleton_classes=False, min_sample_size=None, nan_string='nan', sampler_cls=<class 'cellarr.dataloader.BaseBatchSampler'>, dataset_cls=<class 'cellarr.dataloader.scDataset'>, persistent_workers=False, multiprocessing_context='spawn')[source]

Initialize a DataModule.

Parameters:
  • dataset_path (str) – Path to the directory containing the TileDB stores. Usually the output_path from the build_cellarrdataset().

  • cell_metadata_uri (str) – Relative path to cell metadata store.

  • gene_annotation_uri (str) – Relative path to gene annotation store.

  • matrix_uri (str) – Relative path to matrix store.

  • label_column_name (str) – Column name in cell_metadata_uri containing cell labels.

  • study_column_name (str) – Column name in cell_metadata_uri containing study information.

  • val_studies (Optional[List[str]]) – List of studies to use for validation and test. If None, all studies are used for training.

  • gene_order (Optional[List[str]]) – List of genes to subset to from the gene space. If None, all genes from the gene_annotation are used for training.

  • batch_size (int) – Batch size to use, corresponding to the number of samples in a mini-batch. Defaults to 100.

  • sample_size (int) – Size of each sample use in a mini-batch, corresponding to the number of cells in a sample. Defaults to 100.

  • num_workers (int) – The number of worker threads for dataloaders. Defaults to 1.

  • lognorm (bool) – Whether to return log-normalized expression instead of raw counts.

  • target_sum (float) – Target sum for log-normalization.

  • sparse (bool) – Whether to return a sparse tensor. Defaults to False.

  • sampling_by_class (bool) – Sample based on class counts, where sampling weight is inversely proportional to count. If False, use random sampling. Defaults to False.

  • remove_singleton_classes (bool) – Exclude cells with classes that exist in only one sample. Defaults to False.

  • min_sample_size (Optional[int]) – Set a minimum number of cells in a sample for it to be valid. Defaults to None

  • nan_string (str) – A string representing NaN. Defaults to “nan”.

  • sampler_cls (Sampler) – Sampler class to use for batching. Defauls to BaseBatchSampler.

  • dataset_cls (Dataset) – Dataset, default: scDataset Base Dataset class to use. Defaults to scDataset.

  • persistent_workers (bool) – If True, uses persistent workers in the DataLoaders.

  • multiprocessing_context (str) – Multiprocessing context to use for the DataLoaders. Defaults to “spawn”.

__repr__()[source]
Return type:

str

Returns:

A string representation.

collate(batch)[source]

Collate tensors.

Parameters:

batch – Batch to collate.

Returns:

tuple

A Tuple[torch.Tensor, torch.Tensor, np.ndarray, np.ndarray] containing information corresponding to [input, label, study, sample]

filter_db()[source]
train_dataloader()[source]

Load the training dataset.

Return type:

DataLoader

Returns:

A DataLoader object containing the training dataset.

val_dataloader()[source]

Load the validation dataset.

Return type:

DataLoader

Returns:

A DataLoader object containing the validation dataset.

class cellarr.dataloader.scDataset(data_df, int2sample, sample2cells, sample_size, sampling_by_class=False)[source]

Bases: Dataset

A class that extends pytorch Dataset to enumerate cells and cell metadata using TileDB.

__annotations__ = {}
__getitem__(idx)[source]
__init__(data_df, int2sample, sample2cells, sample_size, sampling_by_class=False)[source]

Initialize a scDataset.

Parameters:
  • data_df (DataFrame) – Pandas dataframe of valid cells.

  • int2sample (dict) – A mapping of sample index to sample id.

  • sample2cells (dict) – A mapping of sample id to cell indices.

  • sample_size (int) – Number of cells one sample.

  • sampling_by_class (bool) – Sample based on class counts, where sampling weight is inversely proportional to count. Defaults to False.

__len__()[source]
__parameters__ = ()
__repr__()[source]
Return type:

str

Returns:

A string representation.

cellarr.queryutils_tiledb_frame module

cellarr.queryutils_tiledb_frame.get_a_column(tiledb_obj, column_name)[source]

Access column(s) from the TileDB object.

Parameters:
  • tiledb_obj (Array) – A TileDB object.

  • column_name (Union[str, List[str]]) – Name(s) of the column to access.

Return type:

list

Returns:

List containing the column values.

cellarr.queryutils_tiledb_frame.get_index(tiledb_obj)[source]

Get the index of the TileDB object.

Parameters:

tiledb_obj (Array) – A TileDB object.

Return type:

list

Returns:

A list containing the index values.

cellarr.queryutils_tiledb_frame.get_schema_names_frame(tiledb_obj)[source]

Get Attributes from a TileDB object.

Parameters:

tiledb_obj (Array) – A TileDB object.

Return type:

List[str]

Returns:

List of schema attributes.

cellarr.queryutils_tiledb_frame.subset_array(tiledb_obj, row_subset, column_subset, shape)[source]

Subset a TileDB storing array data.

Uses multi_index to slice.

Parameters:
Return type:

csr_matrix

Returns:

A sparse array in a csr format.

cellarr.queryutils_tiledb_frame.subset_frame(tiledb_obj, subset, columns, primary_key_column_name=None)[source]

Subset a TileDB object.

Parameters:
  • tiledb_obj (Array) – TileDB object to subset.

  • subset (Union[slice, str]) –

    A slice to subset.

    Alternatively, may also provide a TileDB query expression.

  • columns (list) – List specifying the atrributes from the schema to extract.

  • primary_key_column_name (str) – The primary key to filter for matches when a QueryCondition is used.

Return type:

DataFrame

Returns:

A sliced DataFrame with the subset.

cellarr.utils_anndata module

cellarr.utils_anndata.consolidate_duplicate_symbols(matrix, feature_ids, consolidate_duplicate_gene_func)[source]

Consolidate duplicate gene symbols.

Parameters:
  • matrix (Any) – data matrix with rows for cells and columns for genes.

  • feature_ids (List[str]) – List of feature ids along the column axis of the matrix.

  • consolidate_duplicate_gene_func (callable) –

    Function to consolidate when the AnnData object contains multiple rows with the same feature id or gene symbol.

    Defaults to sum().

Return type:

AnnData

Returns:

AnnData object with duplicate gene symbols consolidated.

cellarr.utils_anndata.extract_anndata_info(h5ad_or_adata, var_feature_column='index', var_subset_columns=None, obs_subset_columns=None, num_threads=1)[source]

Extract and generate the list of unique feature identifiers and cell counts across files.

Parameters:
  • h5ad_or_adata (List[Union[str, AnnData]]) – List of anndata objects or path to h5ad files.

  • var_feature_column (str) – Column containing the feature ids (e.g. gene symbols). Defaults to “index”.

  • var_subset_columns (List[str]) – List of var columns to concatenate across all files. Defaults to None and no metadata columns will be extracted.

  • obs_subset_columns (dict) – List of obs columns to concatenate across all files. Defaults to None and no metadata columns will be extracted.

  • num_threads (int) – Number of threads to use. Defaults to 1.

cellarr.utils_anndata.remap_anndata(h5ad_or_adata, feature_set_order, var_feature_column='index', layer_matrix_name='counts', consolidate_duplicate_gene_func=<built-in function sum>)[source]

Extract and remap the count matrix to the provided feature (gene) set order from the AnnData object.

Parameters:
  • adata

    Input AnnData object.

    Alternatively, may also provide a path to the H5ad file.

    The index of the var slot must contain the feature ids for the columns in the matrix.

  • feature_set_order (dict) – A dictionary with the feature ids as keys and their index as value (e.g. gene symbols). The feature ids from the AnnData object are remapped to the feature order from this dictionary.

  • var_feature_column (str) – Column in var containing the feature ids (e.g. gene symbols). Defaults to the index of the var slot.

  • layer_matrix_name (Union[str, List[str]]) –

    Layer containing the matrix to add to TileDB. Defaults to “counts”.

    Alternatively, may provide a list of layers to extract and add to TileDB.

  • consolidate_duplicate_gene_func (Union[callable, List[callable]]) –

    Function to consolidate when the AnnData object contains multiple rows with the same feature id or gene symbol.

    Defaults to sum().

Return type:

Dict[str, csr_matrix]

Returns:

A dictionary with the key containing the name of the layer and the output a csr_matrix representation of the assay matrix.

cellarr.utils_anndata.scan_for_cellcounts(cache)[source]

Extract cell counts across files.

Needs calling extract_anndata_info() first.

Parameters:

cache – Info extracted by typically running extract_anndata_info().

Return type:

List[int]

Returns:

List of cell counts across files.

cellarr.utils_anndata.scan_for_cellmetadata(cache)[source]

Extract and merge all cell metadata data frames across files.

Needs calling extract_anndata_info() first.

Parameters:

cache – Info extracted by typically running extract_anndata_info().

Return type:

List[int]

Returns:

A pandas.Dataframe containing all cell metadata.

cellarr.utils_anndata.scan_for_features(cache, unique=True)[source]

Extract and generate the list of unique feature identifiers across files.

Needs calling extract_anndata_info() first.

Parameters:
Return type:

List[str]

Returns:

List of all unique feature ids across all files.

cellarr.utils_anndata.scan_for_features_annotations(cache, unique=True)[source]

Extract and generate feature annotation metadata across all files in cache.

Needs calling extract_anndata_info() first.

Parameters:
Return type:

List[str]

Returns:

List of all unique feature ids across all files.

Module contents