Reference¶
Data Input/Output¶
Functions:
|
Load a csv file. |
|
Load a tsv file. |
|
Load a fcs file. |
|
Load a mtx file. |
|
Save a mtx file. |
|
Load data produced from the 10X Cellranger pipeline. |
|
Load HDF5 10X data produced from the 10X Cellranger pipeline. |
|
Load zipped 10X data produced from the 10X Cellranger pipeline. |
-
scprep.io.
load_csv
(filename, cell_axis='row', delimiter=', ', gene_names=True, cell_names=True, sparse=False, chunksize=10000, **kwargs)[source]¶ Load a csv file.
- Parameters
filename (str) – The name of the csv file to be loaded
cell_axis ({'row', 'column'}, optional (default: 'row')) – If your data has genes on the rows and cells on the columns, use cell_axis=’column’
delimiter (str, optional (default: ',')) – Use ‘t’ for tab separated values (tsv)
gene_names (bool, str, array-like, or None (default: True)) – If True, we assume gene names are in the first row/column. Otherwise expects a filename or an array containing a list of gene symbols or ids
cell_names (bool, str, array-like, or None (default: True)) – If True, we assume cell names are in the first row/column. Otherwise expects a filename or an array containing a list of cell barcodes.
sparse (bool, optional (default: False)) – If True, loads the data as a pd.DataFrame[pd.SparseArray]. This uses less memory but more CPU.
chunksize (int, optional (default: 10000)) – If sparse=True, read this many lines of dense data at a time before converting to sparse.
**kwargs (optional arguments for pd.read_csv.) –
- Returns
data – If either gene or cell names are given, data will be a pd.DataFrame or pd.DataFrame[pd.SparseArray]. If no names are given, data will be a np.ndarray or scipy.sparse.spmatrix
- Return type
array-like, shape=[n_samples, n_features]
-
scprep.io.
load_tsv
(filename, cell_axis='row', delimiter='\t', gene_names=True, cell_names=True, sparse=False, **kwargs)[source]¶ Load a tsv file.
- Parameters
filename (str) – The name of the csv file to be loaded
cell_axis ({'row', 'column'}, optional (default: 'row')) – If your data has genes on the rows and cells on the columns, use cell_axis=’column’
delimiter (str, optional (default: 't')) – Use ‘,’ for comma separated values (csv)
gene_names (bool, str, array-like, or None (default: True)) – If True, we assume gene names are in the first row/column. Otherwise expects a filename or an array containing a list of gene symbols or ids
cell_names (bool, str, array-like, or None (default: True)) – If True, we assume cell names are in the first row/column. Otherwise expects a filename or an array containing a list of cell barcodes.
sparse (bool, optional (default: False)) – If True, loads the data as a pd.DataFrame[pd.SparseArray]. This uses less memory but more CPU.
**kwargs (optional arguments for pd.read_csv.) –
- Returns
data – If either gene or cell names are given, data will be a pd.DataFrame or pd.DataFrame[pd.SparseArray]. If no names are given, data will be a np.ndarray or scipy.sparse.spmatrix
- Return type
array-like, shape=[n_samples, n_features]
-
scprep.io.
load_fcs
(filename, gene_names=True, cell_names=True, sparse=None, metadata_channels=['Time', 'Event_length', 'DNA1', 'DNA2', 'Cisplatin', 'beadDist', 'bead1'], channel_naming='$PnS', reformat_meta=True, override=False, **kwargs)[source]¶ Load a fcs file.
- Parameters
filename (str) – The name of the fcs file to be loaded
gene_names (bool, str, array-like, or None (default: True)) – If True, we assume gene names are contained in the file. Otherwise expects a filename or an array containing a list of gene symbols or ids
cell_names (bool, str, array-like, or None (default: True)) – If True, we assume cell names are contained in the file. Otherwise expects a filename or an array containing a list of cell barcodes.
sparse (bool, optional (default: None)) – If True, loads the data as a pd.DataFrame[SparseArray]. This uses less memory but more CPU.
metadata_channels (list-like, optional, shape=[n_meta]) –
- (default: [‘Time’, ‘Event_length’, ‘DNA1’,
’DNA2’, ‘Cisplatin’, ‘beadDist’, ‘bead1’])
Channels to be excluded from the data
channel_naming ('$PnS' | '$PnN') – Determines which meta data field is used for naming the channels. The default should be $PnS (even though it is not guaranteed to be unique) $PnN stands for the short name (guaranteed to be unique). Will look like ‘FL1-H’ $PnS stands for the actual name (not guaranteed to be unique). Will look like ‘FSC-H’ (Forward scatter) The chosen field will be used to population self.channels Note: These names are not flipped in the implementation. It looks like they were swapped for some reason in the official FCS specification.
reformat_meta (bool, optional (default: True)) – If true, the meta data is reformatted with the channel information organized into a DataFrame and moved into the ‘_channels_’ key
override (bool, optional (default: False)) – If true, uses an experimental override of fcsparser. Should only be used in cases where fcsparser fails to load the file, likely due to a malformed header. Credit to https://github.com/pontikos/fcstools
**kwargs (optional arguments for fcsparser.parse.) –
- Returns
channel_metadata (dict) – FCS metadata
cell_metadata (array-like, shape=[n_samples, n_meta]) – Values from metadata channels
data (array-like, shape=[n_samples, n_features]) – If either gene or cell names are given, data will be a pd.DataFrame or pd.DataFrame[SparseArray]. If no names are given, data will be a np.ndarray or scipy.sparse.spmatrix
-
scprep.io.
load_mtx
(mtx_file, cell_axis='row', gene_names=None, cell_names=None, sparse=None)[source]¶ Load a mtx file.
- Parameters
filename (str) – The name of the mtx file to be loaded
cell_axis ({'row', 'column'}, optional (default: 'row')) – If your data has genes on the rows and cells on the columns, use cell_axis=’column’
gene_names (str, array-like, or None (default: None)) – Expects a filename or an array containing a list of gene symbols or ids
cell_names (str, array-like, or None (default: None)) – Expects a filename or an array containing a list of cell barcodes.
sparse (bool, optional (default: None)) – If True, loads the data as a pd.DataFrame[pd.SparseArray]. This uses less memory but more CPU.
- Returns
data – If either gene or cell names are given, data will be a pd.DataFrame or pd.DataFrame[pd.SparseArray]. If no names are given, data will be a np.ndarray or scipy.sparse.spmatrix
- Return type
array-like, shape=[n_samples, n_features]
-
scprep.io.
save_mtx
(data, destination, cell_names=None, gene_names=None)[source]¶ Save a mtx file.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data, saved to destination/matrix.mtx
destination (str) – Directory in which to save the data
cell_names (list-like, shape=[n_samples], optional (default: None)) – Cell names associated with rows, saved to destination/cell_names.tsv. If data is a pandas DataFrame and cell_names is None, these are autopopulated from data.index.
gene_names (list-like, shape=[n_features], optional (default: None)) – Cell names associated with rows, saved to destination/gene_names.tsv. If data is a pandas DataFrame and gene_names is None, these are autopopulated from data.columns.
Examples
>>> import scprep >>> scprep.io.save_mtx(data, destination="my_data") >>> reload = scprep.io.load_mtx("my_data/matrix.mtx", ... cell_names="my_data/cell_names.tsv", ... gene_names="my_data/gene_names.tsv")
-
scprep.io.
load_10X
(data_dir, sparse=True, gene_labels='symbol', allow_duplicates=None)[source]¶ Load data produced from the 10X Cellranger pipeline.
A default run of the cellranger count command will generate gene-barcode matrices for secondary analysis. For both “raw” and “filtered” output, directories are created containing three files: ‘matrix.mtx’, ‘barcodes.tsv’, ‘genes.tsv’. Running scprep.io.load_10X(data_dir) will return a Pandas DataFrame with genes as columns and cells as rows.
- Parameters
data_dir (string) – path to input data directory expects ‘matrix.mtx(.gz)’, ‘[genes/features].tsv(.gz)’, ‘barcodes.tsv(.gz)’ to be present and will raise an error otherwise
sparse (boolean) – If True, a sparse Pandas DataFrame is returned.
gene_labels (string, {'id', 'symbol', 'both'} optional, default: 'symbol') – Whether the columns of the dataframe should contain gene ids or gene symbols. If ‘both’, returns symbols followed by ids in parentheses.
allow_duplicates (bool, optional (default: None)) – Whether or not to allow duplicate gene names. If None, duplicates are allowed for dense input but not for sparse input.
- Returns
data – If sparse, data will be a pd.DataFrame[pd.SparseArray]. Otherwise, data will be a pd.DataFrame.
- Return type
array-like, shape=[n_samples, n_features]
-
scprep.io.
load_10X_HDF5
(filename, genome=None, sparse=True, gene_labels='symbol', allow_duplicates=None, backend=None)[source]¶ Load HDF5 10X data produced from the 10X Cellranger pipeline.
Equivalent to load_10X but for HDF5 format.
- Parameters
filename (string) – path to HDF5 input data
genome (str or None, optional (default: None)) – Name of the genome to which CellRanger ran analysis. If None, selects the first available genome, and prints all available genomes if more than one is available. Invalid for Cellranger 3.0 HDF5 files.
sparse (boolean) – If True, a sparse Pandas DataFrame is returned.
gene_labels (string, {'id', 'symbol', 'both'} optional, default: 'symbol') – Whether the columns of the dataframe should contain gene ids or gene symbols. If ‘both’, returns symbols followed by ids in parentheses.
allow_duplicates (bool, optional (default: None)) – Whether or not to allow duplicate gene names. If None, duplicates are allowed for dense input but not for sparse input.
backend (string, {'tables', 'h5py' or None} optional, default: None) – Selects the HDF5 backend. By default, selects whichever is available, using tables if both are available.
- Returns
data – If sparse, data will be a pd.DataFrame[pd.SparseArray]. Otherwise, data will be a pd.DataFrame.
- Return type
array-like, shape=[n_samples, n_features]
-
scprep.io.
load_10X_zip
(filename, sparse=True, gene_labels='symbol', allow_duplicates=None)[source]¶ Load zipped 10X data produced from the 10X Cellranger pipeline.
Runs load_10X after unzipping the data contained in filename.
- Parameters
filename (string) – path to zipped input data directory expects ‘matrix.mtx’, ‘genes.tsv’, ‘barcodes.tsv’ to be present and will raise an error otherwise
sparse (boolean) – If True, a sparse Pandas DataFrame is returned.
gene_labels (string, {'id', 'symbol', 'both'} optional, default: 'symbol') – Whether the columns of the dataframe should contain gene ids or gene symbols. If ‘both’, returns symbols followed by ids in parentheses.
allow_duplicates (bool, optional (default: None)) – Whether or not to allow duplicate gene names. If None, duplicates are allowed for dense input but not for sparse input.
- Returns
data – If sparse, data will be a pd.DataFrame[pd.SparseArray]. Otherwise, data will be a pd.DataFrame.
- Return type
array-like, shape=[n_samples, n_features]
HDF5¶
Functions:
|
Get a subnode from a HDF5 file or group. |
|
Read values from a HDF5 dataset. |
|
List all first-level nodes in a HDF5 file. |
|
Open an HDF5 file with either tables or h5py. |
-
scprep.io.hdf5.
get_node
(f, node)[source]¶ Get a subnode from a HDF5 file or group.
- Parameters
f (tables.File, h5py.File, tables.Group or h5py.Group) – Open HDF5 file handle or node
node (str) – Name of subnode to retrieve
- Returns
g – Requested HDF5 node.
- Return type
tables.Group, h5py.Group, tables.CArray or hdf5.Dataset
-
scprep.io.hdf5.
get_values
(dataset)[source]¶ Read values from a HDF5 dataset.
- Parameters
dataset (tables.CArray or h5py.Dataset) –
- Returns
data – Data read from HDF5 dataset
- Return type
np.ndarray
-
scprep.io.hdf5.
list_nodes
(f)[source]¶ List all first-level nodes in a HDF5 file.
- Parameters
f (tables.File or h5py.File) – Open HDF5 file handle.
- Returns
nodes – List of names of first-level nodes below f
- Return type
list
-
scprep.io.hdf5.
open_file
(filename, mode='r', backend=None)[source]¶ Open an HDF5 file with either tables or h5py.
Gives a simple, unified interface for both tables and h5py
- Parameters
filename (str) – Name of the HDF5 file
mode (str, optional (default: 'r')) – Read/write mode. Choose from [‘r’, ‘w’, ‘a’ ‘r+’]
backend (str, optional (default: None)) – HDF5 backend to use. Choose from [‘h5py’, ‘tables’]. If not given, scprep will detect which backend is available, using tables if both are installed.
- Returns
f – Open HDF5 file handle.
- Return type
tables.File or h5py.File
Download¶
Functions:
|
Download a .zip file from a URL and extract it. |
|
Download a file from Google Drive. |
|
Download a file from a URL. |
|
Extract a .zip file and optionally remove the archived version. |
-
scprep.io.download.
download_and_extract_zip
(url, destination)[source]¶ Download a .zip file from a URL and extract it.
- Parameters
url (string) – URL of file to be downloaded
destination (string) – Directory in which to extract the downloaded zip
-
scprep.io.download.
download_google_drive
(id, destination)[source]¶ Download a file from Google Drive.
Requires the file to be available to view by anyone with the URL.
- Parameters
id (string) – Google Drive ID string. You can access this by clicking ‘Get Shareable Link’, which will give a URL of the form <https://drive.google.com/file/d/your_file_id/view?usp=sharing>
destination (string or file) – File to which to save the downloaded data
-
scprep.io.download.
download_url
(url, destination)[source]¶ Download a file from a URL.
- Parameters
url (string) – URL of file to be downloaded
destination (string or file) – File to which to save the downloaded data
-
scprep.io.download.
unzip
(filename, destination=None, delete=True)[source]¶ Extract a .zip file and optionally remove the archived version.
- Parameters
filename (string) – Path to the zip file
destination (string, optional (default: None)) – Path to the folder in which to extract the zip. If None, extracts to the same directory the archive is in.
delete (boolean, optional (default: True)) – If True, deletes the zip file after extraction
Filtering¶
Functions:
|
Filter all duplicate cells. |
|
Remove all cells with zero library size. |
|
Filter all genes with zero counts across all cells. |
|
Remove cells with total expression of a gene set above or below a threshold. |
|
Remove all cells with library size above or below a certain threshold. |
|
Filter all genes with negligible counts in all but a few cells. |
|
Remove all cells with values above or below a certain threshold. |
-
scprep.filter.
filter_duplicates
(data, *extra_data, sample_labels=None)[source]¶ Filter all duplicate cells.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
extra_data (array-like, shape=[n_samples, any], optional) – Optional additional data objects from which to select the same rows
sample_labels (Deprecated) –
- Returns
data (array-like, shape=[m_samples, n_features]) – Filtered output data, where m_samples <= n_samples
extra_data (array-like, shape=[m_samples, any]) – Filtered extra data, if passed.
-
scprep.filter.
filter_empty_cells
(data, *extra_data, sample_labels=None)[source]¶ Remove all cells with zero library size.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
extra_data (array-like, shape=[n_samples, any], optional) – Optional additional data objects from which to select the same rows
sample_labels (Deprecated) –
- Returns
data (array-like, shape=[m_samples, n_features]) – Filtered output data, where m_samples <= n_samples
extra_data (array-like, shape=[m_samples, any]) – Filtered extra data, if passed.
-
scprep.filter.
filter_empty_genes
(data, *extra_data)[source]¶ Filter all genes with zero counts across all cells.
This is equivalent to filter_rare_genes(data, cutoff=0, min_cells=1) but should be faster.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
extra_data (array-like, shape=[any, n_features], optional) – Optional additional data objects from which to select the same genes
- Returns
data (array-like, shape=[n_samples, m_features]) – Filtered output data, where m_features <= n_features
extra_data (array-like, shape=[any, m_features]) – Filtered extra data, if passed.
-
scprep.filter.
filter_gene_set_expression
(data, *extra_data, genes=None, starts_with=None, ends_with=None, exact_word=None, regex=None, cutoff=None, percentile=None, library_size_normalize=False, keep_cells=None, return_expression=False, sample_labels=None, filter_per_sample=None)[source]¶ Remove cells with total expression of a gene set above or below a threshold.
It is recommended to use
plot_gene_set_expression()
to choose a cutoff prior to filtering.- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
extra_data (array-like, shape=[n_samples, any], optional) – Optional additional data objects from which to select the same rows
genes (list-like, optional (default: None)) – Integer column indices or string gene names included in gene set
starts_with (str or None, optional (default: None)) – If not None, select genes that start with this prefix
ends_with (str or None, optional (default: None)) – If not None, select genes that end with this suffix
exact_word (str, list-like or None, optional (default: None)) – If not None, select genes that contain this exact word.
regex (str or None, optional (default: None)) – If not None, select genes that match this regular expression
cutoff (float or tuple of floats, optional (default: None)) – Expression value above or below which to remove cells. Only one of cutoff and percentile should be specified.
percentile (int or tuple of ints, optional (Default: None)) – Percentile above or below which to retain a cell. Must be an integer between 0 and 100. Only one of cutoff and percentile should be specified.
library_size_normalize (bool, optional (default: False)) – Divide gene set expression by library size
keep_cells ({'above', 'below', 'between'} or None, optional (default: None)) – Keep cells above or below the cutoff. If None, defaults to ‘below’ for one cutoff and ‘between’ for two.
return_expression (bool, optional (default: False)) – If True, also return the values corresponding to the retained cells
sample_labels (Deprecated) –
filter_per_sample (Deprecated) –
- Returns
data (array-like, shape=[m_samples, n_features]) – Filtered output data, where m_samples <= n_samples
filtered_expression (list-like, shape=[m_samples]) – Gene set expression corresponding to retained samples, returned only if return_expression is True
extra_data (array-like, shape=[m_samples, any]) – Filtered extra data, if passed.
-
scprep.filter.
filter_library_size
(data, *extra_data, cutoff=None, percentile=None, keep_cells=None, return_library_size=False, sample_labels=None, filter_per_sample=None)[source]¶ Remove all cells with library size above or below a certain threshold.
It is recommended to use
plot_library_size()
to choose a cutoff prior to filtering.- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
extra_data (array-like, shape=[n_samples, any], optional) – Optional additional data objects from which to select the same rows
cutoff (float or tuple of floats, optional (default: None)) – Library size above or below which to retain a cell. Only one of cutoff and percentile should be specified.
percentile (int or tuple of ints, optional (Default: None)) – Percentile above or below which to retain a cell. Must be an integer between 0 and 100. Only one of cutoff and percentile should be specified.
keep_cells ({'above', 'below', 'between'} or None, optional (default: None)) – Keep cells above, below or between the cutoff. If None, defaults to ‘above’ when a single cutoff is given and ‘between’ when two cutoffs are given.
return_library_size (bool, optional (default: False)) – If True, also return the library sizes corresponding to the retained cells
sample_labels (Deprecated) –
filter_per_sample (Deprecated) –
- Returns
data (array-like, shape=[m_samples, n_features]) – Filtered output data, where m_samples <= n_samples
filtered_library_size (list-like, shape=[m_samples]) – Library sizes corresponding to retained samples, returned only if return_library_size is True
extra_data (array-like, shape=[m_samples, any]) – Filtered extra data, if passed.
-
scprep.filter.
filter_rare_genes
(data, *extra_data, cutoff=0, min_cells=5)[source]¶ Filter all genes with negligible counts in all but a few cells.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
extra_data (array-like, shape=[any, n_features], optional) – Optional additional data objects from which to select the same rows
cutoff (float, optional (default: 0)) – Number of counts above which expression is deemed non-negligible
min_cells (int, optional (default: 5)) – Minimum number of cells above cutoff in order to retain a gene
- Returns
data (array-like, shape=[n_samples, m_features]) – Filtered output data, where m_features <= n_features
extra_data (array-like, shape=[any, m_features]) – Filtered extra data, if passed.
-
scprep.filter.
filter_values
(data, *extra_data, values=None, cutoff=None, percentile=None, keep_cells='above', return_values=False, sample_labels=None, filter_per_sample=None)[source]¶ Remove all cells with values above or below a certain threshold.
It is recommended to use
histogram()
to choose a cutoff prior to filtering.- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
extra_data (array-like, shape=[n_samples, any], optional) – Optional additional data objects from which to select the same rows
values (list-like, shape=[n_samples]) – Value upon which to filter
cutoff (float or tuple of floats, optional (default: None)) – Value above or below which to retain cells. Only one of cutoff and percentile should be specified.
percentile (int or tuple of ints, optional (Default: None)) – Percentile above or below which to retain cells. Must be an integer between 0 and 100. Only one of cutoff and percentile should be specified.
keep_cells ({'above', 'below', 'between'} or None, optional (default: None)) – Keep cells above, below or between the cutoff. If None, defaults to ‘above’ when a single cutoff is given and ‘between’ when two cutoffs are given.
return_values (bool, optional (default: False)) – If True, also return the values corresponding to the retained cells
sample_labels (Deprecated) –
filter_per_sample (Deprecated) –
- Returns
data (array-like, shape=[m_samples, n_features]) – Filtered output data, where m_samples <= n_samples
filtered_values (list-like, shape=[m_samples]) – Values corresponding to retained samples, returned only if return_values is True
extra_data (array-like, shape=[m_samples, any]) – Filtered extra data, if passed.
Normalization¶
Functions:
|
Perform batch mean-centering on the data. |
|
Perform L1 normalization on input data. |
-
scprep.normalize.
batch_mean_center
(data, sample_idx=None)[source]¶ Perform batch mean-centering on the data.
The features of the data are all centered such that the column means are zero. Each batch is centered separately.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
sample_idx (list-like, optional) – Batch indices. If None, data is assumed to be a single batch
- Returns
data – Batch mean-centered output data.
- Return type
array-like, shape=[n_samples, n_features]
-
scprep.normalize.
library_size_normalize
(data, rescale=10000, return_library_size=False)[source]¶ Perform L1 normalization on input data.
Performs L1 normalization on input data such that the sum of expression values for each cell sums to 1 then returns normalized matrix to the metric space using median UMI count per cell effectively scaling all cells as if they were sampled evenly.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
rescale ({‘mean’, ‘median’}, float or None, optional (default: 10000)) – Rescaling strategy. If ‘mean’ or ‘median’, normalized cells are scaled back up to the mean or median expression value. If a float, normalized cells are scaled up to the given value. If None, no rescaling is done and all cells will have normalized library size of 1.
return_library_size (bool, optional (default: False)) – If True, also return the library size pre-normalization
- Returns
data_norm (array-like, shape=[n_samples, n_features]) – Library size normalized output data
filtered_library_size (list-like, shape=[m_samples]) – Library size of cells pre-normalization, returned only if return_library_size is True
Transformation¶
Functions:
|
Inverse hyperbolic sine transform. |
|
Log transform. |
|
Square root transform. |
-
scprep.transform.
arcsinh
(data, cofactor=5)[source]¶ Inverse hyperbolic sine transform.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
cofactor (float or None, optional (default: 5)) – Factor by which to divide data before arcsinh transform
- Returns
data – Inverse hyperbolic sine transformed output data
- Return type
array-like, shape=[n_samples, n_features]
:raises ValueError : if cofactor <= 0:
-
scprep.transform.
log
(data, pseudocount=1, base=10)[source]¶ Log transform.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
pseudocount (int, optional (default: 1)) – Pseudocount to add to values before log transform. If data is sparse, pseudocount must be 1 such that log(0 + pseudocount) = 0
base ({2, 'e', 10}, optional (default: 10)) – Logarithm base.
- Returns
data – Log transformed output data
- Return type
array-like, shape=[n_samples, n_features]
:raises ValueError : if data has zero or negative values: :raises RuntimeWarning : if data is sparse and pseudocount != 1:
Measurements¶
Functions:
|
Measure the number of cells in which each gene has non-negligible counts. |
|
Measure the expression of a set of genes in each cell. |
|
Measure the variability of each gene in a dataset. |
|
Measure the library size of each cell. |
-
scprep.measure.
gene_capture_count
(data, cutoff=0)[source]¶ Measure the number of cells in which each gene has non-negligible counts.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
cutoff (float, optional (default: 0)) – Number of counts above which expression is deemed non-negligible
- Returns
capture-count – Capture count for each gene
- Return type
list-like, shape=[m_features]
-
scprep.measure.
gene_set_expression
(data, genes=None, library_size_normalize=False, starts_with=None, ends_with=None, exact_word=None, regex=None)[source]¶ Measure the expression of a set of genes in each cell.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
genes (list-like, shape<=[n_features], optional (default: None)) – Integer column indices or string gene names included in gene set
library_size_normalize (bool, optional (default: False)) – Divide gene set expression by library size
starts_with (str or None, optional (default: None)) – If not None, select genes that start with this prefix
ends_with (str or None, optional (default: None)) – If not None, select genes that end with this suffix
exact_word (str, list-like or None, optional (default: None)) – If not None, select genes that contain this exact word.
regex (str or None, optional (default: None)) – If not None, select genes that match this regular expression
- Returns
gene_set_expression – Sum over genes for each cell
- Return type
list-like, shape=[n_samples]
-
scprep.measure.
gene_variability
(data, kernel_size=0.005, smooth=5, return_means=False)[source]¶ Measure the variability of each gene in a dataset.
Variability is computed as the deviation from the rolling median of the mean-variance curve
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
kernel_size (float or int, optional (default: 0.005)) – Width of rolling median window. If a float between 0 and 1, the width is given by kernel_size * data.shape[1]. Otherwise should be an odd integer
smooth (int, optional (default: 5)) – Amount of smoothing to apply to the median filter
return_means (boolean, optional (default: False)) – If True, return the gene means
- Returns
variability – Variability for each gene
- Return type
list-like, shape=[n_samples]
Statistics¶
Functions:
|
Compute Earth Mover’s Distance between samples. |
|
Calculate the most significant genes between two datasets. |
|
Calculate the most significant genes for each cluster in a dataset. |
|
Compute kNN conditional Density Resampled Estimate of Mutual Information. |
|
Calculate the mean difference in genes between two datasets. |
|
Compute mutual information score with set number of bins. |
|
Compute pairwise Pearson correlation between columns of two matrices. |
|
Plot results of DREMI. |
|
Calculate the Wilcoxon rank-sum (aka Mann-Whitney U) statistic. |
|
Calculate Welch’s t statistic. |
-
scprep.stats.
EMD
(x, y)[source]¶ Compute Earth Mover’s Distance between samples.
Calculates an approximation of Earth Mover’s Distance (also called Wasserstein distance) for 2 variables. This can be thought of as the distance between two probability distributions. This metric is useful for identifying differentially expressed genes between two groups of cells. For more information see https://en.wikipedia.org/wiki/Wasserstein_metric.
Note, this is a wrapper function for scipy.stats.wasserstein_disance and assumes the data is 1-dimensional
- Parameters
x (array-like, shape=[n_samples]) – Input data (feature 1)
y (array-like, shape=[n_samples]) – Input data (feature 2)
- Returns
emd – Earth Mover’s Distance between x and y.
- Return type
float
Examples
>>> import scprep >>> data = scprep.io.load_csv("my_data.csv") >>> emd = scprep.stats.EMD(data['GENE1'], data['GENE2'])
-
scprep.stats.
differential_expression
(X, Y, measure='difference', direction='both', gene_names=None, n_jobs=-2)[source]¶ Calculate the most significant genes between two datasets.
If using
measure="emd"
, the test statistic is multiplied by the sign of the mean differencein order to allow for distinguishing between positive and negative shifts. To ignore this, usedirection="both"
to sort by the absolute value.- Parameters
X (array-like, shape=[n_cells, n_genes]) –
Y (array-like, shape=[m_cells, n_genes]) –
measure ({'difference', 'emd', 'ttest', 'ranksum'},) –
optional (default: ‘difference’) The measurement to be used to rank genes. ‘difference’ is the mean difference between genes. ‘emd’ refers to Earth Mover’s Distance. ‘ttest’ refers to Welch’s t-statistic. ‘ranksum’ refers to the Wilcoxon rank sum statistic (or the Mann-Whitney
U statistic).
direction ({'up', 'down', 'both'}, optional (default: 'both')) – The direction in which to consider genes significant. If ‘up’, rank genes where X > Y. If ‘down’, rank genes where X < Y. If ‘both’, rank genes by absolute value.
gene_names (list-like or None, optional (default: None)) – List of gene names associated with the columns of X and Y
n_jobs (int, optional (default: -2)) – Number of threads to use if the measurement is parallelizable (currently used for EMD). If negative, -1 refers to all available cores.
- Returns
result – Ordered DataFrame with a column “gene” and a column named measure.
- Return type
pd.DataFrame
-
scprep.stats.
differential_expression_by_cluster
(data, clusters, measure='difference', direction='both', gene_names=None, n_jobs=-2)[source]¶ Calculate the most significant genes for each cluster in a dataset.
Measurements are run for each cluster against the rest of the dataset.
- Parameters
data (array-like, shape=[n_cells, n_genes]) –
clusters (list-like, shape=[n_cells]) –
measure ({'difference', 'emd', 'ttest', 'ranksum'}, optional) – (default: ‘difference’) The measurement to be used to rank genes. ‘difference’ is the mean difference between genes. ‘emd’ refers to Earth Mover’s Distance. ‘ttest’ refers to Welch’s t-statistic. ‘ranksum’ refers to the Wilcoxon rank sum statistic (or the Mann-Whitney U statistic).
direction ({'up', 'down', 'both'}, optional (default: 'both')) – The direction in which to consider genes significant. If ‘up’, rank genes where X > Y. If ‘down’, rank genes where X < Y. If ‘both’, rank genes by absolute value.
gene_names (list-like or None, optional (default: None)) – List of gene names associated with the columns of X and Y
n_jobs (int, optional (default: -2)) – Number of threads to use if the measurement is parallelizable (currently used for EMD). If negative, -1 refers to all available cores.
- Returns
result – Dictionary containing an ordered DataFrame with a column “gene” and a column named measure for each cluster.
- Return type
dict(pd.DataFrame)
-
scprep.stats.
knnDREMI
(x, y, k=10, n_bins=20, n_mesh=3, n_jobs=1, plot=False, return_drevi=False, **kwargs)[source]¶ Compute kNN conditional Density Resampled Estimate of Mutual Information.
Calculates k-Nearest Neighbor conditional Density Resampled Estimate of Mutual Information as defined in Van Dijk et al, 2018. 1
kNN-DREMI is an adaptation of DREMI (Krishnaswamy et al. 2014, 2) for single cell RNA-sequencing data. DREMI captures the functional relationship between two genes across their entire dynamic range. The key change to kNN-DREMI is the replacement of the heat diffusion-based kernel-density estimator from Botev et al., 2010 3 by a k-nearest neighbor-based density estimator (Sricharan et al., 2012 4), which has been shown to be an effective method for sparse and high dimensional datasets.
Note that kNN-DREMI, like Mutual Information and DREMI, is not symmetric. Here we are estimating I(Y|X).
- Parameters
x (array-like, shape=[n_samples]) – Input data (independent feature)
y (array-like, shape=[n_samples]) – Input data (dependent feature)
k (int, range=[0:n_samples), optional (default: 10)) – Number of neighbors
n_bins (int, range=[0:inf), optional (default: 20)) – Number of bins for density resampling
n_mesh (int, range=[0:inf), optional (default: 3)) – In each bin, density will be calculcated around (mesh ** 2) points
n_jobs (int, optional (default: 1)) – Number of threads used for kNN calculation
plot (bool, optional (default: False)) – If True, DREMI create plots of the data like those seen in Fig 5C/D of van Dijk et al. 2018. (doi:10.1016/j.cell.2018.05.061).
return_drevi (bool, optional (default: False)) – If True, return the DREVI normalized density matrix in addition to the DREMI score.
**kwargs (additional arguments for scprep.stats.plot_knnDREMI) –
- Returns
dremi (float) – kNN condtional Density resampled estimate of mutual information
drevi (np.ndarray) – DREVI normalized density matrix. Only returned if return_drevi is True.
Examples
>>> import scprep >>> data = scprep.io.load_csv("my_data.csv") >>> dremi = scprep.stats.knnDREMI(data['GENE1'], data['GENE2'], ... plot=True, ... filename='dremi.png')
References
- 1(1,2)
van Dijk D et al. (2018), Recovering Gene Interactions from Single-Cell Data Using Data Diffusion, Cell.
- 2
Krishnaswamy S et al. (2014), Conditional density-based analysis of T cell signaling in single-cell data, Science.
- 3
Botev ZI et al. (2010), Kernel density estimation via diffusion, The Annals of Statistics.
- 4
Sricharan K et al. (2012), Estimation of nonlinear functionals of densities with confidence, IEEE Transactions on Information Theory.
-
scprep.stats.
mean_difference
(X, Y)[source]¶ Calculate the mean difference in genes between two datasets.
In the case where the data has been log normalized, this is equivalent to fold change.
- Parameters
X (array-like, shape=[n_cells, n_genes]) –
Y (array-like, shape=[m_cells, n_genes]) –
- Returns
difference
- Return type
list-like, shape=[n_genes]
-
scprep.stats.
mutual_information
(x, y, bins=8)[source]¶ Compute mutual information score with set number of bins.
Helper function for sklearn.metrics.mutual_info_score that builds a contingency table over a set number of bins. Credit: Warran Weckesser.
- Parameters
x (array-like, shape=[n_samples]) – Input data (feature 1)
y (array-like, shape=[n_samples]) – Input data (feature 2)
bins (int or array-like, (default: 8)) – Passed to np.histogram2d to calculate a contingency table.
- Returns
mi – Mutual information between x and y.
- Return type
float
Examples
>>> import scprep >>> data = scprep.io.load_csv("my_data.csv") >>> mi = scprep.stats.mutual_information(data['GENE1'], data['GENE2'])
-
scprep.stats.
pairwise_correlation
(X, Y, ignore_nan=False)[source]¶ Compute pairwise Pearson correlation between columns of two matrices.
From https://stackoverflow.com/a/33651442/3996580
- Parameters
X (array-like, shape=[n_samples, m_features]) – Input data
Y (array-like, shape=[n_samples, p_features]) – Input data
ignore_nan (bool, optional (default: False)) – If True, ignore NaNs, computing correlation over remaining values
- Returns
cor
- Return type
np.ndarray, shape=[m_features, p_features]
-
scprep.stats.
plot_knnDREMI
(dremi, mutual_info, x, y, n_bins, n_mesh, density, bin_density, drevi, figsize=(12, 3.5), filename=None, xlabel='Feature 1', ylabel='Feature 2', title_fontsize=18, label_fontsize=16, dpi=150)[source]¶ Plot results of DREMI.
Create plots of the data like those seen in Fig 5C/D of van Dijk et al. 2018. 1 Note that this function is not designed to be called manually. Instead create plots by running scprep.stats.knnDREMI with plot=True.
- Parameters
figsize (tuple, optional (default: (12, 3.5))) – Matplotlib figure size
filename (str or None, optional (default: None)) – If given, saves the results to a file
xlabel (str, optional (default: "Feature 1")) – The name of the gene shown on the x axis
ylabel (str, optional (default: "Feature 2")) – The name of the gene shown on the y axis
title_fontsize (int, optional (default: 18)) – Font size for figure titles
label_fontsize (int, optional (default: 16)) – Font size for axis labels
dpi (int, optional (default: 150)) – Dots per inch for saved figure
Plotting¶
Functions:
|
Plot a histogram. |
|
Plot the histogram of the expression of a gene set. |
|
Plot the library size histogram. |
|
Create a jitter plot. |
|
Plot marker gene enrichment. |
|
Create a rotating 3D scatter plot. |
|
Create a scatter plot. |
|
Create a 2D scatter plot. |
|
Create a 3D scatter plot. |
|
Plot the explained variance of each principal component. |
|
Plot the histogram of gene variability. |
-
scprep.plot.
histogram
(data, bins=100, log=False, cutoff=None, percentile=None, ax=None, figsize=None, xlabel=None, ylabel='Number of cells', title=None, fontsize=None, histtype='stepfilled', label=None, legend=True, alpha=None, filename=None, dpi=None, **kwargs)[source]¶ Plot a histogram.
- Parameters
data (array-like, shape=[n_samples]) – Input data. Multiple datasets may be given as a list of array-likes.
bins (int, optional (default: 100)) – Number of bins to draw in the histogram
log (bool, or {'x', 'y'}, optional (default: False)) – If True, plot both axes on a log scale. If ‘x’ or ‘y’, only plot the given axis on a log scale. If False, plot both axes on a linear scale.
cutoff (float or None, optional (default: None)) – Absolute cutoff at which to draw a vertical line. Only one of cutoff and percentile may be given.
percentile (float or None, optional (default: None)) – Percentile between 0 and 100 at which to draw a vertical line. Only one of cutoff and percentile may be given.
ax (matplotlib.Axes or None, optional (default: None)) – Axis to plot on. If None, a new axis will be created.
figsize (tuple or None, optional (default: None)) – If not None, sets the figure size (width, height)
[x,y]label (str, optional) – Labels to display on the x and y axis.
title (str or None, optional (default: None)) – Axis title.
fontsize (float or None (default: None)) – Base font size.
histtype ({'bar', 'barstacked', 'step', 'stepfilled'}, optional) – (default: ‘stepfilled’) The type of histogram to draw. ‘bar’ is a traditional bar-type histogram. If multiple data are given the bars are arranged side by side. ‘barstacked’ is a bar-type histogram where multiple data are stacked on top of each other. ‘step’ generates a lineplot that is by default unfilled. ‘stepfilled’ generates a lineplot that is by default filled.
label (str or None, optional (default: None)) – String, or sequence of strings to match multiple datasets.
legend (bool, optional (default: True)) – Show the legend if
label
is given.alpha (float, optional (default: 1 for a single dataset, 0.5 for multiple)) – Histogram transparency
filename (str or None (default: None)) – file to which the output is saved
dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.
**kwargs (additional arguments for matplotlib.pyplot.hist) –
- Returns
ax – axis on which plot was drawn
- Return type
matplotlib.Axes
-
scprep.plot.
plot_gene_set_expression
(data, genes=None, starts_with=None, ends_with=None, exact_word=None, regex=None, bins=100, log=False, cutoff=None, percentile=None, library_size_normalize=False, ax=None, figsize=None, xlabel='Gene expression', title=None, fontsize=None, filename=None, dpi=None, **kwargs)[source]¶ Plot the histogram of the expression of a gene set.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data. Multiple datasets may be given as a list of array-likes.
genes (list-like, optional (default: None)) – Integer column indices or string gene names included in gene set
starts_with (str or None, optional (default: None)) – If not None, select genes that start with this prefix
ends_with (str or None, optional (default: None)) – If not None, select genes that end with this suffix
exact_word (str, list-like or None, optional (default: None)) – If not None, select genes that contain this exact word.
regex (str or None, optional (default: None)) – If not None, select genes that match this regular expression
bins (int, optional (default: 100)) – Number of bins to draw in the histogram
log (bool, or {'x', 'y'}, optional (default: False)) – If True, plot both axes on a log scale. If ‘x’ or ‘y’, only plot the given axis on a log scale. If False, plot both axes on a linear scale.
cutoff (float or None, optional (default: None)) – Absolute cutoff at which to draw a vertical line. Only one of cutoff and percentile may be given.
percentile (float or None, optional (default: None)) – Percentile between 0 and 100 at which to draw a vertical line. Only one of cutoff and percentile may be given.
library_size_normalize (bool, optional (default: False)) – Divide gene set expression by library size
ax (matplotlib.Axes or None, optional (default: None)) – Axis to plot on. If None, a new axis will be created.
figsize (tuple or None, optional (default: None)) – If not None, sets the figure size (width, height)
[x,y]label (str, optional) – Labels to display on the x and y axis.
title (str or None, optional (default: None)) – Axis title.
fontsize (float or None (default: None)) – Base font size.
filename (str or None (default: None)) – file to which the output is saved
dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.
**kwargs (additional arguments for matplotlib.pyplot.hist) –
- Returns
ax – axis on which plot was drawn
- Return type
matplotlib.Axes
-
scprep.plot.
plot_library_size
(data, bins=100, log=True, cutoff=None, percentile=None, ax=None, figsize=None, xlabel='Library size', title=None, fontsize=None, filename=None, dpi=None, **kwargs)[source]¶ Plot the library size histogram.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data. Multiple datasets may be given as a list of array-likes.
bins (int, optional (default: 100)) – Number of bins to draw in the histogram
log (bool, or {'x', 'y'}, optional (default: True)) – If True, plot both axes on a log scale. If ‘x’ or ‘y’, only plot the given axis on a log scale. If False, plot both axes on a linear scale.
cutoff (float or None, optional (default: None)) – Absolute cutoff at which to draw a vertical line. Only one of cutoff and percentile may be given.
percentile (float or None, optional (default: None)) – Percentile between 0 and 100 at which to draw a vertical line. Only one of cutoff and percentile may be given.
ax (matplotlib.Axes or None, optional (default: None)) – Axis to plot on. If None, a new axis will be created.
figsize (tuple or None, optional (default: None)) – If not None, sets the figure size (width, height)
[x,y]label (str, optional) – Labels to display on the x and y axis.
title (str or None, optional (default: None)) – Axis title.
fontsize (float or None (default: None)) – Base font size.
filename (str or None (default: None)) – file to which the output is saved
dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.
**kwargs (additional arguments for matplotlib.pyplot.hist) –
- Returns
ax – axis on which plot was drawn
- Return type
matplotlib.Axes
-
scprep.plot.
jitter
(labels, values, sigma=0.1, c=None, cmap=None, cmap_scale='linear', s=None, mask=None, plot_means=True, means_s=100, means_c='lightgrey', discrete=None, ax=None, legend=None, colorbar=None, shuffle=True, figsize=None, ticks=True, xticks=None, yticks=None, ticklabels=True, xticklabels=None, yticklabels=None, xlabel=None, ylabel=None, title=None, fontsize=None, legend_title=None, legend_loc='best', legend_anchor=None, vmin=None, vmax=None, filename=None, dpi=None, **plot_kwargs)[source]¶ Create a jitter plot.
Creates a 2D scatterplot showing the distribution of
values
for points that have associatedlabels
.- Parameters
labels (array-like, shape=[n_cells]) – Class labels associated with each point.
values (array-like, shape=[n_cells]) – Values associated with each cell
sigma (float, optinoal, default: 0.1) – Adjusts the amount of jitter.
c (list-like or None, optional (default: None)) – Color vector. Can be a single color value (RGB, RGBA, or named matplotlib colors), an array of these of length n_samples, or a list of discrete or continuous values of any data type. If c is not a single or list of matplotlib colors, the values in c will be used to populate the legend / colorbar with colors from cmap
cmap (matplotlib colormap, str, dict or None, optional (default: None)) – matplotlib colormap. If None, uses tab20 for discrete data and inferno for continuous data. If a dictionary, expects one key for every unique value in c, where values are valid matplotlib colors (hsv, rbg, rgba, or named colors)
cmap_scale ({‘linear’, ‘log’, ‘symlog’, ‘sqrt’} or matplotlib.colors.Normalize,) – optional (default: ‘linear’) Colormap normalization scale. For advanced use, see <https://matplotlib.org/users/colormapnorms.html>
s (float, optional (default: None)) – Point size. If None, set to 200 / sqrt(n_samples)
mask (list-like, optional (default: None)) – boolean mask to hide data points
plot_means (bool, optional (default: True)) – If True, plot the mean value for each label.
means_s (float, optional (default: 100)) – Point size for mean values.
means_c (string, list-like or matplotlib color, optional (default: 'lightgrey')) – Point color(s) for mean values.
discrete (bool or None, optional (default: None)) – If True, the legend is categorical. If False, the legend is a colorbar. If None, discreteness is detected automatically. Data containing non-numeric c is always discrete, and numeric data with 20 or less unique values is discrete.
ax (matplotlib.Axes or None, optional (default: None)) – axis on which to plot. If None, an axis is created
legend (bool, optional (default: None)) – States whether or not to create a legend. If data is continuous, the legend is a colorbar. If None, a legend is created where possible
colorbar (bool, optional (default: None)) – Synonym for legend
shuffle (bool, optional (default: True)) – If True. shuffles the order of points on the plot.
figsize (tuple, optional (default: None)) – Tuple of floats for creation of new matplotlib figure. Only used if ax is None.
ticks (True, False, or list-like (default: True)) – If True, keeps default axis ticks. If False, removes axis ticks. If a list, sets custom axis ticks
{x,y}ticks (True, False, or list-like (default: None)) – If set, overrides ticks
ticklabels (True, False, or list-like (default: True)) – If True, keeps default axis tick labels. If False, removes axis tick labels. If a list, sets custom axis tick labels
{x,y}ticklabels (True, False, or list-like (default: None)) – If set, overrides ticklabels
{x,y}label (str or None (default : None)) – Axis labels. If None, no label is set.
title (str or None (default: None)) – axis title. If None, no title is set.
fontsize (float or None (default: None)) – Base font size.
legend_title (str (default: None)) – title for the colorbar of legend
legend_loc (int or string or pair of floats, default: 'best') – Matplotlib legend location. Only used for discrete data. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.
legend_anchor (BboxBase, 2-tuple, or 4-tuple) – Box that is used to position the legend in conjunction with loc. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.
vmax (vmin,) – Range of values to use as the range for the colormap. Only used if data is continuous
filename (str or None (default: None)) – file to which the output is saved
dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.
**plot_kwargs (keyword arguments) – Extra arguments passed to matplotlib.pyplot.scatter.
- Returns
ax – axis on which plot was drawn
- Return type
matplotlib.Axes
-
scprep.plot.
marker_plot
(data, clusters, markers, gene_names=None, normalize_expression=True, normalize_emd=True, reorder_tissues=True, reorder_markers=True, cmap='magma', title=None, figsize=None, ax=None, fontsize=None)[source]¶ Plot marker gene enrichment.
Generate a plot indicating the expression level and enrichment of a set of marker genes for each cluster.
Color of each point indicates the expression of each gene in each cluster. The size of each point indicates how differentially expressed each gene is in each cluster.
- Parameters
data (array-like, shape=[n_cells, n_genes]) – Gene expression data for calculating expression statistics.
clusters (list-like, shape=[n_cells]) – Cluster assignments for each cell. Should be ints like the output of most sklearn.cluster methods.
markers (dict or list-like) – If a dictionary, keys represent tissues and values being a list of marker genes in each tissue. If a list, a list of marker genes.
gene_names (list-like, shape=[n_genes]) – List of gene names.
normalize_{expression,emd} (bool, optional (default: True)) – Normalize the expression and EMD of each row.
reorder_{tissues,markers} (bool, optional (default: True)) – Reorder tissues and markers according to hierarchical clustering=
cmap (str or matplotlib colormap, optional (default: 'inferno')) – Colormap with which to color points.
title (str or None, optional (default: None)) – Title for the plot
figsize (tuple or None, optional (default: None)) – If not None, sets the figure size (width, height)
ax (matplotlib.Axes or None, optional (default: None)) – Axis to plot on. If None, a new axis will be created.
fontsize (int or None, optional (default: None)) – Base fontsize.
- Returns
ax – axis on which plot was drawn
- Return type
matplotlib.Axes
Example
>>> markers = {'Adaxial - Immature': ['myl10', 'myod1'], 'Adaxial - Mature': ['myog'], 'Presomitic mesoderm': ['tbx6', 'msgn1', 'tbx16'], 'Forming somites': ['mespba', 'ripply2'], 'Somites': ['meox1', 'ripply1', 'aldh1a2']} >>> cluster_marker_plot(data, clusters, gene_names, markers, title="Tailbud - PSM")
-
scprep.plot.
rotate_scatter3d
(data, filename=None, rotation_speed=30, fps=10, ax=None, figsize=None, elev=None, ipython_html='jshtml', dpi=None, **kwargs)[source]¶ Create a rotating 3D scatter plot.
Builds upon matplotlib.pyplot.scatter with nice defaults and handles categorical colors / legends better.
- Parameters
data (array-like, phate.PHATE or scanpy.AnnData) – Input data. Only the first three dimensions are used.
filename (str, optional (default: None)) – If not None, saves a .gif or .mp4 with the output
rotation_speed (float, optional (default: 30)) – Speed of axis rotation, in degrees per second
fps (int, optional (default: 10)) – Frames per second. Increase this for a smoother animation
ax (matplotlib.Axes or None, optional (default: None)) – axis on which to plot. If None, an axis is created
figsize (tuple, optional (default: None)) – Tuple of floats for creation of new matplotlib figure. Only used if ax is None.
elev (int, optional (default: None)) – Elevation angle of viewpoint from horizontal, in degrees
ipython_html ({'html5', 'jshtml'}) – which html writer to use if using a Jupyter Notebook
dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.
**kwargs (keyword arguments) – See :~func:scprep.plot.scatter3d.
- Returns
ani – animation object
- Return type
matplotlib.animation.FuncAnimation
Examples
>>> import scprep >>> import numpy as np >>> import matplotlib.pyplot as plt >>> data = np.random.normal(0, 1, [200, 3]) >>> # Continuous color vector >>> colors = data[:, 0] >>> scprep.plot.rotate_scatter3d(data, c=colors, filename="animation.gif") >>> # Discrete color vector with custom colormap >>> colors = np.random.choice(['a','b'], data.shape[0], replace=True) >>> data[colors == 'a'] += 5 >>> scprep.plot.rotate_scatter3d( data, c=colors, cmap={'a' : [1,0,0,1], 'b' : 'xkcd:sky blue'}, filename="animation.mp4" )
-
scprep.plot.
scatter
(x, y, z=None, c=None, cmap=None, cmap_scale='linear', s=None, mask=None, discrete=None, ax=None, legend=None, colorbar=None, shuffle=True, figsize=None, ticks=True, xticks=None, yticks=None, zticks=None, ticklabels=True, xticklabels=None, yticklabels=None, zticklabels=None, label_prefix=None, xlabel=None, ylabel=None, zlabel=None, title=None, fontsize=None, legend_title=None, legend_loc='best', legend_anchor=None, legend_ncol=None, vmin=None, vmax=None, elev=None, azim=None, filename=None, dpi=None, **plot_kwargs)[source]¶ Create a scatter plot.
Builds upon matplotlib.pyplot.scatter with nice defaults and handles categorical colors / legends better. For easy access, use scatter2d or scatter3d.
- Parameters
x (list-like) – data for x axis
y (list-like) – data for y axis
z (list-like, optional (default: None)) – data for z axis
c (list-like or None, optional (default: None)) – Color vector. Can be a single color value (RGB, RGBA, or named matplotlib colors), an array of these of length n_samples, or a list of discrete or continuous values of any data type. If c is not a single or list of matplotlib colors, the values in c will be used to populate the legend / colorbar with colors from cmap
cmap (matplotlib colormap, str, dict or None, optional (default: None)) – matplotlib colormap. If None, uses tab20 for discrete data and inferno for continuous data. If a dictionary, expects one key for every unique value in c, where values are valid matplotlib colors (hsv, rbg, rgba, or named colors)
cmap_scale ({‘linear’, ‘log’, ‘symlog’, ‘sqrt’} or matplotlib.colors.Normalize,) – optional (default: ‘linear’) Colormap normalization scale. For advanced use, see <https://matplotlib.org/users/colormapnorms.html>
s (float, optional (default: None)) – Point size. If None, set to 200 / sqrt(n_samples)
mask (list-like, optional (default: None)) – boolean mask to hide data points
discrete (bool or None, optional (default: None)) – If True, the legend is categorical. If False, the legend is a colorbar. If None, discreteness is detected automatically. Data containing non-numeric c is always discrete, and numeric data with 20 or less unique values is discrete.
ax (matplotlib.Axes or None, optional (default: None)) – axis on which to plot. If None, an axis is created
legend (bool, optional (default: None)) – States whether or not to create a legend. If data is continuous, the legend is a colorbar. If None, a legend is created where possible
colorbar (bool, optional (default: None)) – Synonym for legend
shuffle (bool, optional (default: True)) – If True. shuffles the order of points on the plot.
figsize (tuple, optional (default: None)) – Tuple of floats for creation of new matplotlib figure. Only used if ax is None.
ticks (True, False, or list-like (default: True)) – If True, keeps default axis ticks. If False, removes axis ticks. If a list, sets custom axis ticks
{x,y,z}ticks (True, False, or list-like (default: None)) – If set, overrides ticks
ticklabels (True, False, or list-like (default: True)) – If True, keeps default axis tick labels. If False, removes axis tick labels. If a list, sets custom axis tick labels
{x,y,z}ticklabels (True, False, or list-like (default: None)) – If set, overrides ticklabels
label_prefix (str or None (default: None)) – Prefix for all axis labels. Axes will be labelled label_prefix`1, `label_prefix`2, etc. Can be overriden by setting `xlabel, ylabel, and zlabel.
{x,y,z}label (str, None or False (default : None)) – Axis labels. Overrides the automatic label given by label_prefix. If None and label_prefix is None, no label is set unless the data is a pandas Series, in which case the series name is used. Override this behavior with {x,y,z}label=False
title (str or None (default: None)) – axis title. If None, no title is set.
fontsize (float or None (default: None)) – Base font size.
legend_title (str (default: None)) – title for the colorbar of legend
legend_loc (int or string or pair of floats, default: 'best') – Matplotlib legend location. Only used for discrete data. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.
legend_anchor (BboxBase, 2-tuple, or 4-tuple) – Box that is used to position the legend in conjunction with loc. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.
legend_ncol (int or None, optimal (default: None)) – Number of columns to show in the legend. If None, defaults to a maximum of entries per column.
vmax (vmin,) – Range of values to use as the range for the colormap. Only used if data is continuous
elev (int, optional (default: None)) – Elevation angle of viewpoint from horizontal for 3D plots, in degrees
azim (int, optional (default: None)) – Azimuth angle in x-y plane of viewpoint for 3D plots, in degrees
filename (str or None (default: None)) – file to which the output is saved
dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.
**plot_kwargs (keyword arguments) – Extra arguments passed to matplotlib.pyplot.scatter.
- Returns
ax – axis on which plot was drawn
- Return type
matplotlib.Axes
Examples
>>> import scprep >>> import numpy as np >>> import matplotlib.pyplot as plt >>> data = np.random.normal(0, 1, [200, 3]) >>> # Continuous color vector >>> colors = data[:, 0] >>> scprep.plot.scatter(x=data[:, 0], y=data[:, 1], c=colors) >>> # Discrete color vector with custom colormap >>> colors = np.random.choice(['a','b'], data.shape[0], replace=True) >>> data[colors == 'a'] += 5 >>> scprep.plot.scatter(x=data[:, 0], y=data[:, 1], z=data[:, 2], ... c=colors, cmap={'a' : [1,0,0,1], 'b' : 'xkcd:sky blue'})
-
scprep.plot.
scatter2d
(data, c=None, cmap=None, cmap_scale='linear', s=None, mask=None, discrete=None, ax=None, legend=None, colorbar=None, shuffle=True, figsize=None, ticks=True, xticks=None, yticks=None, ticklabels=True, xticklabels=None, yticklabels=None, label_prefix=None, xlabel=None, ylabel=None, title=None, fontsize=None, legend_title=None, legend_loc='best', legend_anchor=None, legend_ncol=None, filename=None, dpi=None, **plot_kwargs)[source]¶ Create a 2D scatter plot.
Builds upon matplotlib.pyplot.scatter with nice defaults and handles categorical colors / legends better.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data. Only the first two components will be used.
c (list-like or None, optional (default: None)) – Color vector. Can be a single color value (RGB, RGBA, or named matplotlib colors), an array of these of length n_samples, or a list of discrete or continuous values of any data type. If c is not a single or list of matplotlib colors, the values in c will be used to populate the legend / colorbar with colors from cmap
cmap (matplotlib colormap, str, dict, list or None, optional (default: None)) – matplotlib colormap. If None, uses tab20 for discrete data and inferno for continuous data. If a list, expects one color for every unique value in c, otherwise interpolates between given colors for continuous data. If a dictionary, expects one key for every unique value in c, where values are valid matplotlib colors (hsv, rbg, rgba, or named colors)
cmap_scale ({‘linear’, ‘log’, ‘symlog’, ‘sqrt’} or matplotlib.colors.Normalize,) – optional (default: ‘linear’) Colormap normalization scale. For advanced use, see <https://matplotlib.org/users/colormapnorms.html>
s (float, optional (default: None)) – Point size. If None, set to 200 / sqrt(n_samples)
mask (list-like, optional (default: None)) – boolean mask to hide data points
discrete (bool or None, optional (default: None)) – If True, the legend is categorical. If False, the legend is a colorbar. If None, discreteness is detected automatically. Data containing non-numeric c is always discrete, and numeric data with 20 or less unique values is discrete.
ax (matplotlib.Axes or None, optional (default: None)) – axis on which to plot. If None, an axis is created
legend (bool, optional (default: None)) – States whether or not to create a legend. If data is continuous, the legend is a colorbar. If None, a legend is created where possible.
colorbar (bool, optional (default: None)) – Synonym for legend
shuffle (bool, optional (default: True)) – If True. shuffles the order of points on the plot.
figsize (tuple, optional (default: None)) – Tuple of floats for creation of new matplotlib figure. Only used if ax is None.
ticks (True, False, or list-like (default: True)) – If True, keeps default axis ticks. If False, removes axis ticks. If a list, sets custom axis ticks
{x,y}ticks (True, False, or list-like (default: None)) – If set, overrides ticks
ticklabels (True, False, or list-like (default: True)) – If True, keeps default axis tick labels. If False, removes axis tick labels. If a list, sets custom axis tick labels
{x,y}ticklabels (True, False, or list-like (default: None)) – If set, overrides ticklabels
label_prefix (str or None (default: None)) – Prefix for all axis labels. Axes will be labelled label_prefix`1, `label_prefix`2, etc. Can be overriden by setting `xlabel, ylabel, and zlabel.
{x,y}label (str or None (default : None)) – Axis labels. Overrides the automatic label given by label_prefix. If None and label_prefix is None, no label is set unless the data is a pandas Series, in which case the series name is used. Override this behavior with {x,y,z}label=False
title (str or None (default: None)) – axis title. If None, no title is set.
fontsize (float or None (default: None)) – Base font size.
legend_title (str (default: None)) – title for the colorbar of legend
legend_loc (int or string or pair of floats, default: 'best') – Matplotlib legend location. Only used for discrete data. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.
legend_anchor (BboxBase, 2-tuple, or 4-tuple) – Box that is used to position the legend in conjunction with loc. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.
legend_ncol (int or None, optimal (default: None)) – Number of columns to show in the legend. If None, defaults to a maximum of entries per column.
vmax (vmin,) – Range of values to use as the range for the colormap. Only used if data is continuous
filename (str or None (default: None)) – file to which the output is saved
dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.
**plot_kwargs (keyword arguments) – Extra arguments passed to matplotlib.pyplot.scatter.
- Returns
ax – axis on which plot was drawn
- Return type
matplotlib.Axes
Examples
>>> import scprep >>> import numpy as np >>> import matplotlib.pyplot as plt >>> data = np.random.normal(0, 1, [200, 2]) >>> # Continuous color vector >>> colors = data[:, 0] >>> scprep.plot.scatter2d(data, c=colors) >>> # Discrete color vector with custom colormap >>> colors = np.random.choice(['a','b'], data.shape[0], replace=True) >>> data[colors == 'a'] += 10 >>> scprep.plot.scatter2d( data, c=colors, cmap={'a' : [1,0,0,1], 'b' : 'xkcd:sky blue'} )
-
scprep.plot.
scatter3d
(data, c=None, cmap=None, cmap_scale='linear', s=None, mask=None, discrete=None, ax=None, legend=None, colorbar=None, shuffle=True, figsize=None, ticks=True, xticks=None, yticks=None, zticks=None, ticklabels=True, xticklabels=None, yticklabels=None, zticklabels=None, label_prefix=None, xlabel=None, ylabel=None, zlabel=None, title=None, fontsize=None, legend_title=None, legend_loc='best', legend_anchor=None, legend_ncol=None, elev=None, azim=None, filename=None, dpi=None, **plot_kwargs)[source]¶ Create a 3D scatter plot.
Builds upon matplotlib.pyplot.scatter with nice defaults and handles categorical colors / legends better.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data. Only the first two components will be used.
c (list-like or None, optional (default: None)) – Color vector. Can be a single color value (RGB, RGBA, or named matplotlib colors), an array of these of length n_samples, or a list of discrete or continuous values of any data type. If c is not a single or list of matplotlib colors, the values in c will be used to populate the legend / colorbar with colors from cmap
cmap (matplotlib colormap, str, dict, list or None, optional (default: None)) – matplotlib colormap. If None, uses tab20 for discrete data and inferno for continuous data. If a list, expects one color for every unique value in c, otherwise interpolates between given colors for continuous data. If a dictionary, expects one key for every unique value in c, where values are valid matplotlib colors (hsv, rbg, rgba, or named colors)
cmap_scale ({‘linear’, ‘log’, ‘symlog’, ‘sqrt’} or matplotlib.colors.Normalize,) – optional (default: ‘linear’) Colormap normalization scale. For advanced use, see <https://matplotlib.org/users/colormapnorms.html>
s (float, optional (default: None)) – Point size. If None, set to 200 / sqrt(n_samples)
mask (list-like, optional (default: None)) – boolean mask to hide data points
discrete (bool or None, optional (default: None)) – If True, the legend is categorical. If False, the legend is a colorbar. If None, discreteness is detected automatically. Data containing non-numeric c is always discrete, and numeric data with 20 or less unique values is discrete.
ax (matplotlib.Axes or None, optional (default: None)) – axis on which to plot. If None, an axis is created
legend (bool, optional (default: None)) – States whether or not to create a legend. If data is continuous, the legend is a colorbar. If None, a legend is created where possible.
colorbar (bool, optional (default: None)) – Synonym for legend
shuffle (bool, optional (default: True)) – If True. shuffles the order of points on the plot.
figsize (tuple, optional (default: None)) – Tuple of floats for creation of new matplotlib figure. Only used if ax is None.
ticks (True, False, or list-like (default: True)) – If True, keeps default axis ticks. If False, removes axis ticks. If a list, sets custom axis ticks
{x,y,z}ticks (True, False, or list-like (default: None)) – If set, overrides ticks
ticklabels (True, False, or list-like (default: True)) – If True, keeps default axis tick labels. If False, removes axis tick labels. If a list, sets custom axis tick labels
{x,y,z}ticklabels (True, False, or list-like (default: None)) – If set, overrides ticklabels
label_prefix (str or None (default: None)) – Prefix for all axis labels. Axes will be labelled label_prefix`1, `label_prefix`2, etc. Can be overriden by setting `xlabel, ylabel, and zlabel.
{x,y,z}label (str or None (default : None)) – Axis labels. Overrides the automatic label given by label_prefix. If None and label_prefix is None, no label is set unless the data is a pandas Series, in which case the series name is used. Override this behavior with {x,y,z}label=False
title (str or None (default: None)) – axis title. If None, no title is set.
fontsize (float or None (default: None)) – Base font size.
legend_title (str (default: None)) – title for the colorbar of legend
legend_loc (int or string or pair of floats, default: 'best') – Matplotlib legend location. Only used for discrete data. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.
legend_anchor (BboxBase, 2-tuple, or 4-tuple) – Box that is used to position the legend in conjunction with loc. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.
legend_ncol (int or None, optimal (default: None)) – Number of columns to show in the legend. If None, defaults to a maximum of entries per column.
vmax (vmin,) – Range of values to use as the range for the colormap. Only used if data is continuous
elev (int, optional (default: None)) – Elevation angle of viewpoint from horizontal, in degrees
azim (int, optional (default: None)) – Azimuth angle in x-y plane of viewpoint
filename (str or None (default: None)) – file to which the output is saved
dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.
**plot_kwargs (keyword arguments) – Extra arguments passed to matplotlib.pyplot.scatter.
- Returns
ax – axis on which plot was drawn
- Return type
matplotlib.Axes
Examples
>>> import scprep >>> import numpy as np >>> import matplotlib.pyplot as plt >>> data = np.random.normal(0, 1, [200, 3]) >>> # Continuous color vector >>> colors = data[:, 0] >>> scprep.plot.scatter3d(data, c=colors) >>> # Discrete color vector with custom colormap >>> colors = np.random.choice(['a','b'], data.shape[0], replace=True) >>> data[colors == 'a'] += 5 >>> scprep.plot.scatter3d( data, c=colors, cmap={'a' : [1,0,0,1], 'b' : 'xkcd:sky blue'} )
-
scprep.plot.
scree_plot
(singular_values, cumulative=False, ax=None, figsize=None, xlabel='Principal Component', ylabel='Explained Variance (%)', fontsize=None, filename=None, dpi=None, **kwargs)[source]¶ Plot the explained variance of each principal component.
- Parameters
singular_values (list-like, shape=[n_components]) – Singular values returned by scprep.reduce.pca(data, return_singular_values=True)
cumulative (bool, optional (default=False)) – If True, plot the cumulative explained variance
ax (matplotlib.Axes or None, optional (default: None)) – Axis to plot on. If None, a new axis will be created.
figsize (tuple or None, optional (default: None)) – If not None, sets the figure size (width, height)
{x,y}label (str, optional) – Labels to display on the x and y axis.
fontsize (float or None (default: None)) – Base font size.
filename (str or None (default: None)) – file to which the output is saved
dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.
**kwargs (additional arguments for matplotlib.pyplot.plot) –
- Returns
ax – axis on which plot was drawn
- Return type
matplotlib.Axes
Examples
>>> import scprep >>> import numpy as np >>> data = np.random.normal(0, 1, [200, 1000]) >>> pca_data, singular_values = scprep.reduce.pca( data, n_components=100, return_singular_values=True ) >>> scprep.plot.scree_plot(singular_values) >>> scprep.plot.scree_plot(singular_values, cumulative=True)
-
scprep.plot.
plot_gene_variability
(data, kernel_size=0.005, smooth=5, cutoff=None, percentile=90, ax=None, figsize=None, xlabel='Gene mean', ylabel='Standardized variance', title=None, fontsize=None, filename=None, dpi=None, **kwargs)[source]¶ Plot the histogram of gene variability.
Variability is computed as the deviation from a loess fit to the rolling median of the mean-variance curve
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data. Multiple datasets may be given as a list of array-likes.
kernel_size (float or int, optional (default: 0.005)) – Width of rolling median window. If a float between 0 and 1, the width is given by kernel_size * data.shape[1]. Otherwise should be an odd integer
smooth (int, optional (default: 5)) – Amount of smoothing to apply to the median filter
cutoff (float or None, optional (default: None)) – Absolute cutoff at which to draw a vertical line. Only one of cutoff and percentile may be given.
percentile (float or None, optional (default: 90)) – Percentile between 0 and 100 at which to draw a vertical line. Only one of cutoff and percentile may be given.
ax (matplotlib.Axes or None, optional (default: None)) – Axis to plot on. If None, a new axis will be created.
figsize (tuple or None, optional (default: None)) – If not None, sets the figure size (width, height)
[x,y]label (str, optional) – Labels to display on the x and y axis.
title (str or None, optional (default: None)) – Axis title.
fontsize (float or None (default: None)) – Base font size.
filename (str or None (default: None)) – file to which the output is saved
dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.
**kwargs (additional arguments for matplotlib.pyplot.hist) –
- Returns
ax – axis on which plot was drawn
- Return type
matplotlib.Axes
Dimensionality Reduction¶
Classes:
|
Truncated SVD with automatic dimensionality selected by Johnson-Lindenstrauss. |
|
Gaussian random projection with an inverse transform using the pseudoinverse. |
|
Calculate PCA using random projections to handle sparse matrices. |
Functions:
|
Calculate PCA using random projections to handle sparse matrices. |
-
class
scprep.reduce.
AutomaticDimensionSVD
(n_components='auto', eps=0.3, algorithm='randomized', n_iter=5, random_state=None, tol=0.0)[source]¶ Bases:
sklearn.decomposition._truncated_svd.TruncatedSVD
Truncated SVD with automatic dimensionality selected by Johnson-Lindenstrauss.
Methods:
fit
(X)Fit model on training data X.
fit_transform
(X[, y])Fit model to X and perform dimensionality reduction on X.
get_params
([deep])Get parameters for this estimator.
Transform X back to its original space.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Perform dimensionality reduction on X.
-
fit
(X)[source]¶ Fit model on training data X.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training data.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns
self – Returns the transformer object.
- Return type
object
-
fit_transform
(X, y=None)¶ Fit model to X and perform dimensionality reduction on X.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training data.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns
X_new – Reduced version of X. This will always be a dense array.
- Return type
ndarray of shape (n_samples, n_components)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
-
inverse_transform
(X)¶ Transform X back to its original space.
Returns an array X_original whose transform would be X.
- Parameters
X (array-like of shape (n_samples, n_components)) – New data.
- Returns
X_original – Note that this is always a dense array.
- Return type
ndarray of shape (n_samples, n_features)
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
-
transform
(X)¶ Perform dimensionality reduction on X.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data.
- Returns
X_new – Reduced version of X. This will always be a dense array.
- Return type
ndarray of shape (n_samples, n_components)
-
-
class
scprep.reduce.
InvertibleRandomProjection
(n_components='auto', eps=0.3, orthogonalize=False, random_state=None)[source]¶ Bases:
sklearn.random_projection.GaussianRandomProjection
Gaussian random projection with an inverse transform using the pseudoinverse.
Methods:
fit
(X)Generate a sparse random projection matrix.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Project the data by using matrix product with the random matrix.
Attributes:
Pseudoinverse of the random projection.
-
fit
(X)[source]¶ Generate a sparse random projection matrix.
- Parameters
X ({ndarray, sparse matrix} of shape (n_samples, n_features)) – Training set: only the shape is used to find optimal random matrix dimensions based on the theory referenced in the afore mentioned papers.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns
self – BaseRandomProjection class instance.
- Return type
object
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
-
property
pseudoinverse
¶ Pseudoinverse of the random projection.
This inverts the projection operation for any vector in the span of the random projection. For small enough eps, this should be close to the correct inverse.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
-
transform
(X)¶ Project the data by using matrix product with the random matrix.
- Parameters
X ({ndarray, sparse matrix} of shape (n_samples, n_features)) – The input data to project into a smaller dimensional space.
- Returns
X_new – Projected array.
- Return type
{ndarray, sparse matrix} of shape (n_samples, n_components)
-
-
class
scprep.reduce.
SparseInputPCA
(n_components=2, eps=0.3, random_state=None, method='svd', **kwargs)[source]¶ Bases:
sklearn.base.BaseEstimator
Calculate PCA using random projections to handle sparse matrices.
Uses the Johnson-Lindenstrauss Lemma to determine the number of dimensions of random projections prior to subtracting the mean.
- Parameters
n_components (int, optional (default: 2)) – Number of components to keep.
eps (strictly positive float, optional (default=0.15)) – Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to ‘auto’. Smaller values lead to more accurate embeddings but higher computational and memory costs
method ({'svd', 'orth_rproj', 'rproj'}, optional (default: 'svd')) – Dimensionality reduction method applied prior to mean centering. The method choice affects accuracy (svd > orth_rproj > rproj) comes with increased computational cost (but not memory.)
random_state (int, RandomState instance or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
kwargs – Additional keyword arguments for sklearn.decomposition.PCA
Attributes:
Principal axes in feature space, representing directions of maximum variance.
The amount of variance explained by each of the selected components.
Percentage of variance explained by each of the selected components.
Singular values of the PCA decomposition.
Methods:
fit
(X)Fit the model with X.
Fit the model with X and apply the dimensionality reduction on X.
get_params
([deep])Get parameters for this estimator.
Transform data back to its original space.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Apply dimensionality reduction to X.
-
property
components_
¶ Principal axes in feature space, representing directions of maximum variance.
The components are sorted by explained variance.
-
property
explained_variance_
¶ The amount of variance explained by each of the selected components.
-
property
explained_variance_ratio_
¶ Percentage of variance explained by each of the selected components.
The sum of the ratios is equal to 1.0. If n_components is None then the number of components grows as`eps` gets smaller.
-
fit_transform
(X)[source]¶ Fit the model with X and apply the dimensionality reduction on X.
- Parameters
X (array-like, shape=(n_samples, n_features)) –
- Returns
X_new
- Return type
array-like, shape=(n_samples, n_components)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
-
inverse_transform
(X)[source]¶ Transform data back to its original space.
- Parameters
X (array-like, shape=(n_samples, n_components)) –
- Returns
X_new
- Return type
array-like, shape=(n_samples, n_features)
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
-
property
singular_values_
¶ Singular values of the PCA decomposition.
-
scprep.reduce.
pca
(data, n_components=100, eps=0.3, method='svd', seed=None, return_singular_values=False, n_pca=None, svd_offset=None, svd_multiples=None)[source]¶ Calculate PCA using random projections to handle sparse matrices.
Uses the Johnson-Lindenstrauss Lemma to determine the number of dimensions of random projections prior to subtracting the mean. Dense matrices are provided to sklearn.decomposition.PCA directly.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
n_components (int, optional (default: 100)) – Number of PCs to compute
eps (strictly positive float, optional (default=0.3)) – Parameter to control the quality of the embedding of sparse input. Smaller values lead to more accurate embeddings but higher computational and memory costs
method ({'svd', 'orth_rproj', 'rproj', 'dense'}, optional (default: 'svd')) – Dimensionality reduction method applied prior to mean centering of sparse input. The method choice affects accuracy (svd > orth_rproj > rproj) and comes with increased computational cost (but not memory.) On the other hand, method=’dense’ adds a memory cost but is faster.
seed (int, RandomState or None, optional (default: None)) – Random state.
return_singular_values (bool, optional (default: False)) – If True, also return the singular values
n_pca (Deprecated.) –
svd_offset (Deprecated.) –
svd_multiples (Deprecated.) –
- Returns
data_pca (array-like, shape=[n_samples, n_components]) – PCA reduction of data
singular_values (list-like, shape=[n_components]) – Singular values corresponding to principal components returned only if return_values is True
Row/Column Selection¶
Functions:
|
Get a list of cells from data. |
|
Get a list of genes from data. |
|
Select genes with high variability. |
|
Select columns from a data matrix. |
|
Select rows from a data matrix. |
|
Subsample the number of points in a dataset. |
-
scprep.select.
get_cell_set
(data, starts_with=None, ends_with=None, exact_word=None, regex=None)[source]¶ Get a list of cells from data.
- Parameters
data (array-like, shape=[n_samples, n_features] or [n_samples]) – Input pd.DataFrame, or list of cell names
starts_with (str, list-like or None, optional (default: None)) – If not None, only return cell names that start with this prefix.
ends_with (str, list-like or None, optional (default: None)) – If not None, only return cell names that end with this suffix.
exact_word (str, list-like or None, optional (default: None)) – If not None, only return cell names that contain this exact word.
regex (str, list-like or None, optional (default: None)) – If not None, only return cell names that match this regular expression.
- Returns
cells – List of matching cells
- Return type
list-like, shape<=[n_features]
-
scprep.select.
get_gene_set
(data, starts_with=None, ends_with=None, exact_word=None, regex=None)[source]¶ Get a list of genes from data.
- Parameters
data (array-like, shape=[n_samples, n_features] or [n_features]) – Input pd.DataFrame, or list of gene names
starts_with (str, list-like or None, optional (default: None)) – If not None, only return gene names that start with this prefix.
ends_with (str, list-like or None, optional (default: None)) – If not None, only return gene names that end with this suffix.
exact_word (str, list-like or None, optional (default: None)) – If not None, only return gene names that contain this exact word.
regex (str, list-like or None, optional (default: None)) – If not None, only return gene names that match this regular expression.
- Returns
genes – List of matching genes
- Return type
list-like, shape<=[n_features]
-
scprep.select.
highly_variable_genes
(data, *extra_data, kernel_size=0.05, smooth=5, cutoff=None, percentile=80)[source]¶ Select genes with high variability.
Variability is computed as the deviation from a loess fit to the rolling median of the mean-variance curve
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
extra_data (array-like, shape=[any, n_features], optional) – Optional additional data objects from which to select the same rows
kernel_size (float or int, optional (default: 0.005)) – Width of rolling median window. If a float between 0 and 1, the width is given by kernel_size * data.shape[1]. Otherwise should be an odd integer
smooth (int, optional (default: 5)) – Amount of smoothing to apply to the median filter
cutoff (float, optional (default: None)) – Variability above which expression is deemed significant
percentile (int, optional (Default: 80)) – Percentile above or below which to remove genes. Must be an integer between 0 and 100. Only one of cutoff and percentile should be specified.
- Returns
data (array-like, shape=[n_samples, m_features]) – Filtered output data, where m_features <= n_features
extra_data (array-like, shape=[any, m_features]) – Filtered extra data, if passed.
-
scprep.select.
select_cols
(data, *extra_data, idx=None, starts_with=None, ends_with=None, exact_word=None, regex=None)[source]¶ Select columns from a data matrix.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
extra_data (array-like, shape=[any, n_features], optional) – Optional additional data objects from which to select the same rows
idx (list-like, shape=[m_features]) – Integer indices or string column names to be selected
starts_with (str, list-like or None, optional (default: None)) – If not None, select columns that start with this prefix.
ends_with (str, list-like or None, optional (default: None)) – If not None, select columns that end with this suffix.
exact_word (str, list-like or None, optional (default: None)) – If not None, select columns that contain this exact word.
regex (str, list-like or None, optional (default: None)) – If not None, select columns that match this regular expression.
- Returns
data (array-like, shape=[n_samples, m_features]) – Subsetted output data.
extra_data (array-like, shape=[any, m_features]) – Subsetted extra data, if passed.
Examples
- data_subset = scprep.select.select_cols(
data, idx=np.random.choice([True, False], data.shape[1])
) data_subset, metadata_subset = scprep.select.select_cols(
data, metadata, starts_with=”MT”
)
:raises UserWarning : if no columns are selected:
-
scprep.select.
select_rows
(data, *extra_data, idx=None, starts_with=None, ends_with=None, exact_word=None, regex=None)[source]¶ Select rows from a data matrix.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
extra_data (array-like, shape=[n_samples, any], optional) – Optional additional data objects from which to select the same rows
idx (list-like, shape=[m_samples], optional (default: None)) – Integer indices or string index names to be selected
starts_with (str, list-like or None, optional (default: None)) – If not None, select rows that start with this prefix.
ends_with (str, list-like or None, optional (default: None)) – If not None, select rows that end with this suffix.
exact_word (str, list-like or None, optional (default: None)) – If not None, select rows that contain this exact word.
regex (str, list-like or None, optional (default: None)) – If not None, select rows that match this regular expression.
- Returns
data (array-like, shape=[m_samples, n_features]) – Subsetted output data
extra_data (array-like, shape=[m_samples, any]) – Subsetted extra data, if passed.
Examples
- data_subset = scprep.select.select_rows(
data, idx=np.random.choice([True, False], data.shape[0])
) data_subset, labels_subset = scprep.select.select_rows(
data, labels, end_with=”batch1”
)
:raises UserWarning : if no rows are selected:
-
scprep.select.
subsample
(*data, n=10000, seed=None)[source]¶ Subsample the number of points in a dataset.
Selects a random subset of (optionally multiple) datasets. Helpful for plotting, or for methods with computational constraints.
- Parameters
data (array-like, shape=[n_samples, any]) – Input data. Any number of datasets can be passed at once, so long as n_samples remains the same.
n (int, optional (default: 10000)) – Number of samples to retain. Must be less than n_samples.
seed (int, optional (default: None)) – Random seed
Examples
data_subsample, labels_subsample = scprep.utils.subsample(data, labels, n=1000)
Utilities¶
Functions:
|
Ensure that a set of data matrices have consistent columns. |
|
Combine data matrices from multiple batches and store a batch label. |
|
Check if a condition is true anywhere in a data matrix. |
|
Get the minimum value from a data matrix. |
|
Check if all values in a matrix are non-negative. |
|
Get the column-wise, row-wise, or total standard deviation of a matrix. |
|
Get the column-wise, row-wise, or total sum of values in a matrix. |
|
Perform a numerical transformation to data. |
Transpose a matrix in a memory-efficient manner. |
|
|
Elementwise multiply a matrix by a vector. |
|
Sort clusters in increasing order of values. |
|
Get the minimum value from a pandas sparse series. |
Convert an array-like to a np.ndarray or scipy.sparse.spmatrix. |
|
|
Convert an array-like to a np.ndarray. |
-
scprep.utils.
check_consistent_columns
(data, common_columns_only=True)[source]¶ Ensure that a set of data matrices have consistent columns.
- Parameters
data (list of array-likes) – List of matrices to be checked
common_columns_only (bool, optional (default: True)) – With pandas inputs, drop any columns that are not common to all matrices
- Returns
data – List of matrices with consistent columns, subsetted if necessary
- Return type
list of array-likes
- Raises
ValueError – Raised if data has inconsistent number of columns and does not have column names for subsetting
-
scprep.utils.
combine_batches
(data, batch_labels, append_to_cell_names=None, common_columns_only=True)[source]¶ Combine data matrices from multiple batches and store a batch label.
- Parameters
data (list of array-like, shape=[n_batch]) – All matrices must be of the same format and have the same number of columns (or genes.)
batch_labels (list of str, shape=[n_batch]) – List of names assigned to each batch
append_to_cell_names (bool, optional (default: None)) – If input is a pandas dataframe, add the batch label corresponding to each cell to its existing index (or cell name / barcode.) Default behavior is True for dataframes and False otherwise.
common_columns_only (bool, optional (default: True)) – With pandas inputs, drop any columns that are not common to all data matrices
- Returns
data (data matrix, shape=[n_samples, n_features]) – Number of samples is the sum of numbers of samples of all batches. Number of features is the same as each of the batches.
sample_labels (list-like, shape=[n_samples]) – Batch labels corresponding to each sample
-
scprep.utils.
matrix_any
(condition)[source]¶ Check if a condition is true anywhere in a data matrix.
np.any doesn’t handle matrices of type pd.DataFrame
- Parameters
condition (array-like) – Boolean matrix
- Returns
any – True if condition contains any True values, False otherwise
- Return type
bool
-
scprep.utils.
matrix_min
(data)[source]¶ Get the minimum value from a data matrix.
Pandas SparseDataFrame does not handle np.min.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
- Returns
minimum – Minimum entry in data.
- Return type
float
-
scprep.utils.
matrix_non_negative
(data, allow_equal=True)[source]¶ Check if all values in a matrix are non-negative.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
allow_equal (bool, optional (default: True)) – If True, min(data) can be equal to 0
- Returns
is_non_negative
- Return type
bool
-
scprep.utils.
matrix_std
(data, axis=None)[source]¶ Get the column-wise, row-wise, or total standard deviation of a matrix.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
axis (int or None, optional (default: None)) – Axis across which to calculate standard deviation. axis=0 gives column standard deviation, axis=1 gives row standard deviation. None gives the total standard deviation.
- Returns
std – Standard deviation along desired axis.
- Return type
array-like or float
-
scprep.utils.
matrix_sum
(data, axis=None, ignore_nan=False)[source]¶ Get the column-wise, row-wise, or total sum of values in a matrix.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
axis (int or None, optional (default: None)) – Axis across which to sum. axis=0 gives column sums, axis=1 gives row sums. None gives the total sum.
ignore_nan (bool, optional (default: False)) – If True, uses np.nansum instead of np.sum
- Returns
sums – Sums along desired axis.
- Return type
array-like or float
-
scprep.utils.
matrix_transform
(data, fun, *args, **kwargs)[source]¶ Perform a numerical transformation to data.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
fun (callable) – Numerical transformation function, np.ufunc or similar.
kwargs (args,) – arguments for fun. data is always passed as the first argument
- Returns
data – Transformed output data
- Return type
array-like, shape=[n_samples, n_features]
-
scprep.utils.
matrix_transpose
(X)[source]¶ Transpose a matrix in a memory-efficient manner.
Pandas sparse dataframes are coerced to dense
- Parameters
X (array-like, shape=[n,m]) – Input data
- Returns
X_T – Transposed input data
- Return type
array-like, shape=[m,n]
-
scprep.utils.
matrix_vector_elementwise_multiply
(data, multiplier, axis=None)[source]¶ Elementwise multiply a matrix by a vector.
- Parameters
data (array-like, shape=[n_samples, n_features]) – Input data
multiplier (array-like, shape=[n_samples, 1] or [1, n_features]) – Vector by which to multiply data
axis (int or None, optional (default: None)) – Axis across which to sum. axis=0 multiplies each column, axis=1 multiplies each row. None guesses based on dimensions
- Returns
product – Multiplied matrix
- Return type
array-like
-
scprep.utils.
sort_clusters_by_values
(clusters, values)[source]¶ Sort clusters in increasing order of values.
- Parameters
clusters (array-like) – An array of cluster assignments, like the output of a fit_predict() call.
values (type) – An associated value for each index in clusters to use for sorting the clusters.
- Returns
new_clusters – Reordered cluster assignments. np.mean(values[new_clusters == 0]) will be less than np.mean(values[new_clusters == 1]) which will be less than np.mean(values[new_clusters == 2]) and so on.
- Return type
array-likes
-
scprep.utils.
sparse_series_min
(data)[source]¶ Get the minimum value from a pandas sparse series.
Pandas SparseDataFrame does not handle np.min.
- Parameters
data (pd.Series[SparseArray]) – Input data
- Returns
minimum – Minimum entry in data.
- Return type
float
External Tools¶
Functions:
|
Simulate dataset with cellular backbone. |
|
Install a Bioconductor package. |
|
Install a Github repository. |
|
Perform lineage inference with Slingshot. |
|
Simulate count data from a fictional single-cell RNA-seq experiment Splat. |
Classes:
|
Run an R function from Python. |
-
scprep.run.
DyngenSimulate
(backbone, num_cells=500, num_tfs=100, num_targets=50, num_hks=25, simulation_census_interval=10, compute_cellwise_grn=False, compute_rna_velocity=False, n_jobs=7, random_state=None, verbose=True, force_num_cells=False)[source]¶ Simulate dataset with cellular backbone.
The backbone determines the overall dynamic process during a simulation. It consists of a set of gene modules, which regulate each other such that expression of certain genes change over time in a specific manner.
DyngenSimulate is a Python wrapper for the R package Dyngen. Default values obtained from Github vignettes. For more details, read about Dyngen on Github_.
- Parameters
backbone (string) – Backbone name from dyngen list of backbones. Get list with get_backbones()).
num_cells (int, optional (default: 500)) – Number of cells.
num_tfs (int, optional (default: 100)) –
Number of transcription factors. The TFs are the main drivers of the molecular changes in the simulation. A TF can only be regulated by other TFs or itself.
NOTE: If num_tfs input is less than nrow(backbone$module_info), Dyngen will default to nrow(backbone$module_info). This quantity varies between backbones and with each run (without seed). It is generally less than 75. It is recommended to input num_tfs >= 100 to stabilize the output.
num_targets (int, optional (default: 50)) – Number of target genes. Target genes are regulated by a TF or another target gene, but are always downstream of at least one TF.
num_hks (int, optional (default: 25)) – Number of housekeeping genees. Housekeeping genes are completely separate from any TFs or target genes.
simulation_census_interval (int, optional (default: 10)) – Stores the abundance levels only after a specific interval has passed. The lower the interval, the higher detail of simulation trajectory retained, though many timepoints will contain similar information.
compute_cellwise_grn (boolean, optional (default: False)) – If True, computes the ground truth cellwise gene regulatory networks. Also outputs ground truth bulk (entire dataset) regulatory network. NOTE: Increases compute time significantly.
compute_rna_velocity (boolean, optional (default: False)) – If true, computes the ground truth propensity ratios after simulation. NOTE: Increases compute time significantly.
n_jobs (int, optional (default: 8)) – Number of cores to use.
random_state (int, optional (default: None)) – Fixes seed for simulation generator.
verbose (boolean, optional (default: True)) – Data generation verbosity.
force_num_cells (boolean, optional (default: False)) – Dyngen occassionally produces fewer cells than specified. Set this flag to True to rerun Dyngen until correct cell count is reached.
- Returns
Dictionary data of pd.DataFrames
data[‘cell_info’] (pd.DataFrame, shape (n_cells, 4)) – Columns: cell_id, step_ix, simulation_i, sim_time sim_time is the simulated timepoint for a given cell.
data[‘expression’] (pd.DataFrame, shape (n_cells, n_genes)) – Log-transformed counts with dropout.
If compute_cellwise_grn is True,
data[‘bulk_grn’] (pd.DataFrame, shape (n_tf_target_interactions, 4)) – Columns: regulator, target, strength, effect. Strength is positive and unbounded. Effect is either +1 (for activation) or -1 (for inhibition).
data[‘cellwise_grn’] (pd.DataFrame, shape (n_tf_target_interactions_per_cell, 4)) – Columns: cell_id, regulator, target, strength. The output does not include all edges per cell. The regulatory effect lies between [−1, 1], where -1 is complete inhibition of target by TF, +1 is maximal activation of target by TF, and 0 is inactivity of the regulatory interaction between R and T.
If compute_rna_velocity is True,
data[‘rna_velocity’] (pd.DataFrame, shape (n_cells, n_genes)) – Propensity ratios for each cell.
Example
>>> import scprep >>> scprep.run.dyngen.install() >>> backbones = scprep.run.dyngen.get_backbones() >>> data = scprep.run.DyngenSimulate(backbone=backbones[0])
-
scprep.run.
install_bioconductor
(package=None, site_repository=None, update=False, type='binary', version=None, verbose=True)[source]¶ Install a Bioconductor package.
- Parameters
site_repository (string, optional (default: None)) – additional repository in which to look for packages to install. This repository will be prepended to the default repositories
update (boolean, optional (default: False)) – When False, don’t attempt to update old packages. When True, update old packages automatically.
type ({"binary", "source", "both"}, optional (default: "binary")) – Which package version to install if a newer version is available as source. “both” tries source first and uses binary as a fallback.
version (string, optional (default: None)) – Bioconductor version to install, e.g., version = “3.8”. The special symbol version = “devel” installs the current ‘development’ version. If None, installs from the current version.
verbose (boolean, optional (default: True)) – Install script verbosity.
-
scprep.run.
install_github
(repo, lib=None, dependencies=None, update=False, type='binary', build_vignettes=False, force=False, verbose=True)[source]¶ Install a Github repository.
- Parameters
repo (string) – Github repository name to install.
lib (string) – Directory to install the package. If missing, defaults to the first element of .libPaths().
dependencies (boolean, optional (default: None/NA)) – When True, installs all packages specified under “Depends”, “Imports”, “LinkingTo” and “Suggests”. When False, installs no dependencies. When None/NA, installs all packages specified under “Depends”, “Imports” and “LinkingTo”.
update (string or boolean, optional (default: False)) – One of “default”, “ask”, “always”, or “never”. “default” Respects R_REMOTES_UPGRADE variable if set, falls back to “ask” if unset. “ask” prompts the user for which out of date packages to upgrade. For non-interactive sessions “ask” is equivalent to “always”. TRUE and FALSE also accepted, correspond to “always” and “never” respectively.
type ({"binary", "source", "both"}, optional (default: "binary")) – Which package version to install if a newer version is available as source. “both” tries source first and uses binary as a fallback.
build_vignettes (boolean, optional (default: False)) – Builds Github vignettes.
force (boolean, optional (default: False)) – Forces installation even if remote state has not changed since previous install.
verbose (boolean, optional (default: True)) – Install script verbosity.
-
class
scprep.run.
RFunction
(args='', setup='', body='', cleanup=True, verbose=1)[source]¶ Bases:
object
Run an R function from Python.
- Parameters
args (str, optional (default: "")) – Comma-separated R argument names and optionally default parameters
setup (str, optional (default: "")) – R code to run prior to function definition (e.g. loading libraries)
body (str, optional (default: "")) – R code to run in the body of the function
cleanup (boolean, optional (default: True)) – If true, clear the R workspace after the function is complete. If false, this could result in memory leaks.
verbose (int, optional (default: 1)) – R script verbosity. For verbose==0, all messages are printed. For verbose==1, messages from the function body are printed. For verbose==2, messages from the function setup and body are printed.
-
scprep.run.
Slingshot
(data, cluster_labels, start_cluster=None, end_cluster=None, distance=None, omega=None, shrink=True, extend='y', reweight=True, reassign=True, thresh=0.001, max_iter=15, stretch=2, smoother='smooth.spline', shrink_method='cosine', allow_breaks=True, seed=None, verbose=1, **kwargs)[source]¶ Perform lineage inference with Slingshot.
Given a reduced-dimensional data matrix n by p and a vector of cluster labels (or matrix of soft cluster assignments, potentially including a -1 label for “unclustered”), this function performs lineage inference using a cluster-based minimum spanning tree and constructing simulatenous principal curves for branching paths through the tree.
For more details, read about Slingshot on GitHub_ and Bioconductor_.
- Parameters
data (array-like, shape=[n_samples, n_dimensions]) – matrix of (reduced dimension) coordinates to be used for lineage inference.
cluster_labels (list-like, shape=[n_samples]) – a vector of cluster labels, optionally including -1’s for “unclustered.”
start_cluster (string, optional (default: None)) – indicates the cluster(s) of origin. Lineages will be represented by paths coming out of this cluster.
end_cluster (string, optional (default: None)) – indicates the cluster(s) which will be forced leaf nodes. This introduces a constraint on the MST algorithm.
distance (callable, optional (default: None)) – method for calculating distances between clusters. Must take two matrices as input, corresponding to subsets of reduced_dim. If the minimum cluster size is larger than the number dimensions, the default is to use the joint covariance matrix to find squared distance between cluster centers. If not, the default is to use the diagonal of the joint covariance matrix. Not currently implemented
omega (float, optional (default: None)) – this granularity parameter determines the distance between every real cluster and the artificial cluster. It is parameterized such that this distance is omega / 2, making omega the maximum distance between two connected clusters. By default, omega = Inf.
shrink (boolean or float, optional (default: True)) – boolean or numeric between 0 and 1, determines whether and how much to shrink branching lineages toward their average prior to the split.
extend ({'y', 'n', 'pc1'}, optional (default: "y")) – how to handle root and leaf clusters of lineages when constructing the initial, piece-wise linear curve.
reweight (boolean, optional (default: True)) – whether to allow cells shared between lineages to be reweighted during curve-fitting. If True, cells shared between lineages will be iteratively reweighted based on the quantiles of their projection distances to each curve.
reassign (boolean, optional (default: True)) – whether to reassign cells to lineages at each iteration. If True, cells will be added to a lineage when their projection distance to the curve is less than the median distance for all cells currently assigned to the lineage. Additionally, shared cells will be removed from a lineage if their projection distance to the curve is above the 90th percentile and their weight along the curve is less than 0.1.
thresh (float, optional (default: 0.001)) – determines the convergence criterion. Percent change in the total distance from cells to their projections along curves must be less than thresh.
max_iter (int, optional (default: 15)) – maximum number of iterations
stretch (int, optional (default: 2)) – factor between 0 and 2 by which curves can be extrapolated beyond endpoints
smoother ({"smooth.spline", "lowess", "periodic_lowess"},) – optional (default: “smooth.spline”) choice of smoother. “periodic_lowess” allows one to fit closed curves. Beware, you may want to use iter = 0 with “lowess”.
shrink_method (string, optional (default: "cosine")) – how to determine the appropriate amount of shrinkage for a branching lineage. Accepted values: “gaussian”, “rectangular”, “triangular”, “epanechnikov”, “biweight”, “triweight”, “cosine”, “optcosine”, “density”.
allow_breaks (boolean, optional (default: True)) – determines whether curves that branch very close to the origin should be allowed to have different starting points.
seed (int or None, optional (default: None)) – Seed to use for generating random numbers.
verbose (int, optional (default: 1)) – Logging verbosity between 0 and 2.
- Returns
slingshot (dict) – Contains the following keys:
pseudotime (array-like, shape=[n_samples, n_curves]) – Pseudotime projection of each cell onto each principal curve. Value is np.nan if the cell does not lie on the curve
branch (list-like, shape=[n_samples]) – Branch assignment for each cell
curves (array_like, shape=[n_curves, n_samples, n_dimensions]) – Coordinates of each principle curve in the reduced dimension
Examples
>>> import scprep >>> import phate >>> data, clusters = phate.tree.gen_dla(n_branch=4, n_dim=200, branch_length=200) >>> phate_op = phate.PHATE() >>> data_phate = phate_op.fit_transform(data) >>> slingshot = scprep.run.Slingshot(data_phate, clusters) >>> ax = scprep.plot.scatter2d( ... data_phate, ... c=slingshot['pseudotime'][:,0], ... cmap='magma', ... legend_title='Branch 1' ... ) >>> scprep.plot.scatter2d( ... data_phate, ... c=slingshot['pseudotime'][:,1], ... cmap='viridis', ... ax=ax, ... ticks=False, ... label_prefix='PHATE', ... legend_title='Branch 2' ... ) >>> for curve in slingshot['curves']: ... ax.plot(curve[:,0], curve[:,1], c='black') >>> ax = scprep.plot.scatter2d(data_phate, c=slingshot['branch'], ... legend_title='Branch', ticks=False, label_prefix='PHATE') >>> for curve in slingshot['curves']: ... ax.plot(curve[:,0], curve[:,1], c='black')
-
scprep.run.
SplatSimulate
(method='paths', batch_cells=100, n_genes=10000, batch_fac_loc=0.1, batch_fac_scale=0.1, mean_rate=0.3, mean_shape=0.6, lib_loc=11, lib_scale=0.2, lib_norm=False, out_prob=0.05, out_fac_loc=4, out_fac_scale=0.5, de_prob=0.1, de_down_prob=0.1, de_fac_loc=0.1, de_fac_scale=0.4, bcv_common=0.1, bcv_df=60, dropout_type='none', dropout_prob=0.5, dropout_mid=0, dropout_shape=-1, group_prob=1, path_from=0, path_n_steps=100, path_skew=0.5, path_nonlinear_prob=0.1, path_sigma_fac=0.8, seed=None, verbose=1, path_length=None)[source]¶ Simulate count data from a fictional single-cell RNA-seq experiment Splat.
SplatSimulate is a Python wrapper for the R package Splatter. For more details, read about Splatter on GitHub_ and Bioconductor_.
- Parameters
batch_cells (list-like or int, optional (default: 100)) – The number of cells in each batch.
n_genes (int, optional (default:10000)) – The number of genes to simulate.
batch_fac_loc (float, optional (default: 0.1)) – Location (meanlog) parameter for the batch effects factor log-normal distribution.
batch_fac_scale (float, optional (default: 0.1)) – Scale (sdlog) parameter for the batch effects factor log-normal distribution.
mean_shape (float, optional (default: 0.3)) – Shape parameter for the mean gamma distribution.
mean_rate (float, optional (default: 0.6)) – Rate parameter for the mean gamma distribution.
lib_loc (float, optional (default: 11)) – Location (meanlog) parameter for the library size log-normal distribution, or mean for the normal distribution.
lib_scale (float, optional (default: 0.2)) – Scale (sdlog) parameter for the library size log-normal distribution, or sd for the normal distribution.
lib_norm (bool, optional (default: False)) – Whether to use a normal distribution instead of the usual log-normal distribution.
out_prob (float, optional (default: 0.05)) – Probability that a gene is an expression outlier.
out_fac_loc (float, optional (default: 4)) – Location (meanlog) parameter for the expression outlier factor log-normal distribution.
out_fac_scale (float, optional (default: 0.5)) – Scale (sdlog) parameter for the expression outlier factor log-normal distribution.
de_prob (float, optional (default: 0.1)) – Probability that a gene is differentially expressed in each group or path.
de_down_prob (float, optional (default: 0.1)) – Probability that a differentially expressed gene is down-regulated.
de_fac_loc (float, optional (default: 0.1)) – Location (meanlog) parameter for the differential expression factor log-normal distribution.
de_fac_scale (float, optional (default: 0.4)) – Scale (sdlog) parameter for the differential expression factor log-normal distribution.
bcv_common (float, optional (default: 0.1)) – Underlying common dispersion across all genes.
float, optional (default (bcv_df) – Degrees of Freedom for the BCV inverse chi-squared distribution.
dropout_type ({'none', 'experiment', 'batch', 'group', 'cell', 'binomial'},) – optional (default: ‘none’) The type of dropout to simulate. “none” indicates no dropout, “experiment” is global dropout using the same parameters for every cell, “batch” uses the same parameters for every cell in each batch, “group” uses the same parameters for every cell in each groups, “cell” uses a different set of parameters for each cell, and “binomial” performs post-hoc binomial undersampling.
dropout_mid (list-like or float, optional (default: 0)) – Midpoint parameter for the dropout logistic function.
dropout_shape (list-like or float, optional (default: -1)) – Shape parameter for the dropout logistic function.
dropout_prob (float, optional (default: 0.5)) – Probability for binomial undersampling dropout.
group_prob (list-like or int, optional (default: 1, shape=[n_groups])) – The probabilities that cells come from particular groups.
path_from (list-like, optional (default: 0, shape=[n_groups])) – Vector giving the originating point of each path.
path_length (list-like, optional (default: 100, shape=[n_groups])) – Vector giving the number of steps to simulate along each path.
path_skew (list-like, optional (default: 0.5, shape=[n_groups])) – Vector giving the skew of each path.
path_nonlinear_prob (float, optional (default: 0.1)) – Probability that a gene changes expression in a non-linear way along the differentiation path.
path_sigma_fac (float, optional (default: 0.8)) – Sigma factor for non-linear gene paths.
seed (int or None, optional (default: None)) – Seed to use for generating random numbers.
verbose (int, optional (default: 1)) – Logging verbosity between 0 and 2.
- Returns
sim – counts : Simulated expression counts. group : The group or path the cell belongs to. batch : The batch the cell was sampled from. exp_lib_size : The expected library size for that cell. step (paths only) : how far along the path each cell is. base_gene_mean : The base expression level for that gene. outlier_factor : Expression outlier factor for that gene. Values of 1 indicate
the gene is not an expression outlier.
gene_mean : Expression level after applying outlier factors. batch_fac_[batch] : The batch effects factor for each gene for a particular
batch.
- de_fac_[group]The differential expression factor for each gene in a
particular group. Values of 1 indicate the gene is not differentially expressed.
- sigma_fac_[path]Factor applied to genes that have non-linear changes in
expression along a path.
- batch_cell_meansThe mean expression of genes in each cell after adding
batch effects.
- base_cell_meansThe mean expression of genes in each cell after any
differential expression and adjusted for expected library size.
bcv : The Biological Coefficient of Variation for each gene in each cell. cell_means : The mean expression level of genes in each cell adjusted for BCV. true_counts : The simulated counts before dropout. dropout : Logical matrix showing which values have been dropped in which cells.
- Return type
dict
Splatter¶
Functions:
|
Simulate count data from a fictional single-cell RNA-seq experiment Splat. |
|
Install the required R packages to run Splatter. |
-
scprep.run.splatter.
SplatSimulate
(method='paths', batch_cells=100, n_genes=10000, batch_fac_loc=0.1, batch_fac_scale=0.1, mean_rate=0.3, mean_shape=0.6, lib_loc=11, lib_scale=0.2, lib_norm=False, out_prob=0.05, out_fac_loc=4, out_fac_scale=0.5, de_prob=0.1, de_down_prob=0.1, de_fac_loc=0.1, de_fac_scale=0.4, bcv_common=0.1, bcv_df=60, dropout_type='none', dropout_prob=0.5, dropout_mid=0, dropout_shape=-1, group_prob=1, path_from=0, path_n_steps=100, path_skew=0.5, path_nonlinear_prob=0.1, path_sigma_fac=0.8, seed=None, verbose=1, path_length=None)[source]¶ Simulate count data from a fictional single-cell RNA-seq experiment Splat.
SplatSimulate is a Python wrapper for the R package Splatter. For more details, read about Splatter on GitHub_ and Bioconductor_.
- Parameters
batch_cells (list-like or int, optional (default: 100)) – The number of cells in each batch.
n_genes (int, optional (default:10000)) – The number of genes to simulate.
batch_fac_loc (float, optional (default: 0.1)) – Location (meanlog) parameter for the batch effects factor log-normal distribution.
batch_fac_scale (float, optional (default: 0.1)) – Scale (sdlog) parameter for the batch effects factor log-normal distribution.
mean_shape (float, optional (default: 0.3)) – Shape parameter for the mean gamma distribution.
mean_rate (float, optional (default: 0.6)) – Rate parameter for the mean gamma distribution.
lib_loc (float, optional (default: 11)) – Location (meanlog) parameter for the library size log-normal distribution, or mean for the normal distribution.
lib_scale (float, optional (default: 0.2)) – Scale (sdlog) parameter for the library size log-normal distribution, or sd for the normal distribution.
lib_norm (bool, optional (default: False)) – Whether to use a normal distribution instead of the usual log-normal distribution.
out_prob (float, optional (default: 0.05)) – Probability that a gene is an expression outlier.
out_fac_loc (float, optional (default: 4)) – Location (meanlog) parameter for the expression outlier factor log-normal distribution.
out_fac_scale (float, optional (default: 0.5)) – Scale (sdlog) parameter for the expression outlier factor log-normal distribution.
de_prob (float, optional (default: 0.1)) – Probability that a gene is differentially expressed in each group or path.
de_down_prob (float, optional (default: 0.1)) – Probability that a differentially expressed gene is down-regulated.
de_fac_loc (float, optional (default: 0.1)) – Location (meanlog) parameter for the differential expression factor log-normal distribution.
de_fac_scale (float, optional (default: 0.4)) – Scale (sdlog) parameter for the differential expression factor log-normal distribution.
bcv_common (float, optional (default: 0.1)) – Underlying common dispersion across all genes.
float, optional (default (bcv_df) – Degrees of Freedom for the BCV inverse chi-squared distribution.
dropout_type ({'none', 'experiment', 'batch', 'group', 'cell', 'binomial'},) – optional (default: ‘none’) The type of dropout to simulate. “none” indicates no dropout, “experiment” is global dropout using the same parameters for every cell, “batch” uses the same parameters for every cell in each batch, “group” uses the same parameters for every cell in each groups, “cell” uses a different set of parameters for each cell, and “binomial” performs post-hoc binomial undersampling.
dropout_mid (list-like or float, optional (default: 0)) – Midpoint parameter for the dropout logistic function.
dropout_shape (list-like or float, optional (default: -1)) – Shape parameter for the dropout logistic function.
dropout_prob (float, optional (default: 0.5)) – Probability for binomial undersampling dropout.
group_prob (list-like or int, optional (default: 1, shape=[n_groups])) – The probabilities that cells come from particular groups.
path_from (list-like, optional (default: 0, shape=[n_groups])) – Vector giving the originating point of each path.
path_length (list-like, optional (default: 100, shape=[n_groups])) – Vector giving the number of steps to simulate along each path.
path_skew (list-like, optional (default: 0.5, shape=[n_groups])) – Vector giving the skew of each path.
path_nonlinear_prob (float, optional (default: 0.1)) – Probability that a gene changes expression in a non-linear way along the differentiation path.
path_sigma_fac (float, optional (default: 0.8)) – Sigma factor for non-linear gene paths.
seed (int or None, optional (default: None)) – Seed to use for generating random numbers.
verbose (int, optional (default: 1)) – Logging verbosity between 0 and 2.
- Returns
sim – counts : Simulated expression counts. group : The group or path the cell belongs to. batch : The batch the cell was sampled from. exp_lib_size : The expected library size for that cell. step (paths only) : how far along the path each cell is. base_gene_mean : The base expression level for that gene. outlier_factor : Expression outlier factor for that gene. Values of 1 indicate
the gene is not an expression outlier.
gene_mean : Expression level after applying outlier factors. batch_fac_[batch] : The batch effects factor for each gene for a particular
batch.
- de_fac_[group]The differential expression factor for each gene in a
particular group. Values of 1 indicate the gene is not differentially expressed.
- sigma_fac_[path]Factor applied to genes that have non-linear changes in
expression along a path.
- batch_cell_meansThe mean expression of genes in each cell after adding
batch effects.
- base_cell_meansThe mean expression of genes in each cell after any
differential expression and adjusted for expected library size.
bcv : The Biological Coefficient of Variation for each gene in each cell. cell_means : The mean expression level of genes in each cell adjusted for BCV. true_counts : The simulated counts before dropout. dropout : Logical matrix showing which values have been dropped in which cells.
- Return type
dict
-
scprep.run.splatter.
install
(site_repository=None, update=False, version=None, verbose=True)[source]¶ Install the required R packages to run Splatter.
- Parameters
site_repository (string, optional (default: None)) – additional repository in which to look for packages to install. This repository will be prepended to the default repositories
update (boolean, optional (default: False)) – When False, don’t attempt to update old packages. When True, update old packages automatically.
version (string, optional (default: None)) – Bioconductor version to install, e.g., version = “3.8”. The special symbol version = “devel” installs the current ‘development’ version. If None, installs from the current version.
verbose (boolean, optional (default: True)) – Install script verbosity.
Slingshot¶
Functions:
|
Perform lineage inference with Slingshot. |
|
Install the required R packages to run Slingshot. |
-
scprep.run.slingshot.
Slingshot
(data, cluster_labels, start_cluster=None, end_cluster=None, distance=None, omega=None, shrink=True, extend='y', reweight=True, reassign=True, thresh=0.001, max_iter=15, stretch=2, smoother='smooth.spline', shrink_method='cosine', allow_breaks=True, seed=None, verbose=1, **kwargs)[source]¶ Perform lineage inference with Slingshot.
Given a reduced-dimensional data matrix n by p and a vector of cluster labels (or matrix of soft cluster assignments, potentially including a -1 label for “unclustered”), this function performs lineage inference using a cluster-based minimum spanning tree and constructing simulatenous principal curves for branching paths through the tree.
For more details, read about Slingshot on GitHub_ and Bioconductor_.
- Parameters
data (array-like, shape=[n_samples, n_dimensions]) – matrix of (reduced dimension) coordinates to be used for lineage inference.
cluster_labels (list-like, shape=[n_samples]) – a vector of cluster labels, optionally including -1’s for “unclustered.”
start_cluster (string, optional (default: None)) – indicates the cluster(s) of origin. Lineages will be represented by paths coming out of this cluster.
end_cluster (string, optional (default: None)) – indicates the cluster(s) which will be forced leaf nodes. This introduces a constraint on the MST algorithm.
distance (callable, optional (default: None)) – method for calculating distances between clusters. Must take two matrices as input, corresponding to subsets of reduced_dim. If the minimum cluster size is larger than the number dimensions, the default is to use the joint covariance matrix to find squared distance between cluster centers. If not, the default is to use the diagonal of the joint covariance matrix. Not currently implemented
omega (float, optional (default: None)) – this granularity parameter determines the distance between every real cluster and the artificial cluster. It is parameterized such that this distance is omega / 2, making omega the maximum distance between two connected clusters. By default, omega = Inf.
shrink (boolean or float, optional (default: True)) – boolean or numeric between 0 and 1, determines whether and how much to shrink branching lineages toward their average prior to the split.
extend ({'y', 'n', 'pc1'}, optional (default: "y")) – how to handle root and leaf clusters of lineages when constructing the initial, piece-wise linear curve.
reweight (boolean, optional (default: True)) – whether to allow cells shared between lineages to be reweighted during curve-fitting. If True, cells shared between lineages will be iteratively reweighted based on the quantiles of their projection distances to each curve.
reassign (boolean, optional (default: True)) – whether to reassign cells to lineages at each iteration. If True, cells will be added to a lineage when their projection distance to the curve is less than the median distance for all cells currently assigned to the lineage. Additionally, shared cells will be removed from a lineage if their projection distance to the curve is above the 90th percentile and their weight along the curve is less than 0.1.
thresh (float, optional (default: 0.001)) – determines the convergence criterion. Percent change in the total distance from cells to their projections along curves must be less than thresh.
max_iter (int, optional (default: 15)) – maximum number of iterations
stretch (int, optional (default: 2)) – factor between 0 and 2 by which curves can be extrapolated beyond endpoints
smoother ({"smooth.spline", "lowess", "periodic_lowess"},) – optional (default: “smooth.spline”) choice of smoother. “periodic_lowess” allows one to fit closed curves. Beware, you may want to use iter = 0 with “lowess”.
shrink_method (string, optional (default: "cosine")) – how to determine the appropriate amount of shrinkage for a branching lineage. Accepted values: “gaussian”, “rectangular”, “triangular”, “epanechnikov”, “biweight”, “triweight”, “cosine”, “optcosine”, “density”.
allow_breaks (boolean, optional (default: True)) – determines whether curves that branch very close to the origin should be allowed to have different starting points.
seed (int or None, optional (default: None)) – Seed to use for generating random numbers.
verbose (int, optional (default: 1)) – Logging verbosity between 0 and 2.
- Returns
slingshot (dict) – Contains the following keys:
pseudotime (array-like, shape=[n_samples, n_curves]) – Pseudotime projection of each cell onto each principal curve. Value is np.nan if the cell does not lie on the curve
branch (list-like, shape=[n_samples]) – Branch assignment for each cell
curves (array_like, shape=[n_curves, n_samples, n_dimensions]) – Coordinates of each principle curve in the reduced dimension
Examples
>>> import scprep >>> import phate >>> data, clusters = phate.tree.gen_dla(n_branch=4, n_dim=200, branch_length=200) >>> phate_op = phate.PHATE() >>> data_phate = phate_op.fit_transform(data) >>> slingshot = scprep.run.Slingshot(data_phate, clusters) >>> ax = scprep.plot.scatter2d( ... data_phate, ... c=slingshot['pseudotime'][:,0], ... cmap='magma', ... legend_title='Branch 1' ... ) >>> scprep.plot.scatter2d( ... data_phate, ... c=slingshot['pseudotime'][:,1], ... cmap='viridis', ... ax=ax, ... ticks=False, ... label_prefix='PHATE', ... legend_title='Branch 2' ... ) >>> for curve in slingshot['curves']: ... ax.plot(curve[:,0], curve[:,1], c='black') >>> ax = scprep.plot.scatter2d(data_phate, c=slingshot['branch'], ... legend_title='Branch', ticks=False, label_prefix='PHATE') >>> for curve in slingshot['curves']: ... ax.plot(curve[:,0], curve[:,1], c='black')
-
scprep.run.slingshot.
install
(site_repository=None, update=False, version=None, verbose=True)[source]¶ Install the required R packages to run Slingshot.
- Parameters
site_repository (string, optional (default: None)) – additional repository in which to look for packages to install. This repository will be prepended to the default repositories
update (boolean, optional (default: False)) – When False, don’t attempt to update old packages. When True, update old packages automatically.
version (string, optional (default: None)) – Bioconductor version to install, e.g., version = “3.8”. The special symbol version = “devel” installs the current ‘development’ version. If None, installs from the current version.
verbose (boolean, optional (default: True)) – Install script verbosity.