API Reference

Main module

IO

pyrepseq.io.isvalidaa(string)[source]: returns true if string is composed only of characters from the standard amino acid alphabet

pyrepseq.io.isvalidcdr3(string)[source]

returns True if string is a valid CDR3 sequence

Checks the following:

first amino acid is a cysteine (C)
last amino acid is either phenylalanine (F), tryptophan (W), or cysteine (C)
each amino acid is part of the standard amino acid alphabet

See http://www.imgt.org/IMGTScientificChart/Numbering/IMGTIGVLsuperfamily.html and also https://doi.org/10.1093/nar/gkac190

pyrepseq.io.multimerge(dfs, on, suffixes=None, **kwargs)[source]

Merge multiple dataframes on a common column.

Provides support for custom suffixes.

Parameters:

on ('index' or column name)
suffixes ([list-like | None]) – list of suffixes to append to the data
**kwargs (keyword arguments passed along to pd.merge)

Return type:

merged dataframe

pyrepseq.io.standardize_dataframe(df: DataFrame | None = None, col_mapper: Mapping | None = None, standardize: bool = True, species: str = 'HomoSapiens', tcr_enforce_functional: bool = True, tcr_precision: str = 'gene', mhc_precision: str = 'gene', strict_cdr3_standardization: bool = False, suppress_warnings: bool = False, df_old: DataFrame | None = None)[source]

This is a utility function to organise a table of TCR-pMHC data into the standard pyrepseq format and perform data cleaning/standardization to ensure that the TCR/MHC gene symbols are IMGT-compliant, the epitopes are all valid amino acid strings, and the CDR3s look valid. For further notes on data standardization, see below. The standard format is a table with some or all of the following columns (not necessarily in order):

Column Name	Column should contain	Data type
TRAV	TRAV gene symbol	str
CDR3A	TCR alpha chain CDR3 amino acid sequence	str
TRAJ	TRAJ gene symbol	str
TRBV	TRBV gene symbol	str
CDR3B	TCR beta chain CDR3 amino acid sequence	str
TRBJ	TRBJ gene symbol	str
Epitope	Epitope amino acid sequence	str
MHCA	MHC alpha chain gene symbol	str
MHCB	MHC beta chain gene symbol	str

If the input DataFrame contains the necessary data in columns that are named differently, this can be resolved by providing the mapping to the col_mapper argument (see parameters and examples).

If standardization is enabled (True by default), the function will additionally attempt to standardize the TCR and MHC gene symbols to be IMGT-compliant, and CDR3/Epitope amino acid sequences to be valid. However, for the standardization to happen, the columns with the relevant data must either be correctly named, or the necessary re-naming scheme must be specified by supplying an argument to the col_mapper parameter. During standardization, most non-standardizable/nonsensical values will be removed, replaced with None. However, since epitopes are not necessarily always amino acid sequences, values in the Epitope column that fail standardization will be kept as their original value.

Deprecated since version 1.4: df_old will be removed in pyrepseq 2.0, with the more simply named df parameter.

Parameters:

df (pandas.DataFrame) – Source DataFrame from which to pull data.
df_old (pandas.DataFrame) – Alias for df. Now deprecated and will be removed in version 2.0.
col_mapper (Mapping) – A mapping object, such as a dictionary, which maps the old column names to the new column names. This should not be set if no column re-naming is necessary. Defaults to None.
standardize (bool) – When set to False, gene name standardisation is not attempted. Defaults to True.
species (str) – Name of the species from which the TCR data is derived, in their binomial nomenclature, camel-cased. Defaults to 'HomoSapiens'.
tcr_enforce_functional (bool) – When set to True, TCR genes that are not functional (i.e. ORF or pseudogene) are removed, and replaced with None. Defaults to True.
tcr_precision (str) – Level of precision to trim the TCR gene data to ('gene' or 'allele'). Defaults to 'gene'.
mhc_precision (str) – Level of precision to trim the MHC gene data to ('gene', 'protein' or 'allele'). Defaults to 'gene'.
strict_cdr3_standardization (bool) – If True, any string that does not look like a CDR3 sequence is rejected. If False, any inputs that are valid amino acid sequences but do not start with C and end with F/W are not rejected and instead are corrected by having a C appended to the beginning and an F appended at the end. Defaults to False.
suppress_warnings (bool) – If True, suppresses warnings that are emitted when the standardisation of certain values fails. Defaults to False.

Returns:

Standardized DataFrame containing the original data, cleaned.

Return type:

pandas.DataFrame

Examples

If you already have a DataFrame in the standard format, standardize_dataframe can perform data standardization for you. In the examples shown here, we omit any standardization warnings for ease of reading.

Say you have the following DataFrame:

>>> from pyrepseq import io
>>> import pandas as pd
>>> df = pd.DataFrame(
...     data=[
...         ["av26.1*1",  "CIVRAPGRADMRF", "aj43*1",    "bv13*1",      "CASSYLPGQGDHYSNQPQHF","bj1.5*1",    "FLKEKGGL",       "b8",         "b2m"],
...         ["TCRAV20*01","CAVPSGAGSYQLTF","TCRAJ28*01","TCRBV28S1*01","CASSLGQSGANVLTF",     "TCRBJ2S6*01","LQPFPQPELPYPQPQ","HLA-DQA1*05","HLA-DQB1*02"],
...         ["unknown",   "unknown",       "unknown",   "TRBV7-2*01",  "CASSDWGSQNTLYF",      "TRBJ2-4*01", "YMPYFFTLL",      "HLA-A*02",   "B2M"]
...     ],
...     columns=["TRAV","CDR3A","TRAJ","TRBV","CDR3B","TRBJ","Epitope","MHCA","MHCB"]
... )
>>> df
         TRAV           CDR3A        TRAJ          TRBV                 CDR3B         TRBJ          Epitope         MHCA         MHCB
0    av26.1*1   CIVRAPGRADMRF      aj43*1        bv13*1  CASSYLPGQGDHYSNQPQHF      bj1.5*1         FLKEKGGL           b8          b2m
1  TCRAV20*01  CAVPSGAGSYQLTF  TCRAJ28*01  TCRBV28S1*01       CASSLGQSGANVLTF  TCRBJ2S6*01  LQPFPQPELPYPQPQ  HLA-DQA1*05  HLA-DQB1*02
2     unknown         unknown     unknown    TRBV7-2*01        CASSDWGSQNTLYF   TRBJ2-4*01        YMPYFFTLL     HLA-A*02          B2M

By passing this to `standardize_dataframe, you will get a cleaned version of the data.

>>> io.standardize_dataframe(df, suppress_warnings=True)
       TRAV           CDR3A    TRAJ     TRBV                 CDR3B     TRBJ          Epitope      MHCA      MHCB
0  TRAV26-1   CIVRAPGRADMRF  TRAJ43   TRBV13  CASSYLPGQGDHYSNQPQHF  TRBJ1-5         FLKEKGGL     HLA-B       B2M
1    TRAV20  CAVPSGAGSYQLTF  TRAJ28   TRBV28       CASSLGQSGANVLTF  TRBJ2-6  LQPFPQPELPYPQPQ  HLA-DQA1  HLA-DQB1
2      None            None    None  TRBV7-2        CASSDWGSQNTLYF  TRBJ2-4        YMPYFFTLL     HLA-A       B2M

If you want to have extra columns on the DataFrame, that is allowed.

>>> extended_df = df.copy()
>>> extended_df["clone_count"] = [1,2,3]
>>> io.standardize_dataframe(extended_df, suppress_warnings=True)
       TRAV           CDR3A    TRAJ     TRBV                 CDR3B     TRBJ          Epitope      MHCA      MHCB  clone_count
0  TRAV26-1   CIVRAPGRADMRF  TRAJ43   TRBV13  CASSYLPGQGDHYSNQPQHF  TRBJ1-5         FLKEKGGL     HLA-B       B2M            1
1    TRAV20  CAVPSGAGSYQLTF  TRAJ28   TRBV28       CASSLGQSGANVLTF  TRBJ2-6  LQPFPQPELPYPQPQ  HLA-DQA1  HLA-DQB1            2
2      None            None    None  TRBV7-2        CASSDWGSQNTLYF  TRBJ2-4        YMPYFFTLL     HLA-A       B2M            3

Having only a subset of the standard columns is also allowed.

>>> beta_only_df = df.copy()
>>> beta_only_df = beta_only_df[["TRBV","CDR3B","TRBJ"]]
>>> io.standardize_dataframe(beta_only_df, suppress_warnings=True)
      TRBV                 CDR3B     TRBJ
0   TRBV13  CASSYLPGQGDHYSNQPQHF  TRBJ1-5
1   TRBV28       CASSLGQSGANVLTF  TRBJ2-6
2  TRBV7-2        CASSDWGSQNTLYF  TRBJ2-4

Columns can be renamed by suppling a mapping to the col_mapper parameter.

>>> beta_only_misnamed = beta_only_df.copy()
>>> beta_only_misnamed.columns = ["foo", "bar", "baz"]
>>> beta_only_misnamed
            foo                   bar          baz
0        bv13*1  CASSYLPGQGDHYSNQPQHF      bj1.5*1
1  TCRBV28S1*01       CASSLGQSGANVLTF  TCRBJ2S6*01
2    TRBV7-2*01        CASSDWGSQNTLYF   TRBJ2-4*01
>>> col_mapper = {
...     "foo": "TRBV",
...     "bar": "CDR3B",
...     "baz": "TRBJ"
... }
>>> io.standardize_dataframe(beta_only_misnamed, col_mapper=col_mapper)
      TRBV                 CDR3B     TRBJ
0   TRBV13  CASSYLPGQGDHYSNQPQHF  TRBJ1-5
1   TRBV28       CASSLGQSGANVLTF  TRBJ2-6
2  TRBV7-2        CASSDWGSQNTLYF  TRBJ2-4

Stats

pyrepseq.stats.chao1(counts)[source]

Estimate richness from sampled counts.

hatSchao1 = Sobs + f1^2/(2 f2)

pyrepseq.stats.chao2(counts, m)[source]

Estimate richness from incidence data

counts: incidence count vector m: number of replicates

pyrepseq.stats.jaccard_index(A, B)[source]

Calculate the Jaccard index for two sets.

This measure is defined defined as

\(J(A, B) = |A \cap B| / |A \cup B|\)

A, B: iterables (will be converted to sets). If A, B are pd.Series na values will be dropped first

pyrepseq.stats.overlap(A, B)[source]

Calculate the number of overlapping elements of two sets.

This measure is defined as \(|A \cap B|\)

A, B: iterables (will be converted to sets). na values will be dropped first

pyrepseq.stats.overlap_coefficient(A, B)[source]

Calculate the overlap coefficient for two sets.

This measure is defined as \(O(A, B) = |A \cap B| / min(|A|, |B|)\)

A, B: iterables (will be converted to sets). na values will be dropped first

pyrepseq.stats.pc(array: Iterable, array2: Iterable | None = None)[source]

Estimate the coincidence probability \(p_C\) from a sample. \(p_C\) is equal to the probability that two distinct sampled elements are the same. If \(n_i\) are the counts of the i-th unique element and \(N = \sum_i n_i\) the length of the array, then: \(p_C = \sum_i n_i (n_i-1)/(N(N-1))\)

Note: This measure is also known as the Simpson or Hunter-Gaston index

Parameters:

array (Iterable) – Iterable of sampled elements
array2 (Optional[Iterable]) – Second Iterable of sampled elements: if provided probability of cross-coincidences is calculated as \(p_C = (\sum_i n_{1i} n_{2i}) / (N_1 N_2)\)

pyrepseq.stats.pc_conditional(df, by, on, group_weights=None)[source]

Conditional coincidence probability estimator

Parameters:

df (pandas DataFrame)
by (list) – conditioning parameters used to group input dataframe
on (string/list of strings) – column or columns to compute probability of coincidence or joint probability of coincidence on. If type(on) == list then joint pc is computed on the concatenations of each specified column
group_weights (array-like) – weight groups non-uniformly according to square of these values

Returns:

pc of df[on] averaged over groups

Return type:

pandas DataFrame/float

pyrepseq.stats.pc_grouped_cross(df, by, on)[source]

Cross-group coincidence probability estimator

Parameters:

df (pandas DataFrame)
by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby
on (list of strings) – columns on which to obtain a joint probability of coincidence

Returns:

pc computed on the concatenation of each specified column in on

Return type:

pd.DataFrame

pyrepseq.stats.pc_joint(df, on, df_2=None, gap_token='_')[source]

Joint coincidence probability estimator

Parameters:

df (pandas DataFrame)
on (list of strings) – columns on which to obtain a joint probability of coincidence
df_2 (None or pd.DataFrame) – second DataFrame for cross-coincidence calculations
gap_token (string) – character to be added for feature concatenization

Returns:

pc computed on the concatenation of each specified column in on

Return type:

float

pyrepseq.stats.pc_n(n)[source]

Estimate the coincidence probability \(p_C\) from sampled counts. \(p_C\) is equal to the probability that two distinct sampled elements are the same. If \(n_i\) are the counts of the i-th unique element and \(N = \sum_i n_i\) the length of the array, then: \(p_C = \sum_i n_i (n_i-1)/(N(N-1))\)

Note: This measure is also known as the Simpson or Hunter-Gaston index

Parameters:: n (array-like) – list of counts

pyrepseq.stats.powerlaw_mle_alpha(c, cmin=1.0, method='exact', **kwargs)[source]

Maximum likelihood estimate of the power-law exponent.

Parameters:

c (counts)
cmin (only counts >= cmin are included in fit)
continuitycorrection (use continuitycorrection (more accurate for integer counts))
method (one of ['simple', 'continuitycorrection', 'exact']) –

‘simple’: Uses an analytical formula that is exact in the continuous case
(Eq. B17 in Clauset et al. arXiv 0706.1062v2)

’continuitycorrection’: applies a continuity correction to the analytical formula ‘exact’: Numerically maximizes the discrete loglikelihood
kwargs (dict) – passed on to scipy.optimize.minimize_scalar Default: bounds=[1.5, 4.5], method=’bounded’

Return type:

estimated power-law exponent

pyrepseq.stats.powerlaw_sample(size=1, xmin=1.0, alpha=2.0)[source]

Draw samples from a discrete power-law.

Uses an approximate transformation technique, see Eq. D6 in Clauset et al. arXiv 0706.1062v2 for details.

Parameters:

size (number of values to draw)
xmin (minimal value)
alpha (power-law exponent)

Return type:

array of integer samples

pyrepseq.stats.stdpc(array)[source]: Std.dev. estimator for Simpson’s index

pyrepseq.stats.stdpc_joint(df, on, gap_token='_')[source]: Std.dev. estimator for joint Simpson’s index

pyrepseq.stats.stdpc_n(n)[source]: Std.dev. estimator for Simpson’s index

pyrepseq.stats.subsample(counts, n)[source]

Randomly subsample from a vector of counts without replacement.

Parameters:

counts (Vector of counts (integers) to randomly subsample from.)
n (Number of items to subsample from counts. Must be less than or equal) – to the sum of counts.

Returns:

indices, counts

Return type:

Subsampled vector of counts where the sum of the elements equals n

pyrepseq.stats.var_chao1(counts)[source]: Variance estimator for Chao1 richness.

pyrepseq.stats.var_chao2(counts, m)[source]

Variance estimator for Chao2 richness.

counts: incidence count vector m: number of replicates

pyrepseq.stats.varpc_n(n)[source]: Variance estimator for Simpson’s index

Distance

pyrepseq.distance.calculate_neighbor_numbers(seqs, reference=None, neighborhood=<function levenshtein_neighbors>)[source]

Calculate the number of neighbors for each sequence in a list.

Parameters:

seqs (list of sequences)
reference (list of sequences, set(seqs) if None)
neighborhood (function returning iterator over neighbors)

Return type:

integer array of number of neighboring sequences

pyrepseq.distance.cdist(stringsA, stringsB, metric=None, dtype=<class 'numpy.uint8'>, **kwargs)[source]

Compute distance between each pair of the two collections of strings. (scipy.spatial.distance.cdist equivalent for strings)

Parameters:

stringsA (iterable of strings) – An mA-length iterable.
stringsB (iterable of strings) – An mB-length iterable.
metric (function, optional) – The distance metric to use. Default: Levenshtein distance.
dtype (np.dtype) – data type of the distance matrix, default: np.uint8 Note: make sure to change the dtype, if the metric does not return integers

Returns:

Y – A \(m_A\) by \(m_B\) distance matrix is returned. For each \(i\) and \(j\), the metric dist(u=XA[i], v=XB[j]) is computed and stored in the \(ij\) th entry.

Return type:

ndarray

pyrepseq.distance.downsample(seqs: Iterable[str] | DataFrame | None, maxseqs: int | None = None)[source]

Random downsampling of a list of sequences. Also works for standard pyrepseq TCR DataFrames (see pyrepseq.io.standardize_dataframe()).

Parameters:

seqs (Union[Iterable[str], DataFrame]) – Input Iterable of strings, or TCR DataFrame.
maxseqs (Optional[int]) – Max number of sequences to keep. Defaults to None.

Returns:

Random subset of maxseqs elements from the input collection.
If maxseqs is None, returns the input collection without modification.

pyrepseq.distance.find_neighbor_pairs(seqs, neighborhood=<function hamming_neighbors>)[source]

Find neighboring sequences in a list of unique sequences.

Parameters:: neighborhood (callable returning an iterable of neighbors)
Return type:: list of tuples (seq1, seq2)

pyrepseq.distance.find_neighbor_pairs_index(seqs, neighborhood=<function hamming_neighbors>)[source]

Find neighboring sequences in a list of unique sequences.

Parameters:: neighborhood (callable returning an iterable of neighbors)
Return type:: list of tuples (index1, index2)

pyrepseq.distance.hamming_neighbors(x, alphabet='ACDEFGHIKLMNPQRSTVWY', variable_positions=None)[source]

Iterator over Hamming neighbors of a string x.

Parameters:

alphabet (iterable of characters)
variable_positions (iterable of positions to be varied (default: all))

pyrepseq.distance.hierarchical_clustering(seqs: Iterable, metric: Metric | None = None, linkage_kws={'method': 'average', 'optimal_ordering': True}, cluster_kws={'criterion': 'distance', 't': 6})[source]

Hierarchical clustering by sequence similarity.

Parameters:

seqs (Iterable) – A collection of elements to cluster.
metric (Metric) – The metric used to compute distances between elements. If not set, a default is inferred from the input data type of seqs. If seqs is a standard pyrepseq TCR DataFrame (see pyrepseq.io.standardize_dataframe()), then the metric can default to either a pyrepseq.metric.tcr_metric.Cdr3Levenshtein, pyrepseq.metric.tcr_metric.AlphaCdr3Levenshtein, or pyrepseq.metric.tcr_metric.BetaCdr3Levenshtein, depending on what columns are available. In all other cases, the metric defaults to pyrepseq.metric.Levenshtein.
linkage_kws – keyword arguments for linkage algorithm
cluster_kws – keyword arguments for clustering algorithm

pyrepseq.distance.isdist1(x, reference, neighborhood=<function levenshtein_neighbors>)[source]: Is the string x distance 1 away from any of the strings in the reference set

pyrepseq.distance.levenshtein_neighbors(x, alphabet='ACDEFGHIKLMNPQRSTVWY')[source]: Iterator over Levenshtein neighbors of a string x

pyrepseq.distance.load_pcDelta_background(return_bins=True)[source]

Loads pre-computed background pcDelta distributions calculated for PBMC TCRs.

Data: Sample W_F1_2018 from Minervina et al. https://zenodo.org/record/4065547/

Returns:

back (pd.DataFrame) – DataFrame with coincidence probabilities
bins (ndarray [if return_bins = True]) – Delta bins to be used as bins for other data

pyrepseq.distance.next_nearest_neighbors(x, neighborhood, maxdistance=2)[source]

Set of next nearest neighbors of a string x.

Parameters:

alphabet (iterable of characters)
neighborhood (neighborhood iterator)
maxdistance (go up to maxdistance nearest neighbor)

Return type:

set of neighboring sequences

pyrepseq.distance.nndist_hamming(seq, reference, maxdist=4)[source]

Calculate the nearest-neighbor distance by Hamming distance

Parameters:

seqs (list of sequences)
seq (sequence instance)
reference (set of referencesequences)
maxdist (distance beyond which to cut off the calculation (needs to be <=4))

Returns:

distance of nearest neighbor
Note (This function does not check if strings are of same length.)

Calculates binned near-coincidence probabilities \(p_C(\Delta)\) among input sequences.

Parameters:

seqs (Iterable) – A collection of elements to measure distances between.
seqs2 (Optional[Iterable]) – A second collection of elements for cross-comparisons.
metric (pyrepseq.metric.Metric) – The metric used to compute distances between elements. If not set, a default is inferred from the input data type of seqs. If seqs is a standard pyrepseq TCR DataFrame (see pyrepseq.io.standardize_dataframe()), then the metric can default to either a pyrepseq.metric.tcr_metric.Cdr3Levenshtein, pyrepseq.metric.tcr_metric.AlphaCdr3Levenshtein, or pyrepseq.metric.tcr_metric.BetaCdr3Levenshtein, depending on what columns are available. In all other cases, the metric defaults to pyrepseq.metric.Levenshtein.
bins (Union[int, Iterable]) – bins for the distances Delta. (Default: range(0, 25)) bins=0: Calculate exact coincidence probability
normalize (bool) – whether to return pc (normalized) or raw counts
pseudocount (float) – for a Bayesian estimation of coincidence frequencies e.g. can use Jeffrey’s prior value of 0.5
maxseqs (Optional[int]) – maximal number of sequences to keep by random downsampling

Returns:

(normalized) histogram of sequence distances

Return type:

np.ndarray

pyrepseq.distance.pcDelta_grouped(df, by, seq_columns, **kwargs)[source]

Near-coincidence probabilities conditioned to within-group comparisons.

Parameters:

df (pd.DataFrame)
by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby
seq_columns (string) – The data frame column on which we want to apply the pcDelta analysis
**kwargs (keyword arguments) – passed on to pcDelta

Returns:

pcs – Returns a DataFrame of pC(delta) for each group

Return type:

pd.DataFrame

pyrepseq.distance.pcDelta_grouped_cross(df, by, seq_columns, condensed=False, **kwargs)[source]

Near-coincidence probabilities conditioned to cross-group comparisons.

Parameters:

df (pd.DataFrame)
by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby
seq_columns (string) – The data frame column on which we want to apply the pcDelta analysis
condensed (bool) – Return a condensed instead of squareform matrix (default: False)
**kwargs (keyword arguments) – passed on to pcDelta

Returns:

pcs – Returns a DataFrame of pC(delta) across pairs of groups

Return type:

pd.DataFrame

pyrepseq.distance.pdist(strings, metric=None, dtype=<class 'numpy.uint8'>, **kwargs)[source]

Pairwise distances between collection of strings. (scipy.spatial.distance.pdist equivalent for strings)

Parameters:

strings (iterable of strings) – An m-length iterable.
metric (function, optional) – The distance metric to use. Default: Levenshtein distance.
dtype (np.dtype) – data type of the distance matrix, default: np.uint8 Note: make sure to change the dtype, if the metric does not return integers

Returns:

Y – Returns a condensed distance matrix Y. For each \(i\) and \(j\) (where \(i<j<m\)), where m is the number of original observations. The metric dist(u=X[i], v=X[j]) is computed and stored in entry m * i + j - ((i + 2) * (i + 1)) // 2.

Return type:

ndarray

Nearest Neighbor

class pyrepseq.nn.LookupDB(seqs)[source]

Lookup string variants in a dictionary.

The dictionary has sequences as keys, and list of sequence indices as values.

Parameters:: seqs (iterable of strings) – list of sequences

lookup(seqs2, max_edits=1, pdist_mode=False, custom_distance=None, max_custom_distance=inf, output_type='triplets', progress=False)[source]

Query the database

Parameters:

seq2 (iterable of strings or None) – list of query sequences
max_edits (int) – maximum number of edits
pdist_mode (Boolean) – if True, assume seqs2=seqs and filter diagonal
custom_distance (Function(str1, str2) or "hamming") – custom distance metric
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”
progress (bool) – show progress bar

Returns:

neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

class pyrepseq.nn.SymdelDB(seqs, max_edits)[source]

Generate a deletion variant dictionary.

The dictionary has deletion variants as keys, and list of sequence indices as values.

Parameters:

seqs (iterable of strings) – list of sequences
max_edits (int) – maximum deletion distance

lookup(seqs2, custom_distance=None, max_custom_distance=inf, output_type='triplets', progress=False)[source]

Query the database

Parameters:

seq2 (iterable of strings or None) – list of query sequences
custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”
progress (bool) – show progress bar

Returns:

neighbors – Neigbors along with their edit distances according to the given output_type. If “triplets” returns are [(seqs_index, seqs2_index, edit_distance)]. If “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(seqs[i], seqs2[j]) or 0 if not neighbor. If “ndarray” returns numpy’s 2d array representing dense matrix.

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

pyrepseq.nn.calculate_sceptrdist_sparse(edges, tcr_data_array)[source]: Efficiently calculate sparse pairwise distances between vector representations of TCRs.

pyrepseq.nn.hash_based(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', progress=False)[source]

List all neighboring CDR3B sequences efficiently for small edit distances. The idea is to list all possible sequences within a given distance and lookup the dictionary if it exists. This implementation is faster than kdtree implementation for max_edits == 1

Parameters:

seqs (iterable of strings) – list of CDR3B sequences
max_edits (int) – maximum edit distance defining the neighbors
max_returns (int or None) – not implemented
n_cpu (int) – not implemented
custom_distance (Function(str1, str2) or "hamming") – custom distance metric
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”
progress (bool) – show progress bar

Returns:

neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

pyrepseq.nn.kdtree(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', compression=1)[source]

List all neighboring CDR3B sequences efficiently within the given edit distance. With KDTree, the algorithms run with O(N logN) eliminating unnecessary comparisons. With RapidFuzz library, the edit distance comparison is efficiently written in C++. With multiprocessing, the algorithm can take advantage of multiple CPU cores. This implementation is faster than hash-based implementation for max_edits > 1

Parameters:

seqs (iterable of strings) – list of CDR3B sequences
max_edits (int) – maximum edit distance defining the neighbors
max_returns (int or None) – maximum neighbor size
n_cpu (int) – number of CPU cores running in parallel
custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”

Returns:

neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

pyrepseq.nn.nearest_neighbor(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', seqs2=None)[source]

List all neighboring sequences efficiently within a given distance. The distance can be given in terms of hamming, levenshtein, or custom.

If seqs2 is not provided, every sequence is compared against every other sequence.

Parameters:

seqs (iterable of strings) – list of CDR3B sequences
max_edits (int) – maximum edit distance defining the neighbors
max_returns (int or None) – maximum neighbor size
n_cpu (int) – ignored
custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”
seq2 (iterable of strings or None) – another list of CDR3B sequences to compare against

Returns:

neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

pyrepseq.nn.nearest_neighbor_sceptrdist(df, chain='beta', max_edits=2, max_sceptrdist=1.0, **kwargs)[source]

List all neighboring TCR sequences efficiently within a given edit and SCEPTR radius.

[Requires optional dependency sceptr]

Parameters:

chain ('alpha', 'beta') – chain to use for edit distance prefiltering
max_edits (only return neighbors up to <= this edit distance)
max_sceptrdist (only return neighbor up to <= this TCR distance)
**kwargs (passed on to nearest_neighbor function)

Return type:

sparse matrix in (i, j, dist) format

pyrepseq.nn.nearest_neighbor_tcrdist(df: DataFrame, chain='beta', max_edits=2, edit_on_trimmed=True, max_tcrdist=20, tcrdist_kwargs={}, df2: DataFrame | None = None, **kwargs)[source]

List all neighboring TCR sequences efficiently within a given edit and TCRdist radius.

[Requires optional dependency pwseqdist]

Parameters:

df (pandas DataFrame) – A pandas DataFrame in the pyrepseq format. If df2 is not set, then the function computes the nearest neighbor TCRs within this set. If df2 is set, then the function computes the nearest neighbors across this and df2.
chain ('alpha', 'beta', or 'both') – if both finds candidate neighbors using the beta chain, but filter on paired sequence at the end
max_edits (only return neighbors up to <= this edit distance)
edit_on_trimmed (boolean) – apply TCRdist trimming on sequences before calculating edit distance
max_tcrdist (only return neighbor up to <= this TCR distance)
tcrdist_kwargs (dict) – customized parameters for TCRdist calculation
df2 (pandas DataFrame) – A pandas DataFrame in the pyrepseq format.
**kwargs (passed on to nearest_neighbor function)

Return type:

sparse matrix in (i, j, dist) format

pyrepseq.nn.symdel(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', seqs2=None, progress=False)[source]

List all neighboring sequences efficiently within the given distance. This is an improved version over the hash-based.

If seqs2 is not provided, every sequences are compared against every other sequences resulting in \(N(seqs)**2\) combinations. Otherwise, seqs are compared against seqs2 resulting in \(N(seqs)*N(seqs2)\) combinations.

Parameters:

seqs (iterable of strings) – list of sequences
max_edits (int) – maximum edit distance defining the neighbors
max_returns (int or None) – ignored
n_cpu (int) – ignored
custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, “ndarray”
seq2 (iterable of strings or None) – another list of sequences to compare against
progress (bool) – show progress bar

Returns:

neighbors – Neigbors along with their edit distances according to the given output_type. If “triplets” returns are [(seqs_index, seqs2_index, edit_distance)]. If “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(seqs[i], seqs2[j]) or 0 if not neighbor. If “ndarray” returns numpy’s 2d array representing dense matrix.

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

Plotting submodule

pyrepseq.plotting.align_seqs(seqs, debug=False)[source]

Align multiple sequences using mafft-linsi with default parameters.

Requires external dependency mafft-linsi to be installed.

Parameters:

seqs (iterable of strings)
debug (Boolean) – if True, prints mafft-linsi output

Returns:

aligned sequences (with gaps)

Return type:

list of strings

pyrepseq.plotting.density_scatter(x, y, ax=None, discrete=False, sort=True, bins=20, trans=None, cbar=False, **kwargs)[source]

Scatter plot with color indicating point density estimated by local binning.

ax: matplotlib.Axes: axes on which to plot
discrete: Boolean: Is the data discrete? -> count-based density
bins: int: number of bins for density estimation
trans: function: transformation to apply before density estimation
sort: Boolean: sort the data points by density to plot densest points last.
**kwargs:: passed on to ax.scatter

pyrepseq.plotting.label_axes(fig_or_axes, labels='ABCDEFGHIJKLMNOPQRSTUVWXYZ', labelstyle='%s', xy=(-0.1, 0.95), xycoords='axes fraction', **kwargs)[source]

Walks through axes and labels each. kwargs are collected and passed to annotate

Parameters:

fig (Figure or Axes to work on)
labels (iterable or None) – iterable of strings to use to label the axes. If None, lower case letters are used.
loc (Where to put the label units (len=2 tuple of floats))
xycoords (loc relative to axes, figure, etc.)
kwargs (to be passed to annotate)

pyrepseq.plotting.labels_to_colors_hls(labels, palette_kws={'l': 0.5, 's': 0.8}, min_count=None)[source]

Map a list of labels to a list of unique colors. Uses seaborn.hls_palette.

Parameters:

df (pandas DataFrame with data)
labels (list of labels)
min_count (map all labels seen less than min_count to black)
palette_kws (passed to seaborn.hls_palette)

pyrepseq.plotting.labels_to_colors_tableau(labels, min_count=None)[source]

Map a list of labels to a list of unique colors. Uses Tableau_10 colors

Parameters:

df (pandas DataFrame with data)
labels (list of labels)
min_count (map all labels seen less than min_count to black)

pyrepseq.plotting.rankfrequency(data, ax=None, normalize_x=True, normalize_y=False, transform_x=None, transform_y=None, log_x=True, log_y=True, scalex=1.0, scaley=1.0, **kwargs)[source]

Plot rank frequency plots.

Parameters:

data (array-like) – count data
ax (matplotlib.Axes) – axes on which to plot the data
normalize_x (bool, default:True) – whether to normalize counts to relative frequencies
normalize_y (bool, default:False) – whether to normalize ranks to cumulative probabilities
transform_x (function, default:None) – transform to apply to x-values before plotting
transform_y (function, default:None) – transform to apply to y-values before plotting

Returns:

Objectes representing the plotted data.

Return type:

list of Line2D

pyrepseq.plotting.seqlogos(seqs, ax=None, **kwargs)[source]

Display a sequence logo.

Aligns sequences using align_seqs if they are are not of equal length.

Parameters:

seqs (iterable of strings) – sequences to be displayed
ax (matplotlib.axes.Axes, optional) – The axes to plot on. If None, a new figure and axes will be created.
**kwargs (dict) – passed on to logomaker.Logo

Return type:

axes, counts_matrix

pyrepseq.plotting.seqlogos_vj(df, cdr3_column, v_column, j_column, axes=None, **kwargs)[source]

Display a sequence logo with V and J gene information.

Parameters:

df (pd.DataFrame) – input data
cdr3_column (str) – column name for cdr3 sequences
v_column (str) – column name for v genes
j_column (str) – column name for j genes
**kwargs (dict) – passed on to seqlogos

pyrepseq.plotting.similarity_clustermap(df, alpha_column='cdr3a', beta_column='cdr3b', norm=None, bounds=array([0, 1, 2, 3, 4, 5, 6]), linkage_kws={'method': 'average', 'optimal_ordering': True}, show_clusters=True, cluster_kws={'criterion': 'distance', 't': 6}, cbar_kws={'format': '%d', 'label': 'Sequence Distance', 'orientation': 'horizontal'}, meta_columns=None, meta_to_colors=None, **kws)[source]

Plots a sequence-similarity clustermap.

Parameters:

df (pandas DataFrame with data)
alpha_column (column name with alpha and beta amino acid information (set one to None for single chain plotting))
beta_column (column name with alpha and beta amino acid information (set one to None for single chain plotting))
norm (matplotlib.colors.Normalize subclass for turning distances into colors)
bounds (bounds used for colormap matplotlib.colors.BoundaryNorm (only used if norm = None))
linkage_kws (keyword arguments for linkage algorithm)
show_clusters (display clusters as annotation column (default: True))
cluster_kws (keyword arguments for clustering algorithm)
cbar_kws (keyword arguments for colorbar)
meta_columns (list-like) – metadata to plot alongside the cluster assignment
meta_to_colors (list-like) – list of functions mapping metadata labels to colors first element of list is for clusters
kws (keyword arguments passed on to the clustermap.)

Metrics

General Metrics

class pyrepseq.metric.Metric[source]

Base abstract class for all metrics in pyrepseq. This class outlines the interface that all metrics will implement. If a variable or function parameter can be any type of metric, then it should be typed to this class.

abstract calc_cdist_matrix(anchors: Iterable, comparisons: Iterable) → ndarray[source]

Calculates a cdist matrix between two collections of objects.

Parameters:

anchors (Iterable) – A collections of objects to measure distances from.
comparisons (Iterable) – A collection of objects to measure distances to.

Returns:

A matrix of shape (N,M) where N is the number of elements in anchors and M is the number of elements in comparisons. The element in the ith row and jth column will contain the distance between the ith element of anchors and the jth element of comparisons.

Return type:

numpy.ndarray

abstract calc_pdist_vector(instances: Iterable) → ndarray[source]

Calculates a pdist vector given a collection of objects.

Parameters:: instances (Iterable) – A collection of objects to measure distances between.
Returns:: A vector of shape (N*(N-1)/2,) where N is the number of elements in instances. The vector contains all distances that are possible between each possible pair of objects in instances.
Return type:: numpy.ndarray

abstract property name: str: The name of the metric as a string.

class pyrepseq.metric.Levenshtein[source]: Levenshtein distance, also known as edit distance.

class pyrepseq.metric.WeightedLevenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1)[source]

A generalised Levenshtein distance which supports different weights for insertions, deletions, and substitutions.

Parameters:

insertion_weight (int) – An integer multiplier for insertions Defaults to 1.
deletion_weight (int) – An integer multiplier for deletions Defaults to 1.
substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.

TCR Metrics

class pyrepseq.metric.tcr_metric.TcrMetric[source]

Base abstract class for all metrics that operate on TCR . TcrMetrics should expect DataFrames with each row representing a TCR, in the standard pyrepseq format (see pyrepseq.io.standardize_dataframe()). The input DataFrames must also have at least one TCR-related column. Furthermore, if the input DataFrame(s) do not have the required column for the function of the specific metric, the metric will throw a ValueError explaining which columns are missing. All values in the table should be IMGT-standardized.

abstract calc_cdist_matrix(anchors: DataFrame, comparisons: DataFrame) → ndarray[source]

Calculates a cdist matrix between two DataFrames containing TCR data.

Parameters:

anchors (DataFrame) – A DataFrame containing data on TCRs to measure distances from.
comparisons (DataFrame) – A DataFrame containing data on TCRs to measure distances to.

Returns:

A matrix of shape (N,M) where N is the number of TCRs in anchors and M is the number of TCRs in comparisons. The element in the ith row and jth column will contain the distance between the ith TCR of anchors and the jth TCR of comparisons.

Return type:

numpy.ndarray

abstract calc_pdist_vector(instances: DataFrame) → ndarray[source]

Calculates a pdist vector given a DataFrame of TCRs.

Parameters:: instances (DataFrame) – A DataFrame of TCRs to measure distances between.
Returns:: A vector of shape (N*(N-1)/2,) where N is the number of TCRs in instances. The vector contains all distances that are possible between each possible pair of TCRs in instances.
Return type:: numpy.ndarray

class pyrepseq.metric.tcr_metric.AlphaCdr3Levenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1)[source]

A TcrMetric that measures the Levenshtein distance between the alpha chain CDR3 sequences.

Parameters:

insertion_weight (int) – An integer multiplier for insertions Defaults to 1.
deletion_weight (int) – An integer multiplier for deletions Defaults to 1.
substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.

class pyrepseq.metric.tcr_metric.BetaCdr3Levenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1)[source]

A TcrMetric that measures the Levenshtein distance between the beta chain CDR3 sequences.

Parameters:

insertion_weight (int) – An integer multiplier for insertions Defaults to 1.
deletion_weight (int) – An integer multiplier for deletions Defaults to 1.
substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.

class pyrepseq.metric.tcr_metric.Cdr3Levenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1, alpha_weight: int = 1, beta_weight: int = 1)[source]

A TcrMetric that measures the Levenshtein distance between the alpha and beta chain CDR3 sequences.

Parameters:

insertion_weight (int) – An integer multiplier for insertions Defaults to 1.
deletion_weight (int) – An integer multiplier for deletions Defaults to 1.
substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.
alpha_weight (int) – An integer multiplier for edits on the alpha chain. Defaults to 1.
beta_weight (int) – An integer multiplier for edits on the beta chain. Defaults to 1.

class pyrepseq.metric.tcr_metric.AlphaCdrLevenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1, cdr1_weight: int = 1, cdr2_weight: int = 1, cdr3_weight: int = 1)[source]

A TcrMetric that measures the Levenshtein distance between the alpha chain CDR1, CDR2, and CDR3 sequences.

Parameters:

insertion_weight (int) – An integer multiplier for insertions Defaults to 1.
deletion_weight (int) – An integer multiplier for deletions Defaults to 1.
substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.
cdr1_weight (int) – An integer multiplier for edits on the CDR1. Defaults to 1.
cdr2_weight (int) – An integer multiplier for edits on the CDR2. Defaults to 1.
cdr3_weight (int) – An integer multiplier for edits on the CDR3. Defaults to 1.

class pyrepseq.metric.tcr_metric.BetaCdrLevenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1, cdr1_weight: int = 1, cdr2_weight: int = 1, cdr3_weight: int = 1)[source]

A TcrMetric that measures the Levenshtein distance between the beta chain CDR1, CDR2, and CDR3 sequences.

Parameters:

insertion_weight (int) – An integer multiplier for insertions Defaults to 1.
deletion_weight (int) – An integer multiplier for deletions Defaults to 1.
substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.
cdr1_weight (int) – An integer multiplier for edits on the CDR1. Defaults to 1.
cdr2_weight (int) – An integer multiplier for edits on the CDR2. Defaults to 1.
cdr3_weight (int) – An integer multiplier for edits on the CDR3. Defaults to 1.

class pyrepseq.metric.tcr_metric.CdrLevenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1, alpha_weight: int = 1, beta_weight: int = 1, cdr1_weight: int = 1, cdr2_weight: int = 1, cdr3_weight: int = 1)[source]

A TcrMetric that measures the Levenshtein distance between the alpha and beta chain CDR1, CDR2, and CDR3 sequences.

Parameters:

insertion_weight (int) – An integer multiplier for insertions Defaults to 1.
deletion_weight (int) – An integer multiplier for deletions Defaults to 1.
substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.
cdr1_weight (int) – An integer multiplier for edits on the CDR1. Defaults to 1.
cdr2_weight (int) – An integer multiplier for edits on the CDR2. Defaults to 1.
cdr3_weight (int) – An integer multiplier for edits on the CDR3. Defaults to 1.
alpha_weight (int) – An integer multiplier for edits on the alpha chain. Defaults to 1.
beta_weight (int) – An integer multiplier for edits on the beta chain. Defaults to 1.

class pyrepseq.metric.tcr_metric.AlphaCdr3Tcrdist[source]

TcrDist applied to the alpha chain CDR3 sequences.

[Requires optional tcrdist dependency.]

class pyrepseq.metric.tcr_metric.BetaCdr3Tcrdist[source]

TcrDist applied to the beta chain CDR3 sequences.

[Requires optional tcrdist dependency.]

class pyrepseq.metric.tcr_metric.Cdr3Tcrdist[source]

TcrDist applied to the alpha and beta chain CDR3 sequences.

[Requires optional tcrdist dependency.]

class pyrepseq.metric.tcr_metric.AlphaTcrdist[source]

TcrDist applied to the alpha chain.

[Requires optional tcrdist dependency.]

class pyrepseq.metric.tcr_metric.BetaTcrdist[source]

TcrDist applied to the beta chain.

[Requires optional tcrdist dependency.]

class pyrepseq.metric.tcr_metric.Tcrdist[source]

TcrDist applied to the alpha and beta chain.

[Requires optional tcrdist dependency.]