API Reference
Main module
IO
- pyrepseq.io.isvalidaa(string)[source]
returns true if string is composed only of characters from the standard amino acid alphabet
- pyrepseq.io.isvalidcdr3(string)[source]
returns True if string is a valid CDR3 sequence
- Checks the following:
first amino acid is a cysteine (C)
last amino acid is either phenylalanine (F), tryptophan (W), or cysteine (C)
each amino acid is part of the standard amino acid alphabet
See http://www.imgt.org/IMGTScientificChart/Numbering/IMGTIGVLsuperfamily.html and also https://doi.org/10.1093/nar/gkac190
- pyrepseq.io.multimerge(dfs, on, suffixes=None, **kwargs)[source]
Merge multiple dataframes on a common column.
Provides support for custom suffixes.
- Parameters:
on ('index' or column name)
suffixes ([list-like | None]) – list of suffixes to append to the data
**kwargs (keyword arguments passed along to pd.merge)
- Return type:
merged dataframe
- pyrepseq.io.standardize_dataframe(df: DataFrame | None = None, col_mapper: Mapping | None = None, standardize: bool = True, species: str = 'HomoSapiens', tcr_enforce_functional: bool = True, tcr_precision: str = 'gene', mhc_precision: str = 'gene', strict_cdr3_standardization: bool = False, suppress_warnings: bool = False, df_old: DataFrame | None = None)[source]
This is a utility function to organise a table of TCR-pMHC data into the standard pyrepseq format and perform data cleaning/standardization to ensure that the TCR/MHC gene symbols are IMGT-compliant, the epitopes are all valid amino acid strings, and the CDR3s look valid. For further notes on data standardization, see below. The standard format is a table with some or all of the following columns (not necessarily in order):
Column Name
Column should contain
Data type
TRAV
TRAV gene symbol
str
CDR3A
TCR alpha chain CDR3 amino acid sequence
str
TRAJ
TRAJ gene symbol
str
TRBV
TRBV gene symbol
str
CDR3B
TCR beta chain CDR3 amino acid sequence
str
TRBJ
TRBJ gene symbol
str
Epitope
Epitope amino acid sequence
str
MHCA
MHC alpha chain gene symbol
str
MHCB
MHC beta chain gene symbol
str
If the input DataFrame contains the necessary data in columns that are named differently, this can be resolved by providing the mapping to the col_mapper argument (see parameters and examples).
If standardization is enabled (True by default), the function will additionally attempt to standardize the TCR and MHC gene symbols to be IMGT-compliant, and CDR3/Epitope amino acid sequences to be valid. However, for the standardization to happen, the columns with the relevant data must either be correctly named, or the necessary re-naming scheme must be specified by supplying an argument to the col_mapper parameter. During standardization, most non-standardizable/nonsensical values will be removed, replaced with None. However, since epitopes are not necessarily always amino acid sequences, values in the Epitope column that fail standardization will be kept as their original value.
Deprecated since version 1.4: df_old will be removed in pyrepseq 2.0, with the more simply named df parameter.
- Parameters:
df (pandas.DataFrame) – Source
DataFrame
from which to pull data.df_old (pandas.DataFrame) – Alias for
df
. Now deprecated and will be removed in version 2.0.col_mapper (Mapping) – A mapping object, such as a dictionary, which maps the old column names to the new column names. This should not be set if no column re-naming is necessary. Defaults to
None
.standardize (bool) – When set to
False
, gene name standardisation is not attempted. Defaults toTrue
.species (str) – Name of the species from which the TCR data is derived, in their binomial nomenclature, camel-cased. Defaults to
'HomoSapiens'
.tcr_enforce_functional (bool) – When set to
True
, TCR genes that are not functional (i.e. ORF or pseudogene) are removed, and replaced withNone
. Defaults toTrue
.tcr_precision (str) – Level of precision to trim the TCR gene data to (
'gene'
or'allele'
). Defaults to'gene'
.mhc_precision (str) – Level of precision to trim the MHC gene data to (
'gene'
,'protein'
or'allele'
). Defaults to'gene'
.strict_cdr3_standardization (bool) – If True, any string that does not look like a CDR3 sequence is rejected. If False, any inputs that are valid amino acid sequences but do not start with C and end with F/W are not rejected and instead are corrected by having a C appended to the beginning and an F appended at the end. Defaults to False.
suppress_warnings (bool) – If
True
, suppresses warnings that are emitted when the standardisation of certain values fails. Defaults toFalse
.
- Returns:
Standardized
DataFrame
containing the original data, cleaned.- Return type:
pandas.DataFrame
Examples
If you already have a DataFrame in the standard format, standardize_dataframe can perform data standardization for you. In the examples shown here, we omit any standardization warnings for ease of reading.
Say you have the following DataFrame:
>>> from pyrepseq import io >>> import pandas as pd >>> df = pd.DataFrame( ... data=[ ... ["av26.1*1", "CIVRAPGRADMRF", "aj43*1", "bv13*1", "CASSYLPGQGDHYSNQPQHF","bj1.5*1", "FLKEKGGL", "b8", "b2m"], ... ["TCRAV20*01","CAVPSGAGSYQLTF","TCRAJ28*01","TCRBV28S1*01","CASSLGQSGANVLTF", "TCRBJ2S6*01","LQPFPQPELPYPQPQ","HLA-DQA1*05","HLA-DQB1*02"], ... ["unknown", "unknown", "unknown", "TRBV7-2*01", "CASSDWGSQNTLYF", "TRBJ2-4*01", "YMPYFFTLL", "HLA-A*02", "B2M"] ... ], ... columns=["TRAV","CDR3A","TRAJ","TRBV","CDR3B","TRBJ","Epitope","MHCA","MHCB"] ... ) >>> df TRAV CDR3A TRAJ TRBV CDR3B TRBJ Epitope MHCA MHCB 0 av26.1*1 CIVRAPGRADMRF aj43*1 bv13*1 CASSYLPGQGDHYSNQPQHF bj1.5*1 FLKEKGGL b8 b2m 1 TCRAV20*01 CAVPSGAGSYQLTF TCRAJ28*01 TCRBV28S1*01 CASSLGQSGANVLTF TCRBJ2S6*01 LQPFPQPELPYPQPQ HLA-DQA1*05 HLA-DQB1*02 2 unknown unknown unknown TRBV7-2*01 CASSDWGSQNTLYF TRBJ2-4*01 YMPYFFTLL HLA-A*02 B2M
By passing this to `standardize_dataframe, you will get a cleaned version of the data.
>>> io.standardize_dataframe(df, suppress_warnings=True) TRAV CDR3A TRAJ TRBV CDR3B TRBJ Epitope MHCA MHCB 0 TRAV26-1 CIVRAPGRADMRF TRAJ43 TRBV13 CASSYLPGQGDHYSNQPQHF TRBJ1-5 FLKEKGGL HLA-B B2M 1 TRAV20 CAVPSGAGSYQLTF TRAJ28 TRBV28 CASSLGQSGANVLTF TRBJ2-6 LQPFPQPELPYPQPQ HLA-DQA1 HLA-DQB1 2 None None None TRBV7-2 CASSDWGSQNTLYF TRBJ2-4 YMPYFFTLL HLA-A B2M
If you want to have extra columns on the DataFrame, that is allowed.
>>> extended_df = df.copy() >>> extended_df["clone_count"] = [1,2,3] >>> io.standardize_dataframe(extended_df, suppress_warnings=True) TRAV CDR3A TRAJ TRBV CDR3B TRBJ Epitope MHCA MHCB clone_count 0 TRAV26-1 CIVRAPGRADMRF TRAJ43 TRBV13 CASSYLPGQGDHYSNQPQHF TRBJ1-5 FLKEKGGL HLA-B B2M 1 1 TRAV20 CAVPSGAGSYQLTF TRAJ28 TRBV28 CASSLGQSGANVLTF TRBJ2-6 LQPFPQPELPYPQPQ HLA-DQA1 HLA-DQB1 2 2 None None None TRBV7-2 CASSDWGSQNTLYF TRBJ2-4 YMPYFFTLL HLA-A B2M 3
Having only a subset of the standard columns is also allowed.
>>> beta_only_df = df.copy() >>> beta_only_df = beta_only_df[["TRBV","CDR3B","TRBJ"]] >>> io.standardize_dataframe(beta_only_df, suppress_warnings=True) TRBV CDR3B TRBJ 0 TRBV13 CASSYLPGQGDHYSNQPQHF TRBJ1-5 1 TRBV28 CASSLGQSGANVLTF TRBJ2-6 2 TRBV7-2 CASSDWGSQNTLYF TRBJ2-4
Columns can be renamed by suppling a mapping to the col_mapper parameter.
>>> beta_only_misnamed = beta_only_df.copy() >>> beta_only_misnamed.columns = ["foo", "bar", "baz"] >>> beta_only_misnamed foo bar baz 0 bv13*1 CASSYLPGQGDHYSNQPQHF bj1.5*1 1 TCRBV28S1*01 CASSLGQSGANVLTF TCRBJ2S6*01 2 TRBV7-2*01 CASSDWGSQNTLYF TRBJ2-4*01 >>> col_mapper = { ... "foo": "TRBV", ... "bar": "CDR3B", ... "baz": "TRBJ" ... } >>> io.standardize_dataframe(beta_only_misnamed, col_mapper=col_mapper) TRBV CDR3B TRBJ 0 TRBV13 CASSYLPGQGDHYSNQPQHF TRBJ1-5 1 TRBV28 CASSLGQSGANVLTF TRBJ2-6 2 TRBV7-2 CASSDWGSQNTLYF TRBJ2-4
Stats
- pyrepseq.stats.jaccard_index(A, B)[source]
Calculate the Jaccard index for two sets.
This measure is defined defined as
\(J(A, B) = |A \cap B| / |A \cup B|\)
A, B: iterables (will be converted to sets). If A, B are pd.Series na values will be dropped first
- pyrepseq.stats.overlap(A, B)[source]
Calculate the number of overlapping elements of two sets.
This measure is defined as \(|A \cap B|\)
A, B: iterables (will be converted to sets). na values will be dropped first
- pyrepseq.stats.overlap_coefficient(A, B)[source]
Calculate the overlap coefficient for two sets.
This measure is defined as \(O(A, B) = |A \cap B| / min(|A|, |B|)\)
A, B: iterables (will be converted to sets). na values will be dropped first
- pyrepseq.stats.pc(array: Iterable, array2: Iterable | None = None)[source]
Estimate the coincidence probability \(p_C\) from a sample. \(p_C\) is equal to the probability that two distinct sampled elements are the same. If \(n_i\) are the counts of the i-th unique element and \(N = \sum_i n_i\) the length of the array, then: \(p_C = \sum_i n_i (n_i-1)/(N(N-1))\)
Note: This measure is also known as the Simpson or Hunter-Gaston index
- Parameters:
array (Iterable) – Iterable of sampled elements
array2 (Optional[Iterable]) – Second Iterable of sampled elements: if provided probability of cross-coincidences is calculated as \(p_C = (\sum_i n_{1i} n_{2i}) / (N_1 N_2)\)
- pyrepseq.stats.pc_conditional(df, by, on, take_mean=True, weight_uniformly=False)[source]
Conditional coincidence probability estimator
- Parameters:
df (pandas DataFrame)
by (list) – conditioning parameters used to group input data frame
on (string/list of strings) – column or columns to compute probability of coincidence or joint probability of coincidence on. If type(on) == list then pc is computed on the concatenations of each specified column
take_mean (bool) – specify wether to take the average once pc has been computed for each specified group
- Returns:
pc of df[on] computed over each group specified in by. if take_mean=True then the average of these group by pcs is returned
- Return type:
pandas DataFrame/float
- pyrepseq.stats.pc_n(n)[source]
Estimate the coincidence probability \(p_C\) from sampled counts. \(p_C\) is equal to the probability that two distinct sampled elements are the same. If \(n_i\) are the counts of the i-th unique element and \(N = \sum_i n_i\) the length of the array, then: \(p_C = \sum_i n_i (n_i-1)/(N(N-1))\)
Note: This measure is also known as the Simpson or Hunter-Gaston index
- Parameters:
n (array-like) – list of counts
- pyrepseq.stats.powerlaw_mle_alpha(c, cmin=1.0, method='exact', **kwargs)[source]
Maximum likelihood estimate of the power-law exponent.
- Parameters:
c (counts)
cmin (only counts >= cmin are included in fit)
continuitycorrection (use continuitycorrection (more accurate for integer counts))
method (one of ['simple', 'continuitycorrection', 'exact']) –
- ‘simple’: Uses an analytical formula that is exact in the continuous case
(Eq. B17 in Clauset et al. arXiv 0706.1062v2)
’continuitycorrection’: applies a continuity correction to the analytical formula ‘exact’: Numerically maximizes the discrete loglikelihood
kwargs (dict) – passed on to scipy.optimize.minimize_scalar Default: bounds=[1.5, 4.5], method=’bounded’
- Return type:
estimated power-law exponent
- pyrepseq.stats.powerlaw_sample(size=1, xmin=1.0, alpha=2.0)[source]
Draw samples from a discrete power-law.
Uses an approximate transformation technique, see Eq. D6 in Clauset et al. arXiv 0706.1062v2 for details.
- Parameters:
size (number of values to draw)
xmin (minimal value)
alpha (power-law exponent)
- Return type:
array of integer samples
- pyrepseq.stats.renyi2_entropy(df, features, by=None, base=2.0)[source]
Compute Renyi-Simpson entropies
- pyrepseq.stats.subsample(counts, n)[source]
Randomly subsample from a vector of counts without replacement.
- Parameters:
counts (Vector of counts (integers) to randomly subsample from.)
n (Number of items to subsample from counts. Must be less than or equal) – to the sum of counts.
- Returns:
indices, counts
- Return type:
Subsampled vector of counts where the sum of the elements equals n
Distance
- pyrepseq.distance.calculate_neighbor_numbers(seqs, reference=None, neighborhood=<function levenshtein_neighbors>)[source]
Calculate the number of neighbors for each sequence in a list.
- pyrepseq.distance.cdist(stringsA, stringsB, metric=None, dtype=<class 'numpy.uint8'>, **kwargs)[source]
Compute distance between each pair of the two collections of strings. (scipy.spatial.distance.cdist equivalent for strings)
Deprecated since version 1.4:
pyrepseq.cdist()
is now deprecated in favour of theMetric
object system (seepyrepseq.metric.Metric
).Metric
objects implement thecalc_cdist_matrix
method which will perform the cdist computation.pyrepseq.cdist()
will be removed in version 2.0.- Parameters:
stringsA (iterable of strings) – An mA-length iterable.
stringsB (iterable of strings) – An mB-length iterable.
metric (function, optional) – The distance metric to use. Default: Levenshtein distance.
dtype (np.dtype) – data type of the distance matrix, default: np.uint8 Note: make sure to change the dtype, if the metric does not return integers
- Returns:
Y – A \(m_A\) by \(m_B\) distance matrix is returned. For each \(i\) and \(j\), the metric
dist(u=XA[i], v=XB[j])
is computed and stored in the \(ij\) th entry.- Return type:
ndarray
- pyrepseq.distance.downsample(seqs: Iterable[str] | DataFrame | None, maxseqs: int | None = None)[source]
Random downsampling of a list of sequences. Also works for standard pyrepseq TCR DataFrames (see
pyrepseq.io.standardize_dataframe()
).- Parameters:
- Returns:
Random subset of maxseqs elements from the input collection.
If maxseqs is None, returns the input collection without modification.
- pyrepseq.distance.find_neighbor_pairs(seqs, neighborhood=<function hamming_neighbors>)[source]
Find neighboring sequences in a list of unique sequences.
- Parameters:
neighborhood (callable returning an iterable of neighbors)
- Return type:
list of tuples (seq1, seq2)
- pyrepseq.distance.find_neighbor_pairs_index(seqs, neighborhood=<function hamming_neighbors>)[source]
Find neighboring sequences in a list of unique sequences.
- Parameters:
neighborhood (callable returning an iterable of neighbors)
- Return type:
list of tuples (index1, index2)
- pyrepseq.distance.hamming_neighbors(x, alphabet='ACDEFGHIKLMNPQRSTVWY', variable_positions=None)[source]
Iterator over Hamming neighbors of a string x.
- Parameters:
alphabet (iterable of characters)
variable_positions (iterable of positions to be varied (default: all))
- pyrepseq.distance.hierarchical_clustering(seqs: Iterable, metric: Metric | None = None, linkage_kws={'method': 'average', 'optimal_ordering': True}, cluster_kws={'criterion': 'distance', 't': 6})[source]
Hierarchical clustering by sequence similarity.
- Parameters:
seqs (Iterable) – A collection of elements to cluster.
metric (Metric) – The metric used to compute distances between elements. If not set, a default is inferred from the input data type of seqs. If seqs is a standard pyrepseq TCR DataFrame (see
pyrepseq.io.standardize_dataframe()
), then the metric can default to either apyrepseq.metric.tcr_metric.Cdr3Levenshtein
,pyrepseq.metric.tcr_metric.AlphaCdr3Levenshtein
, orpyrepseq.metric.tcr_metric.BetaCdr3Levenshtein
, depending on what columns are available. In all other cases, the metric defaults topyrepseq.metric.Levenshtein
.linkage_kws – keyword arguments for linkage algorithm
cluster_kws – keyword arguments for clustering algorithm
- pyrepseq.distance.isdist1(x, reference, neighborhood=<function levenshtein_neighbors>)[source]
Is the string x distance 1 away from any of the strings in the reference set
- pyrepseq.distance.levenshtein_neighbors(x, alphabet='ACDEFGHIKLMNPQRSTVWY')[source]
Iterator over Levenshtein neighbors of a string x
- pyrepseq.distance.load_pcDelta_background(return_bins=True)[source]
Loads pre-computed background pcDelta distributions calculated for PBMC TCRs.
Data: Sample W_F1_2018 from Minervina et al. https://zenodo.org/record/4065547/
- Returns:
back (pd.DataFrame) – DataFrame with coincidence probabilities
bins (ndarray [if return_bins = True]) – Delta bins to be used as bins for other data
- pyrepseq.distance.next_nearest_neighbors(x, neighborhood, maxdistance=2)[source]
Set of next nearest neighbors of a string x.
- Parameters:
alphabet (iterable of characters)
neighborhood (neighborhood iterator)
maxdistance (go up to maxdistance nearest neighbor)
- Return type:
set of neighboring sequences
- pyrepseq.distance.nndist_hamming(seq, reference, maxdist=4)[source]
Calculate the nearest-neighbor distance by Hamming distance
- pyrepseq.distance.pcDelta(seqs: Iterable, seqs2: Iterable | None = None, metric: Metric | None = None, bins: int | Iterable | None = None, normalize: bool = True, pseudocount: float = 0.0, maxseqs: int | None = None)[source]
Calculates binned near-coincidence probabilities \(p_C(\Delta)\) among input sequences.
- Parameters:
seqs (Iterable) – A collection of elements to measure distances between.
seqs2 (Optional[Iterable]) – A second collection of elements for cross-comparisons.
metric (
pyrepseq.metric.Metric
) – The metric used to compute distances between elements. If not set, a default is inferred from the input data type of seqs. If seqs is a standard pyrepseq TCR DataFrame (seepyrepseq.io.standardize_dataframe()
), then the metric can default to either apyrepseq.metric.tcr_metric.Cdr3Levenshtein
,pyrepseq.metric.tcr_metric.AlphaCdr3Levenshtein
, orpyrepseq.metric.tcr_metric.BetaCdr3Levenshtein
, depending on what columns are available. In all other cases, the metric defaults topyrepseq.metric.Levenshtein
.bins (Union[int, Iterable]) – bins for the distances Delta. (Default: range(0, 25)) bins=0: Calculate exact coincidence probability
normalize (bool) – whether to return pc (normalized) or raw counts
pseudocount (float) – for a Bayesian estimation of coincidence frequencies e.g. can use Jeffrey’s prior value of 0.5
maxseqs (Optional[int]) – maximal number of sequences to keep by random downsampling
- Returns:
(normalized) histogram of sequence distances
- Return type:
np.ndarray
- pyrepseq.distance.pcDelta_grouped(df, by, seq_columns, **kwargs)[source]
Near-coincidence probabilities conditioned to within-group comparisons.
- Parameters:
df (pd.DataFrame)
by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby
seq_columns (string) – The data frame column on which we want to apply the pcDelta analysis
**kwargs (keyword arguments) – passed on to pcDelta
- Returns:
pcs – Returns a DataFrame of pC(delta) for each group
- Return type:
pd.DataFrame
- pyrepseq.distance.pcDelta_grouped_cross(df, by, seq_columns, condensed=False, **kwargs)[source]
Near-coincidence probabilities conditioned to cross-group comparisons.
- Parameters:
df (pd.DataFrame)
by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby
seq_columns (string) – The data frame column on which we want to apply the pcDelta analysis
condensed (bool) – Return a condensed instead of squareform matrix (default: False)
**kwargs (keyword arguments) – passed on to pcDelta
- Returns:
pcs – Returns a DataFrame of pC(delta) across pairs of groups
- Return type:
pd.DataFrame
- pyrepseq.distance.pdist(strings, metric=None, dtype=<class 'numpy.uint8'>, **kwargs)[source]
Pairwise distances between collection of strings. (scipy.spatial.distance.pdist equivalent for strings)
Deprecated since version 1.4:
pyrepseq.pdist()
is now deprecated in favour of theMetric
object system (seepyrepseq.metric.Metric
).Metric
objects implement thecalc_pdist_vector
method which will perform the pdist computation.pyrepseq.pdist()
will be removed in version 2.0.- Parameters:
strings (iterable of strings) – An m-length iterable.
metric (function, optional) – The distance metric to use. Default: Levenshtein distance.
dtype (np.dtype) – data type of the distance matrix, default: np.uint8 Note: make sure to change the dtype, if the metric does not return integers
- Returns:
Y – Returns a condensed distance matrix Y. For each \(i\) and \(j\) (where \(i<j<m\)), where m is the number of original observations. The metric
dist(u=X[i], v=X[j])
is computed and stored in entrym * i + j - ((i + 2) * (i + 1)) // 2
.- Return type:
ndarray
Nearest Neighbor
- pyrepseq.nn.hash_based(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets')[source]
List all neighboring CDR3B sequences efficiently for small edit distances. The idea is to list all possible sequences within a given distance and lookup the dictionary if it exists. This implementation is faster than kdtree implementation for max_edits == 1
- Parameters:
strings (iterable of strings) – list of CDR3B sequences
max_edits (int) – maximum edit distance defining the neighbors
max_returns (int or None) – maximum neighbor size
n_cpu (int) – number of CPU cores running in parallel
custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”
- Returns:
neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix
- Return type:
array of 3D-tuples, sparse matrix, or dense matrix
- pyrepseq.nn.kdtree(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', compression=1)[source]
List all neighboring CDR3B sequences efficiently within the given edit distance. With KDTree, the algorithms run with O(N logN) eliminating unnecessary comparisons. With RapidFuzz library, the edit distance comparison is efficiently written in C++. With multiprocessing, the algorithm can take advantage of multiple CPU cores. This implementation is faster than hash-based implementation for max_edits > 1
- Parameters:
strings (iterable of strings) – list of CDR3B sequences
max_edits (int) – maximum edit distance defining the neighbors
max_returns (int or None) – maximum neighbor size
n_cpu (int) – number of CPU cores running in parallel
custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”
- Returns:
neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix
- Return type:
array of 3D-tuples, sparse matrix, or dense matrix
- pyrepseq.nn.nearest_neighbor(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', seqs2=None)[source]
List all neighboring sequences efficiently within a given distance. The distance can be given in terms of hamming, levenshtein, or custom.
If seqs2 is not provided, every sequence is compared against every other sequence.
- Parameters:
strings (iterable of strings) – list of CDR3B sequences
max_edits (int) – maximum edit distance defining the neighbors
max_returns (int or None) – maximum neighbor size
n_cpu (int) – ignored
custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”
seq2 (iterable of strings or None) – another list of CDR3B sequences to compare against
- Returns:
neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix
- Return type:
array of 3D-tuples, sparse matrix, or dense matrix
- pyrepseq.nn.nearest_neighbor_tcrdist(df, chain='beta', max_edits=2, edit_on_trimmed=True, max_tcrdist=20, tcrdist_kwargs={}, **kwargs)[source]
List all neighboring TCR sequences efficiently within a given edit and TCRdist radius.
[Requires optional dependency pwseqdist]
- Parameters:
chain ('alpha' or 'beta')
max_edits (only return neighbors up to <= this edit distance)
edit_on_trimmed (boolean) – apply TCRdist trimming on sequences before calculating edit distance
max_tcrdist (only return neighbor up to <= this TCR distance)
tcrdist_kwargs (dict) – customized parameters for TCRdist calculation
**kwargs (passed on to nearest_neighbor function)
- Return type:
sparse matrix in (i, j, dist) format
- pyrepseq.nn.symdel(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', seqs2=None)[source]
List all neighboring CDR3B sequences efficiently within the given distance. This is an improved version over the hash-based.
If seqs2 is not provided, every sequences are compared against every other sequences resulting in N(seqs)**2 combinations. Otherwise, seqs are compared against seqs2 resulting in N(seqs)*N(seqs2) combinations.
- Parameters:
strings (iterable of strings) – list of CDR3B sequences
max_edits (int) – maximum edit distance defining the neighbors
max_returns (int or None) – maximum neighbor size
n_cpu (int) – ignored
custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”
seq2 (iterable of strings or None) – another list of CDR3B sequences to compare against
- Returns:
neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix
- Return type:
array of 3D-tuples, sparse matrix, or dense matrix
Plotting submodule
- pyrepseq.plotting.align_seqs(seqs)[source]
Align multiple sequences using mafft-linsi with default parameters.
Requires external dependency mafft-linsi to be installed.
- Parameters:
seqs (iterable of strings)
- Returns:
aligned sequences (with gaps)
- Return type:
list of strings
- pyrepseq.plotting.label_axes(fig_or_axes, labels='ABCDEFGHIJKLMNOPQRSTUVWXYZ', labelstyle='%s', xy=(-0.1, 0.95), xycoords='axes fraction', **kwargs)[source]
Walks through axes and labels each. kwargs are collected and passed to annotate
- Parameters:
fig (Figure or Axes to work on)
labels (iterable or None) – iterable of strings to use to label the axes. If None, lower case letters are used.
loc (Where to put the label units (len=2 tuple of floats))
xycoords (loc relative to axes, figure, etc.)
kwargs (to be passed to annotate)
- pyrepseq.plotting.labels_to_colors_hls(labels, palette_kws={'l': 0.5, 's': 0.8}, min_count=None)[source]
Map a list of labels to a list of unique colors. Uses seaborn.hls_palette.
- Parameters:
df (pandas DataFrame with data)
labels (list of labels)
min_count (map all labels seen less than min_count to black)
palette_kws (passed to seaborn.hls_palette)
- pyrepseq.plotting.labels_to_colors_tableau(labels, min_count=None)[source]
Map a list of labels to a list of unique colors. Uses Tableau_10 colors
- Parameters:
df (pandas DataFrame with data)
labels (list of labels)
min_count (map all labels seen less than min_count to black)
- pyrepseq.plotting.rankfrequency(data, ax=None, normalize_x=True, normalize_y=False, log_x=True, log_y=True, scalex=1.0, scaley=1.0, **kwargs)[source]
Plot rank frequency plots.
- Parameters:
- Returns:
Objectes representing the plotted data.
- Return type:
list of Line2D
- pyrepseq.plotting.seqlogos(seqs, ax=None, **kwargs)[source]
Display a sequence logo.
Aligns sequences using align_seqs if they are are not of equal length.
- Parameters:
seqs (iterable of strings) – sequences to be displayed
ax (matplotlib.axes) – if None create new figure
**kwargs (dict) – passed on to logomaker.Logo
- Return type:
axes, counts_matrix
- pyrepseq.plotting.seqlogos_vj(df, cdr3_column, v_column, j_column, axes=None, **kwargs)[source]
Display a sequence logo with V and J gene information.
- pyrepseq.plotting.similarity_clustermap(df, alpha_column='cdr3a', beta_column='cdr3b', norm=None, bounds=array([0, 1, 2, 3, 4, 5, 6]), linkage_kws={'method': 'average', 'optimal_ordering': True}, cluster_kws={'criterion': 'distance', 't': 6}, cbar_kws={'format': '%d', 'label': 'Sequence Distance', 'orientation': 'horizontal'}, meta_columns=None, meta_to_colors=None, **kws)[source]
Plots a sequence-similarity clustermap.
- Parameters:
df (pandas DataFrame with data)
alpha_column (column name with alpha and beta amino acid information (set one to None for single chain plotting))
beta_column (column name with alpha and beta amino acid information (set one to None for single chain plotting))
norm (matplotlib.colors.Normalize subclass for turning distances into colors)
bounds (bounds used for colormap matplotlib.colors.BoundaryNorm (only used if norm = None))
linkage_kws (keyword arguments for linkage algorithm)
cluster_kws (keyword arguments for clustering algorithm)
cbar_kws (keyword arguments for colorbar)
meta_columns (list-like) – metadata to plot alongside the cluster assignment
meta_to_colors (list-like) – list of functions mapping metadata labels to colors first element of list is for clusters
kws (keyword arguments passed on to the clustermap.)
Metrics
General Metrics
- class pyrepseq.metric.Metric[source]
Base abstract class for all metrics in pyrepseq. This class outlines the interface that all metrics will implement. If a variable or function parameter can be any type of metric, then it should be typed to this class.
- abstract calc_cdist_matrix(anchors: Iterable, comparisons: Iterable) ndarray [source]
Calculates a cdist matrix between two collections of objects.
- Parameters:
anchors (Iterable) – A collections of objects to measure distances from.
comparisons (Iterable) – A collection of objects to measure distances to.
- Returns:
A matrix of shape (N,M) where N is the number of elements in anchors and M is the number of elements in comparisons. The element in the ith row and jth column will contain the distance between the ith element of anchors and the jth element of comparisons.
- Return type:
numpy.ndarray
- abstract calc_pdist_vector(instances: Iterable) ndarray [source]
Calculates a pdist vector given a collection of objects.
- Parameters:
instances (Iterable) – A collection of objects to measure distances between.
- Returns:
A vector of shape (N*(N-1)/2,) where N is the number of elements in instances. The vector contains all distances that are possible between each possible pair of objects in instances.
- Return type:
numpy.ndarray
TCR Metrics
- class pyrepseq.metric.tcr_metric.TcrMetric[source]
Base abstract class for all metrics that operate on TCR . TcrMetrics should expect DataFrames with each row representing a TCR, in the standard pyrepseq format (see
pyrepseq.io.standardize_dataframe()
). The input DataFrames must also have at least one TCR-related column. Furthermore, if the input DataFrame(s) do not have the required column for the function of the specific metric, the metric will throw a ValueError explaining which columns are missing. All values in the table should be IMGT-standardized.- abstract calc_cdist_matrix(anchors: DataFrame, comparisons: DataFrame) ndarray [source]
Calculates a cdist matrix between two DataFrames containing TCR data.
- Parameters:
anchors (DataFrame) – A DataFrame containing data on TCRs to measure distances from.
comparisons (DataFrame) – A DataFrame containing data on TCRs to measure distances to.
- Returns:
A matrix of shape (N,M) where N is the number of TCRs in anchors and M is the number of TCRs in comparisons. The element in the ith row and jth column will contain the distance between the ith TCR of anchors and the jth TCR of comparisons.
- Return type:
numpy.ndarray
- abstract calc_pdist_vector(instances: DataFrame) ndarray [source]
Calculates a pdist vector given a DataFrame of TCRs.
- Parameters:
instances (DataFrame) – A DataFrame of TCRs to measure distances between.
- Returns:
A vector of shape (N*(N-1)/2,) where N is the number of TCRs in instances. The vector contains all distances that are possible between each possible pair of TCRs in instances.
- Return type:
numpy.ndarray
- class pyrepseq.metric.tcr_metric.AlphaCdr3Levenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1)[source]
A TcrMetric that measures the Levenshtein distance between the alpha chain CDR3 sequences.
- class pyrepseq.metric.tcr_metric.BetaCdr3Levenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1)[source]
A TcrMetric that measures the Levenshtein distance between the beta chain CDR3 sequences.
- class pyrepseq.metric.tcr_metric.Cdr3Levenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1, alpha_weight: int = 1, beta_weight: int = 1)[source]
A TcrMetric that measures the Levenshtein distance between the alpha and beta chain CDR3 sequences.
- Parameters:
insertion_weight (int) – An integer multiplier for insertions Defaults to 1.
deletion_weight (int) – An integer multiplier for deletions Defaults to 1.
substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.
alpha_weight (int) – An integer multiplier for edits on the alpha chain. Defaults to 1.
beta_weight (int) – An integer multiplier for edits on the beta chain. Defaults to 1.
- class pyrepseq.metric.tcr_metric.AlphaCdrLevenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1, cdr1_weight: int = 1, cdr2_weight: int = 1, cdr3_weight: int = 1)[source]
A TcrMetric that measures the Levenshtein distance between the alpha chain CDR1, CDR2, and CDR3 sequences.
- Parameters:
insertion_weight (int) – An integer multiplier for insertions Defaults to 1.
deletion_weight (int) – An integer multiplier for deletions Defaults to 1.
substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.
cdr1_weight (int) – An integer multiplier for edits on the CDR1. Defaults to 1.
cdr2_weight (int) – An integer multiplier for edits on the CDR2. Defaults to 1.
cdr3_weight (int) – An integer multiplier for edits on the CDR3. Defaults to 1.
- class pyrepseq.metric.tcr_metric.BetaCdrLevenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1, cdr1_weight: int = 1, cdr2_weight: int = 1, cdr3_weight: int = 1)[source]
A TcrMetric that measures the Levenshtein distance between the beta chain CDR1, CDR2, and CDR3 sequences.
- Parameters:
insertion_weight (int) – An integer multiplier for insertions Defaults to 1.
deletion_weight (int) – An integer multiplier for deletions Defaults to 1.
substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.
cdr1_weight (int) – An integer multiplier for edits on the CDR1. Defaults to 1.
cdr2_weight (int) – An integer multiplier for edits on the CDR2. Defaults to 1.
cdr3_weight (int) – An integer multiplier for edits on the CDR3. Defaults to 1.
- class pyrepseq.metric.tcr_metric.CdrLevenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1, alpha_weight: int = 1, beta_weight: int = 1, cdr1_weight: int = 1, cdr2_weight: int = 1, cdr3_weight: int = 1)[source]
A TcrMetric that measures the Levenshtein distance between the alpha and beta chain CDR1, CDR2, and CDR3 sequences.
- Parameters:
insertion_weight (int) – An integer multiplier for insertions Defaults to 1.
deletion_weight (int) – An integer multiplier for deletions Defaults to 1.
substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.
cdr1_weight (int) – An integer multiplier for edits on the CDR1. Defaults to 1.
cdr2_weight (int) – An integer multiplier for edits on the CDR2. Defaults to 1.
cdr3_weight (int) – An integer multiplier for edits on the CDR3. Defaults to 1.
alpha_weight (int) – An integer multiplier for edits on the alpha chain. Defaults to 1.
beta_weight (int) – An integer multiplier for edits on the beta chain. Defaults to 1.
- class pyrepseq.metric.tcr_metric.AlphaCdr3Tcrdist[source]
TcrDist applied to the alpha chain CDR3 sequences.
[Requires optional tcrdist dependency.]
- class pyrepseq.metric.tcr_metric.BetaCdr3Tcrdist[source]
TcrDist applied to the beta chain CDR3 sequences.
[Requires optional tcrdist dependency.]
- class pyrepseq.metric.tcr_metric.Cdr3Tcrdist[source]
TcrDist applied to the alpha and beta chain CDR3 sequences.
[Requires optional tcrdist dependency.]
- class pyrepseq.metric.tcr_metric.AlphaTcrdist[source]
TcrDist applied to the alpha chain.
[Requires optional tcrdist dependency.]