API Reference

Main module

IO

pyrepseq.io.isvalidaa(string)[source]

returns true if string is composed only of characters from the standard amino acid alphabet

pyrepseq.io.isvalidcdr3(string)[source]

returns True if string is a valid CDR3 sequence

Checks the following:
  • first amino acid is a cysteine (C)

  • last amino acid is either phenylalanine (F), tryptophan (W), or cysteine (C)

  • each amino acid is part of the standard amino acid alphabet

See http://www.imgt.org/IMGTScientificChart/Numbering/IMGTIGVLsuperfamily.html and also https://doi.org/10.1093/nar/gkac190

pyrepseq.io.multimerge(dfs, on, suffixes=None, **kwargs)[source]

Merge multiple dataframes on a common column.

Provides support for custom suffixes.

Parameters:
  • on ('index' or column name) –

  • suffixes ([list-like | None]) – list of suffixes to append to the data

  • **kwargs (keyword arguments passed along to pd.merge) –

Return type:

merged dataframe

pyrepseq.io.standardize_dataframe(df: DataFrame | None = None, col_mapper: Mapping | None = None, standardize: bool = True, species: str = 'HomoSapiens', tcr_enforce_functional: bool = True, tcr_precision: str = 'gene', mhc_precision: str = 'gene', strict_cdr3_standardization: bool = False, suppress_warnings: bool = False, df_old: DataFrame | None = None)[source]

This is a utility function to organise a table of TCR-pMHC data into the standard pyrepseq format and perform data cleaning/standardization to ensure that the TCR/MHC gene symbols are IMGT-compliant, the epitopes are all valid amino acid strings, and the CDR3s look valid. For further notes on data standardization, see below. The standard format is a table with some or all of the following columns (not necessarily in order):

Column Name

Column should contain

Data type

TRAV

TRAV gene symbol

str

CDR3A

TCR alpha chain CDR3 amino acid sequence

str

TRAJ

TRAJ gene symbol

str

TRBV

TRBV gene symbol

str

CDR3B

TCR beta chain CDR3 amino acid sequence

str

TRBJ

TRBJ gene symbol

str

Epitope

Epitope amino acid sequence

str

MHCA

MHC alpha chain gene symbol

str

MHCB

MHC beta chain gene symbol

str

If the input DataFrame contains the necessary data in columns that are named differently, this can be resolved by providing the mapping to the col_mapper argument (see parameters and examples).

If standardization is enabled (True by default), the function will additionally attempt to standardize the TCR and MHC gene symbols to be IMGT-compliant, and CDR3/Epitope amino acid sequences to be valid. However, for the standardization to happen, the columns with the relevant data must either be correctly named, or the necessary re-naming scheme must be specified by supplying an argument to the col_mapper parameter. During standardization, most non-standardizable/nonsensical values will be removed, replaced with None. However, since epitopes are not necessarily always amino acid sequences, values in the Epitope column that fail standardization will be kept as their original value.

Deprecated since version 1.4: df_old will be removed in pyrepseq 2.0, with the more simply named df parameter.

Parameters:
  • df (pandas.DataFrame) – Source DataFrame from which to pull data.

  • df_old (pandas.DataFrame) – Alias for df. Now deprecated and will be removed in version 2.0.

  • col_mapper (Mapping) – A mapping object, such as a dictionary, which maps the old column names to the new column names. This should not be set if no column re-naming is necessary. Defaults to None.

  • standardize (bool) – When set to False, gene name standardisation is not attempted. Defaults to True.

  • species (str) – Name of the species from which the TCR data is derived, in their binomial nomenclature, camel-cased. Defaults to 'HomoSapiens'.

  • tcr_enforce_functional (bool) – When set to True, TCR genes that are not functional (i.e. ORF or pseudogene) are removed, and replaced with None. Defaults to True.

  • tcr_precision (str) – Level of precision to trim the TCR gene data to ('gene' or 'allele'). Defaults to 'gene'.

  • mhc_precision (str) – Level of precision to trim the MHC gene data to ('gene', 'protein' or 'allele'). Defaults to 'gene'.

  • strict_cdr3_standardization (bool) – If True, any string that does not look like a CDR3 sequence is rejected. If False, any inputs that are valid amino acid sequences but do not start with C and end with F/W are not rejected and instead are corrected by having a C appended to the beginning and an F appended at the end. Defaults to False.

  • suppress_warnings (bool) – If True, suppresses warnings that are emitted when the standardisation of certain values fails. Defaults to False.

Returns:

Standardized DataFrame containing the original data, cleaned.

Return type:

pandas.DataFrame

Examples

If you already have a DataFrame in the standard format, standardize_dataframe can perform data standardization for you. In the examples shown here, we omit any standardization warnings for ease of reading.

Say you have the following DataFrame:

>>> from pyrepseq import io
>>> import pandas as pd
>>> df = pd.DataFrame(
...     data=[
...         ["av26.1*1",  "CIVRAPGRADMRF", "aj43*1",    "bv13*1",      "CASSYLPGQGDHYSNQPQHF","bj1.5*1",    "FLKEKGGL",       "b8",         "b2m"],
...         ["TCRAV20*01","CAVPSGAGSYQLTF","TCRAJ28*01","TCRBV28S1*01","CASSLGQSGANVLTF",     "TCRBJ2S6*01","LQPFPQPELPYPQPQ","HLA-DQA1*05","HLA-DQB1*02"],
...         ["unknown",   "unknown",       "unknown",   "TRBV7-2*01",  "CASSDWGSQNTLYF",      "TRBJ2-4*01", "YMPYFFTLL",      "HLA-A*02",   "B2M"]
...     ],
...     columns=["TRAV","CDR3A","TRAJ","TRBV","CDR3B","TRBJ","Epitope","MHCA","MHCB"]
... )
>>> df
         TRAV           CDR3A        TRAJ          TRBV                 CDR3B         TRBJ          Epitope         MHCA         MHCB
0    av26.1*1   CIVRAPGRADMRF      aj43*1        bv13*1  CASSYLPGQGDHYSNQPQHF      bj1.5*1         FLKEKGGL           b8          b2m
1  TCRAV20*01  CAVPSGAGSYQLTF  TCRAJ28*01  TCRBV28S1*01       CASSLGQSGANVLTF  TCRBJ2S6*01  LQPFPQPELPYPQPQ  HLA-DQA1*05  HLA-DQB1*02
2     unknown         unknown     unknown    TRBV7-2*01        CASSDWGSQNTLYF   TRBJ2-4*01        YMPYFFTLL     HLA-A*02          B2M

By passing this to `standardize_dataframe, you will get a cleaned version of the data.

>>> io.standardize_dataframe(df, suppress_warnings=True)
       TRAV           CDR3A    TRAJ     TRBV                 CDR3B     TRBJ          Epitope      MHCA      MHCB
0  TRAV26-1   CIVRAPGRADMRF  TRAJ43   TRBV13  CASSYLPGQGDHYSNQPQHF  TRBJ1-5         FLKEKGGL     HLA-B       B2M
1    TRAV20  CAVPSGAGSYQLTF  TRAJ28   TRBV28       CASSLGQSGANVLTF  TRBJ2-6  LQPFPQPELPYPQPQ  HLA-DQA1  HLA-DQB1
2      None            None    None  TRBV7-2        CASSDWGSQNTLYF  TRBJ2-4        YMPYFFTLL     HLA-A       B2M

If you want to have extra columns on the DataFrame, that is allowed.

>>> extended_df = df.copy()
>>> extended_df["clone_count"] = [1,2,3]
>>> io.standardize_dataframe(extended_df, suppress_warnings=True)
       TRAV           CDR3A    TRAJ     TRBV                 CDR3B     TRBJ          Epitope      MHCA      MHCB  clone_count
0  TRAV26-1   CIVRAPGRADMRF  TRAJ43   TRBV13  CASSYLPGQGDHYSNQPQHF  TRBJ1-5         FLKEKGGL     HLA-B       B2M            1
1    TRAV20  CAVPSGAGSYQLTF  TRAJ28   TRBV28       CASSLGQSGANVLTF  TRBJ2-6  LQPFPQPELPYPQPQ  HLA-DQA1  HLA-DQB1            2
2      None            None    None  TRBV7-2        CASSDWGSQNTLYF  TRBJ2-4        YMPYFFTLL     HLA-A       B2M            3

Having only a subset of the standard columns is also allowed.

>>> beta_only_df = df.copy()
>>> beta_only_df = beta_only_df[["TRBV","CDR3B","TRBJ"]]
>>> io.standardize_dataframe(beta_only_df, suppress_warnings=True)
      TRBV                 CDR3B     TRBJ
0   TRBV13  CASSYLPGQGDHYSNQPQHF  TRBJ1-5
1   TRBV28       CASSLGQSGANVLTF  TRBJ2-6
2  TRBV7-2        CASSDWGSQNTLYF  TRBJ2-4

Columns can be renamed by suppling a mapping to the col_mapper parameter.

>>> beta_only_misnamed = beta_only_df.copy()
>>> beta_only_misnamed.columns = ["foo", "bar", "baz"]
>>> beta_only_misnamed
            foo                   bar          baz
0        bv13*1  CASSYLPGQGDHYSNQPQHF      bj1.5*1
1  TCRBV28S1*01       CASSLGQSGANVLTF  TCRBJ2S6*01
2    TRBV7-2*01        CASSDWGSQNTLYF   TRBJ2-4*01
>>> col_mapper = {
...     "foo": "TRBV",
...     "bar": "CDR3B",
...     "baz": "TRBJ"
... }
>>> io.standardize_dataframe(beta_only_misnamed, col_mapper=col_mapper)
      TRBV                 CDR3B     TRBJ
0   TRBV13  CASSYLPGQGDHYSNQPQHF  TRBJ1-5
1   TRBV28       CASSLGQSGANVLTF  TRBJ2-6
2  TRBV7-2        CASSDWGSQNTLYF  TRBJ2-4

Stats

pyrepseq.stats.chao1(counts)[source]

Estimate richness from sampled counts.

pyrepseq.stats.chao2(counts, m)[source]

Estimate richness from incidence data

pyrepseq.stats.jaccard_index(A, B)[source]

Calculate the Jaccard index for two sets.

This measure is defined defined as

\(J(A, B) = |A \cap B| / |A \cup B|\)

A, B: iterables (will be converted to sets). If A, B are pd.Series na values will be dropped first

pyrepseq.stats.overlap(A, B)[source]

Calculate the number of overlapping elements of two sets.

This measure is defined as \(|A \cap B|\)

A, B: iterables (will be converted to sets). na values will be dropped first

pyrepseq.stats.overlap_coefficient(A, B)[source]

Calculate the overlap coefficient for two sets.

This measure is defined as \(O(A, B) = |A \cap B| / min(|A|, |B|)\)

A, B: iterables (will be converted to sets). na values will be dropped first

pyrepseq.stats.pc(array: Iterable, array2: Iterable | None = None)[source]

Estimate the coincidence probability \(p_C\) from a sample. \(p_C\) is equal to the probability that two distinct sampled elements are the same. If \(n_i\) are the counts of the i-th unique element and \(N = \sum_i n_i\) the length of the array, then: \(p_C = \sum_i n_i (n_i-1)/(N(N-1))\)

Note: This measure is also known as the Simpson or Hunter-Gaston index

Parameters:
  • array (Iterable) – Iterable of sampled elements

  • array2 (Optional[Iterable]) – Second Iterable of sampled elements: if provided probability of cross-coincidences is calculated as \(p_C = (\sum_i n_{1i} n_{2i}) / (N_1 N_2)\)

pyrepseq.stats.pc_conditional(df, by, on, take_mean=True, weight_uniformly=False)[source]

Conditional coincidence probability estimator

Parameters:
  • df (pandas DataFrame) –

  • by (list) – conditioning parameters used to group input data frame

  • on (string/list of strings) – column or columns to compute probability of coincidence or joint probability of coincidence on. If type(on) == list then pc is computed on the concatenations of each specified column

  • take_mean (bool) – specify wether to take the average once pc has been computed for each specified group

Returns:

pc of df[on] computed over each group specified in by. if take_mean=True then the average of these group by pcs is returned

Return type:

pandas DataFrame/float

pyrepseq.stats.pc_joint(df, on)[source]

Joint coincidence probability estimator

Parameters:
  • df (pandas DataFrame) –

  • on (list of strings) – columns on which to obtain a joint probability of coincidence

Returns:

pc computed on the concatenations of each specified column in on

Return type:

float

pyrepseq.stats.pc_n(n)[source]

Estimate the coincidence probability \(p_C\) from sampled counts. \(p_C\) is equal to the probability that two distinct sampled elements are the same. If \(n_i\) are the counts of the i-th unique element and \(N = \sum_i n_i\) the length of the array, then: \(p_C = \sum_i n_i (n_i-1)/(N(N-1))\)

Note: This measure is also known as the Simpson or Hunter-Gaston index

Parameters:

n (array-like) – list of counts

pyrepseq.stats.powerlaw_mle_alpha(c, cmin=1.0, method='exact', **kwargs)[source]

Maximum likelihood estimate of the power-law exponent.

Parameters:
  • c (counts) –

  • cmin (only counts >= cmin are included in fit) –

  • continuitycorrection (use continuitycorrection (more accurate for integer counts)) –

  • method (one of ['simple', 'continuitycorrection', 'exact']) –

    ‘simple’: Uses an analytical formula that is exact in the continuous case

    (Eq. B17 in Clauset et al. arXiv 0706.1062v2)

    ’continuitycorrection’: applies a continuity correction to the analytical formula ‘exact’: Numerically maximizes the discrete loglikelihood

  • kwargs (dict) – passed on to scipy.optimize.minimize_scalar Default: bounds=[1.5, 4.5], method=’bounded’

Return type:

estimated power-law exponent

pyrepseq.stats.powerlaw_sample(size=1, xmin=1.0, alpha=2.0)[source]

Draw samples from a discrete power-law.

Uses an approximate transformation technique, see Eq. D6 in Clauset et al. arXiv 0706.1062v2 for details.

Parameters:
  • size (number of values to draw) –

  • xmin (minimal value) –

  • alpha (power-law exponent) –

Return type:

array of integer samples

pyrepseq.stats.renyi2_entropy(df, features, by=None, base=2.0)[source]

Compute Renyi-Simpson entropies

Parameters:
  • df (pandas DataFrame) –

  • features (list) –

  • by (string/list of strings) –

  • base (float) –

Return type:

float

pyrepseq.stats.shannon_entropy(df, features, by=None, base=2.0)[source]

Compute Shannon entropies

Parameters:
  • df (pandas DataFrame) –

  • features (list) –

  • by (string/list of strings) –

  • base (float) –

Return type:

float

pyrepseq.stats.stdpc(array)[source]

Std.dev. estimator for Simpson’s index

pyrepseq.stats.stdpc_n(n)[source]

Std.dev. estimator for Simpson’s index

pyrepseq.stats.subsample(counts, n)[source]

Randomly subsample from a vector of counts without replacement.

Parameters:
  • counts (Vector of counts (integers) to randomly subsample from.) –

  • n (Number of items to subsample from counts. Must be less than or equal) – to the sum of counts.

Returns:

indices, counts

Return type:

Subsampled vector of counts where the sum of the elements equals n

pyrepseq.stats.var_chao1(counts)[source]

Variance estimator for Chao1 richness.

pyrepseq.stats.var_chao2(counts, m)[source]

Variance estimator for Chao2 richness.

pyrepseq.stats.varpc_n(n)[source]

Variance estimator for Simpson’s index

Distance

pyrepseq.distance.calculate_neighbor_numbers(seqs, reference=None, neighborhood=<function levenshtein_neighbors>)[source]

Calculate the number of neighbors for each sequence in a list.

Parameters:
  • seqs (list of sequences) –

  • reference (list of sequences, set(seqs) if None) –

  • neighborhood (function returning iterator over neighbors) –

Return type:

integer array of number of neighboring sequences

pyrepseq.distance.cdist(stringsA, stringsB, metric=None, dtype=<class 'numpy.uint8'>, **kwargs)[source]

Compute distance between each pair of the two collections of strings. (scipy.spatial.distance.cdist equivalent for strings)

Deprecated since version 1.4: pyrepseq.cdist() is now deprecated in favour of the Metric object system (see pyrepseq.metric.Metric). Metric objects implement the calc_cdist_matrix method which will perform the cdist computation. pyrepseq.cdist() will be removed in version 2.0.

Parameters:
  • stringsA (iterable of strings) – An mA-length iterable.

  • stringsB (iterable of strings) – An mB-length iterable.

  • metric (function, optional) – The distance metric to use. Default: Levenshtein distance.

  • dtype (np.dtype) – data type of the distance matrix, default: np.uint8 Note: make sure to change the dtype, if the metric does not return integers

Returns:

Y – A \(m_A\) by \(m_B\) distance matrix is returned. For each \(i\) and \(j\), the metric dist(u=XA[i], v=XB[j]) is computed and stored in the \(ij\) th entry.

Return type:

ndarray

pyrepseq.distance.downsample(seqs: Iterable[str] | DataFrame | None, maxseqs: int | None = None)[source]

Random downsampling of a list of sequences. Also works for standard pyrepseq TCR DataFrames (see pyrepseq.io.standardize_dataframe()).

Parameters:
  • seqs (Union[Iterable[str], DataFrame]) – Input Iterable of strings, or TCR DataFrame.

  • maxseqs (Optional[int]) – Max number of sequences to keep. Defaults to None.

Returns:

  • Random subset of maxseqs elements from the input collection.

  • If maxseqs is None, returns the input collection without modification.

pyrepseq.distance.find_neighbor_pairs(seqs, neighborhood=<function hamming_neighbors>)[source]

Find neighboring sequences in a list of unique sequences.

Parameters:

neighborhood (callable returning an iterable of neighbors) –

Return type:

list of tuples (seq1, seq2)

pyrepseq.distance.find_neighbor_pairs_index(seqs, neighborhood=<function hamming_neighbors>)[source]

Find neighboring sequences in a list of unique sequences.

Parameters:

neighborhood (callable returning an iterable of neighbors) –

Return type:

list of tuples (index1, index2)

pyrepseq.distance.hamming_neighbors(x, alphabet='ACDEFGHIKLMNPQRSTVWY', variable_positions=None)[source]

Iterator over Hamming neighbors of a string x.

Parameters:
  • alphabet (iterable of characters) –

  • variable_positions (iterable of positions to be varied (default: all)) –

pyrepseq.distance.hierarchical_clustering(seqs: Iterable, metric: Metric | None = None, linkage_kws={'method': 'average', 'optimal_ordering': True}, cluster_kws={'criterion': 'distance', 't': 6})[source]

Hierarchical clustering by sequence similarity.

Parameters:
pyrepseq.distance.isdist1(x, reference, neighborhood=<function levenshtein_neighbors>)[source]

Is the string x distance 1 away from any of the strings in the reference set

pyrepseq.distance.levenshtein_neighbors(x, alphabet='ACDEFGHIKLMNPQRSTVWY')[source]

Iterator over Levenshtein neighbors of a string x

pyrepseq.distance.load_pcDelta_background(return_bins=True)[source]

Loads pre-computed background pcDelta distributions calculated for PBMC TCRs.

Data: Sample W_F1_2018 from Minervina et al. https://zenodo.org/record/4065547/

Returns:

  • back (pd.DataFrame) – DataFrame with coincidence probabilities

  • bins (ndarray [if return_bins = True]) – Delta bins to be used as bins for other data

pyrepseq.distance.next_nearest_neighbors(x, neighborhood, maxdistance=2)[source]

Set of next nearest neighbors of a string x.

Parameters:
  • alphabet (iterable of characters) –

  • neighborhood (neighborhood iterator) –

  • maxdistance (go up to maxdistance nearest neighbor) –

Return type:

set of neighboring sequences

pyrepseq.distance.nndist_hamming(seq, reference, maxdist=4)[source]

Calculate the nearest-neighbor distance by Hamming distance

Parameters:
  • seqs (list of sequences) –

  • seq (sequence instance) –

  • reference (set of referencesequences) –

  • maxdist (distance beyond which to cut off the calculation (needs to be <=4)) –

Returns:

  • distance of nearest neighbor

  • Note (This function does not check if strings are of same length.)

pyrepseq.distance.pcDelta(seqs: Iterable, seqs2: Iterable | None = None, metric: Metric | None = None, bins: int | Iterable | None = None, normalize: bool = True, pseudocount: float = 0.0, maxseqs: int | None = None)[source]

Calculates binned near-coincidence probabilities \(p_C(\Delta)\) among input sequences.

Parameters:
Returns:

(normalized) histogram of sequence distances

Return type:

np.ndarray

pyrepseq.distance.pcDelta_grouped(df, by, seq_columns, **kwargs)[source]

Near-coincidence probabilities conditioned to within-group comparisons.

Parameters:
  • df (pd.DataFrame) –

  • by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby

  • seq_columns (string) – The data frame column on which we want to apply the pcDelta analysis

  • **kwargs (keyword arguments) – passed on to pcDelta

Returns:

pcs – Returns a DataFrame of pC(delta) for each group

Return type:

pd.DataFrame

pyrepseq.distance.pcDelta_grouped_cross(df, by, seq_columns, condensed=False, **kwargs)[source]

Near-coincidence probabilities conditioned to cross-group comparisons.

Parameters:
  • df (pd.DataFrame) –

  • by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby

  • seq_columns (string) – The data frame column on which we want to apply the pcDelta analysis

  • condensed (bool) – Return a condensed instead of squareform matrix (default: False)

  • **kwargs (keyword arguments) – passed on to pcDelta

Returns:

pcs – Returns a DataFrame of pC(delta) across pairs of groups

Return type:

pd.DataFrame

pyrepseq.distance.pdist(strings, metric=None, dtype=<class 'numpy.uint8'>, **kwargs)[source]

Pairwise distances between collection of strings. (scipy.spatial.distance.pdist equivalent for strings)

Deprecated since version 1.4: pyrepseq.pdist() is now deprecated in favour of the Metric object system (see pyrepseq.metric.Metric). Metric objects implement the calc_pdist_vector method which will perform the pdist computation. pyrepseq.pdist() will be removed in version 2.0.

Parameters:
  • strings (iterable of strings) – An m-length iterable.

  • metric (function, optional) – The distance metric to use. Default: Levenshtein distance.

  • dtype (np.dtype) – data type of the distance matrix, default: np.uint8 Note: make sure to change the dtype, if the metric does not return integers

Returns:

Y – Returns a condensed distance matrix Y. For each \(i\) and \(j\) (where \(i<j<m\)), where m is the number of original observations. The metric dist(u=X[i], v=X[j]) is computed and stored in entry m * i + j - ((i + 2) * (i + 1)) // 2.

Return type:

ndarray

Nearest Neighbor

pyrepseq.nn.hash_based(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets')[source]

List all neighboring CDR3B sequences efficiently for small edit distances. The idea is to list all possible sequences within a given distance and lookup the dictionary if it exists. This implementation is faster than kdtree implementation for max_edits == 1

Parameters:
  • strings (iterable of strings) – list of CDR3B sequences

  • max_edits (int) – maximum edit distance defining the neighbors

  • max_returns (int or None) – maximum neighbor size

  • n_cpu (int) – number of CPU cores running in parallel

  • custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)

  • max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied

  • output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”

Returns:

neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

pyrepseq.nn.kdtree(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', compression=1)[source]

List all neighboring CDR3B sequences efficiently within the given edit distance. With KDTree, the algorithms run with O(N logN) eliminating unnecessary comparisons. With RapidFuzz library, the edit distance comparison is efficiently written in C++. With multiprocessing, the algorithm can take advantage of multiple CPU cores. This implementation is faster than hash-based implementation for max_edits > 1

Parameters:
  • strings (iterable of strings) – list of CDR3B sequences

  • max_edits (int) – maximum edit distance defining the neighbors

  • max_returns (int or None) – maximum neighbor size

  • n_cpu (int) – number of CPU cores running in parallel

  • custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)

  • max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied

  • output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”

Returns:

neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

pyrepseq.nn.nearest_neighbor(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', seqs2=None)[source]

List all neighboring sequences efficiently within a given distance. The distance can be given in terms of hamming, levenshtein, or custom.

If seqs2 is not provided, every sequence is compared against every other sequence.

Parameters:
  • strings (iterable of strings) – list of CDR3B sequences

  • max_edits (int) – maximum edit distance defining the neighbors

  • max_returns (int or None) – maximum neighbor size

  • n_cpu (int) – ignored

  • custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)

  • max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied

  • output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”

  • seq2 (iterable of strings or None) – another list of CDR3B sequences to compare against

Returns:

neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

pyrepseq.nn.nearest_neighbor_tcrdist(df, chain='beta', max_edits=2, edit_on_trimmed=True, max_tcrdist=20, tcrdist_kwargs={}, **kwargs)[source]

List all neighboring TCR sequences efficiently within a given edit and TCRdist radius.

[Requires optional dependency pwseqdist]

Parameters:
  • chain ('alpha' or 'beta') –

  • max_edits (only return neighbors up to <= this edit distance) –

  • edit_on_trimmed (boolean) – apply TCRdist trimming on sequences before calculating edit distance

  • max_tcrdist (only return neighbor up to <= this TCR distance) –

  • tcrdist_kwargs (dict) – customized parameters for TCRdist calculation

  • **kwargs (passed on to nearest_neighbor function) –

Return type:

sparse matrix in (i, j, dist) format

pyrepseq.nn.symdel(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', seqs2=None)[source]

List all neighboring CDR3B sequences efficiently within the given distance. This is an improved version over the hash-based.

If seqs2 is not provided, every sequences are compared against every other sequences resulting in N(seqs)**2 combinations. Otherwise, seqs are compared against seqs2 resulting in N(seqs)*N(seqs2) combinations.

Parameters:
  • strings (iterable of strings) – list of CDR3B sequences

  • max_edits (int) – maximum edit distance defining the neighbors

  • max_returns (int or None) – maximum neighbor size

  • n_cpu (int) – ignored

  • custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)

  • max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied

  • output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”

  • seq2 (iterable of strings or None) – another list of CDR3B sequences to compare against

Returns:

neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

Plotting submodule

pyrepseq.plotting.align_seqs(seqs)[source]

Align multiple sequences using mafft-linsi with default parameters.

Requires external dependency mafft-linsi to be installed.

Parameters:

seqs (iterable of strings) –

Returns:

aligned sequences (with gaps)

Return type:

list of strings

pyrepseq.plotting.label_axes(fig_or_axes, labels='ABCDEFGHIJKLMNOPQRSTUVWXYZ', labelstyle='%s', xy=(-0.1, 0.95), xycoords='axes fraction', **kwargs)[source]

Walks through axes and labels each. kwargs are collected and passed to annotate

Parameters:
  • fig (Figure or Axes to work on) –

  • labels (iterable or None) – iterable of strings to use to label the axes. If None, lower case letters are used.

  • loc (Where to put the label units (len=2 tuple of floats)) –

  • xycoords (loc relative to axes, figure, etc.) –

  • kwargs (to be passed to annotate) –

pyrepseq.plotting.labels_to_colors_hls(labels, palette_kws={'l': 0.5, 's': 0.8}, min_count=None)[source]

Map a list of labels to a list of unique colors. Uses seaborn.hls_palette.

Parameters:
  • df (pandas DataFrame with data) –

  • labels (list of labels) –

  • min_count (map all labels seen less than min_count to black) –

  • palette_kws (passed to seaborn.hls_palette) –

pyrepseq.plotting.labels_to_colors_tableau(labels, min_count=None)[source]

Map a list of labels to a list of unique colors. Uses Tableau_10 colors

Parameters:
  • df (pandas DataFrame with data) –

  • labels (list of labels) –

  • min_count (map all labels seen less than min_count to black) –

pyrepseq.plotting.rankfrequency(data, ax=None, normalize_x=True, normalize_y=False, log_x=True, log_y=True, scalex=1.0, scaley=1.0, **kwargs)[source]

Plot rank frequency plots.

Parameters:
  • data (array-like) – count data

  • ax (matplotlib.Axes) – axes on which to plot the data

  • normalize_x (bool, default:True) – whether to normalize counts to relative frequencies

  • normalize_y (bool, default:False) – whether to normalize ranks to cumulative probabilities

Returns:

Objectes representing the plotted data.

Return type:

list of Line2D

pyrepseq.plotting.seqlogos(seqs, ax=None, **kwargs)[source]

Display a sequence logo.

Aligns sequences using align_seqs if they are are not of equal length.

Parameters:
  • seqs (iterable of strings) – sequences to be displayed

  • ax (matplotlib.axes) – if None create new figure

  • **kwargs (dict) – passed on to logomaker.Logo

Return type:

axes, counts_matrix

pyrepseq.plotting.seqlogos_vj(df, cdr3_column, v_column, j_column, axes=None, **kwargs)[source]

Display a sequence logo with V and J gene information.

Parameters:
  • df (pd.DataFrame) – input data

  • cdr3_column (str) – column name for cdr3 sequences

  • v_column (str) – column name for v genes

  • j_column (str) – column name for j genes

  • **kwargs (dict) – passed on to seqlogos

pyrepseq.plotting.similarity_clustermap(df, alpha_column='cdr3a', beta_column='cdr3b', norm=None, bounds=array([0, 1, 2, 3, 4, 5, 6]), linkage_kws={'method': 'average', 'optimal_ordering': True}, cluster_kws={'criterion': 'distance', 't': 6}, cbar_kws={'format': '%d', 'label': 'Sequence Distance', 'orientation': 'horizontal'}, meta_columns=None, meta_to_colors=None, **kws)[source]

Plots a sequence-similarity clustermap.

Parameters:
  • df (pandas DataFrame with data) –

  • alpha_column (column name with alpha and beta amino acid information (set one to None for single chain plotting)) –

  • beta_column (column name with alpha and beta amino acid information (set one to None for single chain plotting)) –

  • norm (matplotlib.colors.Normalize subclass for turning distances into colors) –

  • bounds (bounds used for colormap matplotlib.colors.BoundaryNorm (only used if norm = None)) –

  • linkage_kws (keyword arguments for linkage algorithm) –

  • cluster_kws (keyword arguments for clustering algorithm) –

  • cbar_kws (keyword arguments for colorbar) –

  • meta_columns (list-like) – metadata to plot alongside the cluster assignment

  • meta_to_colors (list-like) – list of functions mapping metadata labels to colors first element of list is for clusters

  • kws (keyword arguments passed on to the clustermap.) –

Metrics

General Metrics

class pyrepseq.metric.Metric[source]

Base abstract class for all metrics in pyrepseq. This class outlines the interface that all metrics will implement. If a variable or function parameter can be any type of metric, then it should be typed to this class.

abstract calc_cdist_matrix(anchors: Iterable, comparisons: Iterable) ndarray[source]

Calculates a cdist matrix between two collections of objects.

Parameters:
  • anchors (Iterable) – A collections of objects to measure distances from.

  • comparisons (Iterable) – A collection of objects to measure distances to.

Returns:

A matrix of shape (N,M) where N is the number of elements in anchors and M is the number of elements in comparisons. The element in the ith row and jth column will contain the distance between the ith element of anchors and the jth element of comparisons.

Return type:

numpy.ndarray

abstract calc_pdist_vector(instances: Iterable) ndarray[source]

Calculates a pdist vector given a collection of objects.

Parameters:

instances (Iterable) – A collection of objects to measure distances between.

Returns:

A vector of shape (N*(N-1)/2,) where N is the number of elements in instances. The vector contains all distances that are possible between each possible pair of objects in instances.

Return type:

numpy.ndarray

abstract property name: str

The name of the metric as a string.

class pyrepseq.metric.Levenshtein[source]

Levenshtein distance, also known as edit distance.

class pyrepseq.metric.WeightedLevenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1)[source]

A generalised Levenshtein distance which supports different weights for insertions, deletions, and substitutions.

Parameters:
  • insertion_weight (int) – An integer multiplier for insertions Defaults to 1.

  • deletion_weight (int) – An integer multiplier for deletions Defaults to 1.

  • substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.

TCR Metrics

class pyrepseq.metric.tcr_metric.TcrMetric[source]

Base abstract class for all metrics that operate on TCR . TcrMetrics should expect DataFrames with each row representing a TCR, in the standard pyrepseq format (see pyrepseq.io.standardize_dataframe()). The input DataFrames must also have at least one TCR-related column. Furthermore, if the input DataFrame(s) do not have the required column for the function of the specific metric, the metric will throw a ValueError explaining which columns are missing. All values in the table should be IMGT-standardized.

abstract calc_cdist_matrix(anchors: DataFrame, comparisons: DataFrame) ndarray[source]

Calculates a cdist matrix between two DataFrames containing TCR data.

Parameters:
  • anchors (DataFrame) – A DataFrame containing data on TCRs to measure distances from.

  • comparisons (DataFrame) – A DataFrame containing data on TCRs to measure distances to.

Returns:

A matrix of shape (N,M) where N is the number of TCRs in anchors and M is the number of TCRs in comparisons. The element in the ith row and jth column will contain the distance between the ith TCR of anchors and the jth TCR of comparisons.

Return type:

numpy.ndarray

abstract calc_pdist_vector(instances: DataFrame) ndarray[source]

Calculates a pdist vector given a DataFrame of TCRs.

Parameters:

instances (DataFrame) – A DataFrame of TCRs to measure distances between.

Returns:

A vector of shape (N*(N-1)/2,) where N is the number of TCRs in instances. The vector contains all distances that are possible between each possible pair of TCRs in instances.

Return type:

numpy.ndarray

class pyrepseq.metric.tcr_metric.AlphaCdr3Levenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1)[source]

A TcrMetric that measures the Levenshtein distance between the alpha chain CDR3 sequences.

Parameters:
  • insertion_weight (int) – An integer multiplier for insertions Defaults to 1.

  • deletion_weight (int) – An integer multiplier for deletions Defaults to 1.

  • substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.

class pyrepseq.metric.tcr_metric.BetaCdr3Levenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1)[source]

A TcrMetric that measures the Levenshtein distance between the beta chain CDR3 sequences.

Parameters:
  • insertion_weight (int) – An integer multiplier for insertions Defaults to 1.

  • deletion_weight (int) – An integer multiplier for deletions Defaults to 1.

  • substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.

class pyrepseq.metric.tcr_metric.Cdr3Levenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1, alpha_weight: int = 1, beta_weight: int = 1)[source]

A TcrMetric that measures the Levenshtein distance between the alpha and beta chain CDR3 sequences.

Parameters:
  • insertion_weight (int) – An integer multiplier for insertions Defaults to 1.

  • deletion_weight (int) – An integer multiplier for deletions Defaults to 1.

  • substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.

  • alpha_weight (int) – An integer multiplier for edits on the alpha chain. Defaults to 1.

  • beta_weight (int) – An integer multiplier for edits on the beta chain. Defaults to 1.

class pyrepseq.metric.tcr_metric.AlphaCdrLevenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1, cdr1_weight: int = 1, cdr2_weight: int = 1, cdr3_weight: int = 1)[source]

A TcrMetric that measures the Levenshtein distance between the alpha chain CDR1, CDR2, and CDR3 sequences.

Parameters:
  • insertion_weight (int) – An integer multiplier for insertions Defaults to 1.

  • deletion_weight (int) – An integer multiplier for deletions Defaults to 1.

  • substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.

  • cdr1_weight (int) – An integer multiplier for edits on the CDR1. Defaults to 1.

  • cdr2_weight (int) – An integer multiplier for edits on the CDR2. Defaults to 1.

  • cdr3_weight (int) – An integer multiplier for edits on the CDR3. Defaults to 1.

class pyrepseq.metric.tcr_metric.BetaCdrLevenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1, cdr1_weight: int = 1, cdr2_weight: int = 1, cdr3_weight: int = 1)[source]

A TcrMetric that measures the Levenshtein distance between the beta chain CDR1, CDR2, and CDR3 sequences.

Parameters:
  • insertion_weight (int) – An integer multiplier for insertions Defaults to 1.

  • deletion_weight (int) – An integer multiplier for deletions Defaults to 1.

  • substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.

  • cdr1_weight (int) – An integer multiplier for edits on the CDR1. Defaults to 1.

  • cdr2_weight (int) – An integer multiplier for edits on the CDR2. Defaults to 1.

  • cdr3_weight (int) – An integer multiplier for edits on the CDR3. Defaults to 1.

class pyrepseq.metric.tcr_metric.CdrLevenshtein(insertion_weight: int = 1, deletion_weight: int = 1, substitution_weight: int = 1, alpha_weight: int = 1, beta_weight: int = 1, cdr1_weight: int = 1, cdr2_weight: int = 1, cdr3_weight: int = 1)[source]

A TcrMetric that measures the Levenshtein distance between the alpha and beta chain CDR1, CDR2, and CDR3 sequences.

Parameters:
  • insertion_weight (int) – An integer multiplier for insertions Defaults to 1.

  • deletion_weight (int) – An integer multiplier for deletions Defaults to 1.

  • substitution_weight (int) – An integer multiplier for substitutions Defaults to 1.

  • cdr1_weight (int) – An integer multiplier for edits on the CDR1. Defaults to 1.

  • cdr2_weight (int) – An integer multiplier for edits on the CDR2. Defaults to 1.

  • cdr3_weight (int) – An integer multiplier for edits on the CDR3. Defaults to 1.

  • alpha_weight (int) – An integer multiplier for edits on the alpha chain. Defaults to 1.

  • beta_weight (int) – An integer multiplier for edits on the beta chain. Defaults to 1.

class pyrepseq.metric.tcr_metric.AlphaCdr3Tcrdist[source]

TcrDist applied to the alpha chain CDR3 sequences.

[Requires optional tcrdist dependency.]

class pyrepseq.metric.tcr_metric.BetaCdr3Tcrdist[source]

TcrDist applied to the beta chain CDR3 sequences.

[Requires optional tcrdist dependency.]

class pyrepseq.metric.tcr_metric.Cdr3Tcrdist[source]

TcrDist applied to the alpha and beta chain CDR3 sequences.

[Requires optional tcrdist dependency.]

class pyrepseq.metric.tcr_metric.AlphaTcrdist[source]

TcrDist applied to the alpha chain.

[Requires optional tcrdist dependency.]

class pyrepseq.metric.tcr_metric.BetaTcrdist[source]

TcrDist applied to the beta chain.

[Requires optional tcrdist dependency.]

class pyrepseq.metric.tcr_metric.Tcrdist[source]

TcrDist applied to the alpha and beta chain.

[Requires optional tcrdist dependency.]