API Reference

main module

IO

pyrepseq.io.isvalidaa(string)[source]

returns true if string is composed only of characters from the standard amino acid alphabet

pyrepseq.io.isvalidcdr3(string)[source]

returns True if string is a valid CDR3 sequence

Checks the following:
  • first amino acid is a cysteine (C)

  • last amino acid is either phenylalanine (F), tryptophan (W), or cysteine (C)

  • each amino acid is part of the standard amino acid alphabet

See http://www.imgt.org/IMGTScientificChart/Numbering/IMGTIGVLsuperfamily.html and also https://doi.org/10.1093/nar/gkac190

pyrepseq.io.multimerge(dfs, on, suffixes=None, **kwargs)[source]

Merge multiple dataframes on a common column.

Provides support for custom suffixes.

Parameters:
  • on ('index' or column name) –

  • suffixes ([list-like | None]) – list of suffixes to append to the data

  • **kwargs (keyword arguments passed along to pd.merge) –

Return type:

merged dataframe

pyrepseq.io.standardize_dataframe(df_old, col_mapper: Mapping, standardize: bool = True, species='HomoSapiens', tcr_enforce_functional=True, tcr_precision='gene', mhc_precision='gene', suppress_warnings=False)[source]

Utility function to organise TCR data into a standardised format.

If standardization is enabled (True by default), the function will additionally attempt to standardise the TCR and MHC gene symbols to be IMGT-compliant, and CDR3 sequences to be valid. The appropriate standardization procedures will be applied for columns with the following names:

  • TRAV / TRBV

  • TRAJ / TRBJ

  • CDR3A / CDR3B

  • MHCA / MHCB

Parameters:
  • df_old (pandas.DataFrame) – Source DataFrame from which to pull data.

  • col_mapper (Mapping) – A mapping object, such as a dictionary, which maps the old column names to the new column names.

  • standardize (bool) – When set to False, gene name standardisation is not attempted. Defaults to True.

  • species (str) – Name of the species from which the TCR data is derived, in their binomial nomenclature, camel-cased. Defaults to 'HomoSapiens'.

  • tcr_enforce_functional (bool) – When set to True, TCR genes that are not functional (i.e. ORF or pseudogene) are removed, and replaced with None. Defaults to True.

  • tcr_precision (str) – Level of precision to trim the TCR gene data to ('gene' or 'allele'). Defaults to 'gene'.

  • mhc_precision (str) – Level of precision to trim the MHC gene data to ('gene', 'protein' or 'allele'). Defaults to 'gene'.

  • suppress_warnings (bool) – If True, suppresses warnings that are emitted when the standardisation of certain values fails. Defaults to False.

Returns:

Standardised DataFrame containing the original data, cleaned.

Return type:

pandas.DataFrame

Stats

pyrepseq.stats.jaccard_index(A, B)[source]

Calculate the Jaccard index for two sets.

This measure is defined defined as

math:J(A, B) = |A intersection B| / |A union B| A, B: iterables (will be converted to sets). If A, B are pd.Series na values will be dropped first

pyrepseq.stats.overlap_coefficient(A, B)[source]

Calculate the overlap coefficient for two sets.

This measure is defined as \(O(A, B) = |A intersection B| / min(|A|, |B|)\)

A, B: iterables (will be converted to sets). na values will be dropped first

pyrepseq.stats.powerlaw_mle_alpha(c, cmin=1.0, method='exact', **kwargs)[source]

Maximum likelihood estimate of the power-law exponent.

Parameters:
  • c (counts) –

  • cmin (only counts >= cmin are included in fit) –

  • continuitycorrection (use continuitycorrection (more accurate for integer counts)) –

  • method (one of ['simple', 'continuitycorrection', 'exact']) –

    ‘simple’: Uses an analytical formula that is exact in the continuous case

    (Eq. B17 in Clauset et al. arXiv 0706.1062v2)

    ’continuitycorrection’: applies a continuity correction to the analytical formula ‘exact’: Numerically maximizes the discrete loglikelihood

  • kwargs (dict) – passed on to scipy.optimize.minimize_scalar Default: bounds=[1.5, 4.5], method=’bounded’

Returns:

estimated power-law exponent

Return type:

float

pyrepseq.stats.powerlaw_sample(size=1, xmin=1.0, alpha=2.0)[source]

Draw samples from a discrete power-law.

Uses an approximate transformation technique, see Eq. D6 in Clauset et al. arXiv 0706.1062v2 for details.

Parameters:
  • size (number of values to draw) –

  • xmin (minimal value) –

  • alpha (power-law exponent) –

Return type:

array of integer samples

Distance

pyrepseq.distance.calculate_neighbor_numbers(seqs, neighborhood=<function levenshtein_neighbors>)[source]

Calculate the number of neighbors for each sequence in a list.

Parameters:
  • seqs (list of sequences) –

  • neighborhood (function returning iterator over neighbors) –

Return type:

integer array of number of neighboring sequences

pyrepseq.distance.cdist(stringsA, stringsB, metric=None, dtype=<class 'numpy.uint8'>, **kwargs)[source]
Compute distance between each pair of the two collections of strings.

(scipy.spatial.distance.cdist equivalent for strings)

Parameters:
  • stringsA (iterable of strings) – An mA-length iterable.

  • stringsB (iterable of strings) – An mB-length iterable.

  • metric (function, optional) – The distance metric to use. Default: Levenshtein distance.

  • dtype (np.dtype) – data type of the distance matrix, default: np.uint8 Note: make sure to change the dtype, if the metric does not return integers

Returns:

Y – A \(m_A\) by \(m_B\) distance matrix is returned. For each \(i\) and \(j\), the metric dist(u=XA[i], v=XB[j]) is computed and stored in the \(ij\) th entry.

Return type:

ndarray

pyrepseq.distance.downsample(seqs, maxseqs)[source]

Random downsampling of a list of sequences.

Also works for tuples (seqs_alpha, seqs_beta).

pyrepseq.distance.find_neighbor_pairs(seqs, neighborhood=<function hamming_neighbors>)[source]

Find neighboring sequences in a list of unique sequences.

Parameters:

neighborhood (callable returning an iterable of neighbors) –

Return type:

list of tuples (seq1, seq2)

pyrepseq.distance.find_neighbor_pairs_index(seqs, neighborhood=<function hamming_neighbors>)[source]

Find neighboring sequences in a list of unique sequences.

Parameters:

neighborhood (callable returning an iterable of neighbors) –

Return type:

list of tuples (index1, index2)

pyrepseq.distance.hamming_neighbors(x, alphabet='ACDEFGHIKLMNPQRSTVWY', variable_positions=None)[source]

Iterator over Hamming neighbors of a string x.

Parameters:
  • alphabet (iterable of characters) –

  • variable_positions (iterable of positions to be varied (default: all)) –

pyrepseq.distance.hierarchical_clustering(seqs, pdist_kws={}, linkage_kws={'method': 'average', 'optimal_ordering': True}, cluster_kws={'criterion': 'distance', 't': 6})[source]

Hierarchical clustering by sequence similarity.

pdist_kws: keyword arguments for distance calculation linkage_kws: keyword arguments for linkage algorithm cluster_kws: keyword arguments for clustering algorithm

pyrepseq.distance.isdist1(x, reference, neighborhood=<function levenshtein_neighbors>)[source]

Is the string x distance 1 away from any of the strings in the reference set

pyrepseq.distance.levenshtein_neighbors(x, alphabet='ACDEFGHIKLMNPQRSTVWY')[source]

Iterator over Levenshtein neighbors of a string x

pyrepseq.distance.load_pcDelta_background(return_bins=True)[source]

Loads pre-computed background pcDelta distributions calculated for PBMC TCRs.

Data: Sample W_F1_2018 from Minervina et al. https://zenodo.org/record/4065547/

Returns:

  • back (pd.DataFrame) – DataFrame with coincidence probabilities

  • bins (ndarray [if return_bins = True]) – Delta bins to be used as bins for other data

pyrepseq.distance.next_nearest_neighbors(x, neighborhood, maxdistance=2)[source]

Set of next nearest neighbors of a string x.

Parameters:
  • alphabet (iterable of characters) –

  • neighborhood (neighborhood iterator) –

  • maxdistance (go up to maxdistance nearest neighbor) –

Return type:

set of neighboring sequences

pyrepseq.distance.nndist_hamming(seq, reference, maxdist=4)[source]

Calculate the nearest-neighbor distance by Hamming distance

Parameters:
  • seqs (list of sequences) –

  • seq (sequence instance) –

  • reference (set of referencesequences) –

  • maxdist (distance beyond which to cut off the calculation (needs to be <=4)) –

Returns:

  • distance of nearest neighbor

  • Note (This function does not check if strings are of same length.)

pyrepseq.distance.pcDelta(seqs, seqs2=None, bins=None, normalize=True, pseudocount=0.0, maxseqs=None, **kwargs)[source]

Calculates binned near-coincidence probabilities \(p_C(\Delta)\) among input sequences.

Parameters:
  • seqs ([list of strings | tuple of lists]) – sequences, or (seqs_alpha, seqs_beta)

  • seqs2 ([list of strings | tuple of lists] (optional)) – second list of sequences for cross-comparisons

  • bins (iterable) – bins for the distances Delta. (Default: range(0, 25)) bins=0: Calculate exact coincidence probability

  • normalize (bool) – whether to return pc (normalized) or raw counts

  • pseudocount (float) – for a Bayesian estimation of coincidence frequencies e.g. can use Jeffrey’s prior value of 0.5

  • maxseqs (int) – maximal number of sequences to keep by random downsampling

  • **kwargs (dict) – passed on to pdist or cdist

Returns:

(normalized) histogram of sequence distances

Return type:

np.ndarray

pyrepseq.distance.pcDelta_grouped(df, by, seq_columns, **kwargs)[source]

Near-coincidence probabilities conditioned to within-group comparisons.

Parameters:
  • df (pd.DataFrame) –

  • by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby

  • seq_columns (string) – The data frame column on which we want to apply the pcDelta analysis

  • **kwargs (keyword arguments) – passed on to pcDelta

Returns:

pcs – Returns a DataFrame of pC(delta) for each group

Return type:

pd.DataFrame

pyrepseq.distance.pcDelta_grouped_cross(df, by, seq_columns, condensed=False, **kwargs)[source]

Near-coincidence probabilities conditioned to cross-group comparisons.

Parameters:
  • df (pd.DataFrame) –

  • by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby

  • seq_columns (string) – The data frame column on which we want to apply the pcDelta analysis

  • condensed (bool) – Return a condensed instead of squareform matrix (default: False)

  • **kwargs (keyword arguments) – passed on to pcDelta

Returns:

pcs – Returns a DataFrame of pC(delta) across pairs of groups

Return type:

pd.DataFrame

pyrepseq.distance.pdist(strings, metric=None, dtype=<class 'numpy.uint8'>, **kwargs)[source]
Pairwise distances between collection of strings.

(scipy.spatial.distance.pdist equivalent for strings)

Parameters:
  • strings (iterable of strings) – An m-length iterable.

  • metric (function, optional) – The distance metric to use. Default: Levenshtein distance.

  • dtype (np.dtype) – data type of the distance matrix, default: np.uint8 Note: make sure to change the dtype, if the metric does not return integers

Returns:

Y – Returns a condensed distance matrix Y. For each \(i\) and \(j\) (where \(i<j<m\)), where m is the number of original observations. The metric dist(u=X[i], v=X[j]) is computed and stored in entry m * i + j - ((i + 2) * (i + 1)) // 2.

Return type:

ndarray

plotting submodule

pyrepseq.plotting.align_seqs(seqs)[source]

Align multiple sequences using mafft-linsi with default parameters.

Parameters:

seqs (iterable of strings) –

Returns:

aligned sequences (with gaps)

Return type:

list of strings

pyrepseq.plotting.label_axes(fig_or_axes, labels='ABCDEFGHIJKLMNOPQRSTUVWXYZ', labelstyle='%s', xy=(-0.1, 0.95), xycoords='axes fraction', **kwargs)[source]

Walks through axes and labels each. kwargs are collected and passed to annotate

Parameters:
  • fig (Figure or Axes to work on) –

  • labels (iterable or None) – iterable of strings to use to label the axes. If None, lower case letters are used.

  • loc (Where to put the label units (len=2 tuple of floats)) –

  • xycoords (loc relative to axes, figure, etc.) –

  • kwargs (to be passed to annotate) –

pyrepseq.plotting.labels_to_colors_hls(labels, palette_kws={'l': 0.5, 's': 0.8}, min_count=None)[source]

Map a list of labels to a list of unique colors. Uses seaborn.hls_palette.

Parameters:
  • df (pandas DataFrame with data) –

  • labels (list of labels) –

  • min_count (map all labels seen less than min_count to black) –

  • palette_kws (passed to seaborn.hls_palette) –

pyrepseq.plotting.labels_to_colors_tableau(labels, min_count=None)[source]

Map a list of labels to a list of unique colors. Uses Tableau_10 colors

Parameters:
  • df (pandas DataFrame with data) –

  • labels (list of labels) –

  • min_count (map all labels seen less than min_count to black) –

pyrepseq.plotting.rankfrequency(data, ax=None, normalize_x=True, normalize_y=False, log_x=True, log_y=True, scalex=1.0, scaley=1.0, **kwargs)[source]

Plot rank frequency plots.

Parameters:
  • data (array-like) – count data

  • ax (matplotlib.Axes) – axes on which to plot the data

  • normalize_x (bool, default:True) – whether to normalize counts to relative frequencies

  • normalize_y (bool, default:False) – whether to normalize ranks to cumulative probabilities

Returns:

Objectes representing the plotted data.

Return type:

list of Line2D

pyrepseq.plotting.seqlogos(seqs, ax=None, **kwargs)[source]

Display a sequence logo.

Aligns sequences using align_seqs if they are are not of equal length.

Parameters:
  • seqs (iterable of strings) – sequences to be displayed

  • ax (matplotlib.axes) – if None create new figure

  • **kwargs (dict) – passed on to logomaker.Logo

Return type:

axes, counts_matrix

pyrepseq.plotting.seqlogos_vj(df, cdr3_column, v_column, j_column, axes=None, **kwargs)[source]

Display a sequence logo with V and J gene information.

Parameters:
  • df (pd.DataFrame) – input data

  • cdr3_column (str) – column name for cdr3 sequences

  • v_column (str) – column name for v genes

  • j_column (str) – column name for j genes

  • **kwargs (dict) – passed on to seqlogos

pyrepseq.plotting.similarity_clustermap(df, alpha_column='cdr3a', beta_column='cdr3b', norm=None, linkage_kws={'method': 'average', 'optimal_ordering': True}, cluster_kws={'criterion': 'distance', 't': 6}, cbar_kws={'format': '%d', 'label': 'Sequence Distance', 'orientation': 'horizontal'}, meta_columns=None, meta_to_colors=None, **kws)[source]

Plots a sequence-similarity clustermap.

Parameters:
  • df (pandas DataFrame with data) –

  • alpha_column (column name with alpha and beta amino acid information) –

  • beta_column (column name with alpha and beta amino acid information) –

  • norm (function to normalize distances) –

  • linkage_kws (keyword arguments for linkage algorithm) –

  • cluster_kws (keyword arguments for clustering algorithm) –

  • cbar_kws (keyword arguments for colorbar) –

  • meta_columns (list-like) – metadata to plot alongside the cluster assignment

  • meta_to_colors (list-like) – list of functions mapping metadata labels to colors first element of list is for clusters

  • kws (keyword arguments passed on to the clustermap.) –