Skip to content

sequence

protein_letters_alph1 module-attribute

protein_letters_alph1: str = join(get_args(protein_letters_alph1_literal))

ACDEFGHIKLMNPQRSTVWY

protein_letters_alph1_extended module-attribute

protein_letters_alph1_extended: str = join(get_args(protein_letters_alph1_extended_literal))

ACDEFGHIKLMNPQRSTVWYBXZJUO

protein_letters_extended_and_gap module-attribute

protein_letters_extended_and_gap: str = join(get_args(protein_letters_alph1_extended_and_gap_literal))

ACDEFGHIKLMNPQRSTVWYBXZJUO-

ProfileDict module-attribute

ProfileDict = dict[int, ProfileEntry]

{1: {'A': 0.04, 'C': 0.12, ..., 'lod': {'A': -5, 'C': -9, ...}, 'type': 'W', 'info': 0.00, 'weight': 0.00}, {...}}

MutationEntry module-attribute

MutationEntry = TypedDict('MutationEntry', {'to': protein_letters_literal, 'from': protein_letters_literal})

Mapping of a reference sequence amino acid type, 'to', and the resulting sequence amino acid type, 'from'.

mutation_dictionary module-attribute

mutation_dictionary = dict[int, MutationEntry]

The mapping of a residue number to a mutation entry containing the reference, 'to', and sequence, 'from', amino acid type.

sequence_dictionary module-attribute

sequence_dictionary = dict[int, protein_letters_literal]

The mapping of a residue number to the corresponding amino acid type.

MultipleSequenceAlignment

MultipleSequenceAlignment(alignment: MultipleSeqAlignment, aligned_sequence: str = None, alphabet: str = protein_letters_alph1_gaped, **kwargs)

Parameters:

  • alignment (MultipleSeqAlignment) –

    A MultipleSeqAlignment object which contains an array-like object of SeqRecords

  • aligned_sequence (str, default: None ) –

    Provide the sequence on which the alignment is based, otherwise the first sequence will be used from the alignment

  • alphabet (str, default: protein_letters_alph1_gaped ) –

    'ACDEFGHIKLMNPQRSTVWY-'

Sets

self.alignment - (Bio.Align.MultipleSeqAlignment) self.number_of_sequences - 214 self.query - 'MGSTHLVLK...' from aligned_sequence argument OR alignment argument, index 0 self.query_with_gaps - 'MGS--THLVLK...' self.counts - {1: {'A': 13, 'C': 1, 'D': 23, ...}, 2: {}, ...} self.frequencies - {1: {'A': 0.05, 'C': 0.001, 'D': 0.1, ...}, 2: {}, ...} self.observations - {1: 210, 2:211, ...}

Source code in symdesign/sequence/__init__.py
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
def __init__(self, alignment: MultipleSeqAlignment, aligned_sequence: str = None,
             alphabet: str = protein_letters_alph1_gaped, **kwargs):
    """Take a Biopython MultipleSeqAlignment object and process for residue specific information. One-indexed

    Args:
        alignment: A MultipleSeqAlignment object which contains an array-like object of SeqRecords
        aligned_sequence: Provide the sequence on which the alignment is based, otherwise the first
            sequence will be used from the alignment
        alphabet: 'ACDEFGHIKLMNPQRSTVWY-'

    Sets:
        self.alignment - (Bio.Align.MultipleSeqAlignment)
        self.number_of_sequences - 214
        self.query - 'MGSTHLVLK...' from aligned_sequence argument OR alignment argument, index 0
        self.query_with_gaps - 'MGS--THLVLK...'
        self.counts - {1: {'A': 13, 'C': 1, 'D': 23, ...}, 2: {}, ...}
        self.frequencies - {1: {'A': 0.05, 'C': 0.001, 'D': 0.1, ...}, 2: {}, ...}
        self.observations - {1: 210, 2:211, ...}
    """
    # count_gaps: bool = False
    # count_gaps: Whether gaps (-) should be counted in column weights
    # if alignment is None:
    #     raise NotImplementedError(
    #         f"Can't create a {MultipleSequenceAlignment.__name__} with alignment=None")

    self.alignment = alignment
    self.alphabet = alphabet
    if aligned_sequence is None:
        self.query_aligned = str(alignment[0].seq)
    else:
        self.query_aligned = aligned_sequence

query instance-attribute

query: str

The sequence used to perform the MultipleSequenceAlignment search

sequence_weights property writable

sequence_weights: list[float]

Weights for each sequence in the alignment. Default is based on the sequence "surprise", however provided weights can also be set

alphabet_length property

alphabet_length

The number of sequence characters in the character alphabet

counts_by_position property

counts_by_position: ndarray

The counts of each alphabet character for each residue position in the alignment with shape (number of residues, alphabet size)

gap_index property

gap_index: int

The index in the alphabet where the gap character resides

query_aligned property writable

query_aligned: str

The sequence used to create the MultipleSequenceAlignment potentially containing gaps from alignment

query_length property

query_length: int

The number of residues in the MultipleSequenceAlignment query

length property

length: int

The number of sequences in the MultipleSequenceAlignment

number_of_positions property

number_of_positions: int

The number of residues plus gaps found in each sequence of the MultipleSequenceAlignment

sequences property

sequences: Iterator[str]

Iterate over the sequences present in the alignment

sequence_identifiers property

sequence_identifiers: list[str]

Return the identifiers associated with each sequence in the alignment

alphabet_type property writable

alphabet_type: alphabet_types_literal

The type of alphabet that the alignment is mapped to numerically

query_indices property

query_indices: ndarray

View the query as a boolean array (1, sequence_length) where gap positions, "-", are False

sequence_indices property writable

sequence_indices: ndarray

View the alignment as a boolean array (length, number_of_positions) where gap positions, "-", are False

numerical_alignment property

numerical_alignment: ndarray

Return the alignment as an integer array (length, number_of_positions) of the amino acid characters

Maps the instance .alphabet characters to their resulting sequence index

array property

array: ndarray

Return the alignment as a character array (length, number_of_positions) with numpy.string_ dtype

deletion_matrix property

deletion_matrix: ndarray

Return the number of deletions at every query aligned sequence index for each sequence in the alignment

frequencies property

frequencies: ndarray

Access the per-residue, alphabet frequencies with shape (number of residues, alphabet characters). Bounded between 0 and 1

observations_by_position property

observations_by_position: ndarray

The number of sequences with observations at each residue position in the alignment

gaps_per_position property

gaps_per_position: ndarray

The number of gaped letters at each position in the sequence with shape (number of residues,)

from_dictionary classmethod

from_dictionary(named_sequences: dict[str, str], **kwargs) -> MultipleSequenceAlignment

Create a MultipleSequenceAlignment from a dictionary of named sequences

Parameters:

  • named_sequences (dict[str, str]) –

    Where name and sequence must be a string, i.e. {'1': 'MNTEELQVAAFEI...', ...}

Returns: The MultipleSequenceAlignment object for the provided sequences

Source code in symdesign/sequence/__init__.py
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
@classmethod
def from_dictionary(cls, named_sequences: dict[str, str], **kwargs) -> MultipleSequenceAlignment:
    """Create a MultipleSequenceAlignment from a dictionary of named sequences

    Args:
        named_sequences: Where name and sequence must be a string, i.e. {'1': 'MNTEELQVAAFEI...', ...}
    Returns:
        The MultipleSequenceAlignment object for the provided sequences
    """
    return cls(alignment=MultipleSeqAlignment(
        [SeqRecord(Seq(sequence), id=name)  # annotations={'molecule_type': 'Protein'},
         for name, sequence in named_sequences.items()]), **kwargs)

from_seq_records classmethod

from_seq_records(seq_records: Iterable[SeqRecord], **kwargs) -> MultipleSequenceAlignment

Create a MultipleSequenceAlignment from a SeqRecord Iterable

Parameters:

  • seq_records (Iterable[SeqRecord]) –

    {name: sequence, ...} ex: {'clean_asu': 'MNTEELQVAAFEI...', ...}

Source code in symdesign/sequence/__init__.py
1358
1359
1360
1361
1362
1363
1364
1365
@classmethod
def from_seq_records(cls, seq_records: Iterable[SeqRecord], **kwargs) -> MultipleSequenceAlignment:
    """Create a MultipleSequenceAlignment from a SeqRecord Iterable

    Args:
        seq_records: {name: sequence, ...} ex: {'clean_asu': 'MNTEELQVAAFEI...', ...}
    """
    return cls(alignment=MultipleSeqAlignment(seq_records), **kwargs)

weight_alignment_by_sequence

weight_alignment_by_sequence() -> list[float]

Measure diversity/surprise when comparing a single alignment entry to the rest of the alignment

Default means for weighting sequences. Important for creating representative sequence populations in the MSA as was described by Heinkoff and Heinkoff, 1994 (PMID: 7966282)

Operation is: SUM(1 / (column_j_aa_representation * aa_ij_count))

Returns:

  • list[float]

    Weight of each sequence in the MSA - [2.390, 2.90, 5.33, 1.123, ...]

Source code in symdesign/sequence/__init__.py
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
def weight_alignment_by_sequence(self) -> list[float]:
    """Measure diversity/surprise when comparing a single alignment entry to the rest of the alignment

    Default means for weighting sequences. Important for creating representative sequence populations in the MSA as
    was described by Heinkoff and Heinkoff, 1994 (PMID: 7966282)

    Operation is: SUM(1 / (column_j_aa_representation * aa_ij_count))

    Returns:
        Weight of each sequence in the MSA - [2.390, 2.90, 5.33, 1.123, ...]
    """
    # create a 1/ obs * counts = positional_weights
    #  alignment.number_of_positions - 0   1   2  ...
    #      / obs 0 [[32]   count seq 0 '-' -  2   0   0  ...   [[ 64   0   0 ...]  \
    # 1 / |  obs 1  [33] * count seq 1 'A' - 10  10   0  ... =  [330 330   0 ...]   |
    #      \ obs 2  [33]   count seq 2 'C' -  8   8   1  ...    [270 270  33 ...]] /
    #   ...   ...]               ...  ... ... ...
    position_weights = 1 / (self.observations_by_position[None, :] * self.counts_by_position)
    # take_along_axis from this with the transposed numerical_alignment (na) where each successive na idx
    # is the sequence position at the na and therefore is grabbing the position_weights by that index
    # finally sum along each sequence
    # The position_weights count seq idx must be taken by a sequence index. This happens to be on NA axis 1
    # at the moment so specified with .T and take using axis=0. Keeping both as axis=0 doen't index
    # correctly. Maybe this is a case where 'F' array ordering is needed?
    sequence_weights = np.take_along_axis(position_weights, self.numerical_alignment.T, axis=0).sum(axis=0)
    logger.critical('New sequence_weights_', sequence_weights)

    # # Old calculation
    # counts_ = [[0 for _ in self.alphabet] for _ in range(self.number_of_positions)]  # list[list]
    # for sequence in self.sequences:
    #     for _count, aa in zip(counts_, sequence):
    #         _count[numerical_translation_alph1_gaped[aa]] += 1
    #         # self.counts[i][aa] += 1
    #
    # self._counts = counts_
    # logger.critical('OLD self._counts', self._counts)
    # self._observations = [sum(_aa_counts[:self.gap_index]) for _aa_counts in self._counts]  # list[list]

    observations_by_position = self.observations_by_position
    counts_by_position = self.counts_by_position
    numerical_alignment = self.numerical_alignment
    # sequence_weights_ = weight_sequences(self._counts, self.alignment, self.observations_by_position)
    # sequence_weights_ = weight_sequences(counts_by_position, self.alignment, observations_by_position)
    sequence_weights_ = []
    for sequence_idx in range(self.length):
        sequence_weights_.append(
            (1 / (observations_by_position * counts_by_position[numerical_alignment[sequence_idx]])).sum())

    logger.critical('OLD sequence_weights_', sequence_weights_)

    return sequence_weights

update_counts_by_position_with_sequence_weights

update_counts_by_position_with_sequence_weights()

Overwrite the current counts with weighted counts

Source code in symdesign/sequence/__init__.py
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
def update_counts_by_position_with_sequence_weights(self):  # UNUSED
    """Overwrite the current counts with weighted counts"""
    # Add each sequence weight to the indices indicated by the numerical_alignment
    self._counts_by_position = np.zeros((self.number_of_positions, self.alphabet_length))
    numerical_alignment = self.numerical_alignment
    sequence_weights = self.sequence_weights
    try:
        for sequence_idx in range(self.length):
            self._counts_by_position[:, numerical_alignment[sequence_idx]] += sequence_weights[sequence_idx]
    except IndexError:  # sequence_weights is the wrong length
        raise IndexError(
            f"Couldn't index the provided 'sequence_weights' with length {len(sequence_weights)}")
    logger.info('sequence_weight self.counts', self.counts_by_position)
    logger.info('May need to refactor weight sequences() to MultipleSequenceAlignment. Take particular care in '
                'putting the alignment back together after .insert()/.delete() <- if written')

get_probabilities_from_profile

get_probabilities_from_profile(profile: numerical_profile) -> ndarray

For each sequence in the alignment, extract the values from a profile corresponding to the amino acid type of each residue in each sequence

Parameters:

  • profile (numerical_profile) –

    A profile of values with shape (length, alphabet_length) where length is the number_of_positions

Returns: The array with shape (length, number_of_positions) with the value for each amino acid index in profile

Source code in symdesign/sequence/__init__.py
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
def get_probabilities_from_profile(self, profile: numerical_profile) -> np.ndarray:
    """For each sequence in the alignment, extract the values from a profile corresponding to the amino acid type
    of each residue in each sequence

    Args:
        profile: A profile of values with shape (length, alphabet_length) where length is the number_of_positions
    Returns:
        The array with shape (length, number_of_positions) with the value for each amino acid index in profile
    """
    # transposed_alignment = self.numerical_alignment.T
    return np.take_along_axis(profile, self.numerical_alignment.T, axis=1).T

insert

insert(at: int, sequence: str, msa_index: bool = False)

Insert new sequence in the MultipleSequenceAlignment where the added sequence is added to all columns

Parameters:

  • at (int) –

    The index to insert the sequence at. By default, the index is in reference to where self.query_indices are True, i.e. the query sequence

  • sequence (str) –

    The sequence to insert. Will be inserted for every sequence of the alignment

  • msa_index (bool, default: False ) –

    Whether the insertion index is in the frame of the entire multiple sequence alignment. Default, False, indicates the index is in the frame of the query sequence index, i.e. no gaps

Sets

self.alignment: The existing alignment updated with the new sequence in alignment form

Source code in symdesign/sequence/__init__.py
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
def insert(self, at: int, sequence: str, msa_index: bool = False):
    """Insert new sequence in the MultipleSequenceAlignment where the added sequence is added to all columns

    Args:
        at: The index to insert the sequence at. By default, the index is in reference to where self.query_indices
            are True, i.e. the query sequence
        sequence: The sequence to insert. Will be inserted for every sequence of the alignment
        msa_index: Whether the insertion index is in the frame of the entire multiple sequence alignment.
            Default, False, indicates the index is in the frame of the query sequence index, i.e. no gaps

    Sets:
        self.alignment: The existing alignment updated with the new sequence in alignment form
    """
    if msa_index:
        at = at
    else:
        try:  # To get the index 'at' for those indices that are present in the query
            at = np.flatnonzero(self.query_indices)[at]
        except IndexError:  # This index is outside of query
            if at >= self.query_length:
                # Treat as append
                at = self.number_of_positions
            else:
                raise NotImplementedError(f"Couldn't index with a negative index...")
    begin_slice = slice(at)
    end_slice = slice(at, None)

    logger.debug(f'Insertion is occurring at {self.__class__.__name__} index {at}')

    new_sequence = Seq(sequence)
    new_alignment = MultipleSeqAlignment(
        [SeqRecord(new_sequence, id=id_)  # annotations={'molecule_type': 'Protein'},
         for id_ in self.sequence_identifiers])

    logger.debug(f'number of sequences in new_alignment: {len(new_alignment)}')
    start_alignment_slice = self.alignment[:, begin_slice]
    start_alignment_len = len(start_alignment_slice)
    if start_alignment_len:
        logger.debug(f'number of sequences in start_alignment_slice: {start_alignment_len}')
        new_alignment = start_alignment_slice + new_alignment

    end_alignment_slice = self.alignment[:, end_slice]
    end_alignment_len = len(end_alignment_slice)
    if end_alignment_len:
        logger.debug(f'number of sequences in end_alignment_slice: {end_alignment_len}')
        new_alignment = new_alignment + end_alignment_slice

    # Set the alignment
    self.alignment = new_alignment
    self.query_aligned = str(new_alignment[0].seq)

    # Update alignment dependent features
    self.reset_state()
    logger.debug(f'Inserted alignment has shape ({self.length}, {self.number_of_positions})')

reset_state

reset_state()

Remove any state attributes

Source code in symdesign/sequence/__init__.py
1875
1876
1877
1878
1879
1880
1881
1882
1883
def reset_state(self):
    """Remove any state attributes"""
    for attr in ['_array', '_deletion_matrix', '_numerical_alignment', '_sequence_indices',
                 '_sequence_identifiers', '_observations_by_position', '_counts_by_position', '_gaps_per_position',
                 '_frequencies']:
        try:
            self.__delattr__(attr)
        except AttributeError:
            continue

pad_alignment

pad_alignment(length, axis: int = 0)

Extend the alignment by a set length

Parameters:

  • length

    The length to pad the alignment

  • axis (int, default: 0 ) –

    The axis to pad. 0 pads the sequences, 1 pads the residues

Sets

self.alignment with the specified padding

Source code in symdesign/sequence/__init__.py
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
def pad_alignment(self, length, axis: int = 0):
    """Extend the alignment by a set length

    Args:
        length: The length to pad the alignment
        axis: The axis to pad. 0 pads the sequences, 1 pads the residues

    Sets:
        self.alignment with the specified padding
    """
    if axis == 0:
        dummy_record = SeqRecord(Seq('-' * self.number_of_positions), id='dummy')
        self.alignment.extend([dummy_record for _ in range(length)])
        self.reset_state()
    else:  # axis == 1
        self.insert(self.number_of_positions, '-' * length, msa_index=True)

    logger.debug(f'padded alignment has shape ({self.length}, {self.number_of_positions})')

create_numeric_translation_table

create_numeric_translation_table(alphabet: Sequence[str], bytes_: bool = True) -> dict[bytes | str, int]

Return the numeric translation from an alphabet to the integer position in that alphabet

Parameters:

  • alphabet (Sequence[str]) –

    The alphabet to use. Example 'ARNDCQEGHILKMFPSTWYVBZX*'

  • bytes_ (bool, default: True ) –

    Whether to map from byte characters

Returns:

  • dict[bytes | str, int]

    The mapping from the character to the positional integer

Source code in symdesign/sequence/__init__.py
154
155
156
157
158
159
160
161
162
163
164
165
166
167
def create_numeric_translation_table(alphabet: Sequence[str], bytes_: bool = True) -> dict[bytes | str, int]:
    """Return the numeric translation from an alphabet to the integer position in that alphabet

    Args:
        alphabet: The alphabet to use. Example 'ARNDCQEGHILKMFPSTWYVBZX*'
        bytes_: Whether to map from byte characters

    Returns:
        The mapping from the character to the positional integer
    """
    if bytes_:
        alphabet = (char.encode() for char in alphabet)

    return dict(zip(alphabet, count()))

get_numeric_translation_table

get_numeric_translation_table(alphabet_type: alphabet_types_literal) -> defaultdict[str, int] | dict[str, int]

Given an amino acid alphabet type, return the corresponding numerical translation table. If a table is passed, just return it

Returns:

  • defaultdict[str, int] | dict[str, int]

    The integer mapping to the sequence of the requested alphabet

Source code in symdesign/sequence/__init__.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
def get_numeric_translation_table(alphabet_type: alphabet_types_literal) -> defaultdict[str, int] | dict[str, int]:
    """Given an amino acid alphabet type, return the corresponding numerical translation table.
    If a table is passed, just return it

    Returns:
        The integer mapping to the sequence of the requested alphabet
    """
    # try:
    #     match alphabet_type:
    #         case 'protein_letters_alph1':
    #             numeric_translation_table = numerical_translation_alph1_bytes
    #         case 'protein_letters_alph3':
    #             numeric_translation_table = numerical_translation_alph3_bytes
    #         case 'protein_letters_alph1_gaped':
    #             numeric_translation_table = numerical_translation_alph1_gaped_bytes
    #         case 'protein_letters_alph3_gaped':
    #             numeric_translation_table = numerical_translation_alph3_gaped_bytes
    #         case 'protein_letters_alph1_unknown':
    #             numeric_translation_table = numerical_translation_alph1_unknown_bytes
    #         case 'protein_letters_alph3_unknown':
    #             numeric_translation_table = numerical_translation_alph3_unknown_bytes
    #         case 'protein_letters_alph1_unknown_gaped':
    #             numeric_translation_table = numerical_translation_alph1_unknown_gaped_bytes
    #         case 'protein_letters_alph3_unknown_gaped':
    #             numeric_translation_table = numerical_translation_alph3_unknown_gaped_bytes
    #         case _:
    #             try:  # To see if we already have the alphabet, and just return defaultdict
    #                 numeric_translation_table = alphabet_to_type[alphabet_type]
    #             except KeyError:
    #                 # raise ValueError(wrong_alphabet_type)
    #                 logger.warning(wrong_alphabet_type)
    #                 numeric_translation_table = create_numeric_translation_table(alphabet_type)
    # except SyntaxError:  # python version not 3.10
    if alphabet_type == 'protein_letters_alph1':
        numeric_translation_table = numerical_translation_alph1_bytes
    elif alphabet_type == 'protein_letters_alph3':
        numeric_translation_table = numerical_translation_alph3_bytes
    elif alphabet_type == 'protein_letters_alph1_gaped':
        numeric_translation_table = numerical_translation_alph1_gaped_bytes
    elif alphabet_type == 'protein_letters_alph3_gaped':
        numeric_translation_table = numerical_translation_alph3_gaped_bytes
    elif alphabet_type == 'protein_letters_alph1_unknown':
        numeric_translation_table = numerical_translation_alph1_unknown_bytes
    elif alphabet_type == 'protein_letters_alph3_unknown':
        numeric_translation_table = numerical_translation_alph3_unknown_bytes
    elif alphabet_type == 'protein_letters_alph1_unknown_gaped':
        numeric_translation_table = numerical_translation_alph1_unknown_gaped_bytes
    elif alphabet_type == 'protein_letters_alph3_unknown_gaped':
        numeric_translation_table = numerical_translation_alph3_unknown_gaped_bytes
    else:
        try:  # To see if we already have the alphabet, and return the defaultdict
            _type = alphabet_to_alphabet_type[alphabet_type]
        except KeyError:
            raise KeyError(
                f"The alphabet '{alphabet_type}' isn't an allowed alphabet_type."
                f" See {', '.join(alphabet_types)}")
            # raise ValueError(wrong_alphabet_type)
        logger.warning(f"{get_numeric_translation_table.__name__}: The alphabet_type '{alphabet_type}' "
                       "isn't viable. Attempting to create it")
        numeric_translation_table = create_numeric_translation_table(alphabet_type)

    return numeric_translation_table

generate_alignment

generate_alignment(seq1: Sequence[str], seq2: Sequence[str], matrix: str = default_substitution_matrix_name, local: bool = False, top_alignment: bool = True, **alignment_kwargs) -> Alignment | PairwiseAlignments

Use Biopython's pairwise2 to generate a sequence alignment

Parameters:

  • seq1 (Sequence[str]) –

    The first sequence to align

  • seq2 (Sequence[str]) –

    The second sequence to align

  • matrix (str, default: default_substitution_matrix_name ) –

    The matrix used to compare character similarities

  • local (bool, default: False ) –

    Whether to run a local alignment. Only use for generally similar sequences!

  • top_alignment (bool, default: True ) –

    Only include the highest scoring alignment

Other Parameters:

  • query_left_open_gap_score

    int = 0 - The score used for opening a gap in the alignment procedure

  • query_left_extend_gap_score

    int = 0 - The score used for extending a gap in the alignment procedure

  • target_left_open_gap_score

    int = 0 - The score used for opening a gap in the alignment procedure

  • target_left_extend_gap_score

    int = 0 - The score used for extending a gap in the alignment procedure

  • query_internal_open_gap_score

    int = -12 - The score used for opening a gap in the alignment procedure

  • query_internal_extend_gap_score

    int = -1 - The score used for extending a gap in the alignment procedure

  • target_internal_open_gap_score

    int = -12 - The score used for opening a gap in the alignment procedure

  • target_internal_extend_gap_score

    int = -1 - The score used for extending a gap in the alignment procedure

  • query_right_open_gap_score

    int = 0 - The score used for opening a gap in the alignment procedure

  • query_right_extend_gap_score

    int = 0 - The score used for extending a gap in the alignment procedure

  • target_right_open_gap_score

    int = 0 - The score used for opening a gap in the alignment procedure

  • target_right_extend_gap_score

    int = 0 - The score used for extending a gap in the alignment procedure

Returns:

  • Alignment | PairwiseAlignments

    The resulting alignment(s). Will be an Alignment object if top_alignment is True else PairwiseAlignments object

Source code in symdesign/sequence/__init__.py
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
def generate_alignment(seq1: Sequence[str], seq2: Sequence[str], matrix: str = default_substitution_matrix_name,
                       local: bool = False, top_alignment: bool = True, **alignment_kwargs) \
        -> Alignment | PairwiseAlignments:
    """Use Biopython's pairwise2 to generate a sequence alignment

    Args:
        seq1: The first sequence to align
        seq2: The second sequence to align
        matrix: The matrix used to compare character similarities
        local: Whether to run a local alignment. Only use for generally similar sequences!
        top_alignment: Only include the highest scoring alignment

    Keyword Args:
        query_left_open_gap_score: int = 0 - The score used for opening a gap in the alignment procedure
        query_left_extend_gap_score: int = 0 - The score used for extending a gap in the alignment procedure
        target_left_open_gap_score: int = 0 - The score used for opening a gap in the alignment procedure
        target_left_extend_gap_score: int = 0 - The score used for extending a gap in the alignment procedure
        query_internal_open_gap_score: int = -12 - The score used for opening a gap in the alignment procedure
        query_internal_extend_gap_score: int = -1 - The score used for extending a gap in the alignment procedure
        target_internal_open_gap_score: int = -12 - The score used for opening a gap in the alignment procedure
        target_internal_extend_gap_score: int = -1 - The score used for extending a gap in the alignment procedure
        query_right_open_gap_score: int = 0 - The score used for opening a gap in the alignment procedure
        query_right_extend_gap_score: int = 0 - The score used for extending a gap in the alignment procedure
        target_right_open_gap_score: int = 0 - The score used for opening a gap in the alignment procedure
        target_right_extend_gap_score: int = 0 - The score used for extending a gap in the alignment procedure

    Returns:
        The resulting alignment(s). Will be an Alignment object if top_alignment is True else PairwiseAlignments object
    """
    matrix_ = _substitution_matrices_cache.get(matrix)
    if matrix_ is None:
        try:  # To get the new matrix and store for future ops
            matrix_ = _substitution_matrices_cache[matrix] = substitution_matrices.load(matrix)
        except FileNotFoundError:  # Missing this
            raise KeyError(
                f"Couldn't find the substitution matrix '{matrix}' ")

    if local:
        mode = 'local'
    else:
        mode = 'global'

    if alignment_kwargs:
        protein_alignment_variables_ = protein_alignment_variables.copy()
        protein_alignment_variables_.update(**alignment_kwargs)
    else:
        protein_alignment_variables_ = protein_alignment_variables
    # logger.debug(f'Generating sequence alignment between:\n{seq1}\n\tAND:\n{seq2}')
    # Create sequence alignment
    aligner = PairwiseAligner(mode=mode, substitution_matrix=matrix_, **protein_alignment_variables_)
    try:
        alignments = aligner.align(seq1, seq2)
    except ValueError:  # sequence contains letters not in the alphabet
        print(f'Sequence1: {seq1}')
        print(f'Sequence2: {seq2}')
        raise
    first_alignment = alignments[0]
    logger.debug(f'Found alignment with score: {alignments.score}\n{first_alignment}')
    # print("Number of alignments: %d" % len(alignments))
    #   Number of alignments: 1
    # alignment = alignments[0]
    # print("Score = %.1f" % alignment.score)
    #   Score = 13.0
    # print(alignment)
    #   target            0 KEVLA 5
    #                     0 -|||- 5
    #   query             0 -EVL- 3
    # print(alignment.target)
    # print(alignment.query)
    # print(alignment.indices)  # returns array([[ 0,  1,  2,  3,  4], [ -1,  0,  1,  2, -1]]) where -1 are gaps
    # This would be useful in checking for gaps during generate_mutations()
    # print(alignment.inverse_indices)  # returns [array([ 0,  1,  2,  3,  4], [ 1,  2,  3]])
    # where -1 are outside array and each index is the position in the alignment. This would be useful for instance with
    # get_equivalent_indices() which is precalculated and now does this routine twice during Entity.__init__()
    # logger.debug(f'Generated alignment:\n{pairwise2.format_alignment(*align[0])}')

    # return align[0] if top_alignment else align
    return first_alignment if top_alignment else alignments

read_fasta_file

read_fasta_file(file_name: AnyStr, **kwargs) -> Iterable[SeqRecord]

Opens a fasta file and return a parser object to load the sequences to SeqRecords

Parameters:

  • file_name (AnyStr) –

    The location of the file on disk

Returns:

  • Iterable[SeqRecord]

    An iterator of the sequences in the file [record1, record2, ...]

Source code in symdesign/sequence/__init__.py
371
372
373
374
375
376
377
378
379
380
def read_fasta_file(file_name: AnyStr, **kwargs) -> Iterable[SeqRecord]:
    """Opens a fasta file and return a parser object to load the sequences to SeqRecords

    Args:
        file_name: The location of the file on disk

    Returns:
        An iterator of the sequences in the file [record1, record2, ...]
    """
    return SeqIO.parse(file_name, 'fasta')

read_sequence_file

read_sequence_file(file_name: AnyStr, **kwargs) -> Iterable[SeqRecord]

Opens a fasta file and return a parser object to load the sequences to SeqRecords

Parameters:

  • file_name (AnyStr) –

    The location of the file on disk

Returns:

  • Iterable[SeqRecord]

    An iterator of the sequences in the file [record1, record2, ...]

Source code in symdesign/sequence/__init__.py
383
384
385
386
387
388
389
390
391
392
393
def read_sequence_file(file_name: AnyStr, **kwargs) -> Iterable[SeqRecord]:
    """Opens a fasta file and return a parser object to load the sequences to SeqRecords

    Args:
        file_name: The location of the file on disk

    Returns:
        An iterator of the sequences in the file [record1, record2, ...]
    """
    raise NotImplementedError()
    return SeqIO.parse(file_name, 'csv')

read_alignment

read_alignment(file_name: AnyStr, alignment_type: str = 'fasta', **kwargs) -> MultipleSeqAlignment

Open an alignment file and parse the alignment to a Biopython MultipleSeqAlignment

Parameters:

  • file_name (AnyStr) –

    The location of the file on disk

  • alignment_type (str, default: 'fasta' ) –

    The type of file that the alignment is stored in. Used for parsing

Returns:

  • MultipleSeqAlignment

    The parsed alignment

Source code in symdesign/sequence/__init__.py
396
397
398
399
400
401
402
403
404
405
406
407
@utils.handle_errors(errors=(FileNotFoundError,))
def read_alignment(file_name: AnyStr, alignment_type: str = 'fasta', **kwargs) -> MultipleSeqAlignment:
    """Open an alignment file and parse the alignment to a Biopython MultipleSeqAlignment

    Args:
        file_name: The location of the file on disk
        alignment_type: The type of file that the alignment is stored in. Used for parsing

    Returns:
        The parsed alignment
    """
    return AlignIO.read(file_name, alignment_type)

write_fasta

write_fasta(sequence_records: Iterable[SeqRecord], file_name: AnyStr = None, name: str = None, out_dir: AnyStr = os.getcwd()) -> AnyStr

Write an iterator of SeqRecords to a .fasta file with fasta format. '.fasta' is appended if not specified in name

Parameters:

  • sequence_records (Iterable[SeqRecord]) –

    The sequences to write. Should be Biopython SeqRecord format

  • file_name (AnyStr, default: None ) –

    The explicit name of the file

  • name (str, default: None ) –

    The name of the file to output

  • out_dir (AnyStr, default: getcwd() ) –

    The location on disk to output file

Returns:

  • AnyStr

    The name of the output file

Source code in symdesign/sequence/__init__.py
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
def write_fasta(sequence_records: Iterable[SeqRecord], file_name: AnyStr = None, name: str = None,
                out_dir: AnyStr = os.getcwd()) -> AnyStr:
    """Write an iterator of SeqRecords to a .fasta file with fasta format. '.fasta' is appended if not specified in name

    Args:
        sequence_records: The sequences to write. Should be Biopython SeqRecord format
        file_name: The explicit name of the file
        name: The name of the file to output
        out_dir: The location on disk to output file

    Returns:
        The name of the output file
    """
    if file_name is None:
        file_name = os.path.join(out_dir, name)
        if not file_name.endswith('.fasta'):
            file_name = f'{file_name}.fasta'

    SeqIO.write(sequence_records, file_name, 'fasta')

    return file_name

write_sequence_to_fasta

write_sequence_to_fasta(sequence: str, file_name: AnyStr, name: str, out_dir: AnyStr = os.getcwd()) -> AnyStr

Write an iterator of SeqRecords to a .fasta file with fasta format. '.fasta' is appended if not specified in name

Parameters:

  • sequence (str) –

    The sequence to write

  • name (str) –

    The name of the sequence. Will be used as the default file_name base name if file_name not provided

  • file_name (AnyStr) –

    The explicit name of the file

  • out_dir (AnyStr, default: getcwd() ) –

    The location on disk to output the file. Only used if file_name not explicitly provided

Returns:

  • AnyStr

    The path to the output file

Source code in symdesign/sequence/__init__.py
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
def write_sequence_to_fasta(sequence: str, file_name: AnyStr, name: str, out_dir: AnyStr = os.getcwd()) -> AnyStr:
    """Write an iterator of SeqRecords to a .fasta file with fasta format. '.fasta' is appended if not specified in name

    Args:
        sequence: The sequence to write
        name: The name of the sequence. Will be used as the default file_name base name if file_name not provided
        file_name: The explicit name of the file
        out_dir: The location on disk to output the file. Only used if file_name not explicitly provided

    Returns:
        The path to the output file
    """
    if file_name is None:
        file_name = os.path.join(out_dir, name)
        if not file_name.endswith('.fasta'):
            file_name = f'{file_name}.fasta'

    with open(file_name, 'w') as outfile:
        outfile.write(f'>{name}\n{sequence}\n')

    return file_name

concatenate_fasta_files

concatenate_fasta_files(file_names: Iterable[AnyStr], out_path: str = 'concatenated_fasta') -> AnyStr

Take multiple fasta files and concatenate into a single file

Parameters:

  • file_names (Iterable[AnyStr]) –

    The name of the files to concatenate

  • out_path (str, default: 'concatenated_fasta' ) –

    The location on disk to output file

Returns:

  • AnyStr

    The name of the output file

Source code in symdesign/sequence/__init__.py
456
457
458
459
460
461
462
463
464
465
466
467
468
469
def concatenate_fasta_files(file_names: Iterable[AnyStr], out_path: str = 'concatenated_fasta') -> AnyStr:
    """Take multiple fasta files and concatenate into a single file

    Args:
        file_names: The name of the files to concatenate
        out_path: The location on disk to output file

    Returns:
        The name of the output file
    """
    seq_records = []
    for file in file_names:
        seq_records.extend(list(read_fasta_file(file)))
    return write_fasta(seq_records, out_dir=out_path)

write_sequences

write_sequences(sequences: Sequence | dict[str, Sequence], names: Sequence = None, out_path: AnyStr = os.getcwd(), file_name: AnyStr = None, csv: bool = False) -> AnyStr

Write a fasta file from sequence(s). If a single sequence is provided, pass as a string

Parameters:

  • sequences (Sequence | dict[str, Sequence]) –

    If a list, can be list of tuples(name, sequence), or list[sequence] where names contain the corresponding sequence names. If dict, uses key as name, value as sequence. If str, treats as the sequence

  • names (Sequence, default: None ) –

    The name or names of the sequence record(s). If a single name, will be used as the default file_name base name if file_name not provided. Otherwise, will be used iteratively

  • out_path (AnyStr, default: getcwd() ) –

    The location on disk to output file

  • file_name (AnyStr, default: None ) –

    The explicit name of the file

  • csv (bool, default: False ) –

    Whether the file should be written as a .csv. Default is .fasta

Returns:

  • AnyStr

    The name of the output file

Source code in symdesign/sequence/__init__.py
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
def write_sequences(sequences: Sequence | dict[str, Sequence], names: Sequence = None,
                    out_path: AnyStr = os.getcwd(), file_name: AnyStr = None, csv: bool = False) -> AnyStr:
    """Write a fasta file from sequence(s). If a single sequence is provided, pass as a string

    Args:
        sequences: If a list, can be list of tuples(name, sequence), or list[sequence] where names contain the
            corresponding sequence names. If dict, uses key as name, value as sequence. If str, treats as the sequence
        names: The name or names of the sequence record(s). If a single name, will be used as the default file_name
            base name if file_name not provided. Otherwise, will be used iteratively
        out_path: The location on disk to output file
        file_name: The explicit name of the file
        csv: Whether the file should be written as a .csv. Default is .fasta

    Returns:
        The name of the output file
    """
    if file_name is None:
        if isinstance(names, str):  # Not an iterable
            file_name = os.path.join(out_path, names)
        else:
            raise ValueError(
                f"Must provide argument 'file_name' or 'names' as a str to {write_sequences.__name__}()")

    if csv:
        start, sep = '', ','
        extension = '.csv'
    else:
        start, sep = '>', '\n'
        extension = '.fasta'

    provided_extension = os.path.splitext(file_name)[-1]
    if not file_name.endswith(extension) or provided_extension != extension:
        file_name = f'{os.path.splitext(file_name)[0]}{extension}'

    def data_dump():
        return f"{write_sequences.__name__} can't parse data to make fasta\n" \
               f'names={names}, sequences={sequences}, extension={extension}, out_path={out_path}, ' \
               f'file_name={file_name}'

    with open(file_name, 'a') as outfile:
        if isinstance(sequences, np.ndarray):
            sequences = sequences.tolist()

        if not sequences:
            raise ValueError(f"No sequences provided, couldn't write anything")

        if isinstance(sequences, list):
            if isinstance(sequences[0], tuple):  # Where seq[0] is name, seq[1] is seq
                formatted_sequence_gen = (f'{start}{name}{sep}{seq}' for name, seq, *_ in sequences)
            elif isinstance(names, Sequence):
                if isinstance(sequences[0], str):
                    formatted_sequence_gen = (f'{start}{name}{sep}{seq}' for name, seq in zip(names, sequences))
                elif isinstance(sequences[0], Sequence):
                    formatted_sequence_gen = \
                        (f'{start}{name}{sep}{"".join(seq)}' for name, seq in zip(names, sequences))
                else:
                    raise TypeError(data_dump())
            # elif isinstance(sequences[0], list):  # Where interior list is alphabet (AA or DNA)
            #     for idx, seq in enumerate(sequences):
            #         outfile.write(f'>{name}_{idx}\n')  # Write header
            #         # Check if alphabet is 3 letter protein
            #         outfile.write(f'{" ".join(seq)}\n' if len(seq[0]) == 3 else f'{"".join(seq)}\n')
            # elif isinstance(sequences[0], str):  # Likely 3 aa format...
            #     outfile.write(f'>{name}\n{" ".join(sequences)}\n')
            else:
                raise TypeError(data_dump())
        elif isinstance(sequences, dict):
            formatted_sequence_gen = (f'{start}{name}{sep}{"".join(seq)}' for name, seq in sequences.items())
        elif isinstance(sequences, tuple):  # Where seq[0] is name, seq[1] is seq
            try:
                name, seq, *_ = sequences
            except ValueError:
                raise ValueError(f"When using a tuple, expected that the tuple contain (name, sequence) pairs")
            formatted_sequence_gen = (f'{start}{name}{sep}{seq}',)
        elif isinstance(names, str):  # Assume sequences is a str or tuple
            formatted_sequence_gen = (f'{start}{names}{sep}{"".join(sequences)}',)
        else:
            raise TypeError(data_dump())
        outfile.write('%s\n' % '\n'.join(formatted_sequence_gen))

    return file_name

hhblits

hhblits(name: str, sequence_file: Sequence[str] = None, sequence: Sequence[str] = None, out_dir: AnyStr = os.getcwd(), threads: int = hhblits_threads, return_command: bool = False, **kwargs) -> list[str] | None

Generate a position specific scoring matrix from HHblits using Hidden Markov Models

When the command is run, it is possible to create six files in out_dir with the pattern '/outdir/name.' where the '' extensions are: hhblits profile - .hmm

hhblits resulting cluster description - .hrr

sequence alignment in a3m format - .a3m

sequence file - .seq (if sequence_file)

sequence alignment in stockholm format - .sto (if return_command is False)

sequence alignment in fasta format - .fasta (if return_command is False)

Parameters:

  • name (str) –

    The name associated with the sequence

  • sequence_file (Sequence[str], default: None ) –

    The file containing the sequence to use

  • sequence (Sequence[str], default: None ) –

    The sequence to use. Used in place of sequence_file

  • out_dir (AnyStr, default: getcwd() ) –

    Disk location where generated files should be written

  • threads (int, default: hhblits_threads ) –

    Number of cpu's to use for the process

  • return_command (bool, default: False ) –

    Whether to simply return the hhblits command

Returns:

  • list[str] | None

    The command if return_command is True, otherwise None

Source code in symdesign/sequence/__init__.py
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
def hhblits(name: str, sequence_file: Sequence[str] = None, sequence: Sequence[str] = None,
            out_dir: AnyStr = os.getcwd(), threads: int = hhblits_threads,
            return_command: bool = False, **kwargs) -> list[str] | None:
    """Generate a position specific scoring matrix from HHblits using Hidden Markov Models

    When the command is run, it is possible to create six files in out_dir with the pattern '/outdir/name.*'
    where the '*' extensions are:
     hhblits profile - .hmm

     hhblits resulting cluster description - .hrr

     sequence alignment in a3m format - .a3m

     sequence file - .seq (if sequence_file)

     sequence alignment in stockholm format - .sto (if return_command is False)

     sequence alignment in fasta format - .fasta (if return_command is False)

    Args:
        name: The name associated with the sequence
        sequence_file: The file containing the sequence to use
        sequence: The sequence to use. Used in place of sequence_file
        out_dir: Disk location where generated files should be written
        threads: Number of cpu's to use for the process
        return_command: Whether to simply return the hhblits command

    Raises:
        RuntimeError if hhblits command is run and returns a non-zero exit code

    Returns:
        The command if return_command is True, otherwise None
    """
    if sequence_file is None:
        if sequence is None:
            raise ValueError(
                f"{hhblits.__name__}: Can't proceed without argument 'sequence_file' or 'sequence'")
        else:
            sequence_file = write_sequences((name, sequence), file_name=os.path.join(out_dir, f'{name}.seq'))
    pssm_file = os.path.join(out_dir, f'{name}.hmm')
    a3m_file = os.path.join(out_dir, f'{name}.a3m')
    # Todo for higher performance set up https://www.howtoforge.com/storing-files-directories-in-memory-with-tmpfs
    #  Create a ramdisk to store a database chunk to make hhblits/Jackhmmer run fast.
    #  sudo mkdir -m 777 --parents /tmp/ramdisk
    #  sudo mount -t tmpfs -o size=9G ramdisk /tmp/ramdisk
    cmd = [putils.hhblits_exe, '-d', putils.uniclust_db, '-i', sequence_file, '-ohhm', pssm_file, '-oa3m', a3m_file,
           '-hide_cons', '-hide_pred', '-hide_dssp', '-E', '1E-06', '-v', '1', '-cpu', str(threads)]

    if return_command:
        return cmd

    logger.debug(f'{name} Profile Command: {subprocess.list2cmdline(cmd)}')
    p = subprocess.Popen(cmd)
    p.communicate()
    if p.returncode != 0:
        # temp_file = os.path.join(out_path, f'{self.name}.hold')
        temp_file = Path(out_dir, f'{name}.hold')
        temp_file.unlink(missing_ok=True)
        # if os.path.exists(temp_file):  # remove hold file blocking progress
        #     os.remove(temp_file)
        raise utils.SymDesignException(
            f'Profile generation for {name} got stuck. Found return code {p.returncode}')  #

    # Preferred alignment type
    msa_file = os.path.join(out_dir, f'{name}.sto')
    p = subprocess.Popen([putils.reformat_msa_exe_path, a3m_file, msa_file, '-num', '-uc'])
    p.communicate()
    fasta_msa = os.path.join(out_dir, f'{name}.fasta')
    p = subprocess.Popen([putils.reformat_msa_exe_path, a3m_file, fasta_msa, '-M', 'first', '-r'])
    p.communicate()

    return None

optimize_protein_sequence

optimize_protein_sequence(sequence: str, species: optimization_species_literal = 'e_coli') -> str

Optimize a sequence for expression in a desired organism

Parameters:

  • sequence (str) –

    The sequence of interest

  • species (optimization_species_literal, default: 'e_coli' ) –

    The species context to optimize nucleotide sequence usage

Returns: The input sequence optimized to nucleotides for expression considerations

Source code in symdesign/sequence/__init__.py
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
def optimize_protein_sequence(sequence: str, species: optimization_species_literal = 'e_coli') -> str:
    """Optimize a sequence for expression in a desired organism

    Args:
        sequence: The sequence of interest
        species: The species context to optimize nucleotide sequence usage
    Returns:
        The input sequence optimized to nucleotides for expression considerations
    """
    seq_length = len(sequence)
    species = species.lower()
    try:
        from dnachisel import reverse_translate, AvoidHairpins, AvoidPattern, AvoidRareCodons, CodonOptimize, \
        DnaOptimizationProblem, EnforceGCContent, EnforceTranslation,  UniquifyAllKmers
    except ModuleNotFoundError:
        raise RuntimeError(
            f"Can't {optimize_protein_sequence.__name__} as the dependency `DnaChisel` isn`t available")

    try:
        dna_sequence = reverse_translate(sequence)
    except KeyError as error:
        raise KeyError(
            f'Warning an invalid character was found in the protein sequence: {error}')

    problem = DnaOptimizationProblem(
        sequence=dna_sequence, logger=None,  # max_random_iters=20000,
        objectives=[CodonOptimize(species=species)],
        # method='harmonize_rca')] <- Useful for folding speed when original organism known
        constraints=[EnforceGCContent(mini=0.25, maxi=0.65),  # Twist required
                     EnforceGCContent(mini=0.35, maxi=0.65, window=50),  # Twist required
                     AvoidHairpins(stem_size=20, hairpin_window=48),  # Efficient translate
                     AvoidPattern('GGAGG', location=(1, seq_length, 1)),  # Ribosome bind
                     AvoidPattern('TAAGGAG', location=(1, seq_length, 1)),  # Ribosome bind
                     AvoidPattern('AAAAA', location=(1, seq_length, 0)),  # Terminator
                     # AvoidPattern('TTTTT', location=(1, seq_length, 1)),  # Terminator
                     AvoidPattern('GGGGGGGGGG', location=(1, seq_length, 0)),  # Homopoly
                     # AvoidPattern('CCCCCCCCCC', location=(1, seq_length)),  # Homopoly
                     UniquifyAllKmers(20),  # Twist required
                     AvoidRareCodons(0.08, species=species),
                     EnforceTranslation(),
                     # EnforceMeltingTemperature(mini=10,maxi=62,location=(1, seq_length)),
                     ])

    # Solve constraints and solve in regard to the objective
    problem.max_random_iters = 20000
    problem.resolve_constraints()
    problem.optimize()

    # Display summaries of constraints that pass
    # print(problem.constraints_text_summary())
    # print(problem.objectives_text_summary())

    # Get final sequence as string
    final_sequence = problem.sequence
    # Get final sequene as BioPython record
    # final_record = problem.to_record(with_sequence_edits=True)

    return final_sequence

make_mutations

make_mutations(sequence: Sequence, mutations: dict[int, dict[str, str]], find_orf: bool = True) -> str

Modify a sequence to contain mutations specified by a mutation dictionary

Assumes a zero-index sequence and zero-index mutations Args: sequence: 'Wild-type' sequence to mutate mutations: {mutation_index: {'from': AA, 'to': AA}, ...} find_orf: Whether to find the correct ORF for the mutations and the seq Returns: seq: The mutated sequence

Source code in symdesign/sequence/__init__.py
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
def make_mutations(sequence: Sequence, mutations: dict[int, dict[str, str]], find_orf: bool = True) -> str:
    """Modify a sequence to contain mutations specified by a mutation dictionary

    Assumes a zero-index sequence and zero-index mutations
    Args:
        sequence: 'Wild-type' sequence to mutate
        mutations: {mutation_index: {'from': AA, 'to': AA}, ...}
        find_orf: Whether to find the correct ORF for the mutations and the seq
    Returns:
        seq: The mutated sequence
    """
    # Seq can be either list or string
    if find_orf:
        offset = find_orf_offset(sequence, mutations)
        logger.info(f'Found ORF. Offset = {offset}')
    else:
        offset = 0  # ZERO_OFFSET

    seq = sequence
    index_errors = []
    for key, mutation in mutations.items():
        index = key + offset
        try:
            if seq[index] == mutation['from']:  # Adjust key for zero index slicing
                seq = seq[:index] + mutation['to'] + seq[index + 1:]
            else:  # Find correct offset, or mark mutation source as doomed
                index_errors.append(key)
        except IndexError:
            logger.error(key + offset)
    if index_errors:
        logger.warning(f'{make_mutations.__name__} index errors: {", ".join(map(str, index_errors))}')

    return seq

find_orf_offset

find_orf_offset(sequence: Sequence, mutations: mutation_dictionary) -> int

Using a sequence and mutation data, find the open reading frame that matches mutations closest

Parameters:

  • sequence (Sequence) –

    Sequence to search for ORF in 1 letter format

  • mutations (mutation_dictionary) –

    {mutation_index: {'from': AA, 'to': AA}, ...} zero-indexed sequence dictionary

Returns: The zero-indexed integer to offset the provided sequence to best match the provided mutations

Source code in symdesign/sequence/__init__.py
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
def find_orf_offset(sequence: Sequence, mutations: mutation_dictionary) -> int:
    """Using a sequence and mutation data, find the open reading frame that matches mutations closest

    Args:
        sequence: Sequence to search for ORF in 1 letter format
        mutations: {mutation_index: {'from': AA, 'to': AA}, ...} zero-indexed sequence dictionary
    Returns:
        The zero-indexed integer to offset the provided sequence to best match the provided mutations
    """
    def gen_offset_repr():
        return f'Found the orf_offsets: {",".join(f"{k}={v}" for k, v in orf_offsets.items())}'

    unsolvable = False
    orf_start_idx = 0
    orf_offsets = {idx: 0 for idx, aa in enumerate(sequence) if aa == 'M'}
    methionine_positions = list(orf_offsets.keys())
    while True:
        if not orf_offsets:  # MET is missing for the sequnce/we haven't found the ORF start and need to scan a range
            orf_offsets = {start_idx: 0 for start_idx in range(-30, 50)}

        # Weight potential MET offsets by finding the one which gives the highest number correct mutation sites
        for test_orf_index in orf_offsets:
            for mutation_index, mutation in mutations.items():
                try:
                    if sequence[test_orf_index + mutation_index] == mutation['from']:
                        orf_offsets[test_orf_index] += 1
                except IndexError:  # We have reached the end of the sequence
                    break

        # logger.debug(gen_offset_repr())
        max_count = max(list(orf_offsets.values()))
        # Check if likely ORF has been identified (count < number mutations/2). If not, MET is missing/not the ORF start
        if max_count < len(mutations) / 2:
            if unsolvable:
                raise RuntimeError(f"Couldn't find a orf_offset max_count {max_count} < {len(mutations) / 2} (half the "
                                   f"mutations). The orf_start_idx={orf_start_idx} still\n\t{gen_offset_repr()}")
            orf_offsets = {}
            unsolvable = True  # if we reach this spot again, the problem is deemed unsolvable
        else:  # Find the index of the max_count
            for idx, count_ in orf_offsets.items():
                if max_count == count_:  # orf_offsets[offset]:
                    orf_start_idx = idx  # Select the first occurrence of the max count
                    break
            else:
                raise RuntimeError(f"Couldn't find a orf_offset count == {max_count} (the max_count). "
                                   f"The orf_start_idx={orf_start_idx} still\n\t{gen_offset_repr()}")

            # For cases where the orf doesn't begin on Met, try to find a prior Met. Otherwise, selects the id'd Met
            closest_met = None
            for met_index in methionine_positions:
                if met_index <= orf_start_idx:
                    closest_met = met_index
                else:  # We have passed the identified orf_start_idx
                    if closest_met is not None:
                        orf_start_idx = closest_met
                    break
            break

    return orf_start_idx

generate_mutations

generate_mutations(reference: Sequence, query: Sequence, offset: bool = True, keep_gaps: bool = False, remove_termini: bool = True, remove_query_gaps: bool = True, only_gaps: bool = False, zero_index: bool = False, return_all: bool = False, return_to: bool = False, return_from: bool = False) -> mutation_dictionary | sequence_dictionary

Create mutation data in a typical A5K format. Integer indexed dictionary keys with the index matching reference sequence. Sequence mutations accessed by "from" and "to" keys. By default, only mutated positions are returned and all gaped sequences are excluded

For PDB comparison, reference should be expression sequence (SEQRES), query should be atomic sequence (ATOM)

Parameters:

  • reference (Sequence) –

    Reference sequence to align mutations against. Character values are returned to the "from" key

  • query (Sequence) –

    Query sequence. Character values are returned to the "to" key

  • offset (bool, default: True ) –

    Whether sequences are different lengths. Will create an alignment of the two sequences

  • keep_gaps (bool, default: False ) –

    Return gaped indices, i.e. outside the aligned sequences or missing internal characters

  • remove_termini (bool, default: True ) –

    Remove indices that are outside the reference sequence boundaries

  • remove_query_gaps (bool, default: True ) –

    Remove indices where there are gaps present in the query sequence

  • only_gaps (bool, default: False ) –

    Only include reference indices that are missing query residues. All "to" values will be a gap "-"

  • zero_index (bool, default: False ) –

    Whether to return the indices zero-indexed (like python) or one-indexed

  • return_all (bool, default: False ) –

    Whether to return all the indices and there corresponding mutational data

  • return_to (bool, default: False ) –

    Whether to return only the 'to' amino acid type

  • return_from (bool, default: False ) –

    Whether to return only the 'from' amino acid type

Returns: Mutation index to mutations with format {1: {'from': 'A', 'to': 'K'}, ...} unless return_to or return_from is True, then {1: 'K', ...} or {1: 'A', ...}, respectively

Source code in symdesign/sequence/__init__.py
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
def generate_mutations(reference: Sequence, query: Sequence, offset: bool = True, keep_gaps: bool = False,
                       remove_termini: bool = True, remove_query_gaps: bool = True, only_gaps: bool = False,
                       zero_index: bool = False,
                       return_all: bool = False, return_to: bool = False, return_from: bool = False) \
        -> mutation_dictionary | sequence_dictionary:
    """Create mutation data in a typical A5K format. Integer indexed dictionary keys with the index matching reference
    sequence. Sequence mutations accessed by "from" and "to" keys. By default, only mutated positions are returned and
    all gaped sequences are excluded

    For PDB comparison, reference should be expression sequence (SEQRES), query should be atomic sequence (ATOM)

    Args:
        reference: Reference sequence to align mutations against. Character values are returned to the "from" key
        query: Query sequence. Character values are returned to the "to" key
        offset: Whether sequences are different lengths. Will create an alignment of the two sequences
        keep_gaps: Return gaped indices, i.e. outside the aligned sequences or missing internal characters
        remove_termini: Remove indices that are outside the reference sequence boundaries
        remove_query_gaps: Remove indices where there are gaps present in the query sequence
        only_gaps: Only include reference indices that are missing query residues. All "to" values will be a gap "-"
        zero_index: Whether to return the indices zero-indexed (like python) or one-indexed
        return_all: Whether to return all the indices and there corresponding mutational data
        return_to: Whether to return only the 'to' amino acid type
        return_from: Whether to return only the 'from' amino acid type
    Returns:
        Mutation index to mutations with format
            {1: {'from': 'A', 'to': 'K'}, ...}
            unless return_to or return_from is True, then
            {1: 'K', ...} or {1: 'A', ...}, respectively
    """
    if offset:
        alignment = generate_alignment(reference, query)
        align_seq_1, align_seq_2 = alignment
    else:
        align_seq_1, align_seq_2 = reference, query

    idx_offset = 0 if zero_index else ZERO_OFFSET

    # Get the first matching index of the reference sequence
    starting_idx_of_seq1 = align_seq_1.find(reference[0])
    # Ensure iteration sequence1/reference starts at idx 0       v
    sequence_iterator = enumerate(zip(align_seq_1, align_seq_2), -starting_idx_of_seq1 + idx_offset)
    # Extract differences from the alignment
    if return_all:
        mutations = {idx: {'from': char1, 'to': char2} for idx, (char1, char2) in sequence_iterator}
    else:
        mutations = {idx: {'from': char1, 'to': char2} for idx, (char1, char2) in sequence_iterator if char1 != char2}

    # Find last index of reference (including internal gaps)
    starting_key_of_seq1 = idx_offset
    ending_key_of_seq1 = starting_key_of_seq1 + align_seq_1.rfind(reference[-1])
    remove_mutation_list = []
    if only_gaps:  # Remove the actual mutations, keep internal and external gap indices and the reference sequence
        keep_gaps = True
        remove_mutation_list.extend([entry for entry, mutation in mutations.items()
                                     if mutation['from'] != '-' and mutation['to'] != '-'])
    if keep_gaps:  # Leave all types of keep_gaps, otherwise check for each requested type
        remove_termini = remove_query_gaps = False

    if remove_termini:  # Remove indices outside of sequence 1
        remove_mutation_list.extend([entry for entry in mutations
                                     if entry < starting_key_of_seq1 or entry > ending_key_of_seq1])

    if remove_query_gaps:  # Remove indices where sequence 2 is gaped
        remove_mutation_list.extend([entry for entry, mutation in mutations.items()
                                     if starting_key_of_seq1 < entry <= ending_key_of_seq1 and mutation['to'] == '-'])
    for entry in remove_mutation_list:
        mutations.pop(entry, None)

    if return_to:
        mutations = {idx: _mutation_dictionary['to'] for idx, _mutation_dictionary in mutations.items()}
    elif return_from:
        mutations = {idx: _mutation_dictionary['from'] for idx, _mutation_dictionary in mutations.items()}

    return mutations

pdb_to_pose_offset

pdb_to_pose_offset(reference_sequence: dict[Any, Sequence]) -> dict[Any, int]

Take a dictionary with chain name as keys and return the length of Pose numbering offset

Parameters:

  • reference_sequence (dict[Any, Sequence]) –

    {key1: 'MSGKLDA...', ...} or {key2: {1: 'A', 2: 'S', ...}, ...}

Returns: {key1: 0, key2: 123, ...}

Source code in symdesign/sequence/__init__.py
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
def pdb_to_pose_offset(reference_sequence: dict[Any, Sequence]) -> dict[Any, int]:
    """Take a dictionary with chain name as keys and return the length of Pose numbering offset

    Args:
        reference_sequence: {key1: 'MSGKLDA...', ...} or {key2: {1: 'A', 2: 'S', ...}, ...}
    Returns:
        {key1: 0, key2: 123, ...}
    """
    offset = {}
    # prior_chain = None
    prior_chains_len = prior_key = 0  # prior_key not used as 0 but to ensure initialized nonetheless
    for idx, key in enumerate(reference_sequence):
        if idx > 0:
            prior_chains_len += len(reference_sequence[prior_key])
        offset[key] = prior_chains_len
        # insert function here? Make this a decorator!?
        prior_key = key

    return offset

generate_multiple_mutations

generate_multiple_mutations(reference: dict[str, str], sequences: dict[str, dict[str, str]], pose_num: bool = True) -> dict[str, dict[str, dict[int]]]

Extract mutation data from multiple sequence dictionaries with regard to a reference. Default is Pose numbering

Parameters:

  • reference (dict[str, str]) –

    {chain: sequence, ...} The reference sequence to compare sequences to

  • sequences (dict[str, dict[str, str]]) –

    {pdb_code: {chain: sequence, ...}, ...}

  • pose_num (bool, default: True ) –

    Whether to return the mutations in Pose numbering with the first Entity as 1 and the second Entity as Entity1 last residue + 1

Returns:

  • dict[str, dict[str, dict[int]]]

    {pdb_code: {chain_id: {mutation_index: {'from': 'A', 'to': 'K'}, ...}, ...}, ...}

Source code in symdesign/sequence/__init__.py
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
def generate_multiple_mutations(
    reference: dict[str, str], sequences: dict[str, dict[str, str]], pose_num: bool = True
) -> dict[str, dict[str, dict[int, ]]]:
    """Extract mutation data from multiple sequence dictionaries with regard to a reference. Default is Pose numbering

    Args:
        reference: {chain: sequence, ...} The reference sequence to compare sequences to
        sequences: {pdb_code: {chain: sequence, ...}, ...}
        pose_num: Whether to return the mutations in Pose numbering with the first Entity as 1 and the
            second Entity as Entity1 last residue + 1

    Returns:
        {pdb_code: {chain_id: {mutation_index: {'from': 'A', 'to': 'K'}, ...}, ...}, ...}
    """
    # Add reference sequence mutations
    mutations = {'reference': {chain: {sequence_idx: {'from': aa, 'to': aa}
                                       for sequence_idx, aa in enumerate(ref_sequence, 1)}
                               for chain, ref_sequence in reference.items()}}
    #                         returns {1: {'from': 'A', 'to': 'K'}, ...}
    # mutations = {pdb: {chain: generate_mutations(sequence, reference[chain], offset=False)
    #                    for chain, sequence in chain_sequences.items()}
    #              for pdb, chain_sequences in pdb_sequences.items()}
    try:
        for name, chain_sequences in sequences.items():
            mutations[name] = {}
            for chain, sequence in chain_sequences.items():
                mutations[name][chain] = generate_mutations(reference[chain], sequence, offset=False)
    except KeyError:
        raise RuntimeError(
            f"The 'reference' and 'sequences' have different chains. Chain '{chain}' isn't in the reference")
    if pose_num:
        offset_dict = pdb_to_pose_offset(reference)
        # pose_mutations = {}
        # for chain, offset in offset_dict.items():
        #     for pdb_code in mutations:
        #         if pdb_code not in pose_mutations:
        #             pose_mutations[pdb_code] = {}
        #         pose_mutations[pdb_code][chain] = {}
        #         for mutation_idx in mutations[pdb_code][chain]:
        #             pose_mutations[pdb_code][chain][mutation_idx + offset] = mutations[pdb_code][chain][mutation_idx]
        # mutations = pose_mutations
        mutations = {name: {chain: {idx + offset: mutation for idx, mutation in chain_mutations[chain].items()}
                            for chain, offset in offset_dict.items()} for name, chain_mutations in mutations.items()}
    return mutations

generate_mutations_from_reference

generate_mutations_from_reference(reference: Sequence[str], sequences: dict[str, Sequence[str]], **kwargs) -> dict[str, mutation_dictionary | sequence_dictionary]

Generate mutation data from multiple alias mapped sequence dictionaries with regard to a single reference.

Defaults to returning only mutations (return_all=False) and forgoes any sequence alignment (offset=False)

Parameters:

  • reference (Sequence[str]) –

    The reference sequence to align each sequence against. Character values are returned to the "from" key

  • sequences (dict[str, Sequence[str]]) –

    The template sequences to align, i.e. {alias: sequence, ...}. Character values are returned to the "to" key

Other Parameters:

  • offset

    bool = True - Whether sequences are different lengths. Will create an alignment of the two sequences

  • keep_gaps

    bool = False - Return gaped indices, i.e. outside the aligned sequences or missing internal characters

  • remove_termini

    bool = True - Remove indices that are outside the reference sequence boundaries

  • remove_query_gaps

    bool = True - Remove indices where there are gaps present in the query sequence

  • only_gaps

    bool = False - Only include reference indices that are missing query residues. All "to" values will be a gap "-"

  • zero_index

    bool = False - Whether to return the indices zero-indexed (like python Sequence) or one-indexed

  • return_all

    bool = False - Whether to return all the indices and there corresponding mutational data

  • return_to

    bool = False - Whether to return only the "to" amino acid type

  • return_from

    bool = False - Whether to return only the "from" amino acid type

Returns:

  • dict[str, mutation_dictionary | sequence_dictionary]

    {alias: {mutation_index: {'from': 'A', 'to': 'K'}, ...}, ...} unless return_to or return_from is True, then {alias: {mutation_index: 'K', ...}, ...}

Source code in symdesign/sequence/__init__.py
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
def generate_mutations_from_reference(reference: Sequence[str], sequences: dict[str, Sequence[str]], **kwargs) -> \
        dict[str, mutation_dictionary | sequence_dictionary]:
    """Generate mutation data from multiple alias mapped sequence dictionaries with regard to a single reference.

    Defaults to returning only mutations (return_all=False) and forgoes any sequence alignment (offset=False)

    Args:
        reference: The reference sequence to align each sequence against.
            Character values are returned to the "from" key
        sequences: The template sequences to align, i.e. {alias: sequence, ...}.
            Character values are returned to the "to" key

    Keyword Args:
        offset: bool = True - Whether sequences are different lengths. Will create an alignment of the two sequences
        keep_gaps: bool = False - Return gaped indices, i.e. outside the aligned sequences or missing internal
            characters
        remove_termini: bool = True - Remove indices that are outside the reference sequence boundaries
        remove_query_gaps: bool = True - Remove indices where there are gaps present in the query sequence
        only_gaps: bool = False - Only include reference indices that are missing query residues.
            All "to" values will be a gap "-"
        zero_index: bool = False - Whether to return the indices zero-indexed (like python Sequence) or one-indexed
        return_all: bool = False - Whether to return all the indices and there corresponding mutational data
        return_to: bool = False - Whether to return only the "to" amino acid type
        return_from: bool = False - Whether to return only the "from" amino acid type

    Returns:
        {alias: {mutation_index: {'from': 'A', 'to': 'K'}, ...}, ...} unless return_to or return_from is True, then
            {alias: {mutation_index: 'K', ...}, ...}
    """
    # offset_value = kwargs.get('offset')
    kwargs['offset'] = (kwargs.get('offset') or False)  # Default to False if there was no argument passed
    mutations = {alias: generate_mutations(reference, sequence, **kwargs)  # offset=False,
                 for alias, sequence in sequences.items()}

    # Add the reference sequence to mutation data
    if kwargs.get('return_to') or kwargs.get('return_from'):
        mutations[putils.reference_name] = dict(enumerate(reference, 0 if kwargs.get('zero_index') else 1))
    else:
        mutations[putils.reference_name] = \
            {sequence_idx: {'from': aa, 'to': aa}
             for sequence_idx, aa in enumerate(reference, 0 if kwargs.get('zero_index') else 1)}

    return mutations

make_sequences_from_mutations

make_sequences_from_mutations(wild_type: str, pdb_mutations: dict, aligned: bool = False) -> dict

Takes a list of sequence mutations and returns the mutated form on wildtype

Parameters:

  • wild_type (str) –

    Sequence to mutate

  • pdb_mutations (dict) –

    {name: {mutation_index: {'from': AA, 'to': AA}, ...}, ...}, ...}

  • aligned (bool, default: False ) –

    Whether the input sequences are already aligned

Returns:

  • dict

    {name: sequence, ...}

Source code in symdesign/sequence/__init__.py
987
988
989
990
991
992
993
994
995
996
997
998
def make_sequences_from_mutations(wild_type: str, pdb_mutations: dict, aligned: bool = False) -> dict:
    """Takes a list of sequence mutations and returns the mutated form on wildtype

    Args:
        wild_type: Sequence to mutate
        pdb_mutations: {name: {mutation_index: {'from': AA, 'to': AA}, ...}, ...}, ...}
        aligned: Whether the input sequences are already aligned

    Returns:
        {name: sequence, ...}
    """
    return {pdb: make_mutations(wild_type, mutations, find_orf=not aligned) for pdb, mutations in pdb_mutations.items()}

generate_sequences

generate_sequences(wild_type_sequences: dict, all_design_mutations: dict) -> dict

Separate chains from mutation dictionary and generate mutated sequences

Parameters:

  • wild_type_sequences (dict) –

    {chain: sequence, ...}

  • all_design_mutations (dict) –

    {'name': {chain: {mutation_index: {'from': AA, 'to': AA}, ...}, ...}, ...} Index so mutation_index starts at 1

Returns:

  • dict

    {chain: {name: sequence, ...}

Source code in symdesign/sequence/__init__.py
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
def generate_sequences(wild_type_sequences: dict, all_design_mutations: dict) -> dict:
    """Separate chains from mutation dictionary and generate mutated sequences

    Args:
        wild_type_sequences: {chain: sequence, ...}
        all_design_mutations: {'name': {chain: {mutation_index: {'from': AA, 'to': AA}, ...}, ...}, ...}
            Index so mutation_index starts at 1

    Returns:
        {chain: {name: sequence, ...}
    """
    mutated_sequences = {}
    for chain in wild_type_sequences:
        # pdb_chain_mutations = {pdb: chain_mutations.get(chain) for pdb, chain_mutations in all_design_mutations.items()}
        pdb_chain_mutations = {}
        for pdb, chain_mutations in all_design_mutations.items():
            if chain in chain_mutations:
                pdb_chain_mutations[pdb] = all_design_mutations[pdb][chain]
        mutated_sequences[chain] = make_sequences_from_mutations(wild_type_sequences[chain], pdb_chain_mutations,
                                                                 aligned=True)
    return mutated_sequences

numeric_to_sequence

numeric_to_sequence(numeric_sequence: ndarray, translation_table: dict[str, int] = None, alphabet_order: int = 1) -> ndarray

Convert a numeric sequence array into a sequence array

Parameters:

  • numeric_sequence (ndarray) –

    The sequence to convert

  • translation_table (dict[str, int], default: None ) –

    If a translation table is provided, it will be used. If not, use alphabet_order

  • alphabet_order (int, default: 1 ) –

    The alphabetical order of the amino acid alphabet. Can be either 1 or 3

Returns: The alphabetically encoded sequence where each entry along axis=-1 is the one letter amino acid

Source code in symdesign/sequence/__init__.py
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
def numeric_to_sequence(numeric_sequence: np.ndarray, translation_table: dict[str, int] = None,
                        alphabet_order: int = 1) -> np.ndarray:
    """Convert a numeric sequence array into a sequence array

    Args:
        numeric_sequence: The sequence to convert
        translation_table: If a translation table is provided, it will be used. If not, use alphabet_order
        alphabet_order: The alphabetical order of the amino acid alphabet. Can be either 1 or 3
    Returns:
        The alphabetically encoded sequence where each entry along axis=-1 is the one letter amino acid
    """
    if translation_table is not None:
        return np.vectorize(translation_table.__getitem__)(numeric_sequence)
    else:
        if alphabet_order == 1:
            return np.vectorize(sequence_translation_alph1.__getitem__)(numeric_sequence)
        elif alphabet_order == 3:
            return np.vectorize(sequence_translation_alph3.__getitem__)(numeric_sequence)
        else:
            raise ValueError(f"The 'alphabet_order' {alphabet_order} isn't valid. Choose from either 1 or 3")

get_equivalent_indices

get_equivalent_indices(target: Sequence = None, query: Sequence = None, mutation_allowed: bool = False) -> tuple[list[int], list[int]]

From two sequences, find the indices where both sequences are equal

Parameters:

  • target (Sequence, default: None ) –

    The first sequence to compare

  • query (Sequence, default: None ) –

    The second sequence to compare

  • mutation_allowed (bool, default: False ) –

    Whether equivalent indices can exist at mutation sites

Returns: The pair of indices where the sequences align. Ex: sequence1 = A B C D E F ... sequence2 = A B - D E F ... returns [0,1, 3,4,5, ...], [0,1, 2,3,4, ...]

Source code in symdesign/sequence/__init__.py
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
def get_equivalent_indices(target: Sequence = None, query: Sequence = None, mutation_allowed: bool = False) \
        -> tuple[list[int], list[int]]:
    """From two sequences, find the indices where both sequences are equal

    Args:
        target: The first sequence to compare
        query: The second sequence to compare
        mutation_allowed: Whether equivalent indices can exist at mutation sites
    Returns:
        The pair of indices where the sequences align.
            Ex: sequence1 = A B C D E F ...
                sequence2 = A B - D E F ...
            returns        [0,1,  3,4,5, ...],
                           [0,1,  2,3,4, ...]
    """
    # alignment: Sequence = None
    #     alignment: An existing Bio.Align.Alignment object
    # if alignment is None:
    if target is not None and query is not None:
        # # Get all mutations from the alignment of sequence1 and sequence2
        mutations = generate_mutations(target, query, keep_gaps=True, return_all=True)
        # alignment = generate_alignment(target, query)
        # alignment.inverse_indices
        # return
    else:
        raise ValueError(f"Can't {get_equivalent_indices.__name__} without passing either 'alignment' or "
                         f"'target' and 'query'")
    # else:  # Todo this may not be ever useful since the alignment needs to go into the generate_mutations()
    #     raise NotImplementedError(f"Set {get_equivalent_indices.__name__} up with an Alignment object from Bio.Align")

    target_mutations = ''.join([mutation['to'] for mutation in mutations.values()])
    query_mutations = ''.join([mutation['from'] for mutation in mutations.values()])
    logger.debug(f"Sequence info:\ntarget :{target_mutations}\nquery  :{query_mutations}")
    # Get only those indices where there is an aligned aa on the opposite chain
    sequence1_indices, sequence2_indices = [], []
    # to_idx = from_idx = 0
    to_idx, from_idx = count(), count()
    # sequence1 'from' is fixed, sequence2 'to' is moving
    for mutation_idx, mutation in enumerate(mutations.values()):
        if mutation['to'] == mutation['from']:  # They are equal
            sequence1_indices.append(next(from_idx))  # from_idx)
            sequence2_indices.append(next(to_idx))  # to_idx)
        elif mutation['from'] == '-':  # increment to_idx/fixed_idx
            # to_idx += 1
            next(to_idx)
        elif mutation['to'] == '-':  # increment from_idx/moving_idx
            # from_idx += 1
            next(from_idx)
        elif mutation['to'] != mutation['from']:
            if mutation_allowed:
                sequence1_indices.append(next(from_idx))
                sequence2_indices.append(next(to_idx))
            else:
                next(from_idx)
                next(to_idx)
            # to_idx += 1
            # from_idx += 1
        else:  # What else is there
            target_mutations = ''.join([mutation['to'] for mutation in mutations.values()])
            query_mutations = ''.join([mutation['from'] for mutation in mutations.values()])
            raise RuntimeError(f"This should never be reached. Ran into error at index {mutation_idx}:\n"
                               f"{mutation}\nSequence info:\ntarget :{target_mutations}\nquery  :{query_mutations}")
        # else:  # They are equal
        #     sequence1_indices.append(next(from_idx))  # from_idx)
        #     sequence2_indices.append(next(to_idx))  # to_idx)
        #     # to_idx += 1
        #     # from_idx += 1

    return sequence1_indices, sequence2_indices

get_lod

get_lod(frequencies: dict[protein_letters_literal, float], background: dict[protein_letters_literal, float], as_int: bool = True) -> dict[str, int | float]

Get the log of the odds that an amino acid is in a frequency distribution compared to a background frequency

Parameters:

  • frequencies (dict[protein_letters_literal, float]) –

    {'A': 0.11, 'C': 0.01, 'D': 0.034, ...}

  • background (dict[protein_letters_literal, float]) –

    {'A': 0.10, 'C': 0.02, 'D': 0.04, ...}

  • as_int (bool, default: True ) –

    Whether to round the lod values to an integer

Returns: The log of odds for each amino acid type {'A': 2, 'C': -9, 'D': -1, ...}

Source code in symdesign/sequence/__init__.py
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
def get_lod(frequencies: dict[protein_letters_literal, float],
            background: dict[protein_letters_literal, float], as_int: bool = True) -> dict[str, int | float]:
    """Get the log of the odds that an amino acid is in a frequency distribution compared to a background frequency

    Args:
        frequencies: {'A': 0.11, 'C': 0.01, 'D': 0.034, ...}
        background: {'A': 0.10, 'C': 0.02, 'D': 0.04, ...}
        as_int: Whether to round the lod values to an integer
    Returns:
         The log of odds for each amino acid type {'A': 2, 'C': -9, 'D': -1, ...}
    """
    lods = {}
    for aa, freq in frequencies.items():
        try:  # Todo why is this 2. * the log2? I believe this is a heuristic of BLOSUM62. This should be removed...
            lods[aa] = float(2. * log2(freq / background[aa]))  # + 0.0
        except ValueError:  # math domain error
            lods[aa] = -9
        except KeyError:
            if aa in protein_letters_alph1:
                raise KeyError(f'{aa} was not in the background frequencies: {", ".join(background)}')
            else:  # We shouldn't worry about a missing value if it's not an amino acid
                continue
        except ZeroDivisionError:  # background is 0. We may need a pseudocount...
            raise ZeroDivisionError(f'{aa} has a background frequency of 0. Consider adding a pseudocount')
        # if lods[aa] < -9:
        #     lods[aa] = -9
        # elif round_lod:
        #     lods[aa] = round(lods[aa])

    if as_int:
        return {aa: (int(value) if value >= -9 else -9) for aa, value in lods.items()}
    else:  # ensure that -9 is the lowest value (formatting issues if 2 digits)
        return {aa: (value if value >= -9 else -9) for aa, value in lods.items()}

parse_pssm

parse_pssm(file: AnyStr, **kwargs) -> dict[int, dict[str, str | float | int | dict[str, int]]]

Take the contents of a pssm file, parse, and input into a pose profile dictionary.

Resulting dictionary is indexed according to the values in the pssm file

Parameters:

  • file (AnyStr) –

    The location of the file on disk

Returns: Dictionary containing residue indexed profile information i.e. {1: {'A': 0, 'R': 0, ..., 'lod': {'A': -5, 'R': -5, ...}, 'type': 'W', 'info': 3.20, 'weight': 0.73}, 2: {}, ...}

Source code in symdesign/sequence/__init__.py
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
def parse_pssm(file: AnyStr, **kwargs) -> dict[int, dict[str, str | float | int | dict[str, int]]]:
    """Take the contents of a pssm file, parse, and input into a pose profile dictionary.

    Resulting dictionary is indexed according to the values in the pssm file

    Args:
        file: The location of the file on disk
    Returns:
        Dictionary containing residue indexed profile information
            i.e. {1: {'A': 0, 'R': 0, ..., 'lod': {'A': -5, 'R': -5, ...}, 'type': 'W', 'info': 3.20, 'weight': 0.73},
                  2: {}, ...}
    """
    with open(file, 'r') as f:
        lines = f.readlines()

    pose_dict = {}
    for line in lines:
        line_data = line.strip().split()
        if len(line_data) == 44:
            residue_number = int(line_data[0])
            pose_dict[residue_number] = \
                dict(zip(protein_letters_alph3,
                         [x / 100. for x in map(int, line_data[22:len(
                             protein_letters_alph3) + 22])]))
            # pose_dict[residue_number] = aa_counts_alph3.copy()
            # for i, aa in enumerate(protein_letters_alph3, 22):
            #     # Get normalized counts for pose_dict
            #     pose_dict[residue_number][aa] = int(line_data[i]) / 100.

            # for i, aa in enumerate(protein_letters_alph3, 2):
            #     pose_dict[residue_number]['lod'][aa] = line_data[i]
            pose_dict[residue_number]['lod'] = \
                dict(zip(protein_letters_alph3, line_data[2:len(
                    protein_letters_alph3) + 2]))
            pose_dict[residue_number]['type'] = line_data[1]
            pose_dict[residue_number]['info'] = float(line_data[42])
            pose_dict[residue_number]['weight'] = float(line_data[43])

    return pose_dict

parse_hhblits_pssm

parse_hhblits_pssm(file: AnyStr, null_background: bool = True, **kwargs) -> ProfileDict

Take contents of protein.hmm, parse file and input into pose_dict. File is Single AA code alphabetical order

Parameters:

  • file (AnyStr) –

    The file to parse, typically with the extension '.hmm'

  • null_background (bool, default: True ) –

    Whether to use the null background for the specific protein

Returns: Dictionary containing residue indexed profile information Ex: {1: {'A': 0.04, 'C': 0.12, ..., 'lod': {'A': -5, 'C': -9, ...}, 'type': 'W', 'info': 0.00, 'weight': 0.00}, {...}}

Source code in symdesign/sequence/__init__.py
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
def parse_hhblits_pssm(file: AnyStr, null_background: bool = True, **kwargs) -> ProfileDict:
    """Take contents of protein.hmm, parse file and input into pose_dict. File is Single AA code alphabetical order

    Args:
        file: The file to parse, typically with the extension '.hmm'
        null_background: Whether to use the null background for the specific protein
    Returns:
        Dictionary containing residue indexed profile information
            Ex: {1: {'A': 0.04, 'C': 0.12, ..., 'lod': {'A': -5, 'C': -9, ...}, 'type': 'W', 'info': 0.00,
                     'weight': 0.00}, {...}}
    """
    null_bg = latest_uniclust_background_frequencies

    def to_freq(value: str) -> float:
        if value == '*':  # When frequency is zero
            return 0.0001
        else:
            # Equation: value = -1000 * log_2(frequency)
            return 2 ** (-int(value) / 1000)

    with open(file, 'r') as f:
        lines = f.readlines()

    evolutionary_profile = {}
    dummy = 0.
    read = False
    for line in lines:
        if not read:
            if line[0:1] == '#':
                read = True
        else:
            if line[0:4] == 'NULL':
                if null_background:  # Use the provided null background from the profile search
                    null, *background_values = line.strip().split()
                    # null = 'NULL', background_values = list[str] ['3706', '5728', ...]
                    null_bg = {aa: to_freq(value) for value, aa in zip(background_values,
                                                                       protein_letters_alph3)}

            if len(line.split()) == 23:
                residue_type, residue_number, *position_values = line.strip().split()
                aa_freqs = {aa: to_freq(value) for value, aa in zip(position_values,
                                                                    protein_letters_alph1)}

                evolutionary_profile[int(residue_number)] = \
                    dict(lod=get_lod(aa_freqs, null_bg), type=residue_type, info=dummy, weight=dummy, **aa_freqs)

    return evolutionary_profile