Scoring matrices – my.bioinfo.guru

The logic and math behind scoring sequence alignments.

All differences are not same.

There are twenty different natural amino acids that makeup proteins. Some of these amino acids share similar physicochemical properties e.g. there are amino acids with aliphatic functional groups (viz. Alanine and Valine) whereas some have an aromatic sidechain (viz. Tyrosine and Phenylalanine); some are positively charged (viz. Lysine and Arginine) while others are negatively charged (viz. Aspartate and Glutamate), etc. While scoring sequence alignments we must account for these similarities within different amino acids because many times a substitution of an amino acid with another amino acid having similar physicochemical properties would have a little or no functional consequence. So, the mismatch score should be less for an amino acid pair having similar physicochemical properties compared to an amino acid pair having different physicochemical properties. The scoring matrices, however, do not consider the physicochemical properties of the amino acids. Instead, they are based on the statistical distribution of different substitutions; as observed in a selected sub-set of sequences. This approach effectively ensures that all differences are not scored the same. In nature, the type-to-type substitutions would be more frequent compared to an amino acid substituting for an amino acid with different physicochemical properties.

Scoring matrix construction

Let’s gain a better understanding of the concept of a scoring matrix by constructing one. Here, we’ll start with a set of dummy DNA sequences to derive the substitution frequency. This would in turn would be used to build a scoring system for different substitutions aka a scoring matrix. The procedure will be same for a set of protein sequences. First, we’ll import some useful libraries for this exercise - Biopython, math, and numpy.

from Bio.Seq import Seq
import math
import numpy as np

Alignment data

# %load seqs.fa
>s1
TACAA
>s2
TACGC
>s3
TACAC
>s4
TACCC
>s5
AACTC
>s6
AATGC
>s7
AATGC

We’ll read our alignment file using the AlignIO module and calculate the pairwise substitution for all the pairs in our data. To do this, for each column, we find out how many times a particular character changes to another. Let’s take the last column where we have one A and six C's. For this column the total number of AC (or CA) substitutions would be six because the A in the first sequence changes C in other six sequences. Similarly, calculating the A $\to$ C substitution for all the columns, we get a total of 8 such substitutions and save this in subs_arr. The total_pos stores the total count of all nucleotides in our sequence dataset.

from Bio import AlignIO
s1 = AlignIO.read("seqs.fa","fasta")
print(s1)
print(s1.substitutions)
subs_arr = s1.substitutions.flatten()
subs_arr

Alignment with 7 rows and 5 columns
TACAA s1
TACGC s2
TACAC s3
TACCC s4
AACTC s5
AATGC s6
AATGC s7
     A    C   G   T
A 25.0  4.0 3.0 7.0
C  4.0 25.0 1.5 5.5
G  3.0  1.5 3.0 1.5
T  7.0  5.5 1.5 7.0

Array([25. ,  4. ,  3. ,  7. ,  4. , 25. ,  1.5,  5.5,  3. ,  1.5,  3. ,
        1.5,  7. ,  5.5,  1.5,  7. ],
         alphabet='ACGT')

from collections import Counter
k = ""
for x in range(0,s1.get_alignment_length()):
    k = k+s1[:,x]
total_pos = (Counter(k))
total_pos

Counter({'T': 7, 'A': 13, 'C': 12, 'G': 3})

Note that the substitution matrix is square symmetric and the total number of pairwise substitions are equally divided into the two possible combinations for each pair. For example, A $\to$ C is equivalent to C $\to$ A, so the total number of substitutions between A and C (8 in our dataset) are equally divided among AC and CA.

Next, we make a list having all possible pairs given the four nucleotides. There’ll be 4x4 = 16 elements in this list.

import itertools
all_pairs = ["".join(x) for x in list(itertools.product(subs_arr.alphabet,subs_arr.alphabet))]
print(all_pairs)

['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT']

Expected probability of pairwise substitution

Now, we’ll calculate the expected probability of pairwise substitution, in our dataset, for each of the pairs in the list above using the formula below.

$\frac{Frequency of the first nucleotide}{Total number of nucleotides} X \frac{Frequency of the second nucleotide}{Total number of nucleotides}$

For the AA (adenine-adenine) pair, the corresponding value would be $\frac{13}{35} X \frac{13}{35}$ which is 0.137.

Similarly, for the AC pair the value would be $\frac{13}{35} X \frac{12}{35}$ which is 0.127. Note that the value for CA substitution would be equal to AC.

exp_arr = []
for p in all_pairs:
    exp_arr.append(total_pos[p[0]]/sum(total_pos.values())*total_pos[p[1]]/sum(total_pos.values()))
print(exp_arr)

[0.1379591836734694, 0.1273469387755102, 0.03183673469387755, 0.07428571428571429, 0.1273469387755102, 0.11755102040816326, 0.029387755102040815, 0.06857142857142857, 0.03183673469387755, 0.029387755102040815, 0.007346938775510204, 0.017142857142857144, 0.07428571428571429, 0.06857142857142857, 0.017142857142857144, 0.04]

Observed probability of pairwise substitution

We now need the observed probability of pairwise substitution for each of the nucleotide pair. Note that we already have the substitution frequency for our alignment dataset - subs_arr. Dividing the elements of this array by the total number of position would give the require probabilities.

obs_arr = []
for a1 in subs_arr:
    obs_arr.append(a1/sum(total_pos.values()))
print(obs_arr)

[0.7142857142857143, 0.11428571428571428, 0.08571428571428572, 0.2, 0.11428571428571428, 0.7142857142857143, 0.04285714285714286, 0.15714285714285714, 0.08571428571428572, 0.04285714285714286, 0.08571428571428572, 0.04285714285714286, 0.2, 0.15714285714285714, 0.04285714285714286, 0.2]

Scoring matrix

Finally, we generate the scores by calculating the log odds ratio as follows

$Substitution Score for ij = 2 l o g_{2} \frac{Observed probability for ij}{Expected probability for ij}$

final_mat_list = []
for i,j in zip(obs_arr,exp_arr):
    final_mat_list.append(2*math.log2(i/j))
final_mat = np.array(final_mat_list)
print(final_mat.reshape(4,4))

[[ 4.74451954 -0.3122384   2.8576866   2.8576866 ]
 [-0.3122384   5.20642841  1.08864103  2.39279443]
 [ 2.8576866   1.08864103  7.08864103  2.64385619]
 [ 2.8576866   2.39279443  2.64385619  4.64385619]]

Final scoring matrix is generated by rounding the values.

print(np.round_(final_mat.reshape(4,4)))

[[ 5. -0.  3.  3.]
 [-0.  5.  1.  2.]
 [ 3.  1.  7.  3.]
 [ 3.  2.  3.  5.]]

import pandas as pd
s_mat = pd.DataFrame(np.rint(final_mat.reshape(4,4)), columns=["A","C","G","T"],\
             index=["A","C","G","T"],dtype=int)
print(s_mat)

   A  C  G  T
A  5  0  3  3
C  0  5  1  2
G  3  1  7  3
T  3  2  3  5

The pandas library is used to add some visual appeal. This link has useful information about styling a dataframe.

	A	C	G	T
A	5	0	3	3
C	0	5	1	2
G	3	1	7	3
T	3	2	3	5

BLOSUM matrix

The algorithm presented above was originally proposed by Henikoff and Henikoffis in 1992 and is the basis of BLOck SUbstitution Matrix published in PNAS. In that research article they “derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins”. Here blocks refer to the regions of aligned sequences with no gaps.