<Protein Secondary Structure Prediction Based on
Position-specific Scoring Matrices>
Since more and more genome sequence has discovered, it has been popular to predict protein structure from amino acid sequence. So far if protein structure has a template structure, people have predicted an unknown protein structure by comparative modelling. If the applicable protein structure template cannot be found, fold recognition methods have been used. There is also ab initio tertiary method trying to predict the protein structure without comparing similarities between proteins. However the most commonly used ab initio method is to predict secondary protein structure.
Either simple stereochemical principles or statistics has been used as an early method to predict secondary structure. Instead of analyzing a single sequence, it is more useful to analyze a whole family in that it gives more information. Using such multiple sequence had a success in secondary structure prediction for alpha-subunit of tryptophan synthase. Also it had a success for predicting the cAMP-dependent kinases. The most important part of predicting the secondary structure is to investigate which region is the most conserved region. Because those are functionally important and functioning as a protein core. Benner and Gerloff found out that the degree of solvent accessibility of amino acid residue can be predicted by grouping the sequence of the multiple sequences in one family and investigating the degree of sequence variability between similar pairs. Then, they predicted the secondary structure by comparing the accessibility patterns.
There have been some efforts to make these ideas automatic in order to generate . Rost and Sander used the PHD method for a set of feed forward neural networks by training back propagation. And it has been the de facto standard for prediction of the secondary structure. Now, the researchers in this paper introduce a new method. Even though the method is greatly simplified, it scores high in prediction accuracy, with the highest score in the third CASP experiment. It is also easy to run on any computer system.
The PHD prediction can be explained in 3 stages. The first is generating a sequence profile. We aim to be able to generate a sequence profile and literally to predict on any workstation. The disadvantage of the PHD is that its server uses too many multiple processor computer system to make multiple sequence alignments and it is not easy to predict at another site. Also, the degree of divergence correlates with the prediction accuracy of methods. PSI-Blast is a new method based on BLAST and it is for very sensitive sequence comparison. It chooses suitable parameters and filters the data banks. It is known that it performs better than a standard Smith-Waterman search in detecting distant homologues of a query sequence. The advantage of using PSI-BLAST is that it enables to radically reduce the whole time of predicting secondary structure. Even it is powerful in reducing the time, it can fail due to several factors. First, this algorithm iterates, so it depends a lot in the sequence of the data especially of there is some repetitive sequence. In order to solve this problem, they have built a custom sequence data bank. In PSI-BLAST, the final position specific scoring matrix is applied as input to the neural network.
For PSIPRED, a standard feed-forward back propagation network architecture was used. A second network is used to filter successive outputs from the main network. And only three possible inputs were used for each amino acid position. Finally a smaller hidden layer of 60 units was used. After updating each pattern presentation, they used online back propagation training procedure to optimize the weights. Also in order to prevent overfitting, they didn’t use 10% of the training set to evaluate the performance.
To evaluate the accuracy of a secondary structure prediction they used a cross-validated testing procedure. Until then, the test was based on the significant sequence similarity but it had a big complication trouble. So instead of removing any protein with a significant degree of sequence similarity to any member of test sets from the training set, they get rid of any protein with a similar fold to any member of the test set. In the process of producing the training and testing sets, each pair of proteins from the test and the train set was assessed by the CATH classifications. And for the further check, five iterations of PSI-BLAST were held to find any missed relationship. But none of the overlaps between training and test sets was found. With this idea, three times independent training and testing set pairs were used. The three kinds of secondary structure states, helix, strand, and coil were derived from the definitions created by DSSP. The eight states, H, I, G, E, B, S, T, were just combined to three states. H and G was considered as helix states, E and B were considered as strand states, and others were considered as coil.
When testing 187 proteins, it scores both high on Q3 score and Sov3 score. With the simpler DSSP mapping, it showed high Q3s score. Even though it has an impressive results, there sill have possibility of bias to experimentally determine structures.
So far, it has been not clear that which exact factors contributed to the success of the PSIPRED method and it has been still under discussion. But there are three aspects of PSI-BLAST program that contributed to the success of PSIPRED. First is the alignments from PSI-BLAST is based on pairwise local alignments. Second, the usage of iterated profiles improved the sensitivity of PSI-BLAST. Finally, not using structure comparison but PSI-BLAST alignments contribute to a higher accuracy. The most significant result is that the very simple and straightforward neural network prediction of PSI-BLAST profiles ranks the very top.