Prediction of protein Post-Translational Modi ﬁ cation sites: An overview

Post-translational modi ication (PTM) refers to the covalent and enzymatic modi ication of proteins during or after protein biosynthesis. In the protein biosynthesis process, the ribosomal mRNA is translated into polypeptide chains, which may further undergo PTM to form the product of mature protein [1]. PTM is a common biological mechanism of both eukaryotic and prokaryotic organisms, which regulates the protein functions, the proteolytic cleavage of regulatory subunits or the degradation of entire proteins and affects all aspects of cellular life. The PTM of a protein can also determine the cell signaling state, turnover, localization, and interactions with other proteins [2]. Therefore, the analysis of proteins and their PTMs are particularly important for the study of heart disease, cancer, neurodegenerative diseases and diabetes [3,4]. Although the characterization of PTMs gets invaluable insight into the cellular functions in etiological processes, there are still challenges. Technically, the major challenges in studying PTMs are the development of speci ic detection and puri ication methods.


Background
Post-translational modi ication (PTM) refers to the covalent and enzymatic modi ication of proteins during or after protein biosynthesis.In the protein biosynthesis process, the ribosomal mRNA is translated into polypeptide chains, which may further undergo PTM to form the product of mature protein [1].PTM is a common biological mechanism of both eukaryotic and prokaryotic organisms, which regulates the protein functions, the proteolytic cleavage of regulatory subunits or the degradation of entire proteins and affects all aspects of cellular life.The PTM of a protein can also determine the cell signaling state, turnover, localization, and interactions with other proteins [2].Therefore, the analysis of proteins and their PTMs are particularly important for the study of heart disease, cancer, neurodegenerative diseases and diabetes [3,4].Although the characterization of PTMs gets invaluable insight into the cellular functions in etiological processes, there are still challenges.Technically, the major challenges in studying PTMs are the development of speci ic detection and puri ication methods.
The PTMs of proteins have been detected by a variety of experimental techniques including the mass spectrometry (MS) [5,6], liquid chromatography [7], radioactive chemical method [8], chromatin immune precipitation (ChIP) [9], western blotting [10], and eastern blotting [7].The MS technique is one of the mainstay routes in detecting PTMs in a high-throughput manner.The new MS and capillary liquid chromatography instrumentation have made revolutionary advance in enrichment strategies in our growing knowledge of various PTMs [11].The last decade of the actual description of many PTMs complexity has emerged through the diverse technologies and thousands of precise modi ication sites can now be identi ied with high con idence [12][13][14][15][16][17][18][19][20].A similar strategy of fragmentation for PTM identi ication is the beam-type collisioninduced dissociation, also called higher energy collisional dissociation [21].These types of fragmentation are characterized by the higher activation energy.Most of the fragmentation methods of precursor ions are based on the radical anions or thermal electrons [22].These methods are advantageous over collisionally activated dissociation methods for detecting the unstable PTMs (e.g.O-GlcNAc and phosphorylation), due to the peptide support fragmentation method is effectively independent of the amino acid sequence [23][24][25].To date, more than 350 types of PTMs have been experimentally discovered in vivo [26].The common PTMs are phosphorylation, ubiquitination, succinylation, acetylation, pupylation, sumoylation, glycosylation, and so on.In addition, pupylation referring to the modi ication of lysine residues with a prokaryotic, ubiquitin-like protein (i.e.Pup) is another PTM in bacteria.
In general, the experimental analysis of PTMs often requires labor-intensive sample preparations and hazardous or expensive chemical reagents.For instance, in the radioactive assay in the kinase based methods are often separated from nonradioactive ATP by the kinase assay and generates radioactive waste [8].Since most of the radioactive substance deal a short half-life, the fresh reagent must be frequently required for identifying PTMs.And sometimes, the substrate concentration of assay is often much higher than the expected substrate concentrations [27].In summary, the identi ication of PTMs by the experimental techniques is laborious, time-consuming and usually expensive.As an alternative, the computational methods are more ef icient for identifying large-scale novel PTM substrates.
The last several decades have been remarkable progress in the identi ication and functional analysis of PTMs in proteins.The PTMs play a vital role in protein folding, protein function, and interactions with other proteins [28,29].Due to the important biological functions of protein PTMs, it is very important to analyze and understand the function of PTMs.In contrast to the traditional experimental methods, computational analysis of PTMs has also been an attractive and alternative approach due to its accuracy, cost-effective and high-speed.The computational tools can narrow down the number of potential candidates and rapidly generate useful information for investigating further experimental approach.Thus far, the prediction of protein PTMs is an important research topic in the ield of protein bioinformatics.Although the great progress has been made by employing various feature representation and statistical learning approaches with numerous feature vectors, the problem is still far from being solved.An overview of protein PTM sites prediction is presented in igure 1.

Feature representation
Feature representation is one of the most important steps for predicting PTM sites.Suitable features in the prediction model allow the precise prediction of protein PTMs.In general, these features refer to the description of the sequences and local structures around these protein functional sites.Ideally, the features can clearly distinguish PTM sites from the random modi ication residues.In the real world, however, the feature of protein functional sites can also exist on the non-functional sites of proteins.In the prediction PTM sites, this speci ic problem is particularly prominent due to the sequence diversity.For instance, some motifs are very weak and some are not available without the sequence evolutionary information [30][31][32][33][34][35].To address this Figure 1: A brief fl owchart template for computational prediction PTM sites.Firstly, the dataset was collected from the published database.Secondly, then need to be preprocessed the collected datasets for making proper positive and negative samples.Thirdly, the resulting encoded feature vectors were independently put into the statistical learning models to produce independent prediction scores.Finally, optimum performance scores were calculated by using cross-validation and parameter optimization, a confi dent cutoff was considered to identify the PTM site.
There are also numerous protein structure features proposed.For example, one can examine the amino acid solvent accessibility of PTM sites.Examining the residue interactions that uphold the stability of protein structures (including electrostatic interactions, hydrophobic interactions, van der Waals interactions, disul ide bonds, hydrogen bonds, and so on) may be also helpful [43,44].Additionally, the residues' structural lexibility information like root mean square deviation and B -factor is sometimes useful, too.Last but not least, some of the residue contact network parameters (betweenness, closeness, degree, and clustering coef icient) were used as features for protein PTM prediction [45].In a real-world prediction task, note that the scientists usually use the integrated feature set to identify the protein PTM sites.

The statistical algorithm of PTM sites prediction
After determining the appropriate features, the next job is to use an appropriate machine learning algorithm to integrate these features for the prediction of protein PTM sites.It will improve the accuracy of the prediction if the prediction algorithm is appropriate.These prediction algorithms of PTM sites can be classi ied into two categories, i.e. statistical probabilistic algorithms and machine learning algorithms.In following, we will discuss some of these algorithms.

Naïve bayes
Naïve Bayes is a predictive algorithm based on the statistical learning theory of Bayesian theorem.The advantage is that this algorithm is easy and simple to calculate.In Bayesian theorem, the posterior probability of a random event is the conditional probability, which is assigned after the relevant evidence been taken into account.Bayesian assumes that a property of a given value is affected by the other values.This assumption is not often established on the model, so its accuracy can be rejected for other properties of the class forecasting models, such as linear regression and logistic regression models.The majority of biologists think that for analyzing the biological data Naïve Bayes is an important algorithm [46].Although, these methods affected by many outlier and do not handle the noise model [47].In bioinformatics research, Naïve Bayes algorithms are widely used [48][49][50].
From more than 20 years ago, machine learning algorithms have been widely used for an interdisciplinary ield.There are related to the probability theory, approximation theory, convex analysis, complexity theory and other disciplines.To predict the unknown data, they have been widely used (http://en.wikipedia.org/wiki/Machine_learning).Since machine learning algorithms are highly automated, accurate and predictive, they have a very wide range of applications, such as the data mining, computer vision, natural language processing and biometrics.Although the performance of machine learning models shows a very good accuracy, they do not help the researchers to understand the deep mechanisms and biological signi icance [51].Thus, sometimes the machine learning methods are criticized as the "black box" learning.
In early 1959, Arthur Lee Samuel de ined the machine learning as "the ield of study that gives computers the ability to learn without being explicitly programmed" [52].For the prediction of protein PTMs, some common machine learning algorithms are widely used such as support vector machine (SVM), arti icial neural network (ANN) and random forest (RF).Subsequently, we will discuss these three common machine learning algorithms.

Random forest
RF is an ensemble supervised learning algorithm [53].It can integrate multiple classi iers to improve the performances of the prediction [54][55][56].It is well known that for a supervised classi ier, the model classi ication error is partly attributed to the different distributions between the training and the unknown samples (Figure 2A).In contrast, if sets have contained a certain degree of disturbance to the training set, which can determine the more general prediction and it can also remove the bias of a single classi ier [57][58].Several advantages of RF are as follows: 1) For the reliable individuals characteristic, RF can produce a highly accurate classi ier.2) It can handle a large number of input variables.3) It can produce the importance of variable from a given class variable.4) In the construction of the forest, it does not produce any bias results.5) It contains a good way to estimate the loss or missing of data and if a large part of the information is lost, it can still maintain accuracy.6) For the unbalanced classi ication problem, it can take balance errors.7) It can calculate the degree of intimacy in each case, such as in data mining for detecting the deviations (outlier) and it is also very useful for data visualization.8) It can also be used in the extended unlabeled dataset, such as non-supervised or supervised clustering.9) The learning process is very fast than other algorithms.It has a high predictive accuracy, good tolerance of outliers and noise.It has been widely used in the ield of bioinformatics research [59][60][61][62][63].

Support vector machine
In 1995, SVM was irstly proposed by Corinna and Vapnik [64], which can solve the nonlinear and multidimensional pattern recognition problem.It uses a nonlinear transformation method and transforms low-dimensional data to high-dimensional feature space.It can look for a hyperplane in a high-dimensional space to maximize the margin between two types of data (Figure 2B).In other words, as long as suitable kernel functions, SVM can solve the high-dimensional classi ication problem.In the theory of SVM, SVM with different kernel functions has led to different algorithms.The most commonly used SVM is radial basis function (RBF) kernel.Until now, many types of SVM software packages have been developed, such as SVM-Light (http://svmlight.joachims.org/),LIBSVM [65], Gist [66], Weka [67], and so on.
Recently, in bioinformatics research, SVM has been widely used in various topics, including protein PTM prediction , protein residue contact prediction [70], protein fold recognition [71], protein secondary structure prediction [72], etc.

Artifi cial neural network
In 1969, after the publication of machine learning research by Marvin and Seymour the neural network research has been boomed [73]; they initially discovered the two key issues with the computational machines learning neural networks.The irst one was the single-layer neural networks for processing on the circuit area.The second was the signi icant issue of computers for processing the power to effectively handle the long run time by large neural networks.
In machine learning and cognitive science approaches, the ANN is a family of statistical learning models and it is inspired by the biological neural networks (central nervous system of animals, in particular, the brain).This learning algorithm is used to estimate the approximate functions of input samples.ANN is also presented as systems of interconnected "neurons" which can exchange the messages between each other.The connections are generally numeric weighted and it can be tuned based on the internal experience.In general, ANN consists of three layers: input layer, hidden and output layer (Figure 2C).The potential law is needed for analyzing the independent variables and dependent variable in the ANN, which can calculate the new input data [74].
In the ield of bioinformatics, ANNs have also a wide range of applications, such as protein functional sites prediction [75][76][77], protein secondary structure prediction [78,79] and tertiary structure prediction [80].Common implementations of ANN software are FANN (http://leenissen.dk/Fann/WP/) and SNNS (http://www.ra.cs.uni-tuebingen.de/SNNS/).In summary, machine learning algorithm is a sub ield of computer science and statistics that evolved the study of pattern recognition and computational learning theory in arti icial intelligence.For PTM prediction machine learning algorithm is an essential step for testing the model performance.

Conclusions
The expansion and application of PTM site prediction are emerging as a promising ield in protein bioinformatics research.High-throughput omics-based techniques have been widely used in the study of PTMs.For our better understanding the function of PTMs, more accurate computational analysis is required.Combining experimental and computational schemes will certainly accelerate our knowledge by analysis of PTMs dataset.

Figure 2 :
Figure 2: Machine learning algorithms for PTMs prediction.(A) Random forest classifi er.Here, red color denotes training samples, while green color denotes testing samples and each node represents a tree.(B) SVM classifi er and (C) artifi cial neural network.Here, each circular node represents an artifi cial network, while an arrow represents a connection from the output of one network to the input of another.