Elsevier

Journal of Theoretical Biology

Volume 336, 7 November 2013, Pages 11-17
Journal of Theoretical Biology

Prediction of pupylation sites using the composition of k-spaced amino acid pairs

https://doi.org/10.1016/j.jtbi.2013.07.009Get rights and content

Highlights

  • The composition of k-spaced amino acid pairs is utilized to predict and analyze pupylation sites.

  • The proposed method is more accurate than existing methods.

  • Top ranked k-spaced amino acid pairs are analyzed to provide insights into the substrate specificity of pupylation.

  • Both web server and standalone software are available for prediction.

  • Probability scores are available for prioritizing pupylation sites.

Abstract

Pupylation is an important post-translational modification in prokaryotes. A prokaryotic ubiquitin-like protein (Pup) is attached to proteins as a signal for selective degradation by proteasome. Several proteomics methods have been developed for the identification of pupylated proteins and pupylation sites. However, pupylation sites of many experimentally identified pupylated proteins are still unknown. The development of sequence-based prediction methods can help to accelerate the identification of pupylation sites and gain insights into the substrate specificity and regulatory functions of pupylation. A novel tool iPUP is developed for the computational identification of pupylation sites. A composition of k-spaced amino acid pairs is utilized to represent a peptide sequence. Top ranked k-spaced amino acid pairs are subsequently selected by using a sequential backward feature elimination algorithm. The 10-fold cross-validation performance of iPUP trained by using the composition of 150 top ranked k-spaced amino acid pairs and support vector machines is 0.83 for the area under receiver operating characteristic curve. The importance analysis of k-spaced amino acid pairs shows that terminal space-containing pairs are useful for discriminating pupylation sites from non-pupylation sites. A sequence analysis confirms that lysines close to C-terminus tend to be pupylated. In contrast, lysines close to N-terminus are less likely to be pupylated. The iPUP tool can predict pupylation sites with probability scores for prioritizing promising pupylation sites. Both the online server and the standalone software of iPUP are freely available for academic use at http://cwtung.kmu.edu.tw/ipup.

Introduction

Selective protein degradation is essential for regulating protein functions in all living organisms. In eukaryotes, ubiquitylation plays important roles in the regulation of DNA repair and transcription, control of signal transduction, and implication of endocytosis and sorting (Herrmann et al., 2007, Welchman et al., 2005). The conjugation of ubiquitins to substrate proteins leads to selective degradation by proteasome (Herrmann et al., 2007, Welchman et al., 2005). Pupylation is recently identified as a functional analogue of ubiquitylation in prokaryotes. A prokaryotic ubiquitin-like protein (Pup) will be linked to specific lysine residues of target proteins by forming isopeptide bonds for proteasomal degradation (Burns et al., 2009, Pearce et al., 2008). Pup is an intrinsically disordered protein (Chen et al., 2009, Liao et al., 2009) with 64 amino acids. Although Pup and ubiquitin are functional analogues, their sequences share only a diglycine motif (GGQ) at C-terminus (Knipfer and Shrader, 1997, Pearce et al., 2008). In contrast to ubiquitylation requiring three enzymes, only two enzymes are involved in pupylation. First, the C-terminal glutamine of Pup is deamidated to glutamate by deamidase of Pup (Dop) (Striebel et al., 2009). Subsequently, proteasome accessory factor A (PafA) attaches the deamidated Pup to specific lysine residues of substrate proteins (Guth et al., 2011).

To better understand the functions of pupylation, several large-scale proteomics methods were developed to identify pupylated proteins and corresponding pupylation sites (Cerda-Maira et al., 2011, Festa et al., 2010, Poulsen et al., 2010, Watrous et al., 2010). As the number of identified pupylated proteins and sites grows, a structured and searchable database PupDB has been developed for management and analysis of pupylation sites by integrating information of pupylated proteins and sites, protein structures and functional annotations (Tung, 2012). However, there are still many pupylation sites to be discovered. It is desirable to develop computational methods to accelerate the identification of pupylated proteins and pupylation sites.

Unlike sumoylation sites with a well-defined motif of [ILVMF]KxE (Melchior, 2000), most post-translational modifications have no clearly defined motif that might be due to the vast number of enzymes for attaching modifiers to substrates such as phosphorylation and ubiquitylation (Li et al., 2008). Similar to ubiquitylation sites, no clear motif has been found in pupylation sites using a two-sample logo analysis based on the dataset of PupDB (Tung, 2012). The development of composition-based prediction methods for pupylation sites.

To develop a prediction method for identifying pupylation sties, pupylation sites are firstly extracted from PupDB and divided into a training dataset and an independent test dataset. Subsequently, a composition of k-spaced amino acid pairs (CKSAAP) is utilized to encode pupylation sites. The original CKSAAP has been successfully applied to the predictions of protein flexible/rigid regions (Chen et al., 2007a), protein crystallization (Chen et al., 2007b), mucin-type O-glycosylation sites (Chen et al., 2008), palmitoylation sites (Wang et al., 2009), ubiquitylation sites (Chen et al., 2011), phosphorylation sites (Zhao et al., 2012) and identification of weakly conserved sequence motifs (Dong et al., 2013). By introducing an additional symbol for a terminal space, CKSAAP can be utilized to analyze k-spaced amino acid pairs containing a terminal space.

Feature selection is a critical step for eliminating unrelated features and improving prediction performances. A sequential backward feature elimination algorithm is proposed to iteratively remove k-spaced amino acid pairs with lowest ranks obtained from chi-square test. Finally, the top 150 k-spaced amino acid pairs are selected to construct the prediction method iPUP that is based on a support vector machine (SVM) classifier. Compared to the only existing predictor GPS-PUP (Liu et al., 2011), iPUP shows better prediction performances of 10-fold cross-validation (10-CV) on the training dataset and independent test on the test dataset with 8% and 5% improvements of area under receiver operating characteristic curve (AUC) values, respectively. An analysis of the top 150 k-spaced amino acid pairs shows that lysines near the protein C-terminus tend to be pupylated. In contrast, lysines near the protein N-terminus is less likely to be pupylated.

In the previous proteomics studies, there are many identified pupylated proteins whose pupylation sites are still unknown. To facilitate the identification of pupylation sites of the pupylated proteins, a total of 1116 experimentally identified pupylated proteins without known pupylation sites are firstly extracted from PupDB. The iPUP is subsequently applied to predict pupylation sites for the pupylated proteins. Based on the threshold of high specificity, iPUP successfully identified 828 out of 1116 (74%) proteins. The average number of predicted pupylation sites for each pupylated protein is 2.59. The prediction method iPUP has been implemented as a web server. It can predict pupylation sites with probability scores for prioritizing promising pupylation sites. A standalone version of iPUP software has also been implemented for the large-scale prediction of pupylation sites for a large number of proteins. The iPUP is a cross-platform program running on the Java Virtual Machine (JVM). Both the web server and software are freely accessible and downloadable at http://cwtung.kmu.edu.tw/ipup.

Section snippets

Datasets

Datasets of pupylated proteins and pupylation sites identified by large-scale proteomics experiments are extracted from PupDB, a collection of pupylated proteins and pupylation sites (Tung, 2012). There are 76, 51 and 55 pupylated proteins with known pupylation sites in datasets of Mycobacterium smegmatis, Mycobacterium tuberculosis and Escherichia coli, respectively. Because pupylation occurs on lysine residues, positive and negative datasets for machine learning are represented as peptide

Development of iPUP

The development of computational methods for the identification of pupylation sites from protein sequences is desirable to accelerate the costly and labor-intensive experimental process. To develop a computational method for identifying pupylation sites, 162 proteins with 183 pupylated and 2258 non-pupylated peptides are extracted from PupDB as the training dataset for building classifiers. The encoding scheme of the composition of k-spaced amino acid pairs (CKSAAP) is utilized to encode

Conclusion

Pupylation is recently discovered as an important post-translational modification in prokaryotes. It functions as an important signal for selective protein degradation by proteasome. The identification of pupylated proteins and pupylation sites can provide better understanding of the substrate specificity and regulatory functions of pupylation. To efficiently identify pupylated proteins and pupylation sites, the development of computational methods for predicting and prioritizing pupylation

Acknowledgments

CWT would like to thank the National Science Council (NSC 101-2311-B-037-001-MY2) of Taiwan and Kaohsiung Medical University Research Foundation (KMU-Q110015 and KMU-Q102012) for financially supporting this research. CWT thanks the anonymous reviewers for their constructive suggestions.

References (27)

  • J. Herrmann et al.

    Ubiquitin and ubiquitin-like proteins in protein regulation

    Circulation Research

    (2007)
  • Y. Huang et al.

    CD-HIT Suite: a web server for clustering and comparing biological sequences

    Bioinformatics

    (2010)
  • N. Knipfer et al.

    Inactivation of the 20S proteasome in Mycobacterium smegmatis

    Molecular Microbiology

    (1997)
  • Cited by (42)

    • Computational prediction of Calu-3-based in vitro pulmonary permeability of chemicals

      2022, Regulatory Toxicology and Pharmacology
      Citation Excerpt :

      Subsequently, the remaining 684 features were then normalized by using a Z-score normalization (Kreyszig, 1979) and utilized as feature vectors for model development. Finally, a sequential feature selection algorithm that has been shown to be efficient and effective for selecting important features (Tung, 2013; Wang et al., 2016) was applied to identify informative features for minimizing the mean squared error 10-fold cross-validation. In this study, the forward selection was applied to identify informative features incrementally.

    • Targets of ubiquitin like system in mycobacteria and related actinobacterial species

      2017, Microbiological Research
      Citation Excerpt :

      It contains 182 pupylated proteins with 215 experimentally validated pupylation sites and 1,123 candidate pupylated proteins belong to organisms namely, Msm, Mtb and Escherichia coli (Tung, 2012). Tung (2013) has used training dataset collected from PupDB database and an encoding scheme namely composition of k-space amino acid pairs (CKSAAPs) characteristics to develop a predictor, iPUP (Tung, 2013). Another new web server named pbPUP was also introduced that also uses a profile-based composition of k-spaced amino acid pairs for the prediction of protein pupylation sites (Hasan et al., 2015).

    View all citing articles on Scopus
    View full text