Prediction of pupylation sites using the composition of k-spaced amino acid pairs

doi:10.1016/j.jtbi.2013.07.009

Journal of Theoretical Biology

Volume 336, 7 November 2013, Pages 11-17

https://doi.org/10.1016/j.jtbi.2013.07.009 Get rights and content

Highlights

•
The composition of k-spaced amino acid pairs is utilized to predict and analyze pupylation sites.
•
The proposed method is more accurate than existing methods.
•
Top ranked k-spaced amino acid pairs are analyzed to provide insights into the substrate specificity of pupylation.
•
Both web server and standalone software are available for prediction.
•
Probability scores are available for prioritizing pupylation sites.

Abstract

Pupylation is an important post-translational modification in prokaryotes. A prokaryotic ubiquitin-like protein (Pup) is attached to proteins as a signal for selective degradation by proteasome. Several proteomics methods have been developed for the identification of pupylated proteins and pupylation sites. However, pupylation sites of many experimentally identified pupylated proteins are still unknown. The development of sequence-based prediction methods can help to accelerate the identification of pupylation sites and gain insights into the substrate specificity and regulatory functions of pupylation. A novel tool iPUP is developed for the computational identification of pupylation sites. A composition of k-spaced amino acid pairs is utilized to represent a peptide sequence. Top ranked k-spaced amino acid pairs are subsequently selected by using a sequential backward feature elimination algorithm. The 10-fold cross-validation performance of iPUP trained by using the composition of 150 top ranked k-spaced amino acid pairs and support vector machines is 0.83 for the area under receiver operating characteristic curve. The importance analysis of k-spaced amino acid pairs shows that terminal space-containing pairs are useful for discriminating pupylation sites from non-pupylation sites. A sequence analysis confirms that lysines close to C-terminus tend to be pupylated. In contrast, lysines close to N-terminus are less likely to be pupylated. The iPUP tool can predict pupylation sites with probability scores for prioritizing promising pupylation sites. Both the online server and the standalone software of iPUP are freely available for academic use at http://cwtung.kmu.edu.tw/ipup.

Introduction

Selective protein degradation is essential for regulating protein functions in all living organisms. In eukaryotes, ubiquitylation plays important roles in the regulation of DNA repair and transcription, control of signal transduction, and implication of endocytosis and sorting (Herrmann et al., 2007, Welchman et al., 2005). The conjugation of ubiquitins to substrate proteins leads to selective degradation by proteasome (Herrmann et al., 2007, Welchman et al., 2005). Pupylation is recently identified as a functional analogue of ubiquitylation in prokaryotes. A prokaryotic ubiquitin-like protein (Pup) will be linked to specific lysine residues of target proteins by forming isopeptide bonds for proteasomal degradation (Burns et al., 2009, Pearce et al., 2008). Pup is an intrinsically disordered protein (Chen et al., 2009, Liao et al., 2009) with 64 amino acids. Although Pup and ubiquitin are functional analogues, their sequences share only a diglycine motif (GGQ) at C-terminus (Knipfer and Shrader, 1997, Pearce et al., 2008). In contrast to ubiquitylation requiring three enzymes, only two enzymes are involved in pupylation. First, the C-terminal glutamine of Pup is deamidated to glutamate by deamidase of Pup (Dop) (Striebel et al., 2009). Subsequently, proteasome accessory factor A (PafA) attaches the deamidated Pup to specific lysine residues of substrate proteins (Guth et al., 2011).

To better understand the functions of pupylation, several large-scale proteomics methods were developed to identify pupylated proteins and corresponding pupylation sites (Cerda-Maira et al., 2011, Festa et al., 2010, Poulsen et al., 2010, Watrous et al., 2010). As the number of identified pupylated proteins and sites grows, a structured and searchable database PupDB has been developed for management and analysis of pupylation sites by integrating information of pupylated proteins and sites, protein structures and functional annotations (Tung, 2012). However, there are still many pupylation sites to be discovered. It is desirable to develop computational methods to accelerate the identification of pupylated proteins and pupylation sites.

Unlike sumoylation sites with a well-defined motif of [ILVMF]KxE (Melchior, 2000), most post-translational modifications have no clearly defined motif that might be due to the vast number of enzymes for attaching modifiers to substrates such as phosphorylation and ubiquitylation (Li et al., 2008). Similar to ubiquitylation sites, no clear motif has been found in pupylation sites using a two-sample logo analysis based on the dataset of PupDB (Tung, 2012). The development of composition-based prediction methods for pupylation sites.

To develop a prediction method for identifying pupylation sties, pupylation sites are firstly extracted from PupDB and divided into a training dataset and an independent test dataset. Subsequently, a composition of k-spaced amino acid pairs (CKSAAP) is utilized to encode pupylation sites. The original CKSAAP has been successfully applied to the predictions of protein flexible/rigid regions (Chen et al., 2007a), protein crystallization (Chen et al., 2007b), mucin-type O-glycosylation sites (Chen et al., 2008), palmitoylation sites (Wang et al., 2009), ubiquitylation sites (Chen et al., 2011), phosphorylation sites (Zhao et al., 2012) and identification of weakly conserved sequence motifs (Dong et al., 2013). By introducing an additional symbol for a terminal space, CKSAAP can be utilized to analyze k-spaced amino acid pairs containing a terminal space.

Feature selection is a critical step for eliminating unrelated features and improving prediction performances. A sequential backward feature elimination algorithm is proposed to iteratively remove k-spaced amino acid pairs with lowest ranks obtained from chi-square test. Finally, the top 150 k-spaced amino acid pairs are selected to construct the prediction method iPUP that is based on a support vector machine (SVM) classifier. Compared to the only existing predictor GPS-PUP (Liu et al., 2011), iPUP shows better prediction performances of 10-fold cross-validation (10-CV) on the training dataset and independent test on the test dataset with 8% and 5% improvements of area under receiver operating characteristic curve (AUC) values, respectively. An analysis of the top 150 k-spaced amino acid pairs shows that lysines near the protein C-terminus tend to be pupylated. In contrast, lysines near the protein N-terminus is less likely to be pupylated.

In the previous proteomics studies, there are many identified pupylated proteins whose pupylation sites are still unknown. To facilitate the identification of pupylation sites of the pupylated proteins, a total of 1116 experimentally identified pupylated proteins without known pupylation sites are firstly extracted from PupDB. The iPUP is subsequently applied to predict pupylation sites for the pupylated proteins. Based on the threshold of high specificity, iPUP successfully identified 828 out of 1116 (74%) proteins. The average number of predicted pupylation sites for each pupylated protein is 2.59. The prediction method iPUP has been implemented as a web server. It can predict pupylation sites with probability scores for prioritizing promising pupylation sites. A standalone version of iPUP software has also been implemented for the large-scale prediction of pupylation sites for a large number of proteins. The iPUP is a cross-platform program running on the Java Virtual Machine (JVM). Both the web server and software are freely accessible and downloadable at http://cwtung.kmu.edu.tw/ipup.

Section snippets

Datasets

Datasets of pupylated proteins and pupylation sites identified by large-scale proteomics experiments are extracted from PupDB, a collection of pupylated proteins and pupylation sites (Tung, 2012). There are 76, 51 and 55 pupylated proteins with known pupylation sites in datasets of Mycobacterium smegmatis, Mycobacterium tuberculosis and Escherichia coli, respectively. Because pupylation occurs on lysine residues, positive and negative datasets for machine learning are represented as peptide

Development of iPUP

The development of computational methods for the identification of pupylation sites from protein sequences is desirable to accelerate the costly and labor-intensive experimental process. To develop a computational method for identifying pupylation sites, 162 proteins with 183 pupylated and 2258 non-pupylated peptides are extracted from PupDB as the training dataset for building classifiers. The encoding scheme of the composition of k-spaced amino acid pairs (CKSAAP) is utilized to encode

Conclusion

Pupylation is recently discovered as an important post-translational modification in prokaryotes. It functions as an important signal for selective protein degradation by proteasome. The identification of pupylated proteins and pupylation sites can provide better understanding of the substrate specificity and regulatory functions of pupylation. To efficiently identify pupylated proteins and pupylation sites, the development of computational methods for predicting and prioritizing pupylation

Acknowledgments

CWT would like to thank the National Science Council (NSC 101-2311-B-037-001-MY2) of Taiwan and Kaohsiung Medical University Research Foundation (KMU-Q110015 and KMU-Q102012) for financially supporting this research. CWT thanks the anonymous reviewers for their constructive suggestions.

References (27)

K.E. Burns et al.
Proteasomal protein degradation in Mycobacteria is dependent upon a prokaryotic ubiquitin-like protein
Journal of Biological Chemistry
(2009)
K. Chen et al.
Prediction of protein crystallization using collocation of amino acid pairs
Biochemical and Biophysical Research Communications
(2007)
X. Chen et al.
Prokaryotic ubiquitin-like protein pup is intrinsically disordered
Journal of Molecular Biology
(2009)
Y.Z. Chen et al.
Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs
BMC Bioinformatics
(2008)
E. Guth et al.
Mycobacterial ubiquitin-like protein ligase PafA follows a two-step reaction pathway with a phosphorylated pup intermediate
Journal of Biological Chemistry
(2011)
F.A. Cerda-Maira et al.
Reconstitution of the Mycobacterium tuberculosis pupylation pathway in Escherichia coli
EMBO Reports
(2011)
K. Chen et al.
Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs
BMC Structural Biology
(2007)
Z. Chen et al.
Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs
PLoS One
(2011)
X. Dong et al.
Using weakly conserved motifs hidden in secretion signals to identify type-III effectors from bacterial pathogen genomes
PLoS One
(2013)
R.A. Festa et al.
Prokaryotic ubiquitin-like protein (Pup) proteome of Mycobacterium tuberculosis
PLoS One
(2010)

J. Herrmann et al.

Ubiquitin and ubiquitin-like proteins in protein regulation

Circulation Research

(2007)

Y. Huang et al.

CD-HIT Suite: a web server for clustering and comparing biological sequences

Bioinformatics

(2010)

N. Knipfer et al.

Inactivation of the 20S proteasome in Mycobacterium smegmatis

Molecular Microbiology

(1997)

Cited by (42)

Computational prediction of Calu-3-based in vitro pulmonary permeability of chemicals
2022, Regulatory Toxicology and Pharmacology
Citation Excerpt :
Subsequently, the remaining 684 features were then normalized by using a Z-score normalization (Kreyszig, 1979) and utilized as feature vectors for model development. Finally, a sequential feature selection algorithm that has been shown to be efficient and effective for selecting important features (Tung, 2013; Wang et al., 2016) was applied to identify informative features for minimizing the mean squared error 10-fold cross-validation. In this study, the forward selection was applied to identify informative features incrementally.
Pulmonary is a potential route for drug delivery and exposure to toxic chemicals. The human bronchial epithelial cell line Calu-3 is generally considered to be a useful in vitro model of pulmonary permeability by calculating the apparent permeability coefficient (Papp) values. Since in vitro experiments are time-consuming and labor-intensive, computational models for pulmonary permeability are desirable for accelerating drug design and toxic chemical assessment. This study presents the first attempt for developing quantitative structure-activity relationship (QSAR) models for addressing this goal. A total of 57 chemicals with Papp values based on Calu-3 experiments was first curated from literature for model development and testing. Subsequently, eleven descriptors were identified by a sequential forward feature selection algorithm to maximize the cross-validation performance of a voting regression model integrating linear regression and nonlinear random forest algorithms. With applicability domain adjustment, the developed model achieved high performance with correlation coefficient values of 0.935 and 0.824 for cross-validation and independent test, respectively. The preliminary results showed that computational models could be helpful for predicting Calu-3-based in vitro pulmonary permeability of chemicals. Future works include the collection of more data for further validating and improving the model.
Prediction of citrullination sites by incorporating k-spaced amino acid pairs into Chou's general pseudo amino acid composition
2018, Gene
As one of the most important and common protein post-translational modifications, citrullination plays a key role in regulating various biological processes and is associated with several human diseases. The accurate identification of citrullination sites is crucial for elucidating the underlying molecular mechanisms of citrullination and designing drugs for related human diseases. In this study, a novel bioinformatics tool named CKSAAP_CitrSite is developed for the prediction of citrullination sites. With the assistance of support vector machine algorithm, the highlight of CKSAAP_CitrSite is to adopt the composition of k-spaced amino acid pairs surrounding a query site as input. As illustrated by 10-fold cross-validation, CKSAAP_CitrSite achieves a satisfactory performance with a Sensitivity of 77.59%, a Specificity of 95.26%, an Accuracy of 89.37% and a Matthew's correlation coefficient of 0.7566, which is much better than those of the existing prediction method. Feature analysis shows that the N-terminal space containing pairs may play an important role in the prediction of citrullination sites, and the arginines close to N-terminus tend to be citrullinated. The conclusions derived from this study could offer useful information for elucidating the molecular mechanisms of citrullination and related experimental validations. A user-friendly web-server for CKSAAP_CitrSite is available at 123.206.31.171/CKSAAP_CitrSite/.
Mechanism-informed read-across assessment of skin sensitizers based on SkinSensDB
2018, Regulatory Toxicology and Pharmacology
Integrative testing strategies using adverse outcome pathway (AOP)-based alternative assays for assessing skin sensitizers show the potential for replacing animal testing. However, the application of alternative assays for a large number of chemicals is still time-consuming and expensive. In order to facilitate the assessment of skin sensitizers based on integrative testing strategies, a mechanism-informed read-across assessment method was proposed and evaluated using data from SkinSensDB. First, the prediction performance of two integrated testing strategy models was evaluated giving the highest area under the receiver operating characteristic curve (AUC) values of 0.928 and 0.837 for predicting human and LLNA data, respectively. The proposed read-across prediction method achieves AUC values of 0.957 and 0.802 for predicting human and LLNA data, respectively, with interpretable activation statuses of AOP events. As data grows, a better prediction performance is expected. A user-friendly tool has been constructed and integrated into SkinSensDB that is publicly accessible at http://cwtung.kmu.edu.tw/skinsensdb.
Targets of ubiquitin like system in mycobacteria and related actinobacterial species
2017, Microbiological Research
Citation Excerpt :
It contains 182 pupylated proteins with 215 experimentally validated pupylation sites and 1,123 candidate pupylated proteins belong to organisms namely, Msm, Mtb and Escherichia coli (Tung, 2012). Tung (2013) has used training dataset collected from PupDB database and an encoding scheme namely composition of k-space amino acid pairs (CKSAAPs) characteristics to develop a predictor, iPUP (Tung, 2013). Another new web server named pbPUP was also introduced that also uses a profile-based composition of k-spaced amino acid pairs for the prediction of protein pupylation sites (Hasan et al., 2015).
Protein turnover and recycling is a prerequisite in all living organisms to maintain normal cellular physiology. Many bacteria are proteasome deficient but they possess typical protease enzymes for carrying out protein turnover. However, several groups of actinobacteria such as mycobacteria harbor both proteasome and proteases. In these bacteria, for cellular protein turnover the target proteins undergo post-translational modification referred as pupylation in which a small protein Pup (prokaryotic ubiquitin-like protein) is tagged to the specific lysine residues of the target proteins and after that those target proteins undergo proteasomal degradation. Thus, Pup serves as a degradation signal, helps in directing proteins toward the bacterial proteasome for a turnover. Although the Pup–proteasome system has a multifaceted role in environmental stresses, pathogenicity and regulation of cellular signaling, but the fate of all types of pupylation such as mono and polypupylation on the proteins is still not completely understood. In this review, we present the mechanisms involved in the activation and conjugation of Pup to the target proteins, describing the structural sketch of pupylation and fundamental differences between the eukaryotic ubiquitin–proteasome and bacterial Pup–proteasome systems. We are also presenting a concise classification and cataloging of the complete battery of experimentally identified Pup-substrates from various species of actinobacteria.
predML-Site: Predicting Multiple Lysine PTM Sites With Optimal Feature Representation and Data Imbalance Minimization
2022, IEEE/ACM Transactions on Computational Biology and Bioinformatics
Identifying Pupylation Proteins and Sites by Incorporating Multiple Methods
2022, Frontiers in Endocrinology

View all citing articles on Scopus

View full text

Prediction of pupylation sites using the composition of k-spaced amino acid pairs

Highlights

Abstract

Introduction

Section snippets

Datasets

Development of iPUP

Conclusion

Acknowledgments

Journal of Biological Chemistry

Biochemical and Biophysical Research Communications

Journal of Molecular Biology

BMC Bioinformatics

Journal of Biological Chemistry

Reconstitution of the Mycobacterium tuberculosis pupylation pathway in Escherichia coli

EMBO Reports

Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs

BMC Structural Biology

Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs

PLoS One

Using weakly conserved motifs hidden in secretion signals to identify type-III effectors from bacterial pathogen genomes

PLoS One

Prokaryotic ubiquitin-like protein (Pup) proteome of Mycobacterium tuberculosis

PLoS One

Ubiquitin and ubiquitin-like proteins in protein regulation

Circulation Research

CD-HIT Suite: a web server for clustering and comparing biological sequences

Bioinformatics

Inactivation of the 20S proteasome in Mycobacterium smegmatis

Molecular Microbiology