A Novel Approach for Predicting Type IV Secretion System (T4SS) Effector Proteins
Esna Ashari Esfahani, Zhila
MetadataShow full item record
Type IV secretion systems (T4SS) are multi-protein complexes in some bacterial pathogens that are used to secrete effector proteins directly into host cells. Upon entry, these effectors manipulate the host cell's machinery, resulting in serious illness or even death of the host. Therefore, identification of T4SS effectors is an important subject in bioinformatics. In recent years, multiple scoring and machine learning-based methods have been suggested for effector prediction. These approaches have used different sets of features, and their predictions have been inconsistent. In this work, first an optimal set of features is presented for predicting T4SS effector proteins using a multi-level feature selection approach. Next we focus on the best way to use these optimal features by designing several machine learning classifiers, comparing the results with those of others, and obtaining de novo results. We chose the pathogen Legionella pneumophila strain Philadelphia-1, a cause of Legionnaires’ disease, for these experiments. An important contribution was the development of a new comprehensive and user-friendly software package called OPT4e for Optimal-features Predictor for T4SS Effector proteins. OPT4e was used to predict candidate effectors from the proteomes of Anaplasma phagocytophilum strains HZ and HGE-1, the causative agent of anaplasmosis in humans, which is currently a very important pathogen for research because of the scarcity of known effectors. OPT4e predicted 48 and 46 candidates for strains HZ and HGE-1, respectively, with 16 and 18 most probable effectors. Two new algorithms, t-Tree and t-Forest, were developed as variations of the decision tree and random forest algorithms. The new algorithms improved the original algorithms by accounting for the relevance of features to the output classes in addition to the standard Gini index when creating split points. Known T4SS effector proteins for L. pneumophila were used to test the new algorithms as well as several variations of these algorithms. Finally, a method for prediction of protein secondary structure using the DAgger algorithm was considered as a possible improvement to OPT4e, and parallelization of PSSM protein profile calculations are presented, tested, and discussed.