Bioinformatics Group, Biology Division, Indian Institute of Chemical Technology, Uppal Road , Tarnaka, Hyderabad-500607, Andhra Pradesh, India
In this study, data mining approach was used to derive decision rules for predicting average flexibility from the various derived sequence and structural features. 21 parameters were calculated and variable importance was calculated for 101 sequences of CaMK kinase family belonging to mouse and human using Classification and Regression Tree (CART). Coils were found to have maximum influence on average flexibility while the Parallel beta strands were found to exert minimum impact on average flexibility. Understanding the variable importance will prove useful as a simple predictor of flexibility from an amino acid sequence. This will aid in better understanding of phenomenon underlying the average flexibility and thus, will pave a way for rational design of therapeutics and development of proper parametric weight distribution for existing molecular dynamics and protein folding algorithms.
Average flexibility, CaMK Kinase, Bioinformatics, Data mining, Classification & Regression tree (CART).
In this data–rich, information-poor world, extracting meaningful information from the flood of data is a formidable task. Though at its nascent stage, data mining is enabling researchers in demystifying biological processes. The multiplicity of functions of function is attributed to their structure. Given the dynamic nature of proteins, their structure function relationship is being actively investigated. Protein flexibility constitutes a significant linkage between protein structure and function. Conformational changes as and when required in biological processes are facilitated by their inherent flexibility. Proteins are the lead players encompassing a varied range of functions like transport of metabolites [1,2], catalysis [3,4] and regulation of protein activity [5,6] etc, average flexibility holds prime importance in this context. Protein flexibility may influence diminutive changes in conformation to large-scale molecular motions. Various degree of flexibility exhibited by protein molecules often perplexes the researchers. Various studies have been incited after the discovery of role of some highly flexible proteins with implications in pathologies like AIDS (HIV gp41) and scrapie .
A comprehensive knowledge of fundamental nature of average flexibility will facilitate the unraveling of structure-function relationship and will also aid in development of novel therapeutics . Thus, a comprehensive understanding of the intricate relationship of factors influencing protein flexibility will aid in the rational design.
The Ca2+/calmodulin-dependent kinase (CaMK) family, which is activated in response to elevation of intracellular Ca2+, includes CaMKI, CaMKII, CaMKIV and CaMK-kinases (CaMKKs). CaMKK/CaMK cascade plays an important role in regulating Ca2+ mediated cellular response. There is no dearth of data on flexibility of proteins but most of the studies have focused only on 3-D structure and related parameters. This study is an attempt to investigate the significance of diverse parameters influencing the average flexibility of CaMK kinase family by means of data mining approach.
2.1 Sequence Collection and Pre-Processing
Protein sequences of the enzymes belonging to CaMK kinases were collected in FASTA format from the NCBI’s protein database (https://www.ncbi.nlm.nih.gov) (Supplement). The collected sequences were filtered in order to exclude redundancy. From the available sequences, 101 sequences belonging to Homo sapiens (55) and Mus musculus (46) were considered for this study.
2.2 Feature Extraction
Sequence features were extracted for these sequences using ProtScale (https://expasy.org/tools/protscale.html). 21 scales like molecular weight, number of codons, bulkiness, polarity , refractivity , recognition factors , hydrophobicity , transmembrane tendency , % burried residues, % accessible residues, average area buried, average flexibility , alpha-helix , beta-sheet , beta-turn , coil , total beta-strand  , antiparallel beta-strand , parallel beta-strand , amino acid composition  and relative mutability  were calculated for all the sequences. Being a categorical variable of little importance for further analysis, accession numbers were excluded from the analysis.
2.3 Data mining
CART (Classification And Regression Tree) from Salford Systems Inc, USA is a data-mining tool based on recursive binary partitioning (21). For gaining a comprehensive understanding on influence of different variables on average flexibility, CART was employed to determine variable importance. 20 parameters were considered as predictor (independent) variables and average flexibility was considered as predictive (dependent) variable. As the target variable is continuous variable, regression model using Least Square (LS) method was selected.10 fold cross validation and default options for penalty were kept for the analysis.
CART yielded the output of basic statistical analyses performed for all the parameters and the results are represented in Table 1 and frequency distribution for these are presented in Figure 1.
While CART will highlight the optimal tree based on the lowest cross-validated relative error, the overall goal was to obtain a tree which can yield maximum number of association rules. For the sake of simplicity, best regression tree should be with least number of nodes while for accuracy, best regression tree should have maximum possible number of nodes. 14 trees with different complexities and error values obtained using CART based on splitting criteria are reflected in Table 2. Out of these trees, tree with 20 terminal nodes (Figure 2) with minimum complexity and resubstitution relative error of 0.03218 and cross validated error of 0.34002 ± 0.09877 generated by Least Square splitting criteria was selected for generating decision rules. Decision rules obtained using CART are summarized in (Suppl. Table 1).
The tree selected for deriving decision rules is shown in Figure 3 along with error rate.
To calculate a variable importance score, CART looks at the improvement measure attributable to each variable in its role as a surrogate to the primary split. The values of these improvements are summed over each node and summed, and are scaled relative to the best performing variable. The variable with the highest sum of improvements is scored 100, and all other variables will have lower scores ranging downwards towards zero. Importance of different variables was calculated and summarized in Table 3.
Rules derived from CART can be interpreted in simple context of “If “and “Then” based statement and thus are self-explanatory.
For example: Rule 1 can be interpreted as: Rule 1: IF “RECOGNITION FACTORS<= 81.5417” & “MOLECULAR WEIGHT<= 114.042” & “% ACCESSIBLE RESIDUES<= 5.497” & “BETA SHEET<= 1” THEN “AVERAGE FLEXIBILITY =0.374”
Rule 14 can be explained as:
Rule 14: IF “RECOGNITION FACTORS> 87.4445” & “% ACCESSIBLE RESIDUES > 5.8055” & “COIL<= 1” & “ALPHA HELIX>1” & “PARALLEL BETA SHEET> 1” &”AVERAGE AREA BURIED <= 129.268” THEN “AVERAGE FLEXIBILITY =0.444”.
Many biological processes require change in conformations of proteins, thus, are influenced by the flexibility of the particular protein. This very property of proteins allows a spectrum of interactions between Enzyme-substrate/inhibitor in catalysis and hormone-receptor in biological systems. Thus, average flexibility, an inherent property of protein molecules is correlated with functions [22-26]. The discovery that some flexible proteins were found to have implications in pathological conditions has fuelled the studies relating to average flexibility of proteins. The complexity of such studies is often bewildering, given the enormous data available.
Data mining approaches based on decision tree based methods have been successfully exploited in elucidating importance of features affecting important biological processes . Decision tree based methods are effective and simple means for sifting complex biological data for hidden explicit patterns and information. More and more biological studies are harnessing CART methodologies owing to its simplicity and ability to handle missing values. The CART methodology is being increasingly employed in biological studies like in ecological studies , diagnosis decision processes , epidemiology , microbiology , histology , genetics  and biochemical analysis .
CaMKK is known to control the activity of both CaMKI and CaMKIV. CaMK kinase, a part of CaMK cascade has been characterized in many organisms. Although various studies have focussed on kinetics of CaMK Kinases but the impact of various factors influencing their average flexibility is yet to be explored. Our analysis revealed that in CaMK kinases, recognition factors, amino acid composition, molecular weight, percent accessible residues, bulkiness, hydrophobicity, refractivity, polarity, transmembrane tendency, relative mutability, average area buried, numbers of codons among the sequence features were found to exert the influence on average flexibility in descending order. Among secondary structures, coil, anti parallel beta strand, alpha helix, beta sheet, beta turn, total beta strand, parallel beta strand were found to influence the average flexibility in decreasing order.
Keeping in mind, the recent enthusiasm for the inclusion of protein flexibility in docking algorithms, it will be interesting to gain an insight on features influencing the flexibility of proteins.
It is anticipated that an extensive knowledge of protein flexibility and the various parameters contributing towards is important for rational drug design. Such an approach will lead to better understanding of underlying biological phenomena and aid in enzyme engineering processes
Authors thank Dr. J.S. Yadav, Director, IICT for his continuous support and encouragement. Authors are thankful to anonymous reviewers for their critical suggestions for the improvement of manuscript.