GUD-VE visualization tool for physicochemical properties of proteins

The physicochemical properties of primary sequences of proteins helps in determining both the structure and biological functions. The sequence analysis of the proteins and nucleic acids is most fundamental element of bioinformatics. Without these elements, it is impossible to gain insight deeper molecular and biochemical mechanisms. For this purpose, the computational methods like bioinformatics tools assist experts and novices alike in resolving issues relating to protein analysis. Similarly, this proposed work, for the graphical user interface (GUI) based prediction and visualization through the computations-based method done on Jupyter Notebook with tkinter package which allows the creation of a program on a local host platform and accessed by the programmer.• When it is queried with a protein sequence, it predicts physicochemical parameters of the peptides.• Users can choose to visualize the findings acquired either anonymously or on the user-specified email address and compare the biophysical properties of one protein with other using amino acids (AA) sequences. The aim of this paper is to meet the requirements of experimentalists, not just hardcore bioinformaticians related to biophysical properties prediction and comparison with other proteins. The code for it has been uploaded on GitHub (an online repository of codes) in private mode.


a b s t r a c t
The physicochemical properties of primary sequences of proteins helps in determining both the structure and biological functions. The sequence analysis of the proteins and nucleic acids is most fundamental element of bioinformatics. Without these elements, it is impossible to gain insight deeper molecular and biochemical mechanisms. For this purpose, the computational methods like bioinformatics tools assist experts and novices alike in resolving issues relating to protein analysis. Similarly, this proposed work, for the graphical user interface (GUI) based prediction and visualization through the computations-based method done on Jupyter Notebook with tkinter package which allows the creation of a program on a local host platform and accessed by the programmer.
• When it is queried with a protein sequence, it predicts physicochemical parameters of the peptides. • Users can choose to visualize the findings acquired either anonymously or on the userspecified email address and compare the biophysical properties of one protein with other using amino acids (AA) sequences.
The aim of this paper is to meet the requirements of experimentalists, not just hardcore bioinformaticians related to biophysical properties prediction and comparison with other proteins. The code for it has been uploaded on GitHub (an online repository of codes) in private mode.

Introduction
The functions of a protein is defined and given by their unique structure provided by their amino acid (AA) compositions and folds. Proteins with similar AA sequences exhibit similar bioactivity, similar structure, and function. However, Point Mutations, deletions, or alterations in the expression of the genes results in mutant proteins leading to many incurable diseases; proteins translated by these genes have different AA sequence than native protein which leads to the variation in the structure of proteins, different structures lead to different functions or deregulation in the biological processes. These deregulated biomolecules, such as nucleic acids and proteins, more specifically enzymes, cytokines, antibodies, aptamers, etc. which are crucial for certain complex diseases pathogenesis such as cancer, diabetes, Alzheimer's, Parkinson's disease, are known as biomarkers. The identification of biomarkers that are indicative of a specific biological state, is a major research topic in biomedical applications of computational biology [1][2][3] . Thus, the protein's family, superfamily, subcellular localization, physicochemical properties give insight to the researchers for further research like peptide binding sites prediction, drug development, functional annotation, disordered protein etc.
Proteins are depicted as having diverse roles in the body because of their functions such as catalysts, cofactors, hormones, transporters, structure, signaling for biochemical reactions, maintaining physiology and many more. Therefore, it becomes necessary to examine the protein sequence and structure [2] . To study the protein structure in vivo and in vitro requires technical expertise, resources, and financial implications. Earlier many innovative and user-friendly research tools and software were developed to make it easier and cut down these complications and diminish the line between experimental result and information technology based analysis [4][5][6] . The First protein sequence analysis tool was reported in the year 1965, named COMPROTEIN. The Dayhoff and Robert S. Ledley'Atlas' first version included 65 protein sequences, the majority of which were interspecific variations of a few proteins. As a result, the first Atlas was a perfect data set for two researchers who theorized that protein sequences indicate a species' evolutionary history. Because of the massive increase in the number of protein sequences and structures recorded in biological databases, several bioinformatics tools, and systems have been developed to organize, validate, compare, and analyze this massive amount of data. It is now maintained as the PIR-International [7 , 8] . Furthermore, effective algorithms to utilize this sequence and structural data to aid protein classification and infer the biological function of newly found proteins are being created.
A catalog of bioinformatics resources known as ExPASy SIB Bioinformatics Resource Portal has several protein sequence analysis tools for biophysical properties among other key resources. Online tools Compute pI/mW, ProtParam, ProtScale, and AAcompIdent are related to the functionalities of our tool which are discussed here [9] . Compute pI/mW (isoelectric point/ molecular weight) tool is available at ExPASy which predicts the isoelectric point and molecular weight of proteins. It takes input as UniProtKB/ Swiss-Pot Protein ID/ UniProt Knowledgebase accession number as input. It can predict the pI/mW for many proteins at once [10] . The ExPASy ProtParam calculates physicochemical parameters of the protein. This tool Computes both pI and mW as well as predicts AA composition, atomic composition, extinction coefficient, estimated 1 2 life, and instability index among others. It takes input in the form of an AA sequence of the protein or Swiss-Prot/TrEMBL accession no. or ID [10] . ProtScale is used to compute and display the protein profile using 57 amino acid scales, such as hydrophobicity, hydrophilicity charts. These tools require just amino acid sequence to give the output, no additional information is required [10] . AAcompIdent compares the theoretical percent of AA composition in the Swiss-Prot/ TrEMBL database to the empirically measured percent of AA composition of an unknown protein. The extent of the difference in the composition of unknown protein and protein entry in the database is calculated by a score. This score is the sum of squared between the percent AA of the all AA in unknown protein and the protein in the database. Then the matched proteins are ranked in the list with respect to this score. The best matches are preserved at the top of the list, while the poorest are retained at the bottom [10] . PepDraw allows users to draw chemical structure of AA sequence entered in the input box given on user interface along-with peptide analysis ( http://www.tulane.edu/~biochem/WW/PepDraw/ ). Most of these tools give individually computed parameters. No option is available to compare the biophysical properties computed for a query protein sequence with other protein in these above-mentioned tools. Lack of interactive interface is also there while using these tools. Instances of microheterogeneity can be identied by using our program. In classical biochemistry methods, solutions of high concentrations are required to calculate the physiochemical properties of proteins which our program predicts and compares with other proteins AA sequence.
Hence, our proposed work has been created to investigate protein's physicochemical properties. This program provides precise results and a user-friendly interface. It compares the two protein sequences, using a set of parameters such as atomic composition, molecular weight, theoretical pI, AA composition, number of amino acids present in our AA sequence, aliphatic index, and GRAVY. These parameters can be used in various studies like protein structure prediction, function prediction, mutation prediction, etc. It also generates output in the form of graphs and pie charts, such as amino acid composition and atomic composition. The sequence of AA from the protein database can be used to investigate protein's properties. This application is made up of three parts: a graphical user interface, data visualization, and e-mail. Python's tkinter module is used to create the user interface. There are three windows in all. The primary window is where the user enters the target sequences of proteins that need to be compared. The parameters are displayed in other sub-windows. We used visual data approaches for improved visualization so that the user can easily understand the parameter analysis. Protein thermostability and hydrophobicity are also compared by GUD-VE. The aliphatic index and the grand average of hydropathicity (GRAVY) are used to compute this. It not only provides you with the output but also allows you to preserve it for future use by providing you with the option of receiving and saving your output by email. This will aid in the identification of proteins based on their AA compositions.

Method details
The GUD-VE architecture not only computes the parameters individually but also compares the user-provided sequences using data visualizations like graphs and comparison templates. This proposed work is done on the Jupyter Notebook using tkinter package on the local host, accessed by the programmer only.

Architecture design
This proposed work's architecture is based on three domains: Graphic User Interface (GUI), Data Visualization, and E-mail. The GUI of the application is developed on the tkinter Library. The Data Visualization is developed on the matplotlib Library of Jupyter Notebook. The e-mail is developed on smtp Library of Jupyter Notebook. The flowchart is depicted in Fig. 1 .

Architecture principle
Its architecture is divided into three parts namely GUI, Data visualization, and Email. The first root window of this proposed paper work, was made using the tkinter library is designed to take user input of two protein sequences, an Email address if reports to be mailed and a user-desired path to save the plots. Buttons labeled 'submit', 'reset', 'compute parameters' are also there, to perform their namely functions. This proposed work allows a user to calculate the parameters for a single protein if required, keeping this in mind it is designed to open computed results in two different root windows. These input of protein sequences are used for calculating the parameters, which are elaborated in later parts of this work and used for making plots and charts for data visualization. PyPlot from Matplotlib library is used for this part along with PIL library for saving and opening created plots and charts into and from the memory location.
Users can also receive the report of computed parameters in form of a text document and plots in a png format attached and sent to them via email, stmplib library is used for this, which uses stmp.google.com and its port for sending emails. So, the architecture is named GUD-VE for GUI, Data Visualization, and Email, which works on user inputs of protein sequences, recipient email address, and a memory directory path.

Architecture prototype
This application uses data visualization and a graphical user interface to make the platform of the application more user-friendly. In this section, we describe the procedures of the program separated into three main components which are graphical user interface, data visualization, and email.

Graphical user interface (GUI)
The programming of application is based on Python programming language. The GUI framework of the application is done using the tkinter library of python. Tkinter is the standard GUI library for Python. It provides a powerful object-oriented interface to the Tk GUI toolkit. The programming of this application is done such that there are three operational root windows. The first is the main window and the remaining two windows contain the parameters of the protein sequence provided by the user namely, Protein-1 and Protein-2. This could be achieved by importing various widgets from the tkinter library, such as Labels are used to place the required text on the window. The Button widget is used to give the command to the application to perform a particular function. Entry widget used to take information from the user like email ID and path. Text Box widget used to incorporate the protein sequences whose parameters need to be determined and compared. The Algorithm 1 for GUI, we utilized to develop this work is given below.

Algorithm 1
For GUI of Protein-1 and Protein-2.
1: Import the tkinter library 2: Create and place the Label 'Protein-1 ′ on the canvas of the first root window 3: Create and place Label that informs the user to give manual input beneath the previous label 4: Create and place a button that contains command to 'Compute Parameters' below the previous label 5: Create and place a button that contains command to 'RESET' below the previous label 6: Create and place a text box where the user will give the manual input Sub-root windows. The outset of the sub root windows for Protein-1 and Protein-2 is connected to the "COMPUTE PARAMETER " button attached to the main root window. In Fig. 3 (A) and (B), it is shown the results can be downloaded on the location through the path user provide or on the E-mail address on the basis of user's choice. Users can save results on both the local address and E-mail address. The parameterized data based on the molecular weight of amino acid, number of amino acids, theoretical pI, amino acid composition, atomic composition, etc. have been displayed on this window using labels as shown in the Fig. 3 (C) and (D) given below. sequence. These amino acids are ambiguous, and their atomic acid composition cannot be estimated. So, the button for the atomic composition plot is not visible.

Data visualization
In this proposed work, along with data output of computed parameters, visual data representation techniques are used for a better understanding and analysis of those parameters it computes. Matplotlib library of python programming language is used for this purpose. This application calculates various physicochemical properties of AA sequence provided in the first root window, is  used to make plots and charts. Our proposed work takes two protein sequence inputs and then assesses them to plot two types of graphs and one pie chart. First is a Bar-plot which shows the percentage composition of amino acids. The pie-chart is used to visualize the atomic composition of that protein and an overlapping line-plot is used for a head-to-head comparison of amino acid composition. Additionally, this proposed work also compares aliphatic indexes and hydrophobic characters of two proteins which helps to determine more thermostable and hydrophobic characters having protein. These two parameters are also expressed in visual form on an image template that can be viewed and saved to a user-defined location.

E-mail
In the proposed work the parameters are calculated via the FASTA sequence provided is visible on the output root window, but for future reference or user ease these computed parameters can be mailed via email to a user-provided email address. A report will be attached in a .TXT format file having parameters for both the proteins computed. Function send() is defined for this purpose, having the email address and password of sender's email, and takes recipients email address from email.get() function. Stmplib library was used for this, it is easy to use and flexible library for handling the mail-related tasks.
The text file that has computed parameters is attached via set_payload() and encoded with encoders.encode_base64() . This encoded text file is attached and sent via stmp of the gmail i.e., stmp.google.com with port number 587, server checks login credentials and if found correct, text file is sent, and the session is ended with quit() at the end of the block.
To send plots by mail, Stmplib library is used, same as send() function, send_img() function have the sender email address and password for authentication, to attach the plots, a path variable path is defined which have the location of saved plots and is used to open the png format images in open() function in 'rb' mode. Then data of these plots are attached using msg.add_attachment() which takes image data, type, and subtype as arguments, for image these are defined as 'image' and 'png' in the subtype.

Calculations
The parameters included are described along with the calculation base for this proposed work as given below: User-provided sequence, which displays the amino acid sequence divided into 10 fragments each making it easy for the user to demarcate the position of each amino acid in the sequence. The number of AA displays the total number of AA present in the protein sequence. It is calculated by obtaining the length of the amino acid sequence. Molecular weight, which displays a mass of the protein sequence, based on 12 as the atomic number of carbon-12. It is calculated by summing the average isotopic masses of amino acids in the protein and the average isotopic mass of one water molecule.
A total number of negatively charged residues, which displays the sum of the total number of Aspartic acid and Glutamic acid. The total number of positively charged residues displays the sum of the total number of Arginine and Lysine [14] . Atomic composition, which displays the total number of carbon (C), hydrogen (H), Oxygen (O), Nitrogen (N), and Sulphur (S) present in the sequence. The first column contains the name of the atoms. The second column contains the symbol of the atoms, and the third column contains the number of each atom in the sequence.
Formula, which displays the formula of the amino acid sequence using the number of carbon atoms, hydrogen, oxygen, nitrogen, and sulfur atoms data. The total number of atoms, which displays the total number of atoms in the amino acid sequence. It is calculated by summing up the number of carbon atoms, hydrogen, oxygen, nitrogen, and sulfur atoms present in the sequence.
Aliphatic index, which is calculated and displayed according to the following formula given in Eq. (1) . where X(Ala), X(Val), X(Ile), and X(Leu) are mole percent (100 × mole fraction) of alanine, valine, isoleucine, and leucine. The coefficients a and b are the relative volume of the valine side chain ( a = 2.9) and Leu/Ile-side chains ( b = 3.9) to the side chain of alanine. This is one of the positive factors for the thermostability of proteins [15] . GRAVY is calculated by summing the hydropathy values of each amino acid in the protein sequence, divided by the number of residues in the sequence [16] .

Results
This section includes the results obtained from evaluating the data inputted by the user. The result has been demonstrated separately under three headings as mentioned before in the methods section.

Graphical user interface (GUI)
The graphical orientation has already been explained in methodology GUI section 2.3.1 using the tkinter library. The final framework looks like Fig. 2 including all the three sections for manual input i.e., E-mail ID, Protein-1 and Protein-2 sequences and path. when each feature selection process is approached by the user, First, comes the protein sequences. In this part, the user has been informed to enter the amino acid sequence in the box below. This architecture is built such that it takes input as the protein sequence from the FASTA sequence. The user can either paste the sequence from any website of NCBI or it can be inputted manually as per the user's choice. Second, comes the path where the user wants to save the plots as images. Here the user is required to copy the path from their personal computer where they want to save the images that pop up while execution of the program after pressing the buttons. This becomes a required parameter otherwise data visualization and email are not possible.
Third, comes the e-mail feature where the user must enter the email ID of the receiver i.e., whoever they want to send the result to. The result for the main root window has been demonstrated in the previous context. Now, the outcome of the sub root windows will be elaborated. The sub root windows get activated by pressing the "Compute Parameters " button, which has been provided for each protein i.e., Protein-1 and Protein-2 separately. The sub root windows contain the computed parameters and buttons for obtaining the image results as given in Fig. 3 (A) and (B) .

Data visualization
Explained in the previous section the number of amino acids or the amino acid composition is represented in a form of a bar graph for visual representation, three-letter abbreviations are used for amino acid names on the x-axis, and the number of repetitions on the y-axis. This bar graph given in Fig. 4 (A ) can be accessed from the sub-root window separately for both the proteins from their respective windows. It opens in a new pop-up window using the PIL library.
The second available visual representation is a pie-chart showing the atomic composition of the protein with names of the atoms as the label and percentage composition as tags, this pie-chart is also available for both the proteins on both the sub-root windows. Fig. 4 (B) shows the type of pie-chart that opens when the "atomic composition plot " labelled button is pressed.
After these two plots for a single protein, for comparison a line graph is plotted which shows one on one overlap for every single amino acid present in both the proteins. This graph is plotted after taking details about both the proteins, this plot Fig. 5 (A) can be accessed from the second sub-root window through the button labeled as "comparison plot ", which pops a new window with this image.
The last button labelled as "compare parameters " opens a run-time edited image template which compares two important aspects of proteins, the first one being thermostability, calculated using the aliphatic index and hydrophobic nature, calculated using the GRAVY method, both described in the earlier section are compared for both proteins to show which protein is more thermostable and more hydrophobic. Fig. 5 (B) shows that protein 1 is more thermostable and more hydrophobic as compared to protein 2.

Discussion
Proteins physicochemical properties have a direct relation with their biological functions, especially GRAVY, thermostability, length of AA sequence, pI, and mW [17] . Proteins belonging to different species can also identified on the basis of the physicochemical properties of the AA sequence of proteins [11] . Furthermore, an advantage of our proposed work is that it utilizes computational methods that can both effectively predict and then compare the physicochemical properties of proteins which may help researchers in reducing cost as well as the time spent in providing a better understanding of the physicochemical properties of Proteins.
Unfortunately, there is a limited study done so far that provides the reasonable prediction and analysis of physicochemical properties of proteins. Cellular assays may be employed for quantifying the mutation's effect on the function and AA change's effect on protein's properties. There is another framework to integrate the protein's sequence data and variation information, its effect on physicochemical properties to interpret the information through bioinformatics-based methods. Combining these two approaches is necessary for the reliability of research and medical implementation [18][19][20] .
There are many aspects of proteomic research which can be dealt using physicochemical properties of proteins. Identifying the disordered protein regions, peptide binding sites, differentiating the favorable and not favorable AA residues in water soluble and transmembrane protein like bioinformatics research is already done using physicochemical properties of AA sequences. Disordered protein region was identified with the help of biophysical properties of proteins because they mostly have polar and charged AA and their depletion takes place in hydrophobic AA. physicochemical properties of AA sequence which are hydrophobicity, volume, area, pI, and the indicator variables aliphatic, aromatic, branch and sulfur were used to predict the short peptide binding with Major Histocompatibility Complex [21][22][23] . Herein, we have proposed work that compares the two protein sequences and provides powerful computation of different physical and chemical parameters.
Furthermore, this proposed work provides the user a text file that includes the number of AA, molecular weight, theoretical pI and AA composition also the information in three types of plot amino acid composition, atomic composition, and comparison between two protein sequences. Thus, allowing us the new interpretation of underlying physicochemical behavior in the query protein sequences. This proposed work's architecture not only computes the parameters individually but also compares the user-provided sequences using data visualizations like graphs and comparison templates. The most salient feature of this application is that it can send the computed data to a user-provided e-mail which makes the application more user-friendly.
The first limitation of this proposed work is that it compares and analyses only two amino acid sequences. Since the calculations of the protein parameters are completely programming-based and the BioPython module is not used, the result of some exceptional protein sequences containing ambiguous amino acid sequences can be sometimes incorrectly predicted. Secondly, if, by mistake the user copies a faulty amino acid sequence, the tool will not show any error message. Moreover this program is not a server-based, thus the outcomes are limited to the data programmed in it. It is not supported by mainstream bioinformatics organizations, so the information is limited. We will further extend this proposed work to overcome these limitations and extend the physicochemical properties this proposed work can compute, compare and visualize.

Ethics statements
Not Applicable.

Funding
The present study was supported by the University Grant Commission (

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data will be made available on request.