Graph Theoretical Analysis of Protein Networks

 Graph Theoretical Analysis of Protein Networks

                                                             

Part of the  Research Science Initiative Chennai, 2018, IIT-M

Co written with S. V. Shanmugha Balan

Under the Guidance of:-

Dr. Karthik Raman, Department of Biotechnology, IIT Madras

Aarthi Ravikrishnan


Introduction

Ever since Alexander Fleming discovered Penicillin, humans have used antibiotics to fight pathogenic invasions.  Broad-spectrum antibiotics tend to work by disrupting enzymes involved in biosynthesis, nucleic acid metabolism, protein synthesis and membrane structure. In recent years, abuse of antibiotics has led to antibiotic resistant ‘superbugs’. This is due to the process of divergent evolution.  Therefore, we must make the transition to narrow spectrum antibiotics, which are more personalised and efficient in their action.

In this report, we use graph theory, to analyze protein networks and to identify important proteins. Then we remove these proteins and analyse the changes to the network. We interpret our results and discuss the practical removal of these proteins. We also list a few ways in which we can improve our analysis.

Graph Theory

Graphs are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of nodes (points) which are connected by edges (lines).  Graphs can be directed, like (gene protein interactions) or undirected (like protein-protein reaction). The edges in a graph may also be weighted with numerical values. For example, in a graph representing the road networks in a city, wider roads will have a higher weight than narrow road.


Centrality Measures

Centrality measures are the identifiers of the most central nodes of a graph. This project used the following centrality measures. 

  1. Degree Centrality: - The degree of a node refers to the number of edges connected to it. Here, degree centrality of a node is the fraction of the total nodes of a graph to which a node is connected to.

  2. Betweenness Centrality: - The betweenness of a node is the ratio of the number of shortest paths between any pair two other nodes through that node to the total number of shortest paths between them.

  1. Average Shortest Path Length: - The mean of the length of the shortest path between each pair of nodes.

  2. Degree Distribution: -   A plot of P(k) versus k. Where P(k) is the probability that a node has a degree k.

Protein Networks

Proteins are bio macromolecules which are vital to the survival of living organisms. They perform various functions such as catalysing reactions, transporting molecules, DNA replication and providing cellular structure. Recent research shows that many processes are not carried out by proteins themselves but rather because of interaction between various proteins.

These highly specific interactions are regulated by electrostatic forces and can occur both in and out of cells. They can be represented using graphs. Research shows that these graphs have very specific properties.

Most protein networks tend to follow a power law degree distribution. This means that P(k) is proportional to k where is called the degree exponent. In graph theory language, highly connected nodes are called hubs.  A scale free distribution tends to have a few hubs to which other nodes are connected to.

Most biological networks are scale free and have 2< γ<3 with increase in leading to decrease in the degree centrality of a hub. The log-log plot of the degree distribution is a straight line. Another property is that clustering coefficient which characterises the tendency of the node to form groups or clusters also follows a power law. The average clustering coefficient of a node with degree k is proportional to k-1. The log-log plot is a constant.

Creating a Model

The Lethality Centrality Rule states that higher the connectivity of a node, higher the probability of it being essential to the organism. On the basis of this assumption the following model was made:- 

Methodology

  • The protein-protein interaction data was obtained for E. coli from the STRING database. The NetworkX package of Python was used.

  • For computational purposes, only the nodes connected to edges with weights greater than 800 were considered.

  • The degree centralities and betweenness centralities of all the nodes and average shortest path length for the graph was computed.

  • The nodes with the ten highest values of those centralities were removed separately. The changes to the shortest path length were computed.

  • As the graph produced was disconnected, the largest subgraph was used.






Results

Applications

  • This method will allow more specific targeting of pathogens and will help combat against the increasing antibiotic resistance. Once the most essential protein is identified :-

  • 1.) A molecular model of the enzyme involved in the activation of that protein can be made. Induced Fit hypothesis states that enzymes have active sites that are highly dynamic and alter the structure of the substrate to produce product molecules. Inhibitors are compounds that either bind to the active site and prevent substrate from binding or side sites (allosteric sites) and conformationally change the enzyme. Using the model and by calculating electrostatic potentials, various inhibitory compounds can be made and activation of the protein will be stopped.

  • 2.) The gene responsible for the protein can be silenced using methods like mRNA interference. A RNA molecule can be used to inhibit selected mRNA molecules this affects the expression of genes and the production of proteins.


Limitations and Scope for Improvement

  • The most connected node may not actually be part of an essential pathway for the organism.

  • This model does not take into account the highly dynamical nature of protein-protein interactions. 

  • Only two centrality measures (betweenness and degree) were considered here. Others like Eigen-Value Centrality and Closeness were not considered.

  • Different centrality measures were considered separately. They should be combined together using regression analysis into an overall centrality measure and that should be used for finding the most influential node.

  • Detailed protein-protein interaction data is needed.

  • This method serves no purpose in the analysis of fast evolving pathogens or viruses.

Conclusions

Through Graph Theoretical Analysis it’s possible to identify the most important proteins in organism. These can then be specifically targeted. This approach can be used to counter the increasing menace of drug resistant pathogens. The authors of this report will like to conclude with a word of caution. The prevalence of superbugs is due to the abuse of antibiotics. Similarly any artificial method of immunity will lead to problems. There must be a focus on more natural and organic ways of immunity, with an increased reliance on the body’s inbuilt defence capability. Such methods must be a last resort approach only.

Acknowledgements

  • Dr Karthik Raman

  • Miss Aarthi Ravikrishnan


References

  • Wikipedia

  • Embryo Project Encyclopaedia

  • NetworkX Documentation

  • String Database

  • Network Biology-Understanding the Cell's Functional Organization (Paper by Albert-László Barabási & Zoltán N. Oltvai)

  • Trueman’s Elementary Biology Volume I & II

  • A protocol for generating a high-quality genome-scale metabolic reconstruction (Paper by Ines Thiele & Bernard Ø Palsson)


  


 



Comments