PyGenePlexus

PyGenePlexus is a Python package for running the [GenePlexus] model.

PyGenePlexus enables researchers to predict novel genes similar to their genes of interest based on their patterns of connectivity in genome-scale molecular interaction networks.

My Text

Overview of PyGenePlexus

Given a list of input genes and a geneset collection (GSC) to help select negative examples, the package trains a logistic regression model using one of three network derived features (adjacency, influence, or embedding) and generates the following outputs

  1. Genome-wide prediction of how functionally similar a gene is to the input gene list. Evaluation of the model is provided by performing k-fold cross validation. The default is 3-fold cross validation when a minimum of 15 input genes are supplied. These parameters can be changed when accessing the Python class. PyGenePlexus does not enforce a minimum or maximum number of genes, and we note evaluations of the model were carried out for gene sets ranging between 5 and 500 genes. See fit_and_predict()

  2. (Optional) Interpretability of the model is provided by comparing the model trained on the user gene set to models pretrained on 1000’s of known gene sets from [GO] bioloigcal proceses and [DisGeNet] diseases. See make_sim_dfs()

  3. (Optional) Interpretability of the top predicted genes is provided by returning their network connectivity. make_small_edgelist()

Quick start

PyGenePlexus comes with an easy to use command line interface (CLI) to run the full GenePlexus pipeline given an input gene list. Go get started, install via pip and run a quick example as follows.

pip install geneplexus
geneplexus -i my_gene_list.txt --output_dir my_result

Note that you need to supply the my_gene_list.txt file, which is a line separated gene list text file (NCBI Entrez IDs, Symbol or Ensembl IDs are accepted). An example can be found on the GitHub page under example/input_genes.txt. More info can be found in PyGenePlexus CLI.

Warning

All necessary files for a specific selection of parameters (network, feature, and gene set collection) will be downloaded automatically and saved under ~/.data/geneplexus. User can also specify the location of data to be saved using the --output_dir argument. The example provided will download files that occupy ~300MB of space.

Using the API

A quick example of generating predictions using an input gene list. More info can be found in PyGenePlexus API.

>>> from geneplexus import GenePlexus
>>> input_genes = ["ARL6", "BBS1", "BBS10", "BBS12", "BBS2", "BBS4",
...                "BBS5", "BBS7", "BBS9", "CCDC28B", "CEP290", "KIF7",
...                "MKKS", "MKS1", "TRIM32", "TTC8", "WDPCP"]
>>> gp = GenePlexus(net_type="STRING", features="Embedding", gsc="DisGeNet",
...                 input_genes=input_genes, auto_download=True, log_level="INFO")
>>> df_probs = gp.fit_and_predict()[1]
>>> df_probs.iloc[:10]
    Entrez  Symbol                                             Name  Probability Known/Novel Class-Label  Rank
0     8100   IFT88                      intraflagellar transport 88     0.995984       Novel           U     1
1      585    BBS4                          Bardet-Biedl syndrome 4     0.992909       Known           P     2
2   261734   NPHP4                                   nephrocystin 4     0.990705       Novel           U     3
3    91147  TMEM67                         transmembrane protein 67     0.986072       Novel           U     4
4     9657   IQCB1                           IQ motif containing B1     0.983366       Novel           U     5
5      582    BBS1                          Bardet-Biedl syndrome 1     0.979287       Known           P     6
6   200894  ARL13B          ADP ribosylation factor like GTPase 13B     0.977565       Novel           U     7
7     8481    OFD1  OFD1 centriole and centriolar satellite protein     0.974288       Novel           U     8
8    80184  CEP290                          centrosomal protein 290     0.963544       Known           P     9
9    54903    MKS1            MKS transition zone complex subunit 1     0.960611       Known           P    10

Supported networks

Currently, GenePlexus come with four networks, including [BioGRID], [STRING] (default), [STRING-EXP], and [GIANT-TN]. Prediction using a custom network can also be done, see Using custom networks. However, when using a custom network, the model similarity analysis cannot be done due to the lack to pretrained models.

Indices and tables