PyGenePlexus

PyGenePlexus is a Python package for running the [GenePlexus] model.

Warning

The documentation on the “latest” version of ReadTheDocs may not yet reflect, code pushed to the main branch of the GitHub repository. However, the douemntation for releases on PyPi will be correct on ReadTheDocs.

PyGenePlexus enables researchers to predict genes similar to an uploaded geneset of interest based on patterns of connectivity in genome-scale molecular interaction networks, with the ability to translate these findings across species.

My Text

Overview of PyGenePlexus

Given a list of input genes and a geneset collection (GSC) to help select negative examples, the package trains a logistic regression model using node embeddings as features and generates the following outputs, either in the same species as the input genes or translated to a model species.

  1. Genome-wide prediction of how functionally similar a gene is to the input gene list. Evaluation of the model is provided by performing k-fold cross validation. The default is 3-fold cross validation when a minimum of 15 input genes are supplied. PyGenePlexus does not enforce a minimum or maximum number of genes (the minumum number of genes can be set in fit()), and we note evaluations of the model were carried out for gene sets ranging between 15 and 500 genes. See fit() and predict()

  2. (Optional) Interpretability of the model is provided by comparing the model trained on the user gene set to models pretrained on 1000’s of known gene sets from [GO] bioloigcal proceses, [Monarch] phenotypes and [Mondo] diseases. See make_sim_dfs()

  3. (Optional) Interpretability of the top predicted genes is provided by returning their network connectivity. make_small_edgelist()

Quick start

PyGenePlexus comes with an easy to use command line interface (CLI) to run the full GenePlexus pipeline given an input gene list. Go get started, install via pip and run a quick example as follows.

pip install geneplexus
geneplexus --input_file my_gene_list.txt --output_dir my_result

Note that you need to supply the my_gene_list.txt file, which is a line separated gene list text file (NCBI Entrez IDs, Symbol or Ensembl IDs are accepted). An example can be found on the GitHub page under example/input_genes.txt. More info can be found in PyGenePlexus CLI.

Warning

All necessary files for a specific selection of parameters (network, feature, species, and gene set collection) will be downloaded automatically and saved under ~/.data/geneplexus. User can also specify the location of data to be saved using the --output_dir argument. The example provided will download files that occupy ~4GB of space.

Using the API

A quick example of generating predictions using an input gene list. More info can be found in PyGenePlexus API.

from geneplexus import GenePlexus
input_genes = ["ARL6", "BBS1", "BBS10", "BBS12", "BBS2", "BBS4",
               "BBS5", "BBS7", "BBS9", "CCDC28B", "CEP290", "KIF7",
               "MKKS", "MKS1", "TRIM32", "TTC8", "WDPCP"]
gp = GenePlexus(net_type="STRING", features="SixSpeciesN2V",
                sp_trn="Human", sp_res="Human",
                gsc_trn="Combined", gsc_res="Combined",
                input_genes=input_genes, auto_download=True,
                log_level="INFO")
gp.fit()
gp.predict()
df_probs = gp.model_info["All-Genes"].results["Human-Combined"].df_probs
print(df_probs.iloc[:10])

Note

v3 of PyGenePlexus is signifcanlty different than v2 and v1 Documentation of older stable releases can be found out https://pygeneplexus.readthedocs.io/en/v2.0.4/ or https://pygeneplexus.readthedocs.io/en/v1.0.1/

Indices and tables