PyGenePlexus API

Download datasets

Manual download

The examples below show downloading the data to my_data/ for 1) all tasks for network [STRING], using Embedding as features, and the geneset collections (GSCs) [GO] and [DisGeNet] and 2) the full data.

Warning

PROCEED WITH CAUTION The first example below (STRING network using Embedding features with GO and DisGeNet GSCs) will occupy ~300MB of space. The second example (full download) will occupy ~32GB of space.

>>> from geneplexus.download import download_select_data
>>> download_select_data("my_data", tasks="All", networks="STRING",
...                      features="Embedding", gscs=["GO", "DisGeNet"])
>>> download_select_data("my_data")  # alternatively, download all data at once

See geneplexus.download.download_select_data() for more information

Data options:

Networks	[BioGRID], [STRING], [STRING-EXP], [GIANT-TN]
Features	Adjacency, Influence, Embedding
GSCs	[GO], [DisGeNet]

Note

The Influence and Adjacency data representations take the longest time to download, from ~10 minutes up to an hour dependeing on the download speed. The Embedding data representation takes the least amount of time to download (within a minute).

Auto download

Optionally, set the auto_download key word argument to True to automatically download necessary data at initialization of the GenePlexus object.

from geneplexus import GenePlexus
gp = GenePlexus(net_type="STRING", features="Embedding", gsc="GO", auto_download=True)

Note

The default data location is ~/.data/geneplexus/. You can change this by setting the file_loc argument of GenePlexus.

Run the PyGenePlexus pipeline

First, specify the input genes (can have mixed gene ID types, i.e. have any combination of Entrez IDs, Gene Symbols, or Ensembl IDs).

input_genes = ["6457", "7037", "3134", "TTC8"," BBS5", "BBS12", ...]

Alternatively, read the gene list from file

import geneplexus
input_genes = geneplexus.util.read_gene_list("my_gene_list.txt")

Next, run the pipline using the GenePlexus object.

gp = geneplexus.GenePlexus(net_type="STRING", features="Embedding", gsc="GO")

# Load input genes and set up positives/negatives for training
gp.load_genes(input_genes)

# Train logistic regression model and get genome-wide gene predictions
mdl_weights, df_probs, avgps = gp.fit_and_predict()

# Optionally, compute model similarity to models pretrained on GO and DisGeNet gene sets
df_sim_GO, df_sim_Dis, weights_GO, weights_Dis = gp.make_sim_dfs()

# Optionally, extract the subgraph induced by the top (50 by default) predicted genes
df_edge, isolated_genes, df_edge_sym, isolated_genes_sym = gp.make_small_edgelist()