PyGenePlexus API

Download datasets

Manual download

The examples below show downloading the data to my_data/ for 1) all tasks for network [STRING], using Embedding as features, and the geneset collections (GSCs) [GO] and [DisGeNet] and 2) the full data.

Warning

PROCEED WITH CAUTION The first example below (STRING network using Embedding features with GO and DisGeNet GSCs) will occupy ~300MB of space. The second example (full download) will occupy ~32GB of space.

>>> from geneplexus.download import download_select_data
>>> download_select_data("my_data", tasks="All", networks="STRING",
...                      features="Embedding", gscs=["GO", "DisGeNet"])
>>> download_select_data("my_data")  # alternatively, download all data at once

See geneplexus.download.download_select_data() for more information

Data options:

Networks

[BioGRID], [STRING], [STRING-EXP], [GIANT-TN]

Features

Adjacency, Influence, Embedding

GSCs

[GO], [DisGeNet]

Note

The Influence and Adjacency data representations take the longest time to download, from ~10 minutes up to an hour dependeing on the download speed. The Embedding data representation takes the least amount of time to download (within a minute).

Auto download

Optionally, set the auto_download key word argument to True to automatically download necessary data at initialization of the GenePlexus object.

from geneplexus import GenePlexus
gp = GenePlexus(net_type="STRING", features="Embedding", gsc="GO", auto_download=True)

Note

The default data location is ~/.data/geneplexus/. You can change this by setting the file_loc argument of GenePlexus.

Run the PyGenePlexus pipeline

First, specify the input genes (can have mixed gene ID types, i.e. have any combination of Entrez IDs, Gene Symbols, or Ensembl IDs).

input_genes = ["6457", "7037", "3134", "TTC8"," BBS5", "BBS12", ...]

Alternatively, read the gene list from file

import geneplexus
input_genes = geneplexus.util.read_gene_list("my_gene_list.txt")

Next, run the pipline using the GenePlexus object.

gp = geneplexus.GenePlexus(net_type="STRING", features="Embedding", gsc="GO")

# Load input genes and set up positives/negatives for training
gp.load_genes(input_genes)

# Train logistic regression model and get genome-wide gene predictions
mdl_weights, df_probs, avgps = gp.fit_and_predict()

# Optionally, compute model similarity to models pretrained on GO and DisGeNet gene sets
df_sim_GO, df_sim_Dis, weights_GO, weights_Dis = gp.make_sim_dfs()

# Optionally, extract the subgraph induced by the top (50 by default) predicted genes
df_edge, isolated_genes, df_edge_sym, isolated_genes_sym = gp.make_small_edgelist()