PyGenePlexus CLI
PyGenePlexus provides a command line interface to run the full [GenePlexus] pipeline on a user defined geneset (a text file with gene IDs seprated by line).
geneplexus --input_file my_gene_list.txt --output_dir my_result --file_loc my_data
The command above reads the gene list file my_gene_list.txt, downloads the necessary
data files and saves them to the directory my_data/. If --file_loc is not supplied,
the data files will be saved under ~/.data/geneplexus/ by default. Finally, all
output files will be saved under my_result/.
Note
If the direcory my_result/ already exists, the program will try to append
a number, e.g., my_result_1/, to prevent overwriting. If you would like
to overwrite, you can do so by specifying the --overwrite CLI option.
Full CLI options (check out with geneplexus --help)
Run the GenePlexus pipline on a input gene list.
options:
-h, --help show this help message and exit
-i , --input_file Input gene list file (eg. (.txt file)). (default: None)
-d , --gene_list_delimiter
Delimiter used in the gene list. Use 'newline' if the genes are separated
by new line, and use 'tab' if the genes are seperate by tabs. If not
newline or tab, will use argument directly, so /t, /n, , (default:
newline)
-fl , --file_loc Directory in which the data are stored, if set to None, then use the
default data directory ~/.data/geneplexus (default: None)
-n , --net_type Network to use. The choices are: {BioGRID, STRING, IMP} (default: STRING)
-f , --features Types of feature to use. The choices are: {SixSpeciesN2V} (default:
SixSpeciesN2V)
-st , --sp_trn Species of training data The choices are: {Human, Mouse, Fly, Worm,
Zebrafish, Yeast} (default: Human)
-sr , --sp_res Species of results data The choices are: {Human, Mouse, Fly, Worm,
Zebrafish, Yeast}. If more than one species make comma seaprated.
(default: Human)
-gt , --gsc_trn Geneset collection used to generate negatives. The choices are: {GO,
Monarch, Mondo, Combined} (default: Combined)
-gr , --gsc_res Geneset collection used for model similarities. The choices are: {GO,
Monarch, Mondo, Combined}. If more than one gsc can be comma spearated.
(default: Combined)
-in , --input_negatives
Input negative gene list (.txt) file. (default: None)
-l , --log_level Logging level. The choices are: {CRITICAL, ERROR, WARNING, INFO, DEBUG}.
Set to CRITICAL for quietest logging. (default: INFO)
-ad, --auto_download When added turns on autodownloader which is off by default. (default:
False)
--clear-data When added will allow user to interactively clear file_loc data and exit.
(default: False)
--do_clustering When added cluster_input() function will be run. (default: False)
--skip-mdl-sim When added make_sim_dfs() will not be run (default: False)
--skip-sm-edgelist When added make_small_edgelist() will not be run (default: False)
-cm , --clust_method
Sets the clustering method in cluster_input(). The choices are: {louvain,
domino} (default: louvain)
-cmin , --clust_min_size
Sets the minimum size of clusters allowed in cluster_input(). (default:
15)
-cw, --clust_weighted
When added will set clust_weight argument to False in cluster_input().
(default: True)
-ck , --clust_kwargs
Sets the clustering keyword arguments in cluster_input(). (default:
{'louvain_max_size': 70, 'louvain_max_tries': 3, 'louvain_res': 1,
'louvain_seed': 123, 'domino_res': 1, 'domino_slice_thresh': 0.3,
'domino_n_steps': 20, 'domino_module_threshold': 0.05, 'domino_seed':
123})
-lk , --logreg_kwargs
Set the logistic regression keyword arguments in fit(). (default:
{'max_iter': 10000, 'solver': 'lbfgs', 'penalty': 'l2', 'C': 1.0})
-s, --scale When added, will set scale to True in fit(). See docs for more info of
when this is good to do. (default: False)
-mnp , --min_num_pos
Minimum umber of genes needed to fit a model in fit(). (default: 15)
-mnpcv , --min_num_pos_cv
Minumum number of genes needed to do cross validation in fit(). (default:
15)
-nf , --num_folds Number of folds to do for cross validation in fit(). (default: 3)
-nv , --null_val Value to use when CV can't be done in fit(). (default: None)
-rs , --random_state
Random state value to use in fit(). (default: 0)
-cv, --cross_validate
When added, will set cross validate to False in fit(). (default: True)
-nn , --num_nodes Number of nodes in make_small_edgelist(). (default: 50)
-od , --output_dir Output directory with respect to the repo root directory used in
save_class(). if set to None, then use the default output directory
~/.data/geneplexus_outputs/results (default: None)
-svt , --save_type Which file saving method to use in save_class(). The choices are: {all,
results_only} (default: all)
-z, --zip-output When added, zip_ouput is set to True in save_class(). (default: False)
-o, --overwrite When added, overwrite is set to True in save_class(). (default: False)
The output file structure is as follows. This is for –save_type all, if –save_type results_only is used then only select files will be saved.
my_result/Output directorygeneplexus.logThe logger messages.top_level_config.jsonContains configuration infomration for GenePlexus class.df_convert_out.tsvTable showing conversion of input genes to Entrez IDs for all networks. (seegeneplexus.GenePlexus.load_genes())Model DirectoriesFolders containing information for each of the trained models. All-Genes for full input gene list and Cluster-N for each cluster if clustering was performed.clf.joblibSerialized version of the trained model.std_scale.joblibSerialized version of the standard scaler used (Noneif scale=False ingeneplexus.GenePlexus.fit()).model_level_config.jsonContains configuration information specific to each model including evaluation metrics and positive, megative and neutral genes, and model weights.df_convert_out_for_model.tsvTable showing conversion of input genes for each model. (seegeneplexus.GenePlexus.fit())Result DirectoriesFolders containing results for eachsp_resandgsc_rescombinationdf_probs.tsvTop predicted genes related to the input gene list. (seegeneplexus.GenePlexus.predict())df_sim.tsvSimilarity of model trained on user gene list to models trained on known gene sets. (seegeneplexus.GenePlexus.make_sim_dfs())df_edge.tsvEdgelist (Entrez ID) of subgraph induced by top predicted genes. (seegeneplexus.GenePlexus.make_small_edgelist())df_edge_sym.tsvEdgelist (Symbol) of subgraph induced by top predicted genes. (seegeneplexus.GenePlexus.make_small_edgelist())isoloated_genes.txtList of top predicted genes (Entrez ID) that have no edges in the network. (seegeneplexus.GenePlexus.make_small_edgelist())isoloated_genes_sym.txtList of top predicted genes (Symbol) that have no edges in the network. (seegeneplexus.GenePlexus.make_small_edgelist())