.. _cli: PyGenePlexus CLI ================ PyGenePlexus provides a command line interface to run the full [GenePlexus]_ pipeline on a user defined geneset (a text file with gene IDs seprated by line). .. code-block:: bash geneplexus --input_file my_gene_list.txt --output_dir my_result --file_loc my_data The command above reads the gene list file ``my_gene_list.txt``, downloads the necessary data files and saves them to the directory ``my_data/``. If ``--file_loc`` is not supplied, the data files will be saved under ``~/.data/geneplexus/`` by default. Finally, all output files will be saved under ``my_result/``. .. note:: If the direcory ``my_result/`` already exists, the program will try to append a number, e.g., ``my_result_1/``, to prevent overwriting. If you would like to overwrite, you can do so by specifying the ``--overwrite`` CLI option. Full CLI options (check out with ``geneplexus --help``) .. code-block:: text Run the GenePlexus pipline on a input gene list. options: -h, --help show this help message and exit -i , --input_file Input gene list file (eg. (.txt file)). (default: None) -d , --gene_list_delimiter Delimiter used in the gene list. Use 'newline' if the genes are separated by new line, and use 'tab' if the genes are seperate by tabs. If not newline or tab, will use argument directly, so /t, /n, , (default: newline) -fl , --file_loc Directory in which the data are stored, if set to None, then use the default data directory ~/.data/geneplexus (default: None) -n , --net_type Network to use. The choices are: {BioGRID, STRING, IMP} (default: STRING) -f , --features Types of feature to use. The choices are: {SixSpeciesN2V} (default: SixSpeciesN2V) -st , --sp_trn Species of training data The choices are: {Human, Mouse, Fly, Worm, Zebrafish, Yeast} (default: Human) -sr , --sp_res Species of results data The choices are: {Human, Mouse, Fly, Worm, Zebrafish, Yeast}. If more than one species make comma seaprated. (default: Human) -gt , --gsc_trn Geneset collection used to generate negatives. The choices are: {GO, Monarch, Mondo, Combined} (default: Combined) -gr , --gsc_res Geneset collection used for model similarities. The choices are: {GO, Monarch, Mondo, Combined}. If more than one gsc can be comma spearated. (default: Combined) -in , --input_negatives Input negative gene list (.txt) file. (default: None) -l , --log_level Logging level. The choices are: {CRITICAL, ERROR, WARNING, INFO, DEBUG}. Set to CRITICAL for quietest logging. (default: INFO) -ad, --auto_download When added turns on autodownloader which is off by default. (default: False) --clear-data When added will allow user to interactively clear file_loc data and exit. (default: False) --do_clustering When added cluster_input() function will be run. (default: False) --skip-mdl-sim When added make_sim_dfs() will not be run (default: False) --skip-sm-edgelist When added make_small_edgelist() will not be run (default: False) -cm , --clust_method Sets the clustering method in cluster_input(). The choices are: {louvain, domino} (default: louvain) -cmin , --clust_min_size Sets the minimum size of clusters allowed in cluster_input(). (default: 15) -cw, --clust_weighted When added will set clust_weight argument to False in cluster_input(). (default: True) -ck , --clust_kwargs Sets the clustering keyword arguments in cluster_input(). (default: {'louvain_max_size': 70, 'louvain_max_tries': 3, 'louvain_res': 1, 'louvain_seed': 123, 'domino_res': 1, 'domino_slice_thresh': 0.3, 'domino_n_steps': 20, 'domino_module_threshold': 0.05, 'domino_seed': 123}) -lk , --logreg_kwargs Set the logistic regression keyword arguments in fit(). (default: {'max_iter': 10000, 'solver': 'lbfgs', 'penalty': 'l2', 'C': 1.0}) -s, --scale When added, will set scale to True in fit(). See docs for more info of when this is good to do. (default: False) -mnp , --min_num_pos Minimum umber of genes needed to fit a model in fit(). (default: 15) -mnpcv , --min_num_pos_cv Minumum number of genes needed to do cross validation in fit(). (default: 15) -nf , --num_folds Number of folds to do for cross validation in fit(). (default: 3) -nv , --null_val Value to use when CV can't be done in fit(). (default: None) -rs , --random_state Random state value to use in fit(). (default: 0) -cv, --cross_validate When added, will set cross validate to False in fit(). (default: True) -nn , --num_nodes Number of nodes in make_small_edgelist(). (default: 50) -od , --output_dir Output directory with respect to the repo root directory used in save_class(). if set to None, then use the default output directory ~/.data/geneplexus_outputs/results (default: None) -svt , --save_type Which file saving method to use in save_class(). The choices are: {all, results_only} (default: all) -z, --zip-output When added, zip_ouput is set to True in save_class(). (default: False) -o, --overwrite When added, overwrite is set to True in save_class(). (default: False) The output file structure is as follows. This is for `--save_type all`, if `--save_type results_only` is used then only select files will be saved. * ``my_result/`` Output directory * ``geneplexus.log`` The logger messages. * ``top_level_config.json`` Contains configuration infomration for GenePlexus class. * ``df_convert_out.tsv`` Table showing conversion of input genes to Entrez IDs for all networks. (see :meth:`geneplexus.GenePlexus.load_genes`) * ``Model Directories`` Folders containing information for each of the trained models. `All-Genes` for full input gene list and `Cluster-N` for each cluster if clustering was performed. * ``clf.joblib`` Serialized version of the trained model. * ``std_scale.joblib`` Serialized version of the standard scaler used (``None`` if `scale=False` in :meth:`geneplexus.GenePlexus.fit`). * ``model_level_config.json`` Contains configuration information specific to each model including evaluation metrics and positive, megative and neutral genes, and model weights. * ``df_convert_out_for_model.tsv`` Table showing conversion of input genes for each model. (see :meth:`geneplexus.GenePlexus.fit`) * ``Result Directories`` Folders containing results for each ``sp_res`` and ``gsc_res`` combination * ``df_probs.tsv`` Top predicted genes related to the input gene list. (see :meth:`geneplexus.GenePlexus.predict`) * ``df_sim.tsv`` Similarity of model trained on user gene list to models trained on known gene sets. (see :meth:`geneplexus.GenePlexus.make_sim_dfs`) * ``df_edge.tsv`` Edgelist (Entrez ID) of subgraph induced by top predicted genes. (see :meth:`geneplexus.GenePlexus.make_small_edgelist`) * ``df_edge_sym.tsv`` Edgelist (Symbol) of subgraph induced by top predicted genes. (see :meth:`geneplexus.GenePlexus.make_small_edgelist`) * ``isoloated_genes.txt`` List of top predicted genes (Entrez ID) that have no edges in the network. (see :meth:`geneplexus.GenePlexus.make_small_edgelist`) * ``isoloated_genes_sym.txt`` List of top predicted genes (Symbol) that have no edges in the network. (see :meth:`geneplexus.GenePlexus.make_small_edgelist`)