|
| 1 | +# GOcats |
| 2 | + |
| 3 | +GOcats is an Open Biomedical Ontology (OBO) parser and categorizing utility--currently specialized for the Gene Ontology (GO)--which can sort ontology terms into conceptual categories that a user provides. |
| 4 | + |
| 5 | +## Important note: cite your use of GOcats |
| 6 | +See the CITATION file for instructions. |
| 7 | + |
| 8 | +## Getting Started |
| 9 | + |
| 10 | +It is recommended that you clone this repository into a project directory within the home directory. |
| 11 | + |
| 12 | +You will also need a local copy of the Gene Ontology OBO flat file, available here: http://purl.obolibrary.org/obo/go.obo |
| 13 | + |
| 14 | +GOcats is able to map annotations within Gene Association Files (GAFs) into categories specified by the user. These categories are specified by creating a csv file where column 1 is the name of the category and column 2 is a list of keywords associated with that category concept, separated by semicolons (;). See GOcats/gocats/exampledata/examplecategories.csv as an example of 25 subcellular location categories. In its current version, this will be the main use of GOcats. |
| 15 | + |
| 16 | +If you would like to perform the analyses carried out in the development of GOcats which involve mapping comparisons to OWLTools' Map2Slim and to UniProt's Subcellular Location Controlled Vocabulary, please install the "Additional Packages" listed under the Prerequisites section and see the Running the Tests section. |
| 17 | + |
| 18 | +### Prerequisites |
| 19 | + |
| 20 | +#### Generating GOcats category mapping and mapping GAFs (standard usage) |
| 21 | + |
| 22 | +##### Python3 / pip |
| 23 | + |
| 24 | +Fedora 24 |
| 25 | +``` |
| 26 | +sudo dnf install python3-devel |
| 27 | +sudo dnf install python3-pip |
| 28 | +``` |
| 29 | + |
| 30 | +Ubuntu 16.04 |
| 31 | +``` |
| 32 | +sudo apt-get install python3-dev |
| 33 | +sudo apt-get install python3-pip |
| 34 | +``` |
| 35 | + |
| 36 | +##### Docopt / JSONPickle |
| 37 | + |
| 38 | +Fedora 24 / Ubuntu 16.04 |
| 39 | +``` |
| 40 | +sudo pip3 install docopt |
| 41 | +sudo pip3 install jsonpickle |
| 42 | +``` |
| 43 | + |
| 44 | +#### Additional Packages (for running development tests, and scripts for producing manuscript results) |
| 45 | + |
| 46 | +##### OWLTools prerequisites (see Installing OWLTools under Installing or visit https://github.com/owlcollab/owltools): |
| 47 | + |
| 48 | +###### Maven / Java |
| 49 | + |
| 50 | +Fedora 24 |
| 51 | +``` |
| 52 | +sudo dnf install maven java-1.8.0-openjdk-devel |
| 53 | +``` |
| 54 | + |
| 55 | +Ubuntu |
| 56 | +``` |
| 57 | +sudo apt-get install maven openjdk-8-jdk |
| 58 | +``` |
| 59 | + |
| 60 | +#### Plotting figures and tables |
| 61 | + |
| 62 | +Fedora 24 |
| 63 | +``` |
| 64 | +sudo dnf install gcc-c++ libpng-devel freetype-devel libffi-devel python3-tkinter |
| 65 | +sudo pip3 install --upgrade pip |
| 66 | +sudo pip3 install numpy pandas tabulate cairocffi pyupset py2cytoscape matplotlib |
| 67 | +``` |
| 68 | +Ubuntu 16.04 |
| 69 | +``` |
| 70 | +sudo apt get install gcc libpng-dev freetype2-demos libffi-dev python3-tk |
| 71 | +sudo pip3 install --upgrade pip |
| 72 | +sudo pip3 install numpy pandas tabulate cairocffi pyupset py2cytoscape matplotlib |
| 73 | +``` |
| 74 | + |
| 75 | +### Installing |
| 76 | + |
| 77 | +#### GOcats |
| 78 | + |
| 79 | +Clone the repo after installing the dependencies (you will need permission to |
| 80 | + access the gitlab server. If you do not have access, you probably got this |
| 81 | + project directory from FigShare, in which case these steps are unnecessary). |
| 82 | +``` |
| 83 | +cd |
| 84 | +git clone https://[email protected]/eugene/GOcats.git |
| 85 | +``` |
| 86 | + |
| 87 | +Checkout the manuscript_3 branch for the most recent version |
| 88 | +``` |
| 89 | +cd GOcats |
| 90 | +git fetch |
| 91 | +git checkout manuscript_3 |
| 92 | +``` |
| 93 | + |
| 94 | +#### OWLTools (optional) |
| 95 | + |
| 96 | +Clone the repo after installing the dependencies |
| 97 | +``` |
| 98 | +cd |
| 99 | +git clone https://github.com/owlcollab/owltools |
| 100 | +``` |
| 101 | + |
| 102 | +Install owltools using maven |
| 103 | +``` |
| 104 | +cd ~/owltools/OWLTools-Parent |
| 105 | +mvn clean package |
| 106 | +``` |
| 107 | + |
| 108 | +You may get build errors. If this happens, I found that this command gets around them without affecting the usage in this project |
| 109 | +``` |
| 110 | +mvn clean package -D maven.test.skip.exec=true |
| 111 | +``` |
| 112 | + |
| 113 | +#### Example usage |
| 114 | + |
| 115 | +Creating a mapping of GO terms from the Gene Ontology using a category file |
| 116 | +``` |
| 117 | +python3 ~/GOcats/gocats/gocats.py create_subgraphs /path_to_ontology_file ~/ARK.GOcats/gocats/exampledata/examplecategories.csv ~/Output --supergraph_namespace=cellular_component --subgraph_namespace=cellular_component --output_termlist |
| 118 | +``` |
| 119 | +This will output several files in the 'Output' directory including: |
| 120 | +``` |
| 121 | +GC_content_mapping.json_pickle # A python dictionary with category-defining GO terms as keys and a list of all subgraph contents as values. |
| 122 | +GC_id_mapping.json_pickle # A python dictionary with every GO term of the specified namespace as keys and a list of category root terms as values. |
| 123 | +``` |
| 124 | + |
| 125 | +Mapping GO terms in a GAF |
| 126 | +``` |
| 127 | +python3 ~/GOcats/gocats/gocats.py categorize_dataset YOUR_GAF.goa YOUR_OUTPUT_DIRECTORY/GC_id_mapping.json_pickle YOUR_OUTPUT_DIRECTORY MAPPED_GAF_NAME.goa |
| 128 | +``` |
| 129 | + |
| 130 | +## Running the tests and producing manuscript results |
| 131 | + |
| 132 | +##### The following run scripts are located in GOcats/runscripts. See doc strings in each script for information on how to run each. NOTE: All prerequisites must be installed before running the following scripts. Make sure to check each script to ensure that the installation path to OWLTools is correct. |
| 133 | + |
| 134 | +**run.sh** - This script runs all figure and table-producing scripts in the GOcats/runscripts directory and places output tables, figures and data in the specified <output_dir>. |
| 135 | + |
| 136 | +**GenerateHindererCategories.sh** - Used to produce S1 and Tables 1, and 2. This script produces inclusion index values, Jaccard index values, and other information for the example subgraph categories described by Hinderer and Moseley. |
| 137 | + |
| 138 | +**GenerateHPAMappingComparison.sh** - Used to produce Figures 7a, and 8a and Tables 6 and 8. Note: Requires OWLTools-map2slim OWLTools available here: https://github.com/owlcollab/owltools/wiki/Map2Slim Assuming OwlTools is installed under ~$HOME/owltools If not, edit OWLTOOLS_DIR to the appropriate directory. |
| 139 | + |
| 140 | +**GenerateGenericHPAMappingComparison.sh** - Used to produce Figures 7b and 8b. This script produces knowledgebase mappings from the HPA raw data and from the knowledgebases to a set of categories representing a more generic version of HPA's localization annotations. These were chosen by Hinderer and Moseley to resolve discrepancies in term granularity observed between knowledgebase annotations and experimental data annotations. |
| 141 | + |
| 142 | +**GenerateVisualizationData.sh** - Used to produce data for Figure 3a-c. Network tables produced by this script can be loded into Cytoscape for network visualization. GOcats/runscripts/run.sh can automatically load up and format the Cytoscape networks if an active Cytoscape session is opened to port 1234. To do this, navigate to your Cytoscape directory and run the following before executing run.sh: sh cytoscape.sh -R 1234 |
| 143 | + |
| 144 | +**SpeedTest.sh** - Used to report speed comparisons between GOcats and Map2Slim. |
| 145 | + |
| 146 | +##### The following test and supporting scripts are located in GOcats/gocats: |
| 147 | + |
| 148 | +**hpmappingtesting.py** - Produces the data for Table 4. |
| 149 | + |
| 150 | +**gofull.py** - Used to gather graph information across all of GO or specific sections of GO. Specifically used to gather information about the number of each relation in GO. |
| 151 | + |
| 152 | +**plotfigures.py** - Creates figures 7a-b and 8a-b from the data produced from other run scripts. Be sure to run all run scripts and note the |
| 153 | +location of the output directories before running this script. |
| 154 | + |
| 155 | +**cytoscapegraph.py** - Loads and automatically formats visualization data produced by GenerateVisualizationData.sh in an active Cytoscape session. See comments in GOcats/runscripts/run.sh for more information. |
| 156 | + |
| 157 | +**testfindancestors.py** - Creates ancestor lists of GO terms from annotations in a Gene Annotation File using several methods of ancestor finding. |
| 158 | + |
| 159 | +##### The following run scripts are located in GOcats/gocats/tests/Map2SlimMappingTest: |
| 160 | + |
| 161 | +**run.sh** - Produces the data used in Table 5. Once run, the results are stored in GOcats/gocats/tests/Map2SlimMappingTest/logs. NOTE! These scripts contain custom commands for a TORQUE cluster that can only be run in-house and are thus not reproducible outside of our lab. Contact corresponding author for questions. |
| 162 | + |
| 163 | +##### Other results: |
| 164 | + |
| 165 | +Information for Table 7 was entered manually to describe how the custom generic categories encompassed the previously-used categories. |
| 166 | + |
| 167 | +Information for Table 9 was compiled using the build_graph_interpreter command in gocats.py for each constraint (all GO, cellular_component, molecular_function, and biological_process) and accessing the graph object's 'relationship_count' variable to tally the use of each relationship type. The rest of the information was entered manually. |
| 168 | + |
| 169 | +## Authors |
| 170 | + |
| 171 | +* **Eugene Hinderer** - [ehinderer](https://github.com/ehinderer) |
0 commit comments