-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERT1 not in network database #1106
Comments
Let’s audit this—we can look at both the database directly and also the downloaded original files; the latter can be supplied by @ntran18 |
A quick look at our archived data indicates that
|
We should check the original YeastMine downloads for the presence of |
@ntran18 found an off-by-one column issue while investigating this one—it may or may not be related, but should also be fixed nonetheless. @kdahlquist will look at the YeastMine downloads to track down what might have happened with |
This file is all the genes in network table |
I searched the files and If the database scripts are only loading genes from all_gene.csv, then this will account for why ERT1 (and potentially 200 more genes) is missing. Should we instead be loading the union of the gene files into the database? |
A few questions are emerging based on this discovery; some may be answerable with a review of the code while others may need investigation of the data. The current BioDB class can do some of this as part of their final assignment:
|
@ntran18 observes that the gene tables for GRNs and PPIs are distinct as of now. The GRN gene table appears to have an additional column over the PPI gene table. Ideally, we have a unified gene table that contains all of the genes in YeastMine, but in order to get there, we will need a better understanding of the current database code and content It was also observed that the PPI database dropdown includes the 2024 import, which was not expected. This is another side issue to track down |
… cols regulator/rows targetETR1. Fix by adding \t to the header
I met with Dondi on Wednesday to discuss a solution to this problem. We decided to create 3 duplicated gene tables for expression, network, and protein-protein interactions by creating a union gene table. Network, expression, and protein-protein interactions might have different gene tables because they have different queries for Yeastmine API. Thus, to understand more about the cause, I have to research Yeastmine API and how it works. However, I was sick last week, so I couldn't do any updates this week except for fixing the off-by-one column issue. |
Intermediate plan: before going into the Intermine API logic, the gene tables can be unioned after the fact for now. Further analysis can be the next step |
….py. Adding all populating data to database in loader.py. Need to work on Readme file to update the command in the next commit, and fixing update scripts
The current script is able to load everything from a fresh start but not update the database. Currently the pipeline to load database from fresh start is: |
Here is the union gene table. |
@ntran18 will post the union table and @kdahlquist will double-check it against the YeastMine Feature Type-->Genes Query to make sure that all 6511 genes are in our database. Our database has 7098 records in the union table. It's OK to have more genes, but we want to make sure that we have all 6511. |
I have to write scripts for updating the network and protein-protein databases, create union missing-genes and updating-gene tables, and create a file for |
I haven't yet done a comparison with the YeastMine data, but I did a visual inspection of the union table and found a number of issues:
I'll try to find time to do the YeastMine comparison later. |
I did notice about "None" display name for both network and protein-protein interactions database. There might be a chance that our production database also have some value None for display name too. |
…pdate, can create a union missing_gene and update_gene files. The code for loader_update will be in later commit.
…s, updated genes, misisng protein, updated proteins for both network and protein-protein interactions table
There are some problems when I create union gene table. A lot of genes from the protein table have a |
…ng instructions on how to update database
The gene id is equivalent to the systematic name in SGD, and the display name id is equivalent to the standard name in SGD. So for the example of SBH1:
In the history of SGD, all genes were given a systematic name because it literally encodes the position on the chromosome:
Not all genes have a separate standard name (display gene id). Standard names are assigned by a committee to be (somewhat) meaningful names. They take the form of three letters and a number. There are some rare examples that do not follow this rule. For example one standard name has a ' character and another has a , If a gene does not have a standard name, then the systematic name becomes the standard name and they will be the same. If you find an example where in one case both the gene id and display gene id are the same, they should both be systematic names. If one dataset was later than the other, the gene could have been assigned a standard name in the newer dataset. In all cases, SGD should be the final authority on which is correct. Individual genes can be looked up at www.yeastgenome.org. Alternately, we could pull down the entire list of genes from YeastMine and compare to what we have: https://yeastmine.yeastgenome.org/yeastmine/begin.do When I look up YMR295C, SGD says it should have a display name of GSR1. There should be no case where the display gene id is "none". I think the best thing to do to populate that list is to refer to a gene list from YeastMine to correct that. |
We will table the full union work for after this semester due to what’s involved; we’ll explore the SQL We also checked to see if the off-by-one fix needs to be deployed and it looks like it doesn’t, but that isn’t consistent with the commit history that we looked at— @dondi will look into how this file is used in order to get a conclusive picture of the bug’s impact |
So the immediate goal for now is to ensure that @ntran18’s code refactor is indeed functional and we can close the semester with that as the final accomplishment |
#1106 Writing Scripts for Updating Database and Refactor Code
I double-checked the database. The data for ppi shouldn't be updated, only the source table is updated. However, the data for GRN is updated on 3/19. It's possible that when I updated the GRN table, I accidentally updated the PPI's source table. in another word, when users select source from 2024 for PPI, genes or proteins are still available, but no connection is shown for genes or proteins. I copied all the data in beta database and uploaded it to GRNsight Box (folder name is GRNsight 2024 Database Debugging). @kdahlquist can you please look over these tables and make sure they are correct? I think there are something wrong with the genes table for GRN. |
#1077: Update the progress to add ERT1 to the GRN gene table. After adding ERT1 to the GRN gene table, we still can't load it in GRN. So I looked deeper into the code and saw this. The gene has to be inside the network table to be loaded. However, currently, the network table doesn't have ERT1 for 2022 - 2024 data. |
I wrote a doc about this issue here. |
Here is a PDF capture of the Google Doc above as of 10/2: |
Please commit the file to the repository and then link to it from the wiki, so that it is more easily accessible than trying to remember it is on this issue. |
#1106 Allowing query for any gene in gene table even if it's not in network table
We are going to start using GRNsight in BIOL 367 this week. I was writing the protocol and had occasion to look up "ERT1" in the "Load from database" GRN. It was not found. Can we check to see that it is not in the network database? Also, will it be there in the 2024 update?
I then tried to look it up by its systematic name "YBR239C" and got an error saying that it did not conform to the naming convention it was expecting. But I was unable to reproduce this.
The text was updated successfully, but these errors were encountered: