Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERT1 not in network database #1106

Closed
kdahlquist opened this issue Mar 26, 2024 · 26 comments
Closed

ERT1 not in network database #1106

kdahlquist opened this issue Mar 26, 2024 · 26 comments

Comments

@kdahlquist
Copy link
Collaborator

We are going to start using GRNsight in BIOL 367 this week. I was writing the protocol and had occasion to look up "ERT1" in the "Load from database" GRN. It was not found. Can we check to see that it is not in the network database? Also, will it be there in the 2024 update?

I then tried to look it up by its systematic name "YBR239C" and got an error saying that it did not conform to the naming convention it was expecting. But I was unable to reproduce this.

@dondi
Copy link
Owner

dondi commented Mar 26, 2024

Let’s audit this—we can look at both the database directly and also the downloaded original files; the latter can be supplied by @ntran18

@dondi
Copy link
Owner

dondi commented Mar 26, 2024

A quick look at our archived data indicates that ERT1 was in the original database load from fall 2021 but appears to have been dropped in spring 2022:

postgres=> \dn
             List of schemas
             Name             |  Owner   
------------------------------+----------
 fall2021                     | postgres
 gene_expression              | postgres
 gene_regulatory_network      | postgres
 protein_protein_interactions | postgres
 public                       | postgres
 settings                     | postgres
 spring2022_network           | postgres
(7 rows)

postgres=> set search_path=fall2021;
SET
postgres=> \dt
                 List of relations
  Schema  |        Name         | Type  |  Owner   
----------+---------------------+-------+----------
 fall2021 | degradation_rate    | table | postgres
 fall2021 | expression          | table | postgres
 fall2021 | expression_metadata | table | postgres
 fall2021 | gene                | table | postgres
 fall2021 | production_rate     | table | postgres
 fall2021 | ref                 | table | postgres
(6 rows)

postgres=> \d gene
                        Table "fall2021.gene"
     Column      |       Type        | Collation | Nullable | Default 
-----------------+-------------------+-----------+----------+---------
 gene_id         | character varying |           | not null | 
 display_gene_id | character varying |           |          | 
 species         | character varying |           |          | 
 taxon_id        | character varying |           | not null | 
Indexes:
    "gene_pkey" PRIMARY KEY, btree (gene_id, taxon_id)
Referenced by:
    TABLE "degradation_rate" CONSTRAINT "degradation_rate_gene_id_fkey" FOREIGN KEY (gene_id, taxon_id) REFERENCES gene(gene_id, taxon_id)
    TABLE "expression" CONSTRAINT "expression_gene_id_fkey" FOREIGN KEY (gene_id, taxon_id) REFERENCES gene(gene_id, taxon_id)
    TABLE "production_rate" CONSTRAINT "production_rate_gene_id_fkey" FOREIGN KEY (gene_id, taxon_id) REFERENCES gene(gene_id, taxon_id)

postgres=> select gene_id, display_gene_id from gene where gene_id='YBR239C' or display_gene_id='ERT1';
 gene_id | display_gene_id 
---------+-----------------
 YBR239C | ERT1
(1 row)

postgres=> \dn
             List of schemas
             Name             |  Owner   
------------------------------+----------
 fall2021                     | postgres
 gene_expression              | postgres
 gene_regulatory_network      | postgres
 protein_protein_interactions | postgres
 public                       | postgres
 settings                     | postgres
 spring2022_network           | postgres
(7 rows)

postgres=> set search_path=spring2022_network;
SET
postgres=> \dt
                List of relations
       Schema       |  Name   | Type  |  Owner   
--------------------+---------+-------+----------
 spring2022_network | gene    | table | postgres
 spring2022_network | network | table | postgres
 spring2022_network | source  | table | postgres
(3 rows)

postgres=> select gene_id, display_gene_id from gene where gene_id='YBR239C' or display_gene_id='ERT1';
 gene_id | display_gene_id 
---------+-----------------
(0 rows)

@dondi
Copy link
Owner

dondi commented Mar 26, 2024

We should check the original YeastMine downloads for the presence of ERT1 and proceed based on what we find. If ERT1 is in the downloads, then we have a lurking bug in our database scripts which prevent this gene from being included in the database; if we do not find ERT1 in the downloads, then this might be a YeastMine issue

@dondi
Copy link
Owner

dondi commented Mar 26, 2024

@ntran18 found an off-by-one column issue while investigating this one—it may or may not be related, but should also be fixed nonetheless. @kdahlquist will look at the YeastMine downloads to track down what might have happened with ERT1

@ntran18
Copy link
Collaborator

ntran18 commented Mar 27, 2024

This file is all the genes in network table
all_gene.csv
This ifile contains all the protein in the PPI table
protein.csv
This file contains all the gene in the PPI table
gene.csv

@dondi
Copy link
Owner

dondi commented Apr 1, 2024

I searched the files and ERT1 is present in gene.csv but not all_gene.csv. Might this be a lead? Based on line count, all_gene.csv has 6514 records whereas gene.csv has 6715—that’s 201 more

If the database scripts are only loading genes from all_gene.csv, then this will account for why ERT1 (and potentially 200 more genes) is missing. Should we instead be loading the union of the gene files into the database?

@dondi
Copy link
Owner

dondi commented Apr 2, 2024

A few questions are emerging based on this discovery; some may be answerable with a review of the code while others may need investigation of the data. The current BioDB class can do some of this as part of their final assignment:

  • How exactly are all_gene.csv and gene.csv derived and used by our database scripts? …this can be looked at via code inspection. Also, more descriptive filenames can be used
  • What’s the feasibility of revising our gene-loading code so that it loads either the union of these files or we should do a fresh query that unconditionally downloads all genes—this latter is what we’re actually after when loading genes into our database
  • How does our app behave when activating node coloring when there is a gene in the network that doesn’t have expression data? This will require some database querying in order to identify such genes, then we can test them in the app

@dondi
Copy link
Owner

dondi commented Apr 2, 2024

@ntran18 observes that the gene tables for GRNs and PPIs are distinct as of now. The GRN gene table appears to have an additional column over the PPI gene table. Ideally, we have a unified gene table that contains all of the genes in YeastMine, but in order to get there, we will need a better understanding of the current database code and content

It was also observed that the PPI database dropdown includes the 2024 import, which was not expected. This is another side issue to track down

ntran18 added a commit that referenced this issue Apr 9, 2024
… cols regulator/rows targetETR1. Fix by adding \t to the header
@ntran18
Copy link
Collaborator

ntran18 commented Apr 9, 2024

I met with Dondi on Wednesday to discuss a solution to this problem. We decided to create 3 duplicated gene tables for expression, network, and protein-protein interactions by creating a union gene table.

Network, expression, and protein-protein interactions might have different gene tables because they have different queries for Yeastmine API. Thus, to understand more about the cause, I have to research Yeastmine API and how it works. However, I was sick last week, so I couldn't do any updates this week except for fixing the off-by-one column issue.

dondi added a commit that referenced this issue Apr 9, 2024
#1106 fixing off-by-one column issue.
@dondi
Copy link
Owner

dondi commented Apr 9, 2024

With PR #1111 merged, we will need a database reload to test the off-by-one fix. @ntran18 can choose whether to try this first before moving on to the unioned gene tables or—because this will incur a reload of the 2024 data sets—whether to try a full load after the union itself is done

@dondi
Copy link
Owner

dondi commented Apr 9, 2024

Intermediate plan: before going into the Intermine API logic, the gene tables can be unioned after the fact for now. Further analysis can be the next step

ntran18 added a commit that referenced this issue Apr 16, 2024
….py. Adding all populating data to database in loader.py. Need to work on Readme file to update the command in the next commit, and fixing update scripts
ntran18 added a commit that referenced this issue Apr 16, 2024
@ntran18
Copy link
Collaborator

ntran18 commented Apr 16, 2024

The current script is able to load everything from a fresh start but not update the database. combine_all_genes.csv contains the logic to combine all genes from expression, network, and protein-protein interactions.

Currently the pipeline to load database from fresh start is:
1/ Create schemas
2/ Load schema structures to the database
3/ Run preprocessing.py in expression-database/scripts
4/ Run generate_network.py for network-database/scripts
5/ Run generate_network.py for protein-protein-database/scripts
6/ After getting all genes from expression, network, and protein-protein-interactions, run combine_all_genes.csv to create union-genes.csv.
7/ Run loader.py to load everything to the database beside settings and public

@ntran18
Copy link
Collaborator

ntran18 commented Apr 16, 2024

Here is the union gene table.
union_genes.csv

@kdahlquist
Copy link
Collaborator Author

@ntran18 will post the union table and @kdahlquist will double-check it against the YeastMine Feature Type-->Genes Query to make sure that all 6511 genes are in our database. Our database has 7098 records in the union table. It's OK to have more genes, but we want to make sure that we have all 6511.

@ntran18
Copy link
Collaborator

ntran18 commented Apr 16, 2024

I have to write scripts for updating the network and protein-protein databases, create union missing-genes and updating-gene tables, and create a file for loader-update.py that will update both the protein and network databases. I also have to update readME.md to update how to run it.

@kdahlquist
Copy link
Collaborator Author

I haven't yet done a comparison with the YeastMine data, but I did a visual inspection of the union table and found a number of issues:

  • There are actually 7097 records because field names are the first row.
  • There are 329 records that have "none" as the display name. They should all have display names. If they don't have their own standard name (display name), then the systematic name (gene ID) should be used as the display name.
  • There are 22 records that have an issue with the Gene ID (systematic name). I put notes in a notes field. Many of these have a "/" ID that needs to be either removed or separated. Others have other comments. I've attached the file.
    union_genes_2024-04-16_with-notes.xlsx

I'll try to find time to do the YeastMine comparison later.

@ntran18
Copy link
Collaborator

ntran18 commented Apr 17, 2024

I did notice about "None" display name for both network and protein-protein interactions database. There might be a chance that our production database also have some value None for display name too.

ntran18 added a commit that referenced this issue Apr 23, 2024
…pdate, can create a union missing_gene and update_gene files. The code for loader_update will be in later commit.
ntran18 added a commit that referenced this issue Apr 23, 2024
…s, updated genes, misisng protein, updated proteins for both network and protein-protein interactions table
@ntran18
Copy link
Collaborator

ntran18 commented Apr 23, 2024

There are some problems when I create union gene table.

A lot of genes from the protein table have a None display name, but in the network table, the same gene would have a display name. Eg. YMR295C. Another case is when the protein table would have the gene id and display name ID different from each other, but the network table would have the same display name id with gene id. Eg. SBH1. I don't know which one is the correct one.

ntran18 added a commit that referenced this issue Apr 23, 2024
ntran18 added a commit that referenced this issue Apr 23, 2024
ntran18 added a commit that referenced this issue Apr 23, 2024
@kdahlquist
Copy link
Collaborator Author

The gene id is equivalent to the systematic name in SGD, and the display name id is equivalent to the standard name in SGD. So for the example of SBH1:

  • YER087C-B is the gene id (systematic name)
  • SBH1 is the display gene id (standard name)

In the history of SGD, all genes were given a systematic name because it literally encodes the position on the chromosome:

  • "Y" stands for "yeast"
  • "A-O" refer to each chromosome where A is chromosome 1, B is chromosome 2, etc.
  • "R" or "L" refers to whether the gene location is to the left (short arm) or right (long arm) of the centromere.
  • "###" refers to the order the gene appears counting from the centromere outward.
  • "W" or "C" refers to which strand the gene is encoded on (stands for "Watson" or "Crick")
  • "-A", "-B", "-C" is optional. This occurs when they found a new gene in between two other genes that were previously annotated. They didn't want to renumber the genes, so they found a way to create a systematic name that would indicate it is in between two other genes.

Not all genes have a separate standard name (display gene id). Standard names are assigned by a committee to be (somewhat) meaningful names. They take the form of three letters and a number. There are some rare examples that do not follow this rule. For example one standard name has a ' character and another has a ,

If a gene does not have a standard name, then the systematic name becomes the standard name and they will be the same.

If you find an example where in one case both the gene id and display gene id are the same, they should both be systematic names. If one dataset was later than the other, the gene could have been assigned a standard name in the newer dataset.

In all cases, SGD should be the final authority on which is correct. Individual genes can be looked up at www.yeastgenome.org. Alternately, we could pull down the entire list of genes from YeastMine and compare to what we have: https://yeastmine.yeastgenome.org/yeastmine/begin.do

When I look up YMR295C, SGD says it should have a display name of GSR1.

There should be no case where the display gene id is "none". I think the best thing to do to populate that list is to refer to a gene list from YeastMine to correct that.

@dondi
Copy link
Owner

dondi commented Apr 23, 2024

We will table the full union work for after this semester due to what’s involved; we’ll explore the SQL UNION command in order to make the database do the heavy lifting plus also remove duplicates automatically. The premise to doing that, though, is to make sure that the GRN and PPI gene tables have been normalized into having the corresponding values (e.g., correct IDs, etc.)

We also checked to see if the off-by-one fix needs to be deployed and it looks like it doesn’t, but that isn’t consistent with the commit history that we looked at— @dondi will look into how this file is used in order to get a conclusive picture of the bug’s impact

@dondi
Copy link
Owner

dondi commented Apr 23, 2024

So the immediate goal for now is to ensure that @ntran18’s code refactor is indeed functional and we can close the semester with that as the final accomplishment

dondi added a commit that referenced this issue Aug 28, 2024
#1106 Writing Scripts for Updating Database and Refactor Code
@ntran18
Copy link
Collaborator

ntran18 commented Aug 30, 2024

I double-checked the database. The data for ppi shouldn't be updated, only the source table is updated. However, the data for GRN is updated on 3/19. It's possible that when I updated the GRN table, I accidentally updated the PPI's source table. in another word, when users select source from 2024 for PPI, genes or proteins are still available, but no connection is shown for genes or proteins. I copied all the data in beta database and uploaded it to GRNsight Box (folder name is GRNsight 2024 Database Debugging). @kdahlquist can you please look over these tables and make sure they are correct? I think there are something wrong with the genes table for GRN.

Link to the folder

@dondi dondi mentioned this issue Sep 4, 2024
41 tasks
@ntran18
Copy link
Collaborator

ntran18 commented Sep 16, 2024

#1077: Update the progress to add ERT1 to the GRN gene table. After adding ERT1 to the GRN gene table, we still can't load it in GRN. So I looked deeper into the code and saw this.
image

The gene has to be inside the network table to be loaded. However, currently, the network table doesn't have ERT1 for 2022 - 2024 data.

@ntran18
Copy link
Collaborator

ntran18 commented Sep 29, 2024

I wrote a doc about this issue here.

@dondi
Copy link
Owner

dondi commented Oct 2, 2024

Here is a PDF capture of the Google Doc above as of 10/2:

GRNsight database.pdf

@dondi dondi closed this as completed Oct 2, 2024
@kdahlquist
Copy link
Collaborator Author

Please commit the file to the repository and then link to it from the wiki, so that it is more easily accessible than trying to remember it is on this issue.

dondi added a commit that referenced this issue Oct 16, 2024
#1106 Allowing query for any gene in gene table even if it's not in network table
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants