ERT1 not in network database #1106

kdahlquist · 2024-03-26T00:10:03Z

We are going to start using GRNsight in BIOL 367 this week. I was writing the protocol and had occasion to look up "ERT1" in the "Load from database" GRN. It was not found. Can we check to see that it is not in the network database? Also, will it be there in the 2024 update?

I then tried to look it up by its systematic name "YBR239C" and got an error saying that it did not conform to the naming convention it was expecting. But I was unable to reproduce this.

dondi · 2024-03-26T17:23:13Z

Let’s audit this—we can look at both the database directly and also the downloaded original files; the latter can be supplied by @ntran18

dondi · 2024-03-26T17:33:23Z

A quick look at our archived data indicates that ERT1 was in the original database load from fall 2021 but appears to have been dropped in spring 2022:

postgres=> \dn
             List of schemas
             Name             |  Owner   
------------------------------+----------
 fall2021                     | postgres
 gene_expression              | postgres
 gene_regulatory_network      | postgres
 protein_protein_interactions | postgres
 public                       | postgres
 settings                     | postgres
 spring2022_network           | postgres
(7 rows)

postgres=> set search_path=fall2021;
SET
postgres=> \dt
                 List of relations
  Schema  |        Name         | Type  |  Owner   
----------+---------------------+-------+----------
 fall2021 | degradation_rate    | table | postgres
 fall2021 | expression          | table | postgres
 fall2021 | expression_metadata | table | postgres
 fall2021 | gene                | table | postgres
 fall2021 | production_rate     | table | postgres
 fall2021 | ref                 | table | postgres
(6 rows)

postgres=> \d gene
                        Table "fall2021.gene"
     Column      |       Type        | Collation | Nullable | Default 
-----------------+-------------------+-----------+----------+---------
 gene_id         | character varying |           | not null | 
 display_gene_id | character varying |           |          | 
 species         | character varying |           |          | 
 taxon_id        | character varying |           | not null | 
Indexes:
    "gene_pkey" PRIMARY KEY, btree (gene_id, taxon_id)
Referenced by:
    TABLE "degradation_rate" CONSTRAINT "degradation_rate_gene_id_fkey" FOREIGN KEY (gene_id, taxon_id) REFERENCES gene(gene_id, taxon_id)
    TABLE "expression" CONSTRAINT "expression_gene_id_fkey" FOREIGN KEY (gene_id, taxon_id) REFERENCES gene(gene_id, taxon_id)
    TABLE "production_rate" CONSTRAINT "production_rate_gene_id_fkey" FOREIGN KEY (gene_id, taxon_id) REFERENCES gene(gene_id, taxon_id)

postgres=> select gene_id, display_gene_id from gene where gene_id='YBR239C' or display_gene_id='ERT1';
 gene_id | display_gene_id 
---------+-----------------
 YBR239C | ERT1
(1 row)

postgres=> \dn
             List of schemas
             Name             |  Owner   
------------------------------+----------
 fall2021                     | postgres
 gene_expression              | postgres
 gene_regulatory_network      | postgres
 protein_protein_interactions | postgres
 public                       | postgres
 settings                     | postgres
 spring2022_network           | postgres
(7 rows)

postgres=> set search_path=spring2022_network;
SET
postgres=> \dt
                List of relations
       Schema       |  Name   | Type  |  Owner   
--------------------+---------+-------+----------
 spring2022_network | gene    | table | postgres
 spring2022_network | network | table | postgres
 spring2022_network | source  | table | postgres
(3 rows)

postgres=> select gene_id, display_gene_id from gene where gene_id='YBR239C' or display_gene_id='ERT1';
 gene_id | display_gene_id 
---------+-----------------
(0 rows)

dondi · 2024-03-26T17:34:41Z

We should check the original YeastMine downloads for the presence of ERT1 and proceed based on what we find. If ERT1 is in the downloads, then we have a lurking bug in our database scripts which prevent this gene from being included in the database; if we do not find ERT1 in the downloads, then this might be a YeastMine issue

dondi · 2024-03-26T17:48:55Z

@ntran18 found an off-by-one column issue while investigating this one—it may or may not be related, but should also be fixed nonetheless. @kdahlquist will look at the YeastMine downloads to track down what might have happened with ERT1

ntran18 · 2024-03-27T18:16:36Z

This file is all the genes in network table
all_gene.csv
This ifile contains all the protein in the PPI table
protein.csv
This file contains all the gene in the PPI table
gene.csv

dondi · 2024-04-01T07:10:08Z

I searched the files and ERT1 is present in gene.csv but not all_gene.csv. Might this be a lead? Based on line count, all_gene.csv has 6514 records whereas gene.csv has 6715—that’s 201 more

If the database scripts are only loading genes from all_gene.csv, then this will account for why ERT1 (and potentially 200 more genes) is missing. Should we instead be loading the union of the gene files into the database?

dondi · 2024-04-02T17:41:24Z

A few questions are emerging based on this discovery; some may be answerable with a review of the code while others may need investigation of the data. The current BioDB class can do some of this as part of their final assignment:

How exactly are all_gene.csv and gene.csv derived and used by our database scripts? …this can be looked at via code inspection. Also, more descriptive filenames can be used
What’s the feasibility of revising our gene-loading code so that it loads either the union of these files or we should do a fresh query that unconditionally downloads all genes—this latter is what we’re actually after when loading genes into our database
How does our app behave when activating node coloring when there is a gene in the network that doesn’t have expression data? This will require some database querying in order to identify such genes, then we can test them in the app

dondi · 2024-04-02T17:50:00Z

@ntran18 observes that the gene tables for GRNs and PPIs are distinct as of now. The GRN gene table appears to have an additional column over the PPI gene table. Ideally, we have a unified gene table that contains all of the genes in YeastMine, but in order to get there, we will need a better understanding of the current database code and content

It was also observed that the PPI database dropdown includes the 2024 import, which was not expected. This is another side issue to track down

… cols regulator/rows targetETR1. Fix by adding \t to the header

ntran18 · 2024-04-09T08:04:01Z

I met with Dondi on Wednesday to discuss a solution to this problem. We decided to create 3 duplicated gene tables for expression, network, and protein-protein interactions by creating a union gene table.

Network, expression, and protein-protein interactions might have different gene tables because they have different queries for Yeastmine API. Thus, to understand more about the cause, I have to research Yeastmine API and how it works. However, I was sick last week, so I couldn't do any updates this week except for fixing the off-by-one column issue.

#1106 fixing off-by-one column issue.

dondi · 2024-04-09T17:28:14Z

With PR #1111 merged, we will need a database reload to test the off-by-one fix. @ntran18 can choose whether to try this first before moving on to the unioned gene tables or—because this will incur a reload of the 2024 data sets—whether to try a full load after the union itself is done

dondi · 2024-04-09T17:34:50Z

Intermediate plan: before going into the Intermine API logic, the gene tables can be unioned after the fact for now. Further analysis can be the next step

….py. Adding all populating data to database in loader.py. Need to work on Readme file to update the command in the next commit, and fixing update scripts

ntran18 · 2024-04-16T17:09:23Z

The current script is able to load everything from a fresh start but not update the database. combine_all_genes.csv contains the logic to combine all genes from expression, network, and protein-protein interactions.

Currently the pipeline to load database from fresh start is:
1/ Create schemas
2/ Load schema structures to the database
3/ Run preprocessing.py in expression-database/scripts
4/ Run generate_network.py for network-database/scripts
5/ Run generate_network.py for protein-protein-database/scripts
6/ After getting all genes from expression, network, and protein-protein-interactions, run combine_all_genes.csv to create union-genes.csv.
7/ Run loader.py to load everything to the database beside settings and public

ntran18 · 2024-04-16T17:35:33Z

Here is the union gene table.
union_genes.csv

kdahlquist · 2024-04-16T17:36:36Z

@ntran18 will post the union table and @kdahlquist will double-check it against the YeastMine Feature Type-->Genes Query to make sure that all 6511 genes are in our database. Our database has 7098 records in the union table. It's OK to have more genes, but we want to make sure that we have all 6511.

ntran18 · 2024-04-16T17:38:22Z

I have to write scripts for updating the network and protein-protein databases, create union missing-genes and updating-gene tables, and create a file for loader-update.py that will update both the protein and network databases. I also have to update readME.md to update how to run it.

kdahlquist · 2024-04-16T23:42:40Z

I haven't yet done a comparison with the YeastMine data, but I did a visual inspection of the union table and found a number of issues:

There are actually 7097 records because field names are the first row.
There are 329 records that have "none" as the display name. They should all have display names. If they don't have their own standard name (display name), then the systematic name (gene ID) should be used as the display name.
There are 22 records that have an issue with the Gene ID (systematic name). I put notes in a notes field. Many of these have a "/" ID that needs to be either removed or separated. Others have other comments. I've attached the file.
union_genes_2024-04-16_with-notes.xlsx

I'll try to find time to do the YeastMine comparison later.

ntran18 · 2024-04-17T01:19:29Z

I did notice about "None" display name for both network and protein-protein interactions database. There might be a chance that our production database also have some value None for display name too.

…pdate, can create a union missing_gene and update_gene files. The code for loader_update will be in later commit.

…s, updated genes, misisng protein, updated proteins for both network and protein-protein interactions table

ntran18 · 2024-04-23T09:22:10Z

There are some problems when I create union gene table.

A lot of genes from the protein table have a None display name, but in the network table, the same gene would have a display name. Eg. YMR295C. Another case is when the protein table would have the gene id and display name ID different from each other, but the network table would have the same display name id with gene id. Eg. SBH1. I don't know which one is the correct one.

…ng instructions on how to update database

kdahlquist · 2024-04-23T15:56:06Z

The gene id is equivalent to the systematic name in SGD, and the display name id is equivalent to the standard name in SGD. So for the example of SBH1:

YER087C-B is the gene id (systematic name)
SBH1 is the display gene id (standard name)

In the history of SGD, all genes were given a systematic name because it literally encodes the position on the chromosome:

"Y" stands for "yeast"
"A-O" refer to each chromosome where A is chromosome 1, B is chromosome 2, etc.
"R" or "L" refers to whether the gene location is to the left (short arm) or right (long arm) of the centromere.
"###" refers to the order the gene appears counting from the centromere outward.
"W" or "C" refers to which strand the gene is encoded on (stands for "Watson" or "Crick")
"-A", "-B", "-C" is optional. This occurs when they found a new gene in between two other genes that were previously annotated. They didn't want to renumber the genes, so they found a way to create a systematic name that would indicate it is in between two other genes.

Not all genes have a separate standard name (display gene id). Standard names are assigned by a committee to be (somewhat) meaningful names. They take the form of three letters and a number. There are some rare examples that do not follow this rule. For example one standard name has a ' character and another has a ,

If a gene does not have a standard name, then the systematic name becomes the standard name and they will be the same.

If you find an example where in one case both the gene id and display gene id are the same, they should both be systematic names. If one dataset was later than the other, the gene could have been assigned a standard name in the newer dataset.

In all cases, SGD should be the final authority on which is correct. Individual genes can be looked up at www.yeastgenome.org. Alternately, we could pull down the entire list of genes from YeastMine and compare to what we have: https://yeastmine.yeastgenome.org/yeastmine/begin.do

When I look up YMR295C, SGD says it should have a display name of GSR1.

There should be no case where the display gene id is "none". I think the best thing to do to populate that list is to refer to a gene list from YeastMine to correct that.

dondi · 2024-04-23T18:05:06Z

We will table the full union work for after this semester due to what’s involved; we’ll explore the SQL UNION command in order to make the database do the heavy lifting plus also remove duplicates automatically. The premise to doing that, though, is to make sure that the GRN and PPI gene tables have been normalized into having the corresponding values (e.g., correct IDs, etc.)

We also checked to see if the off-by-one fix needs to be deployed and it looks like it doesn’t, but that isn’t consistent with the commit history that we looked at— @dondi will look into how this file is used in order to get a conclusive picture of the bug’s impact

dondi · 2024-04-23T18:05:53Z

So the immediate goal for now is to ensure that @ntran18’s code refactor is indeed functional and we can close the semester with that as the final accomplishment

#1106 Writing Scripts for Updating Database and Refactor Code

ntran18 · 2024-08-30T06:04:29Z

I double-checked the database. The data for ppi shouldn't be updated, only the source table is updated. However, the data for GRN is updated on 3/19. It's possible that when I updated the GRN table, I accidentally updated the PPI's source table. in another word, when users select source from 2024 for PPI, genes or proteins are still available, but no connection is shown for genes or proteins. I copied all the data in beta database and uploaded it to GRNsight Box (folder name is GRNsight 2024 Database Debugging). @kdahlquist can you please look over these tables and make sure they are correct? I think there are something wrong with the genes table for GRN.

Link to the folder

ntran18 · 2024-09-16T06:13:56Z

#1077: Update the progress to add ERT1 to the GRN gene table. After adding ERT1 to the GRN gene table, we still can't load it in GRN. So I looked deeper into the code and saw this.

The gene has to be inside the network table to be loaded. However, currently, the network table doesn't have ERT1 for 2022 - 2024 data.

ntran18 · 2024-09-29T22:16:12Z

I wrote a doc about this issue here.

dondi · 2024-10-02T20:29:31Z

Here is a PDF capture of the Google Doc above as of 10/2:

GRNsight database.pdf

kdahlquist · 2024-10-02T20:30:43Z

Please commit the file to the repository and then link to it from the wiki, so that it is more easily accessible than trying to remember it is on this issue.

#1106 Allowing query for any gene in gene table even if it's not in network table

kdahlquist added bug priority 0.5 database labels Mar 26, 2024

dondi assigned ntran18 Mar 26, 2024

ntran18 added a commit that referenced this issue Apr 9, 2024

#1106 fixing off-by-one column issue. Previously the first column was…

9e7b437

… cols regulator/rows targetETR1. Fix by adding \t to the header

dondi added a commit that referenced this issue Apr 9, 2024

Merge pull request #1111 from dondi/maika-1106

31df56b

#1106 fixing off-by-one column issue.

dondi added priority 0 and removed priority 0.5 labels Apr 9, 2024

ntran18 added a commit that referenced this issue Apr 16, 2024

#1106 delete commented code

786c18a

ntran18 added a commit that referenced this issue Apr 23, 2024

#1106 Make create_union_file a Utils method for reuse so for loader_u…

4cf1a07

…pdate, can create a union missing_gene and update_gene files. The code for loader_update will be in later commit.

ntran18 added a commit that referenced this issue Apr 23, 2024

#1106 Create constants.py file to contain all file path directory

8d4dde0

ntran18 added a commit that referenced this issue Apr 23, 2024

#1106 Adding a combine filter_update.py file to find all missing gene…

315710f

…s, updated genes, misisng protein, updated proteins for both network and protein-protein interactions table

ntran18 added a commit that referenced this issue Apr 23, 2024

#1106 Modify the path to store union-gene-data

46c077b

ntran18 added a commit that referenced this issue Apr 23, 2024

#1106 delete unused files

d1be504

ntran18 added a commit that referenced this issue Apr 23, 2024

#1106 Modify instructions on how to load data into database, and addi…

b9efb4d

…ng instructions on how to update database

ntran18 added a commit that referenced this issue Apr 23, 2024

#1106 remove all the testing namespace

b0e3cf6

ntran18 added a commit that referenced this issue Apr 23, 2024

#1106 Adding update to expression gene table

6dc0eaa

dondi added a commit that referenced this issue Aug 28, 2024

Merge pull request #1113 from dondi/maika-1106

94256ce

#1106 Writing Scripts for Updating Database and Refactor Code

dondi added the for next release label Aug 28, 2024

dondi mentioned this issue Sep 4, 2024

Release checklist for v7.1 #1077

Closed

41 tasks

dondi closed this as completed Oct 2, 2024

ntran18 added a commit that referenced this issue Oct 10, 2024

#1106 Allowing query all the genes even if it's not in grn network table

295e5cd

dondi added a commit that referenced this issue Oct 16, 2024

Merge pull request #1124 from dondi/maika-1106-2

56d436f

#1106 Allowing query for any gene in gene table even if it's not in network table

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERT1 not in network database #1106

ERT1 not in network database #1106

kdahlquist commented Mar 26, 2024

dondi commented Mar 26, 2024

dondi commented Mar 26, 2024

dondi commented Mar 26, 2024

dondi commented Mar 26, 2024

ntran18 commented Mar 27, 2024

dondi commented Apr 1, 2024

dondi commented Apr 2, 2024

dondi commented Apr 2, 2024

ntran18 commented Apr 9, 2024

dondi commented Apr 9, 2024

dondi commented Apr 9, 2024

ntran18 commented Apr 16, 2024 •

edited

Loading

ntran18 commented Apr 16, 2024

kdahlquist commented Apr 16, 2024

ntran18 commented Apr 16, 2024

kdahlquist commented Apr 16, 2024

ntran18 commented Apr 17, 2024

ntran18 commented Apr 23, 2024

kdahlquist commented Apr 23, 2024

dondi commented Apr 23, 2024

dondi commented Apr 23, 2024

ntran18 commented Aug 30, 2024 •

edited

Loading

ntran18 commented Sep 16, 2024

ntran18 commented Sep 29, 2024

dondi commented Oct 2, 2024

kdahlquist commented Oct 2, 2024

ERT1 not in network database #1106

ERT1 not in network database #1106

Comments

kdahlquist commented Mar 26, 2024

dondi commented Mar 26, 2024

dondi commented Mar 26, 2024

dondi commented Mar 26, 2024

dondi commented Mar 26, 2024

ntran18 commented Mar 27, 2024

dondi commented Apr 1, 2024

dondi commented Apr 2, 2024

dondi commented Apr 2, 2024

ntran18 commented Apr 9, 2024

dondi commented Apr 9, 2024

dondi commented Apr 9, 2024

ntran18 commented Apr 16, 2024 • edited Loading

ntran18 commented Apr 16, 2024

kdahlquist commented Apr 16, 2024

ntran18 commented Apr 16, 2024

kdahlquist commented Apr 16, 2024

ntran18 commented Apr 17, 2024

ntran18 commented Apr 23, 2024

kdahlquist commented Apr 23, 2024

dondi commented Apr 23, 2024

dondi commented Apr 23, 2024

ntran18 commented Aug 30, 2024 • edited Loading

ntran18 commented Sep 16, 2024

ntran18 commented Sep 29, 2024

dondi commented Oct 2, 2024

kdahlquist commented Oct 2, 2024

ntran18 commented Apr 16, 2024 •

edited

Loading

ntran18 commented Aug 30, 2024 •

edited

Loading