Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snp_asGeneticPos() for hg38 #200

Closed
JingZhang1227 opened this issue Mar 29, 2021 · 37 comments
Closed

snp_asGeneticPos() for hg38 #200

JingZhang1227 opened this issue Mar 29, 2021 · 37 comments

Comments

@JingZhang1227
Copy link

Hi Florian,

Thank you very much again for answering my earlier questions.

I was wondering if I were to build the LD reference using a dataset with genome build hg38, can I still use the snp_asGeneticPos() function? My understanding is the genetic map the function used is based on hg19. I was wondering if you have any recommendations for hg38.

Thank you very much!

@privefl
Copy link
Owner

privefl commented Mar 29, 2021

It seems these files are not available for hg38 (see e.g. joepickrell/1000-genomes-genetic-maps#2).

If not on Windows, you can use snp_modifyBuild() to convert from builds. If would be nice if someone could do that and create a pull request to add these to the repo there.
Then, I could add some parameter to choose from the build needed.

@JingZhang1227
Copy link
Author

Thanks for getting back to me! I was wondering if it is possible to match SNPs with rsid instead of the physical position.

@privefl
Copy link
Owner

privefl commented Mar 30, 2021

I've just added a new parameter rsid to the function.
You need to install the latest version with remotes::install_github('privefl/bigsnpr').

@JingZhang1227
Copy link
Author

Thank you very much! This is very helpful! I will try matching with rsid.

@pjordab
Copy link

pjordab commented Apr 1, 2021

Hi Florian and users,

Thank you very much for the previous comments and answers. They have been very helpful.

I have the same issue as JinZhang1227 as my datasets are in hg38.

I am following your tutorial, and I'd really appreciate if you could tell me how should I modify the code to use the new rsid function, I've already installed the latest version of bigsnpr.

Many thanks in advance,

Paloma

@privefl
Copy link
Owner

privefl commented Apr 1, 2021

@pjordab Just add rsid = <rsid> to your snp_asGeneticPos() call.

@privefl
Copy link
Owner

privefl commented Apr 13, 2021

Does it work? Any update on this?

@pjordab
Copy link

pjordab commented Apr 13, 2021

Good morning,

I have converted the documents with the position in cM to hg38, as my data contains snpid in this format: chr:position:allele:allele so I cannot use the new rsid function you have suggested.

There are some SNPs where the conversion has failed, either they are not included in the downloaded UCSC liftover document, or I have deleted them because the new chr was described as unidentified/random/ or another chromosome was written different from the original one. I would like to share the files in case they are useful to other users, but I don't know how to do it.

Despite the new files, which are sorted by position, the script stops with this error 'infos.pos' is not sorted. Any idea how I can solve this?

Many thanks!

Paloma

@privefl
Copy link
Owner

privefl commented Apr 13, 2021

For which function do you get this error? For snp_asGeneticPos()?

@pjordab
Copy link

pjordab commented Apr 13, 2021 via email

@privefl
Copy link
Owner

privefl commented Apr 13, 2021

POS2 is probably not sorted; this can happen when switching from builds I guess.
You can probably get ord <- order(POS2[ind.chr2]) and then use ind.chr2[ord] instead of both ind.chr2.
But then you have to make sure the sumstats are also in the same order that you're using here.. that's a bit annoying.

@pjordab
Copy link

pjordab commented Apr 16, 2021

Hi Florian,

Thank you very much for your answers.

I've tried your solution and multiple variants of it, but it gives me an out of memory error:
slurmstepd: error: Detected 1 oom-kill event(s) in step 18267081.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

for (chr in 1:22){
ind.chr <- which(info_snp$chr == chr)
ind.chr2 <- info_snp$_NUM_ID_[ind.chr]
sort(POS2[ind.chr2])->POS2sorted
ind.chr2[POS2sorted]->ind.chr2sorted
corr0 <- snp_cor(
genotype,
ind.col = ind.chr2sorted,
ncores = 30,
infos.pos = POS2[ind.chr2sorted],
size = 3 / 1000
)
if (chr == 1) {
ld <- Matrix::colSums(corr0^2)
corr <- as_SFBM(corr0, tmp)
} else {
ld <- c(ld, Matrix::colSums(corr0^2))
corr$add_columns(corr0, nrow(corr))
}
}

I'd really appreciate if you have other suggestions.

Many thanks!

Paloma

@privefl
Copy link
Owner

privefl commented Apr 16, 2021

For the current problem, I don't think you can use POS2sorted as indices. Do what I proposed, and you need to form df_beta as well by rbinding each sorted subset of the sumstats for each chromosome (I guess info_snp[ind.chr[ord], ]).

For the issue about memory, please open another issue, as we are very far from the initial subject here.

@pjordab
Copy link

pjordab commented Apr 16, 2021

Ok, many thanks! For the first step, you mean this:

for (chr in 1:22) {
ind.chr <- which(info_snp$chr == chr)
ind.chr2 <- info_snp$_NUM_ID_[ind.chr]
ord <- order(POS2[ind.chr2])
corr0 <- snp_cor(
genotype,
ind.col = ind.chr2[ord],
ncores = NCORES,
infos.pos = POS2[ind.chr2[ord]],
size = 3 / 1000
)
if (chr == 1) {
ld <- Matrix::colSums(corr0^2)
corr <- as_SFBM(corr0, tmp)
} else {
ld <- c(ld, Matrix::colSums(corr0^2))
corr$add_columns(corr0, nrow(corr))
}
}

@privefl
Copy link
Owner

privefl commented Apr 16, 2021

Yes, and df_beta <- info_snp[ind.chr[ord], ] and df_beta <- rbind(df_beta, info_snp[ind.chr[ord], ]) in the if-else.

@pjordab
Copy link

pjordab commented Apr 21, 2021

Hi Florian,
As a follow-up:
I've reviewed the ind.chr, ind.chr2 and POS2[ind.chr2] files using and without using the new vector order( POS2[ind.chr2])->ord.
I've realised that when doing order for the position in centimorgans, the index file I obtain (ord) keeps exactly the same order in all cases, for all 22 chromosomes. Using is.unsorted and is.unsorted (strictly=T) I think a possible explanation for the unsorted issue for infos.pos is that there are some subsequent lines with equal value to the precedent:
i.e.
45
45
58
60
60
(..)
But when doing order, it keeps the order 1,2,3,4,5... in all chromosomes.
Using [ord] the repeated values are interpreted in ascending order without reordering.
So this is good news as I don't need to reorder the sumstats files
Many thanks for your help!

@privefl
Copy link
Owner

privefl commented Apr 21, 2021

I'm not sure I understand how the problem is solved, since you say the reordering is actually not doing anything.

And I'm not checking for strict sorting, so it is okay to have consecutive equal values.

@pjordab
Copy link

pjordab commented Apr 21, 2021 via email

@privefl privefl closed this as completed Apr 21, 2021
@pjordab
Copy link

pjordab commented Jun 15, 2021

Hi Florian,

I'd like to use the function snp_asGeneticPos using the rsID to get the position in cM.

After installing the latest version of your package with remotes::install_github('privefl/bigsnpr').

I do:

POS2 <- snp_asGeneticPos(CHR, POS, dir = ".", ncores=1, rsid=RSID)
Error in snp_asGeneticPos(CHR, POS, dir = ".", ncores = 1, rsid = RSID) :
unused argument (rsid = RSID)

Being RSID the vector that contains my rsIDs.

Am I using the new function correctly?

Many thanks!

Paloma

@privefl
Copy link
Owner

privefl commented Jun 15, 2021

Yes, this should work.
Are you sure the installation was successful? What is packageVersion("bigsnpr")?

@pjordab
Copy link

pjordab commented Jun 15, 2021 via email

@privefl
Copy link
Owner

privefl commented Jun 15, 2021

Yes, I am able to run

info <- readRDS(url("https://ndownloader.figshare.com/files/25503788"))
with(info[1:100, ], bigsnpr::snp_asGeneticPos(chr, pos, rsid = rsid))

@pjordab
Copy link

pjordab commented Jun 15, 2021

This is working POS2 <- snp_asGeneticPos(CHR, POS, rsid = RSID)

(instead of what I was using before: POS2 <- snp_asGeneticPos(CHR, POS, dir = ".", ncores=1, rsid=RSID)

Many thanks!

@privefl
Copy link
Owner

privefl commented Jun 15, 2021

I still do not understand why you get the error with the other call, though.

@pjordab
Copy link

pjordab commented Jun 15, 2021

Neither do I, sorry.

In order to get POS2 from RSID I am using the info_snp file, since my target data does not contain rsid.

Could you please tell me if this is correct:

CHR <-info_snp$chr
POS <-info_snp$pos
RSID <- info_snp$rsid.ss
POS2 <- snp_asGeneticPos(CHR, POS, rsid = RSID)

for (chr in 1:22){
ind.chr <- which(info_snp$chr == chr)
ind.chr2 <- info_snp$_NUM_ID_[ind.chr]
corr0 <- snp_cor(
genotype,
ind.col = ind.chr2,
ncores = 30, #is it correct to use parallelism in here?
infos.pos = POS2[ind.chr2], #should I modify the infos.pos information considering that I only have the snps after snp_match in the POS2 vector? i.e. POS2[ind.chr] ?
size = 3 / 1000
)
if (chr == 1) {
ld <- Matrix::colSums(corr0^2)
corr <- as_SFBM(corr0, tmp)
} else {
ld <- c(ld, Matrix::colSums(corr0^2))
corr$add_columns(corr0, nrow(corr))
}
}

Many thanks!!

@privefl
Copy link
Owner

privefl commented Jun 15, 2021

You need to use POS2[ind.chr] indeed, as ind.chr corresponds to the indices for info_snp, whereas ind.chr2 corresponds to the indices for genotype.

@AmmyDK
Copy link

AmmyDK commented Jun 25, 2021

Dear developer:
Now I am running POS2 <- snp_asGeneticPos(CHR, POS, dir = "."), but my servers are without web, so I downloaded all files one by one from the https://github.com/joepickrell/1000-genomes-genetic-maps here. But I don't know how to read and mange them as snp_asGeneticPos function did. So how could run snp_asGeneticPos function locally?
Many thanks
Ammy

@privefl
Copy link
Owner

privefl commented Jun 25, 2021

You need the uncompressed files locally.
To uncompress them, you can use e.g.

R.utils::gunzip(gzfile)

@AmmyDK
Copy link

AmmyDK commented Jun 25, 2021

Dear developer:
So I do assert_length()but it tells me that cannot find assert_length()function, this is the code I plan to run, only remove url()part:
library("R.utils")
assert_lengths(infos.chr, infos.pos)#got wrong messages, cannot find"infos.chr, infos.pos"
if (!is.null(rsid)) assert_lengths(rsid, infos.pos)

snp_split(infos.chr, function(ind.chr, pos, dir, rsid) {

chr <- attr(ind.chr, "chr")
basename <- paste0("chr", chr, ".OMNI.interpolated_genetic_map")
mapfile <- file.path(dir, basename)
if (!file.exists(mapfile)) {

gzfile <- paste0(mapfile, ".gz")
R.utils::gunzip(gzfile)
}
map.chr <- bigreadr::fread2(mapfile, showProgress = FALSE)

if (is.null(rsid)) {
  ind <- bigutilsr::knn_parallel(as.matrix(map.chr$V2), as.matrix(pos[ind.chr]),
                                 k = 1, ncores = 1)$nn.idx
  new_pos <- map.chr$V3[ind]
} else {
  ind <- match(rsid[ind.chr], map.chr$V1)
  new_pos <- map.chr$V3[ind]

  indNA <- which(is.na(ind))
  if (length(indNA) > 0) {
    pos.chr <- pos[ind.chr]
    new_pos[indNA] <- suppressWarnings(
      stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "hyman")$y)
  }
}
new_pos

}, combine = "c", pos = infos.pos, dir = dir, rsid = rsid, ncores = ncores)
}

Best
Ammy

@privefl
Copy link
Owner

privefl commented Jun 25, 2021

Just gunzip all the 22 chromosome files and run POS2 <- snp_asGeneticPos(CHR, POS, dir = ".") again (maybe changing dir).

@AmmyDK
Copy link

AmmyDK commented Jun 25, 2021

Dear developer,
Yes, it works now, thank you so much.

Best Regards
Ammy

@SiyiJiang41
Copy link

Dear developer, Yes, it works now, thank you so much.

Best Regards Ammy

I come across the same problem with you, can you tell me how you deal with it at last? I don't understant how to change dir, and what to do next.

@privefl
Copy link
Owner

privefl commented Oct 23, 2021

That's the dir parameter.

@SiyiJiang41
Copy link

It works! Thank you!
I have another question that if I can get the PRS of each individual. I want to compare the PRS outcomes with other traditional methods( overestimated or underestimated). Maybe I should learn about the function snp_PRS?
bigsnpr is really a comprehensive package for a beginner in biostatistics like me. Many thanks for your help!

@privefl
Copy link
Owner

privefl commented Oct 24, 2021

If you an unrelated question, please open a new issue.

@zjppdozen
Copy link

zjppdozen commented Apr 28, 2022

Yes, and df_beta <- info_snp[ind.chr[ord], ] and df_beta <- rbind(df_beta, info_snp[ind.chr[ord], ]) in the if-else.

Hi Florian,

Could you specify where should I put df_beta <- info_snp[ind.chr[ord], ] and df_beta <- rbind(df_beta, info_snp[ind.chr[ord], ]) in the if-else clause (as shown in my code)?

My code is as the following.

for (chr in 1:22) { 

  ind.chr <- which(df_beta$chr == chr)

  ind.chr2 <- df_beta$`_NUM_ID_`[ind.chr]
  ord <- order(POS2[ind.chr2])
  
  corr0 <- snp_cor(G, ind.col = ind.chr2[ord], size = 3 / 1000,
                   infos.pos = POS2[ind.chr2[ord]], ncores = NCORES)
  
  if (chr == 1) {
    ld <- Matrix::colSums(corr0^2)
    corr <- as_SFBM(corr0, tmp, compact = TRUE)
  } else {
    ld <- c(ld, Matrix::colSums(corr0^2))
    corr$add_columns(corr0, nrow(corr))
  }
}

Many thanks

@privefl
Copy link
Owner

privefl commented Apr 29, 2022

In the ifelse I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants