snp_asGeneticPos() for hg38 #200

JingZhang1227 · 2021-03-29T19:35:20Z

Hi Florian,

Thank you very much again for answering my earlier questions.

I was wondering if I were to build the LD reference using a dataset with genome build hg38, can I still use the snp_asGeneticPos() function? My understanding is the genetic map the function used is based on hg19. I was wondering if you have any recommendations for hg38.

Thank you very much!

privefl · 2021-03-29T20:25:56Z

It seems these files are not available for hg38 (see e.g. joepickrell/1000-genomes-genetic-maps#2).

If not on Windows, you can use snp_modifyBuild() to convert from builds. If would be nice if someone could do that and create a pull request to add these to the repo there.
Then, I could add some parameter to choose from the build needed.

JingZhang1227 · 2021-03-30T03:42:09Z

Thanks for getting back to me! I was wondering if it is possible to match SNPs with rsid instead of the physical position.

privefl · 2021-03-30T09:54:46Z

I've just added a new parameter rsid to the function.
You need to install the latest version with remotes::install_github('privefl/bigsnpr').

JingZhang1227 · 2021-03-30T14:06:36Z

Thank you very much! This is very helpful! I will try matching with rsid.

pjordab · 2021-04-01T14:33:11Z

Hi Florian and users,

Thank you very much for the previous comments and answers. They have been very helpful.

I have the same issue as JinZhang1227 as my datasets are in hg38.

I am following your tutorial, and I'd really appreciate if you could tell me how should I modify the code to use the new rsid function, I've already installed the latest version of bigsnpr.

Many thanks in advance,

Paloma

privefl · 2021-04-01T14:35:33Z

@pjordab Just add rsid = <rsid> to your snp_asGeneticPos() call.

privefl · 2021-04-13T06:54:39Z

Does it work? Any update on this?

pjordab · 2021-04-13T12:55:39Z

Good morning,

I have converted the documents with the position in cM to hg38, as my data contains snpid in this format: chr:position:allele:allele so I cannot use the new rsid function you have suggested.

There are some SNPs where the conversion has failed, either they are not included in the downloaded UCSC liftover document, or I have deleted them because the new chr was described as unidentified/random/ or another chromosome was written different from the original one. I would like to share the files in case they are useful to other users, but I don't know how to do it.

Despite the new files, which are sorted by position, the script stops with this error 'infos.pos' is not sorted. Any idea how I can solve this?

Many thanks!

Paloma

privefl · 2021-04-13T13:22:29Z

For which function do you get this error? For snp_asGeneticPos()?

pjordab · 2021-04-13T13:41:33Z

It happens after 2h of running the script. That is my error.log: Loading required package: bigstatsr Warning message: NAs introduced by coercion Warning message: NAs introduced by coercion Attaching package: ‘dplyr’ The following objects are masked from ‘package:data.table’: between, first, last The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union Warning message: NAs introduced by coercion 1,111,908 variants to be matched. 0 ambiguous SNPs have been removed. 665,308 variants have been matched; 0 were flipped and 326,513 were reversed. Error: 'infos.pos' is not sorted. Execution halted Missatge de Florian Privé ***@***.***> del dia dt., 13 d’abr. 2021 a les 9:22:

For which function do you get this error? For snp_asGeneticPos()? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#200 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQV3QYGJJ4TCEUYG7ZVI4PTTIRASRANCNFSM42AHI3BQ> .

I think it is related to this part of the script: for (chr in 1:22) { ind.chr <- which(info_snp$chr == chr) ind.chr2 <- info_snp$`_NUM_ID_`[ind.chr] corr0 <- snp_cor( genotype, ind.col = ind.chr2, ncores = NCORES, infos.pos = POS2[ind.chr2], size = 3 / 1000 ) if (chr == 1) { ld <- Matrix::colSums(corr0^2) corr <- as_SFBM(corr0, tmp) } else { ld <- c(ld, Matrix::colSums(corr0^2)) corr$add_columns(corr0, nrow(corr)) } }

privefl · 2021-04-13T13:53:11Z

POS2 is probably not sorted; this can happen when switching from builds I guess.
You can probably get ord <- order(POS2[ind.chr2]) and then use ind.chr2[ord] instead of both ind.chr2.
But then you have to make sure the sumstats are also in the same order that you're using here.. that's a bit annoying.

pjordab · 2021-04-16T02:04:05Z

Hi Florian,

Thank you very much for your answers.

I've tried your solution and multiple variants of it, but it gives me an out of memory error:
slurmstepd: error: Detected 1 oom-kill event(s) in step 18267081.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

for (chr in 1:22){
ind.chr <- which(info_snp$chr == chr)
ind.chr2 <- info_snp$_NUM_ID_[ind.chr]
sort(POS2[ind.chr2])->POS2sorted
ind.chr2[POS2sorted]->ind.chr2sorted
corr0 <- snp_cor(
genotype,
ind.col = ind.chr2sorted,
ncores = 30,
infos.pos = POS2[ind.chr2sorted],
size = 3 / 1000
)
if (chr == 1) {
ld <- Matrix::colSums(corr0^2)
corr <- as_SFBM(corr0, tmp)
} else {
ld <- c(ld, Matrix::colSums(corr0^2))
corr$add_columns(corr0, nrow(corr))
}
}

I'd really appreciate if you have other suggestions.

Many thanks!

Paloma

privefl · 2021-04-16T06:21:12Z

For the current problem, I don't think you can use POS2sorted as indices. Do what I proposed, and you need to form df_beta as well by rbinding each sorted subset of the sumstats for each chromosome (I guess info_snp[ind.chr[ord], ]).

For the issue about memory, please open another issue, as we are very far from the initial subject here.

pjordab · 2021-04-16T13:39:14Z

Ok, many thanks! For the first step, you mean this:

for (chr in 1:22) {
ind.chr <- which(info_snp$chr == chr)
ind.chr2 <- info_snp$_NUM_ID_[ind.chr]
ord <- order(POS2[ind.chr2])
corr0 <- snp_cor(
genotype,
ind.col = ind.chr2[ord],
ncores = NCORES,
infos.pos = POS2[ind.chr2[ord]],
size = 3 / 1000
)
if (chr == 1) {
ld <- Matrix::colSums(corr0^2)
corr <- as_SFBM(corr0, tmp)
} else {
ld <- c(ld, Matrix::colSums(corr0^2))
corr$add_columns(corr0, nrow(corr))
}
}

privefl · 2021-04-16T17:24:56Z

Yes, and df_beta <- info_snp[ind.chr[ord], ] and df_beta <- rbind(df_beta, info_snp[ind.chr[ord], ]) in the if-else.

pjordab · 2021-04-21T15:55:02Z

Hi Florian,
As a follow-up:
I've reviewed the ind.chr, ind.chr2 and POS2[ind.chr2] files using and without using the new vector order( POS2[ind.chr2])->ord.
I've realised that when doing order for the position in centimorgans, the index file I obtain (ord) keeps exactly the same order in all cases, for all 22 chromosomes. Using is.unsorted and is.unsorted (strictly=T) I think a possible explanation for the unsorted issue for infos.pos is that there are some subsequent lines with equal value to the precedent:
i.e.
45
45
58
60
60
(..)
But when doing order, it keeps the order 1,2,3,4,5... in all chromosomes.
Using [ord] the repeated values are interpreted in ascending order without reordering.
So this is good news as I don't need to reorder the sumstats files
Many thanks for your help!

privefl · 2021-04-21T17:03:49Z

I'm not sure I understand how the problem is solved, since you say the reordering is actually not doing anything.

And I'm not checking for strict sorting, so it is okay to have consecutive equal values.

pjordab · 2021-04-21T20:49:45Z

After observing that order gives me exactly the same order, I've done this for the 22 chr files: ind.chr <- which(info_snp$chr == 22) ind.chr2 <- info_snp$`_NUM_ID_`[ind.chr] order(POS2[ind.chr2])->ord

is.unsorted(POS2[ind.chr2],strictly=F)

[1] FALSE

is.unsorted(POS2[ind.chr2],strictly=T)

[1] TRUE

is.unsorted(ind.chr2,strictly=T)

[1] FALSE

is.unsorted(ord,strictly=T)

[1] FALSE

is.unsorted(POS2[ind.chr2[ord]])

[1] FALSE

is.unsorted(POS2[ind.chr2[ord]],strictly=T)

[1] TRUE From this: (POS2[ind.chr2]) is only "strictly" unsorted, the new "ord" vector gives me exactly the same order but this extra index vector resolves the issue "infos.pos is unsorted". Missatge de Florian Privé ***@***.***> del dia dc., 21 d’abr. 2021 a les 13:04:

…

I'm not sure I understand how the problem is solved, since you say the reordering is actually not doing anything. And I'm not checking for strict sorting, so it is okay to have consecutive equal values. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#200 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQV3QYEXQ2JKWXLHUPTCUDTTJ4AQLANCNFSM42AHI3BQ> .

pjordab · 2021-06-15T02:15:59Z

Hi Florian,

I'd like to use the function snp_asGeneticPos using the rsID to get the position in cM.

After installing the latest version of your package with remotes::install_github('privefl/bigsnpr').

I do:

POS2 <- snp_asGeneticPos(CHR, POS, dir = ".", ncores=1, rsid=RSID)
Error in snp_asGeneticPos(CHR, POS, dir = ".", ncores = 1, rsid = RSID) :
unused argument (rsid = RSID)

Being RSID the vector that contains my rsIDs.

Am I using the new function correctly?

Many thanks!

Paloma

privefl · 2021-06-15T06:05:36Z

Yes, this should work.
Are you sure the installation was successful? What is packageVersion("bigsnpr")?

pjordab · 2021-06-15T11:59:16Z

This one:

packageVersion("bigsnpr")

[1] ‘1.8.1’ Is this the last one? Missatge de Florian Privé ***@***.***> del dia dt., 15 de juny 2021 a les 2:05:

…

Yes, this should work. Are you sure the installation was successful? What is packageVersion("bigsnpr")? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#200 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQV3QYADOVZW6LZHFEBH5ZLTS3UT7ANCNFSM42AHI3BQ> .

privefl · 2021-06-15T12:26:19Z

Yes, I am able to run

info <- readRDS(url("https://ndownloader.figshare.com/files/25503788"))
with(info[1:100, ], bigsnpr::snp_asGeneticPos(chr, pos, rsid = rsid))

pjordab · 2021-06-15T12:53:23Z

This is working POS2 <- snp_asGeneticPos(CHR, POS, rsid = RSID)

(instead of what I was using before: POS2 <- snp_asGeneticPos(CHR, POS, dir = ".", ncores=1, rsid=RSID)

Many thanks!

privefl · 2021-06-15T13:36:43Z

I still do not understand why you get the error with the other call, though.

pjordab · 2021-06-15T14:33:53Z

Neither do I, sorry.

In order to get POS2 from RSID I am using the info_snp file, since my target data does not contain rsid.

Could you please tell me if this is correct:

CHR <-info_snp$chr
POS <-info_snp$pos
RSID <- info_snp$rsid.ss
POS2 <- snp_asGeneticPos(CHR, POS, rsid = RSID)

for (chr in 1:22){
ind.chr <- which(info_snp$chr == chr)
ind.chr2 <- info_snp$_NUM_ID_[ind.chr]
corr0 <- snp_cor(
genotype,
ind.col = ind.chr2,
ncores = 30, #is it correct to use parallelism in here?
infos.pos = POS2[ind.chr2], #should I modify the infos.pos information considering that I only have the snps after snp_match in the POS2 vector? i.e. POS2[ind.chr] ?
size = 3 / 1000
)
if (chr == 1) {
ld <- Matrix::colSums(corr0^2)
corr <- as_SFBM(corr0, tmp)
} else {
ld <- c(ld, Matrix::colSums(corr0^2))
corr$add_columns(corr0, nrow(corr))
}
}

Many thanks!!

privefl · 2021-06-15T14:48:43Z

You need to use POS2[ind.chr] indeed, as ind.chr corresponds to the indices for info_snp, whereas ind.chr2 corresponds to the indices for genotype.

AmmyDK · 2021-06-25T08:56:43Z

Dear developer:
Now I am running POS2 <- snp_asGeneticPos(CHR, POS, dir = "."), but my servers are without web, so I downloaded all files one by one from the https://github.com/joepickrell/1000-genomes-genetic-maps here. But I don't know how to read and mange them as snp_asGeneticPos function did. So how could run snp_asGeneticPos function locally?
Many thanks
Ammy

privefl · 2021-06-25T09:05:35Z

You need the uncompressed files locally.
To uncompress them, you can use e.g.

bigsnpr/R/match-alleles.R

Line 287 in ff127e2

R.utils::gunzip(gzfile)

AmmyDK · 2021-06-25T09:18:20Z

Dear developer:
So I do assert_length()but it tells me that cannot find assert_length()function, this is the code I plan to run, only remove url()part:
library("R.utils")
assert_lengths(infos.chr, infos.pos)#got wrong messages, cannot find"infos.chr, infos.pos"
if (!is.null(rsid)) assert_lengths(rsid, infos.pos)

snp_split(infos.chr, function(ind.chr, pos, dir, rsid) {

chr <- attr(ind.chr, "chr")
basename <- paste0("chr", chr, ".OMNI.interpolated_genetic_map")
mapfile <- file.path(dir, basename)
if (!file.exists(mapfile)) {

gzfile <- paste0(mapfile, ".gz")
R.utils::gunzip(gzfile)
}
map.chr <- bigreadr::fread2(mapfile, showProgress = FALSE)

if (is.null(rsid)) {
  ind <- bigutilsr::knn_parallel(as.matrix(map.chr$V2), as.matrix(pos[ind.chr]),
                                 k = 1, ncores = 1)$nn.idx
  new_pos <- map.chr$V3[ind]
} else {
  ind <- match(rsid[ind.chr], map.chr$V1)
  new_pos <- map.chr$V3[ind]

  indNA <- which(is.na(ind))
  if (length(indNA) > 0) {
    pos.chr <- pos[ind.chr]
    new_pos[indNA] <- suppressWarnings(
      stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "hyman")$y)
  }
}
new_pos

}, combine = "c", pos = infos.pos, dir = dir, rsid = rsid, ncores = ncores)
}

Best
Ammy

privefl · 2021-06-25T09:40:54Z

Just gunzip all the 22 chromosome files and run POS2 <- snp_asGeneticPos(CHR, POS, dir = ".") again (maybe changing dir).

AmmyDK · 2021-06-25T09:43:44Z

Dear developer,
Yes, it works now, thank you so much.

Best Regards
Ammy

SiyiJiang41 · 2021-10-23T12:50:30Z

Dear developer, Yes, it works now, thank you so much.

Best Regards Ammy

I come across the same problem with you, can you tell me how you deal with it at last? I don't understant how to change dir, and what to do next.

privefl · 2021-10-23T16:00:14Z

That's the dir parameter.

SiyiJiang41 · 2021-10-24T09:47:10Z

It works! Thank you!
I have another question that if I can get the PRS of each individual. I want to compare the PRS outcomes with other traditional methods( overestimated or underestimated). Maybe I should learn about the function snp_PRS?
bigsnpr is really a comprehensive package for a beginner in biostatistics like me. Many thanks for your help!

privefl · 2021-10-24T10:30:51Z

If you an unrelated question, please open a new issue.

zjppdozen · 2022-04-28T20:13:16Z

Yes, and df_beta <- info_snp[ind.chr[ord], ] and df_beta <- rbind(df_beta, info_snp[ind.chr[ord], ]) in the if-else.

Hi Florian,

Could you specify where should I put df_beta <- info_snp[ind.chr[ord], ] and df_beta <- rbind(df_beta, info_snp[ind.chr[ord], ]) in the if-else clause (as shown in my code)?

My code is as the following.

for (chr in 1:22) { 

  ind.chr <- which(df_beta$chr == chr)

  ind.chr2 <- df_beta$`_NUM_ID_`[ind.chr]
  ord <- order(POS2[ind.chr2])
  
  corr0 <- snp_cor(G, ind.col = ind.chr2[ord], size = 3 / 1000,
                   infos.pos = POS2[ind.chr2[ord]], ncores = NCORES)
  
  if (chr == 1) {
    ld <- Matrix::colSums(corr0^2)
    corr <- as_SFBM(corr0, tmp, compact = TRUE)
  } else {
    ld <- c(ld, Matrix::colSums(corr0^2))
    corr$add_columns(corr0, nrow(corr))
  }
}

Many thanks

privefl · 2022-04-29T14:37:06Z

In the ifelse I guess.

JingZhang1227 mentioned this issue Mar 29, 2021

Quality control of summary statistics #195

Closed

privefl closed this as completed Apr 21, 2021

snp_asGeneticPos() for hg38 #200

snp_asGeneticPos() for hg38 #200

Comments

JingZhang1227 commented Mar 29, 2021

privefl commented Mar 29, 2021

JingZhang1227 commented Mar 30, 2021

privefl commented Mar 30, 2021

JingZhang1227 commented Mar 30, 2021

pjordab commented Apr 1, 2021

privefl commented Apr 1, 2021 • edited Loading

privefl commented Apr 13, 2021

pjordab commented Apr 13, 2021

privefl commented Apr 13, 2021

pjordab commented Apr 13, 2021 via email • edited Loading

privefl commented Apr 13, 2021 • edited Loading

pjordab commented Apr 16, 2021

privefl commented Apr 16, 2021 • edited Loading

pjordab commented Apr 16, 2021

privefl commented Apr 16, 2021 • edited Loading

pjordab commented Apr 21, 2021

privefl commented Apr 21, 2021

pjordab commented Apr 21, 2021 via email

pjordab commented Jun 15, 2021

privefl commented Jun 15, 2021 • edited Loading

pjordab commented Jun 15, 2021 via email

privefl commented Jun 15, 2021

pjordab commented Jun 15, 2021

privefl commented Jun 15, 2021 • edited Loading

pjordab commented Jun 15, 2021

privefl commented Jun 15, 2021 • edited Loading

AmmyDK commented Jun 25, 2021

privefl commented Jun 25, 2021

AmmyDK commented Jun 25, 2021 • edited Loading

privefl commented Jun 25, 2021

AmmyDK commented Jun 25, 2021

SiyiJiang41 commented Oct 23, 2021

privefl commented Oct 23, 2021

SiyiJiang41 commented Oct 24, 2021

privefl commented Oct 24, 2021

zjppdozen commented Apr 28, 2022 • edited by privefl Loading

privefl commented Apr 29, 2022

privefl commented Apr 1, 2021 •

edited

Loading

pjordab commented Apr 13, 2021 via email •

edited

Loading

privefl commented Apr 13, 2021 •

edited

Loading

privefl commented Apr 16, 2021 •

edited

Loading

privefl commented Apr 16, 2021 •

edited

Loading

privefl commented Jun 15, 2021 •

edited

Loading

privefl commented Jun 15, 2021 •

edited

Loading

privefl commented Jun 15, 2021 •

edited

Loading

AmmyDK commented Jun 25, 2021 •

edited

Loading

zjppdozen commented Apr 28, 2022 •

edited by privefl

Loading