Skip to content

Commit 151bc19

Browse files
authored
Coverage table-valued function (#80)
* table valued function * Coverage refactored * Coverage refactored * coverage without implementation * coverage done * coverage table valued function * Jenkisfile fix for snapshot doc * Jenkisfile fix for snapshot doc * doc update * Minor code refactoring * Docker for shiny * fix to array type in coverage * Partition pruning added * Documentation update for coverage, pruning and sparklyr * Doc update * fix serialization issue SequilaAnalyzer * doc fix
1 parent 03ce5a0 commit 151bc19

24 files changed

+1194
-122
lines changed

Docker/bdg-sequila-shiny/Dockerfile

+50
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
FROM rocker/shiny
2+
3+
RUN apt-get update && apt-get install --yes git sudo curl libssl-dev libxml2-dev
4+
5+
#install devtools
6+
RUN Rscript -e "install.packages('devtools')"
7+
8+
#install sequila
9+
RUN Rscript -e "devtools::install_github('ZSI-Bio/bdg-sparklyr-sequila')"
10+
11+
#install spark (installed by .onLoad when package loaded)
12+
RUN Rscript -e "library(sequila)"
13+
14+
15+
#install jdk8
16+
RUN apt-get install --yes gnupg2
17+
##A quick & dirty fix for failing Oracle JDK installer
18+
RUN if [ ! -d /usr/share/man/man1 ]; then mkdir -p /usr/share/man/man1; fi
19+
RUN \
20+
echo "===> add webupd8 repository..." && \
21+
echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu xenial main" | tee /etc/apt/sources.list.d/webupd8team-java.list && \
22+
echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu xenial main" | tee -a /etc/apt/sources.list.d/webupd8team-java.list && \
23+
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886 && \
24+
apt-get update && \
25+
\
26+
\
27+
echo "===> install Java" && \
28+
echo debconf shared/accepted-oracle-license-v1-1 select true | debconf-set-selections && \
29+
echo debconf shared/accepted-oracle-license-v1-1 seen true | debconf-set-selections && \
30+
cd /var/lib/dpkg/info && \
31+
DEBIAN_FRONTEND=noninteractive apt-get install -y --force-yes oracle-java8-installer oracle-java8-set-default && \
32+
\
33+
\
34+
echo "===> clean up..." && \
35+
rm -rf /var/cache/oracle-jdk8-installer && \
36+
apt-get clean && \
37+
rm -rf /var/lib/apt/lists/*
38+
39+
ENV JAVA_HOME /usr/lib/jvm/java-8-oracle
40+
41+
#copy test data
42+
COPY NA12878.slice.bam /tmp/NA12878.slice.bam
43+
COPY warmcache.scala /tmp/warmcache.scala
44+
45+
#sequila versions
46+
47+
ARG BDG_VERSION 0.4-SNAPSHOT
48+
ENV BGD_VERSION 0.4-SNAPSHOT
49+
RUN /root/spark/spark-2.2.1-bin-hadoop2.7/bin/spark-shell --packages org.biodatageeks:bdg-sequila_2.11:${BGD_VERSION} \
50+
-i /tmp/warmcache.scala --repositories https://zsibio.ii.pw.edu.pl/nexus/repository/maven-releases/,https://zsibio.ii.pw.edu.pl/nexus/repository/maven-snapshots/
356 KB
Binary file not shown.
+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
System.exit(0)

Docker/bdg-sequila/Dockerfile

+18-1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,13 @@ ENV BGD_VERSION={{COMPONENT_VERSION}}
1515

1616

1717

18+
RUN apt-get update && apt-get install --yes git sudo curl libssl-dev libxml2-dev
19+
20+
21+
22+
23+
24+
1825
RUN mkdir /tmp/bdg-toolset
1926

2027
###once the repo is public we can use git instead
@@ -31,7 +38,7 @@ COPY bin/bdg-sequilaR.sh /tmp/bdg-toolset/bdg-sequilaR
3138

3239
#featureCounts scripts
3340
COPY bin/featureCounts.sh /tmp/bdg-toolset/featureCounts
34-
RUN bash -c " if [[ $BDG_VERSION =~ *SNAPSHOT ]]; then \
41+
RUN bash -c " if [[ $BDG_VERSION =~ SNAPSHOT ]]; then \
3542
wget https://zsibio.ii.pw.edu.pl/nexus/repository/maven-snapshots/org/biodatageeks/bdg-sequila_2.11/${BGD_VERSION}/bdg-sequila_2.11-${BGD_VERSION}-assembly.jar -O /tmp/bdg-toolset/bdg-sequila-assembly-${BGD_VERSION}.jar ; \
3643
else wget https://zsibio.ii.pw.edu.pl/nexus/repository/maven-releases/org/biodatageeks/bdg-sequila_2.11/${BGD_VERSION}/bdg-sequila_2.11-${BGD_VERSION}-assembly.jar -O /tmp/bdg-toolset/bdg-sequila-assembly-${BGD_VERSION}.jar ; \
3744
fi"
@@ -105,11 +112,21 @@ RUN apt-get update \
105112
RUN Rscript -e 'source("http://bioconductor.org/biocLite.R")' -e 'biocLite("edgeR")'
106113
RUN Rscript -e 'source("http://bioconductor.org/biocLite.R")' -e 'biocLite("DESeq2")'
107114

115+
#install devtools
116+
RUN Rscript -e "install.packages('devtools')"
117+
#install sequila
118+
RUN Rscript -e "devtools::install_github('ZSI-Bio/bdg-sparklyr-sequila')"
119+
108120
USER tempuser
109121

110122
WORKDIR /home/tempuser
111123
##just to download all depencies and speedup start
112124
RUN bdg-shell -i /tmp/bdg-toolset/warmcache.scala build
113125

126+
127+
128+
#install spark (installed by .onLoad when package loaded)
129+
RUN Rscript -e "library(sequila)"
130+
114131
USER root
115132
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]

Docker/bdg-sequila/bin/bdg-sequilaR.sh

+3-2
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ echo $BGD_VERSION
1616
echo -e "\n"
1717

1818
rm -rf ~/metastore_db
19-
sparkR --packages org.biodatageeks:bdg-sequila_2.11:${BGD_VERSION} \
20-
--repositories https://zsibio.ii.pw.edu.pl/nexus/repository/maven-releases/,https://zsibio.ii.pw.edu.pl/nexus/repository/maven-snapshots/ $@
19+
R
20+
#sparkR --packages org.biodatageeks:bdg-sequila_2.11:${BGD_VERSION} \
21+
# --repositories https://zsibio.ii.pw.edu.pl/nexus/repository/maven-releases/,https://zsibio.ii.pw.edu.pl/nexus/repository/maven-snapshots/ $@
2122

Docker/bdg-sequila/bin/bdginit.scala

+9-6
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,20 @@
1+
import org.apache.spark.sql.SequilaSession
12
import org.biodatageeks.utils.{SequilaRegister, UDFRegister}
23

34
/*set params*/
45

5-
spark.sqlContext.setConf("spark.biodatageeks.rangejoin.useJoinOrder","false")
6-
spark.sqlContext.setConf("spark.biodatageeks.rangejoin.maxBroadcastSize", (128*1024*1024).toString)
6+
val ss = SequilaSession(spark)
77

8-
spark.sqlContext.setConf("spark.biodatageeks.rangejoin.minOverlap","1")
9-
spark.sqlContext.setConf("spark.biodatageeks.rangejoin.maxGap","0")
8+
ss.sqlContext.setConf("spark.biodatageeks.rangejoin.useJoinOrder","false")
9+
ss.sqlContext.setConf("spark.biodatageeks.rangejoin.maxBroadcastSize", (128*1024*1024).toString)
10+
11+
ss.sqlContext.setConf("spark.biodatageeks.rangejoin.minOverlap","1")
12+
ss.sqlContext.setConf("spark.biodatageeks.rangejoin.maxGap","0")
1013

1114
/*register UDFs*/
1215

13-
UDFRegister.register(spark)
16+
UDFRegister.register(ss)
1417

1518
/*inject bdg-granges strategy*/
16-
SequilaRegister.register(spark)
19+
SequilaRegister.register(ss)
1720

build.sbt

+15-11
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ import scala.util.Properties
22

33
name := """bdg-sequila"""
44

5-
version := "0.3"
5+
version := "0.4-SNAPSHOT"
66

77
organization := "org.biodatageeks"
88

@@ -42,13 +42,17 @@ libraryDependencies += "com.github.potix2" %% "spark-google-spreadsheets" % "0.5
4242

4343
libraryDependencies += "ch.cern.sparkmeasure" %% "spark-measure" % "0.11"
4444

45-
//fork := true
45+
//libraryDependencies += "pl.edu.pw.ii.zsibio" % "common-routines_2.11" % "0.1-SNAPSHOT"
46+
47+
fork := false
4648
fork in Test := true
47-
parallelExecution in Test := false
49+
//parallelExecution in Test := false
4850
javaOptions in test += "-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=9999"
49-
javaOptions in run ++= Seq(
50-
"-Dlog4j.debug=true",
51-
"-Dlog4j.configuration=log4j.properties")
51+
javaOptions in run += "-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=9999"
52+
53+
//javaOptions in run ++= Seq(
54+
// "-Dlog4j.debug=true",
55+
// "-Dlog4j.configuration=log4j.properties")
5256

5357
javaOptions ++= Seq("-Xms512M", "-Xmx8192M", "-XX:+CMSClassUnloadingEnabled")
5458

@@ -93,11 +97,11 @@ assemblyMergeStrategy in assembly := {
9397
}
9498

9599
/* only for releasing assemblies*/
96-
artifact in (Compile, assembly) := {
97-
val art = (artifact in (Compile, assembly)).value
98-
art.withClassifier(Some("assembly"))
99-
}
100-
addArtifact(artifact in (Compile, assembly), assembly)
100+
//artifact in (Compile, assembly) := {
101+
// val art = (artifact in (Compile, assembly)).value
102+
// art.withClassifier(Some("assembly"))
103+
//}
104+
//addArtifact(artifact in (Compile, assembly), assembly)
101105

102106
publishConfiguration := publishConfiguration.value.withOverwrite(true)
103107

build.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ do
3434
#diffTs=`echo "$(date +%s) - $(git log -n 1 --pretty=format:%at ${dir})" | bc`
3535
#if [ $diffTs -lt $MAX_COMMIT_TS_DIFF ]; then
3636
cd $dir
37-
docker build -t $image:$version .
37+
docker build --no-cache -t $image:$version .
3838
docker build -t $image:latest .
3939
if [[ ${BUILD_MODE} != "local" ]]; then
4040
docker push docker.io/$image:latest

build_docs.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ cd docs && ./docs.sh html
1010
if [[ $version =~ SNAPSHOT ]]; then
1111
docker build -t zsi-bio/bdg-sequila-snap-doc .
1212
if [ $(docker ps | grep bdg-sequila-snap-doc | wc -l) -gt 0 ]; then docker stop bdg-sequila-snap-doc && docker rm bdg-sequila-snap-doc; fi
13-
docker run -v 80:81 -d --name bdg-sequila-snap-doc zsi-bio/bdg-sequila-snap-doc
13+
docker run -p 81:80 -d --name bdg-sequila-snap-doc zsi-bio/bdg-sequila-snap-doc
1414
else
1515
docker build -t zsi-bio/bdg-sequila-doc .
1616
if [ $(docker ps | grep bdg-sequila-doc | wc -l) -gt 0 ]; then docker stop bdg-sequila-doc && docker rm bdg-sequila-doc; fi

docs/source/function/function.rst

+109
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,55 @@ process and query them using a SQL interface:
7676
|
7777
""".stripMargin)
7878
spark.sql("SELECT sampleId,contigName,start,end,cigar FROM reads").show(5)
79+
80+
Implicit partition pruning for BAM data source
81+
##############################################
82+
83+
BAM data source supports implicit `partition pruning <https://docs.oracle.com/database/121/VLDBG/GUID-E677C85E-C5E3-4927-B3DF-684007A7B05D.htm#VLDBG00401>`_
84+
mechanism to speed up queries that are restricted to only subset of samples from a table. Consider a following example:
85+
86+
.. code-block:: bash
87+
88+
MacBook-Pro:multisample marek$ ls -ltr
89+
total 2136
90+
-rw-r--r-- 1 marek staff 364043 May 15 18:53 NA12877.slice.bam
91+
-rw-r--r-- 1 marek staff 364043 May 15 18:53 NA12878.slice.bam
92+
-rw-r--r-- 1 marek staff 364043 May 15 18:53 NA12879.slice.bam
93+
94+
MacBook-Pro:multisample marek$ pwd
95+
/Users/marek/data/multisample
96+
97+
98+
.. code-block:: scala
99+
100+
import org.apache.spark.sql.{SequilaSession, SparkSession}
101+
val bamPath ="/Users/marek/data/multisample/*.bam"
102+
val tableNameBAM = "reads"
103+
val ss: SparkSession = SequilaSession(spark)
104+
ss.sql(
105+
s"""
106+
|CREATE TABLE ${tableNameBAM}
107+
|USING org.biodatageeks.datasources.BAM.BAMDataSource
108+
|OPTIONS(path "${bamPath}")
109+
|
110+
""".stripMargin)
111+
112+
val query =
113+
"""
114+
|SELECT sampleId,count(*) FROM reads where sampleId IN('NA12878','NA12879')
115+
|GROUP BY sampleId order by sampleId
116+
""".stripMargin
117+
ss.sql(query)
118+
119+
120+
If you run the above query you should get the information that SeQuiLa optimized the physical plan and will only read 2 BAM files
121+
instead of 3 to answer your query:
122+
123+
.. code-block:: bash
124+
125+
WARN BAMRelation: Partition pruning detected,reading only files for samples: NA12878,NA12879
126+
127+
79128
Using UDFs
80129
##########
81130

@@ -258,3 +307,63 @@ Parameter is set via coniguration:
258307
spark.sqlContext.setConf("spark.biodatageeks.rangejoin.useJoinOrder", "true")
259308

260309

310+
Coverage
311+
##########
312+
313+
In order to compute coverage for your sample you can run a set of queries as follows:
314+
315+
.. code-block:: scala
316+
317+
val tableNameBAM = "reads"
318+
val bamPath = "/data/samples/*.bam"
319+
ss.sql("CREATE DATABASE dna")
320+
ss.sql("USE dna")
321+
ss.sql(
322+
s"""
323+
|CREATE TABLE ${tableNameBAM}
324+
|USING org.biodatageeks.datasources.BAM.BAMDataSource
325+
|OPTIONS(path "${bamPath}")
326+
|
327+
""".stripMargin)
328+
ss.sql(s"SELECT * FROM coverage('${tableNameBAM}')").show(5)
329+
330+
+--------+----------+--------+--------+
331+
|sampleId|contigName|position|coverage|
332+
+--------+----------+--------+--------+
333+
| NA12878| chr1| 137| 1|
334+
| NA12878| chr1| 138| 1|
335+
| NA12878| chr1| 139| 1|
336+
| NA12878| chr1| 140| 1|
337+
| NA12878| chr1| 141| 1|
338+
+--------+----------+--------+--------+
339+
340+
If you would like to do additional short reads prefiltering, you can create a temporary table and use it as an input to the coverage function, e.g.:
341+
342+
.. code-block:: scala
343+
344+
ss.sql(s"CREATE TABLE filtered_reads AS SELECT * FROM ${tableNameBAM} WHERE mapq > 10 AND start> 200")
345+
ss.sql(s"SELECT * FROM coverage('filtered_reads')").show(5)
346+
347+
+--------+----------+--------+--------+
348+
|sampleId|contigName|position|coverage|
349+
+--------+----------+--------+--------+
350+
| NA12878| chr1| 361| 1|
351+
| NA12878| chr1| 362| 1|
352+
| NA12878| chr1| 363| 1|
353+
| NA12878| chr1| 364| 1|
354+
| NA12878| chr1| 365| 1|
355+
+--------+----------+--------+--------+
356+
357+
(Experimental WIP) If you are interested in coverage histograms using e.g. mapping quality you can use the following table valued function:
358+
359+
.. code-block:: scala
360+
361+
ss.sql(s"SELECT * FROM coverage_hist('${tableNameBAM}') WHERE position=20204").show()
362+
363+
+--------+----------+--------+------------------+-------------+
364+
|sampleId|contigName|position| coverage|coverageTotal|
365+
+--------+----------+--------+------------------+-------------+
366+
| NA12878| chr1| 20204|[1017, 0, 2, 0, 0]| 1019|
367+
+--------+----------+--------+------------------+-------------+
368+
369+

0 commit comments

Comments
 (0)