Skip to content

Commit 339c0f9

Browse files
nchammasHyukjinKwon
authored andcommitted
[SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options
### What changes were proposed in this pull request? This PR adds a doc builder for Spark SQL's configuration options. Here's what the new Spark SQL config docs look like ([configuration.html.zip](https://github.com/apache/spark/files/4172109/configuration.html.zip)): ![Screen Shot 2020-02-07 at 12 13 23 PM](https://user-images.githubusercontent.com/1039369/74050007-425b5480-49a3-11ea-818c-42700c54d1fb.png) Compare this to the [current docs](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql): ![Screen Shot 2020-02-04 at 4 55 10 PM](https://user-images.githubusercontent.com/1039369/73790828-24a5a980-476f-11ea-998c-12cd613883e8.png) ### Why are the changes needed? There is no visibility into the various Spark SQL configs on [the config docs page](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql). ### Does this PR introduce any user-facing change? No, apart from new documentation. ### How was this patch tested? I tested this manually by building the docs and reviewing them in my browser. Closes apache#27459 from nchammas/SPARK-30510-spark-sql-options. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
1 parent a7ae77a commit 339c0f9

File tree

8 files changed

+163
-67
lines changed

8 files changed

+163
-67
lines changed

docs/.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
sql-configs.html

docs/configuration.md

+7-39
Original file line numberDiff line numberDiff line change
@@ -2399,47 +2399,15 @@ the driver or executor, or, in the absence of that value, the number of cores av
23992399
Please refer to the [Security](security.html) page for available options on how to secure different
24002400
Spark subsystems.
24012401

2402-
### Spark SQL
2403-
2404-
Running the <code>SET -v</code> command will show the entire list of the SQL configuration.
2405-
2406-
<div class="codetabs">
2407-
<div data-lang="scala" markdown="1">
24082402

2409-
{% highlight scala %}
2410-
// spark is an existing SparkSession
2411-
spark.sql("SET -v").show(numRows = 200, truncate = false)
2412-
{% endhighlight %}
2413-
2414-
</div>
2415-
2416-
<div data-lang="java" markdown="1">
2417-
2418-
{% highlight java %}
2419-
// spark is an existing SparkSession
2420-
spark.sql("SET -v").show(200, false);
2421-
{% endhighlight %}
2422-
</div>
2423-
2424-
<div data-lang="python" markdown="1">
2425-
2426-
{% highlight python %}
2427-
# spark is an existing SparkSession
2428-
spark.sql("SET -v").show(n=200, truncate=False)
2429-
{% endhighlight %}
2430-
2431-
</div>
2432-
2433-
<div data-lang="r" markdown="1">
2434-
2435-
{% highlight r %}
2436-
sparkR.session()
2437-
properties <- sql("SET -v")
2438-
showDF(properties, numRows = 200, truncate = FALSE)
2439-
{% endhighlight %}
2403+
{% for static_file in site.static_files %}
2404+
{% if static_file.name == 'sql-configs.html' %}
2405+
### Spark SQL
24402406

2441-
</div>
2442-
</div>
2407+
{% include_relative sql-configs.html %}
2408+
{% break %}
2409+
{% endif %}
2410+
{% endfor %}
24432411

24442412

24452413
### Spark Streaming

sql/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,4 @@ Spark SQL is broken up into four subprojects:
99
- Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
1010
- HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.
1111

12-
Running `./sql/create-docs.sh` generates SQL documentation for built-in functions under `sql/site`.
12+
Running `./sql/create-docs.sh` generates SQL documentation for built-in functions under `sql/site`, and SQL configuration documentation that gets included as part of `configuration.md` in the main `docs` directory.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+18-17
Original file line numberDiff line numberDiff line change
@@ -324,11 +324,11 @@ object SQLConf {
324324
.doc("Configures the maximum size in bytes for a table that will be broadcast to all worker " +
325325
"nodes when performing a join. By setting this value to -1 broadcasting can be disabled. " +
326326
"Note that currently statistics are only supported for Hive Metastore tables where the " +
327-
"command <code>ANALYZE TABLE &lt;tableName&gt; COMPUTE STATISTICS noscan</code> has been " +
327+
"command `ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan` has been " +
328328
"run, and file-based data source tables where the statistics are computed directly on " +
329329
"the files of data.")
330330
.bytesConf(ByteUnit.BYTE)
331-
.createWithDefault(10L * 1024 * 1024)
331+
.createWithDefaultString("10MB")
332332

333333
val LIMIT_SCALE_UP_FACTOR = buildConf("spark.sql.limit.scaleUpFactor")
334334
.internal()
@@ -402,7 +402,7 @@ object SQLConf {
402402
s"an effect when '${ADAPTIVE_EXECUTION_ENABLED.key}' and " +
403403
s"'${REDUCE_POST_SHUFFLE_PARTITIONS_ENABLED.key}' is enabled.")
404404
.bytesConf(ByteUnit.BYTE)
405-
.createWithDefault(64 * 1024 * 1024)
405+
.createWithDefaultString("64MB")
406406

407407
val SHUFFLE_MAX_NUM_POSTSHUFFLE_PARTITIONS =
408408
buildConf("spark.sql.adaptive.shuffle.maxNumPostShufflePartitions")
@@ -436,7 +436,7 @@ object SQLConf {
436436
.doc("Configures the minimum size in bytes for a partition that is considered as a skewed " +
437437
"partition in adaptive skewed join.")
438438
.bytesConf(ByteUnit.BYTE)
439-
.createWithDefault(64 * 1024 * 1024)
439+
.createWithDefaultString("64MB")
440440

441441
val ADAPTIVE_EXECUTION_SKEWED_PARTITION_FACTOR =
442442
buildConf("spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionFactor")
@@ -770,7 +770,7 @@ object SQLConf {
770770
val BROADCAST_TIMEOUT = buildConf("spark.sql.broadcastTimeout")
771771
.doc("Timeout in seconds for the broadcast wait time in broadcast joins.")
772772
.timeConf(TimeUnit.SECONDS)
773-
.createWithDefault(5 * 60)
773+
.createWithDefaultString(s"${5 * 60}")
774774

775775
// This is only used for the thriftserver
776776
val THRIFTSERVER_POOL = buildConf("spark.sql.thriftserver.scheduler.pool")
@@ -830,7 +830,7 @@ object SQLConf {
830830
.createWithDefault(true)
831831

832832
val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.sources.bucketing.maxBuckets")
833-
.doc("The maximum number of buckets allowed. Defaults to 100000")
833+
.doc("The maximum number of buckets allowed.")
834834
.intConf
835835
.checkValue(_ > 0, "the value of spark.sql.sources.bucketing.maxBuckets must be greater than 0")
836836
.createWithDefault(100000)
@@ -1022,7 +1022,7 @@ object SQLConf {
10221022
"This configuration is effective only when using file-based sources such as Parquet, JSON " +
10231023
"and ORC.")
10241024
.bytesConf(ByteUnit.BYTE)
1025-
.createWithDefault(128 * 1024 * 1024) // parquet.block.size
1025+
.createWithDefaultString("128MB") // parquet.block.size
10261026

10271027
val FILES_OPEN_COST_IN_BYTES = buildConf("spark.sql.files.openCostInBytes")
10281028
.internal()
@@ -1161,7 +1161,8 @@ object SQLConf {
11611161

11621162
val VARIABLE_SUBSTITUTE_ENABLED =
11631163
buildConf("spark.sql.variable.substitute")
1164-
.doc("This enables substitution using syntax like ${var} ${system:var} and ${env:var}.")
1164+
.doc("This enables substitution using syntax like `${var}`, `${system:var}`, " +
1165+
"and `${env:var}`.")
11651166
.booleanConf
11661167
.createWithDefault(true)
11671168

@@ -1171,7 +1172,7 @@ object SQLConf {
11711172
.doc("Enable two-level aggregate hash map. When enabled, records will first be " +
11721173
"inserted/looked-up at a 1st-level, small, fast map, and then fallback to a " +
11731174
"2nd-level, larger, slower map when 1st level is full or keys cannot be found. " +
1174-
"When disabled, records go directly to the 2nd level. Defaults to true.")
1175+
"When disabled, records go directly to the 2nd level.")
11751176
.booleanConf
11761177
.createWithDefault(true)
11771178

@@ -1325,10 +1326,10 @@ object SQLConf {
13251326

13261327
val STREAMING_STOP_TIMEOUT =
13271328
buildConf("spark.sql.streaming.stopTimeout")
1328-
.doc("How long to wait for the streaming execution thread to stop when calling the " +
1329-
"streaming query's stop() method in milliseconds. 0 or negative values wait indefinitely.")
1329+
.doc("How long to wait in milliseconds for the streaming execution thread to stop when " +
1330+
"calling the streaming query's stop() method. 0 or negative values wait indefinitely.")
13301331
.timeConf(TimeUnit.MILLISECONDS)
1331-
.createWithDefault(0L)
1332+
.createWithDefaultString("0")
13321333

13331334
val STREAMING_NO_DATA_PROGRESS_EVENT_INTERVAL =
13341335
buildConf("spark.sql.streaming.noDataProgressEventInterval")
@@ -1611,10 +1612,10 @@ object SQLConf {
16111612
val PANDAS_UDF_BUFFER_SIZE =
16121613
buildConf("spark.sql.execution.pandas.udf.buffer.size")
16131614
.doc(
1614-
s"Same as ${BUFFER_SIZE} but only applies to Pandas UDF executions. If it is not set, " +
1615-
s"the fallback is ${BUFFER_SIZE}. Note that Pandas execution requires more than 4 bytes. " +
1616-
"Lowering this value could make small Pandas UDF batch iterated and pipelined; however, " +
1617-
"it might degrade performance. See SPARK-27870.")
1615+
s"Same as `${BUFFER_SIZE.key}` but only applies to Pandas UDF executions. If it is not " +
1616+
s"set, the fallback is `${BUFFER_SIZE.key}`. Note that Pandas execution requires more " +
1617+
"than 4 bytes. Lowering this value could make small Pandas UDF batch iterated and " +
1618+
"pipelined; however, it might degrade performance. See SPARK-27870.")
16181619
.fallbackConf(BUFFER_SIZE)
16191620

16201621
val PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_NAME =
@@ -2039,7 +2040,7 @@ object SQLConf {
20392040
.checkValue(i => i >= 0 && i <= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH, "Invalid " +
20402041
"value for 'spark.sql.maxPlanStringLength'. Length must be a valid string length " +
20412042
"(nonnegative and shorter than the maximum size).")
2042-
.createWithDefault(ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH)
2043+
.createWithDefaultString(s"${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}")
20432044

20442045
val SET_COMMAND_REJECTS_SPARK_CORE_CONFS =
20452046
buildConf("spark.sql.legacy.setCommandRejectsSparkCoreConfs")

sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala

+7
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ import org.apache.spark.sql.catalyst.expressions.ExpressionInfo
2929
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
3030
import org.apache.spark.sql.execution.{ExplainMode, QueryExecution}
3131
import org.apache.spark.sql.execution.arrow.ArrowConverters
32+
import org.apache.spark.sql.internal.SQLConf
3233
import org.apache.spark.sql.types.DataType
3334

3435
private[sql] object PythonSQLUtils {
@@ -39,6 +40,12 @@ private[sql] object PythonSQLUtils {
3940
FunctionRegistry.functionSet.flatMap(f => FunctionRegistry.builtin.lookupFunction(f)).toArray
4041
}
4142

43+
def listSQLConfigs(): Array[(String, String, String)] = {
44+
val conf = new SQLConf()
45+
// Py4J doesn't seem to translate Seq well, so we convert to an Array.
46+
conf.getAllDefinedConfs.toArray
47+
}
48+
4249
/**
4350
* Python callable function to read a file in Arrow stream format and create a [[RDD]]
4451
* using each serialized ArrowRecordBatch as a partition.

sql/create-docs.sh

+8-6
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
# limitations under the License.
1818
#
1919

20-
# Script to create SQL API docs. This requires `mkdocs` and to build
20+
# Script to create SQL API and config docs. This requires `mkdocs` and to build
2121
# Spark first. After running this script the html docs can be found in
2222
# $SPARK_HOME/sql/site
2323

@@ -39,14 +39,16 @@ fi
3939

4040
pushd "$FWDIR" > /dev/null
4141

42-
# Now create the markdown file
4342
rm -fr docs
4443
mkdir docs
45-
echo "Generating markdown files for SQL documentation."
46-
"$SPARK_HOME/bin/spark-submit" gen-sql-markdown.py
4744

48-
# Now create the HTML files
49-
echo "Generating HTML files for SQL documentation."
45+
echo "Generating SQL API Markdown files."
46+
"$SPARK_HOME/bin/spark-submit" gen-sql-api-docs.py
47+
48+
echo "Generating SQL configuration table HTML file."
49+
"$SPARK_HOME/bin/spark-submit" gen-sql-config-docs.py
50+
51+
echo "Generating HTML files for SQL API documentation."
5052
mkdocs build --clean
5153
rm -fr docs
5254

sql/gen-sql-markdown.py sql/gen-sql-api-docs.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,11 @@
1515
# limitations under the License.
1616
#
1717

18-
import sys
1918
import os
2019
from collections import namedtuple
2120

21+
from pyspark.java_gateway import launch_gateway
22+
2223
ExpressionInfo = namedtuple(
2324
"ExpressionInfo", "className name usage arguments examples note since deprecated")
2425

@@ -219,8 +220,7 @@ def generate_sql_markdown(jvm, path):
219220

220221

221222
if __name__ == "__main__":
222-
from pyspark.java_gateway import launch_gateway
223-
224223
jvm = launch_gateway().jvm
225-
markdown_file_path = "%s/docs/index.md" % os.path.dirname(sys.argv[0])
224+
spark_root_dir = os.path.dirname(os.path.dirname(__file__))
225+
markdown_file_path = os.path.join(spark_root_dir, "sql/docs/index.md")
226226
generate_sql_markdown(jvm, markdown_file_path)

sql/gen-sql-config-docs.py

+117
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
import os
19+
import re
20+
from collections import namedtuple
21+
from textwrap import dedent
22+
23+
# To avoid adding a new direct dependency, we import markdown from within mkdocs.
24+
from mkdocs.structure.pages import markdown
25+
from pyspark.java_gateway import launch_gateway
26+
27+
SQLConfEntry = namedtuple(
28+
"SQLConfEntry", ["name", "default", "description"])
29+
30+
31+
def get_public_sql_configs(jvm):
32+
sql_configs = [
33+
SQLConfEntry(
34+
name=_sql_config._1(),
35+
default=_sql_config._2(),
36+
description=_sql_config._3(),
37+
)
38+
for _sql_config in jvm.org.apache.spark.sql.api.python.PythonSQLUtils.listSQLConfigs()
39+
]
40+
return sql_configs
41+
42+
43+
def generate_sql_configs_table(sql_configs, path):
44+
"""
45+
Generates an HTML table at `path` that lists all public SQL
46+
configuration options.
47+
48+
The table will look something like this:
49+
50+
```html
51+
<table class="table">
52+
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
53+
54+
<tr>
55+
<td><code>spark.sql.adaptive.enabled</code></td>
56+
<td>false</td>
57+
<td><p>When true, enable adaptive query execution.</p></td>
58+
</tr>
59+
60+
...
61+
62+
</table>
63+
```
64+
"""
65+
value_reference_pattern = re.compile(r"^<value of (\S*)>$")
66+
67+
with open(path, 'w') as f:
68+
f.write(dedent(
69+
"""
70+
<table class="table">
71+
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
72+
"""
73+
))
74+
for config in sorted(sql_configs, key=lambda x: x.name):
75+
if config.default == "<undefined>":
76+
default = "(none)"
77+
elif config.default.startswith("<value of "):
78+
referenced_config_name = value_reference_pattern.match(config.default).group(1)
79+
default = "(value of <code>{}</code>)".format(referenced_config_name)
80+
else:
81+
default = config.default
82+
83+
if default.startswith("<"):
84+
raise Exception(
85+
"Unhandled reference in SQL config docs. Config '{name}' "
86+
"has default '{default}' that looks like an HTML tag."
87+
.format(
88+
name=config.name,
89+
default=config.default,
90+
)
91+
)
92+
93+
f.write(dedent(
94+
"""
95+
<tr>
96+
<td><code>{name}</code></td>
97+
<td>{default}</td>
98+
<td>{description}</td>
99+
</tr>
100+
"""
101+
.format(
102+
name=config.name,
103+
default=default,
104+
description=markdown.markdown(config.description),
105+
)
106+
))
107+
f.write("</table>\n")
108+
109+
110+
if __name__ == "__main__":
111+
jvm = launch_gateway().jvm
112+
sql_configs = get_public_sql_configs(jvm)
113+
114+
spark_root_dir = os.path.dirname(os.path.dirname(__file__))
115+
sql_configs_table_path = os.path.join(spark_root_dir, "docs/sql-configs.html")
116+
117+
generate_sql_configs_table(sql_configs, path=sql_configs_table_path)

0 commit comments

Comments
 (0)