You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-31672][SQL] Fix loading of timestamps before 1582-10-15 from dictionary encoded Parquet columns
### What changes were proposed in this pull request?
Modified the `decodeDictionaryIds()` method of `VectorizedColumnReader` to handle especially `TimestampType` when the passed parameter `rebaseDateTime` is true. In that case, decoded milliseconds/microseconds are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianMicros()`.
### Why are the changes needed?
This fixes the bug of loading timestamps before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:
```scala
spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala>
Seq.tabulate(8)(_ => "1001-01-01 01:02:03.123").toDF("tsS")
.select($"tsS".cast("timestamp").as("ts")).repartition(1)
.write
.option("parquet.enable.dictionary", true)
.parquet(path)
```
Load the dates back:
```scala
scala> spark.read.parquet(path).show(false)
+-----------------------+
|ts |
+-----------------------+
|1001-01-07 00:32:20.123|
...
|1001-01-07 00:32:20.123|
+-----------------------+
```
Expected values **must be 1001-01-01 01:02:03.123** but not 1001-01-07 00:32:20.123.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```scala
scala> spark.read.parquet(path).show(false)
+-----------------------+
|ts |
+-----------------------+
|1001-01-01 01:02:03.123|
...
|1001-01-01 01:02:03.123|
+-----------------------+
```
### How was this patch tested?
Modified the test `SPARK-31159: rebasing timestamps in write` in `ParquetIOSuite` to checked reading dictionary encoded dates.
Closesapache#28489 from MaxGekk/fix-ts-rebase-parquet-dict-enc.
Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
0 commit comments