apache · stephhazlitt · May 26, 2022 · May 26, 2022 · May 26, 2022 · May 26, 2022
diff --git a/r/content/_bookdown.yml b/r/content/_bookdown.yml
@@ -24,7 +24,9 @@ edit: https://github.com/apache/arrow-cookbook/edit/main/r/content/%s
 
 rmd_files: [
   "index.Rmd",
-  "reading_and_writing_data.Rmd",
+  "reading_and_writing_individual_files.Rmd",
+  "reading_and_writing_multifiles.Rmd",
+  "compressed_data.Rmd",
   "creating_arrow_objects.Rmd",
   "specify_data_types_and_schemas.Rmd",
   "arrays.Rmd",

diff --git a/r/content/compressed_data.Rmd b/r/content/compressed_data.Rmd
@@ -0,0 +1,124 @@
+# Reading and Writing Compressed Data
+
+To save a compressed data file to disk, you specifiy the compression algorithm with the compression argument in the Arrow `write_*()` functions.
+
+## Write compressed data
+
+You want to save a file, compressed with a specified compression algorithm.
+
+### Solution
+
+```{r, parquet_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write data compressed with the gzip algorithm instead of the default
+write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")
+```
+
+```{r, test_parquet_gzip, opts.label = "test"}
+test_that("parquet_gzip", {
+  file.exists(file.path(td, "iris.parquet"))
+})
+```
+
+### Discussion
+
+Note that `write_parquet()` by default already uses compression.  See 
+`default_parquet_compression()` to see what the default configured on your 
+machine is.
+
+You can also supply the `compression` argument to `write_dataset()`, as long as 
+the compression algorithm is compatible with the chosen format.
+
+```{r, dataset_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset to file
+write_dataset(iris, path = td, compression = "gzip")
+```
+
+```{r}
+# View files in the directory
+list.files(td, recursive = TRUE)
+```
+```{r, test_dataset_gzip, opts.label = "test"}
+test_that("dataset_gzip", {
+  file.exists(file.path(td, "part-0.parquet"))
+})
+```
+
+### See also
+
+Some formats write compressed data by default.  For more information 
+on the supported compression algorithms and default settings, see:
+
+* `?write_parquet()`
+* `?write_feather()`
+* `?write_dataset()`
+
+## Read compressed data
+
+You want to read in data which has been compressed.
+
+### Solution
+
+```{r, read_parquet_compressed}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset which is to be read back in
+write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")
+
+# Read in data
+ds <- read_parquet(file.path(td, "iris.parquet")) %>%
+  collect()
+
+ds
+```
+
+```{r, test_read_parquet_compressed, opts.label = "test"}
+test_that("read_parquet_compressed", {
+  expect_s3_class(ds, "data.frame")
+  expect_named(
+    ds,
+    c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
+  )
+})
+```
+
+### Discussion
+
+Note that Arrow automatically detects the compression and you do not have to 
+supply it in the call to `open_dataset()` or the `read_*()` functions.
+
+Although the CSV format does not support compression itself, Arrow supports 
+reading in CSV data which has been compressed, if the file extension is `.gz`.
+
+```{r, read_compressed_csv}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset which is to be read back in
+write.csv(iris, gzfile(file.path(td, "iris.csv.gz")), row.names = FALSE, quote = FALSE)
+
+# Read in data
+ds <- open_dataset(td, format = "csv") %>%
+  collect()
+ds
+```
+
+```{r, test_read_compressed_csv, opts.label = "test"}
+test_that("read_compressed_csv", {
+  expect_s3_class(ds, "data.frame")
+  expect_named(
+    ds,
+    c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
+  )
+})
+```