Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add Dataset chapter and refactor read/write content into individual and multi-file sections #221

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion r/content/_bookdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,9 @@ edit: https://github.com/apache/arrow-cookbook/edit/main/r/content/%s

rmd_files: [
"index.Rmd",
"reading_and_writing_data.Rmd",
"reading_and_writing_individual_files.Rmd",
"reading_and_writing_multifiles.Rmd",
"compressed_data.Rmd",
"creating_arrow_objects.Rmd",
"specify_data_types_and_schemas.Rmd",
"arrays.Rmd",
Expand Down
124 changes: 124 additions & 0 deletions r/content/compressed_data.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Reading and Writing Compressed Data

To save a compressed data file to disk, you specifiy the compression algorithm with the compression argument in the Arrow `write_*()` functions.

## Write compressed data

You want to save a file, compressed with a specified compression algorithm.

### Solution

```{r, parquet_gzip}
# Create a temporary directory
td <- tempfile()
dir.create(td)

# Write data compressed with the gzip algorithm instead of the default
write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")
```

```{r, test_parquet_gzip, opts.label = "test"}
test_that("parquet_gzip", {
file.exists(file.path(td, "iris.parquet"))
})
```

### Discussion

Note that `write_parquet()` by default already uses compression. See
`default_parquet_compression()` to see what the default configured on your
machine is.

You can also supply the `compression` argument to `write_dataset()`, as long as
the compression algorithm is compatible with the chosen format.

```{r, dataset_gzip}
# Create a temporary directory
td <- tempfile()
dir.create(td)

# Write dataset to file
write_dataset(iris, path = td, compression = "gzip")
```

```{r}
# View files in the directory
list.files(td, recursive = TRUE)
```
```{r, test_dataset_gzip, opts.label = "test"}
test_that("dataset_gzip", {
file.exists(file.path(td, "part-0.parquet"))
})
```

### See also

Some formats write compressed data by default. For more information
on the supported compression algorithms and default settings, see:

* `?write_parquet()`
* `?write_feather()`
* `?write_dataset()`

## Read compressed data

You want to read in data which has been compressed.

### Solution

```{r, read_parquet_compressed}
# Create a temporary directory
td <- tempfile()
dir.create(td)

# Write dataset which is to be read back in
write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")

# Read in data
ds <- read_parquet(file.path(td, "iris.parquet")) %>%
collect()

ds
```

```{r, test_read_parquet_compressed, opts.label = "test"}
test_that("read_parquet_compressed", {
expect_s3_class(ds, "data.frame")
expect_named(
ds,
c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
)
})
```

### Discussion

Note that Arrow automatically detects the compression and you do not have to
supply it in the call to `open_dataset()` or the `read_*()` functions.

Although the CSV format does not support compression itself, Arrow supports
reading in CSV data which has been compressed, if the file extension is `.gz`.

```{r, read_compressed_csv}
# Create a temporary directory
td <- tempfile()
dir.create(td)

# Write dataset which is to be read back in
write.csv(iris, gzfile(file.path(td, "iris.csv.gz")), row.names = FALSE, quote = FALSE)

# Read in data
ds <- open_dataset(td, format = "csv") %>%
collect()
ds
```

```{r, test_read_compressed_csv, opts.label = "test"}
test_that("read_compressed_csv", {
expect_s3_class(ds, "data.frame")
expect_named(
ds,
c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
)
})
```
Loading