[R] Initial datasets content #159

thisisnic · 2022-03-01T21:26:36Z

This PR adds a chapter on working with Datasets, though is still in draft form right now.

stephhazlitt · 2022-05-18T22:35:36Z

@thisisnic following up on our chat re: the Datasets chapter. I think one content approach might be to follow the cheatsheet chunks, refactoring the current Reading & Writing Data chapter to Reading & Writing Individual Data Files, and moving the small amount of Dataset content from there into a stand-alone chapter on Reading & Writing Multi-file Datasets (which could include addressing #172, #152, #120). I then envision an additional chapter that compares and contrasts Tables and Datasets (#92). What do you think?

thisisnic · 2022-08-30T09:21:37Z

@stephhazlitt Apologies, only saw your comment from May just now as there was some activity on this, but that sounds good to me.

stephhazlitt · 2022-08-30T15:28:32Z

Thanks @thisisnic! After (finally) coming back to this, I decided my suggested approach fragmented the content too much. I have been working on your original PR, so stay tuned :)

thisisnic · 2022-08-30T15:33:38Z

Thanks @thisisnic! After (finally) coming back to this, I decided my suggested approach fragmented the content too much. I have been working on your original PR, so stay tuned :)

OK, that's fine too, look forward to seeing it! :D

thisisnic

Thanks for updating this PR @stephhazlitt , this is looking great, and just a few changes to suggest here!

r/content/datasets.Rmd

thisisnic · 2022-10-20T09:55:34Z

r/content/datasets.Rmd

+further advantages when using Arrow, as Arrow will only 
+read in the necessary partitioned files needed for any given analysis.
+
+It's possible to read in partitioned data in Parquet, Feather (aka Arrow), and CSV (or 


I wonder if, given the discussions on the mailing list lately, we refer to it as "Arrow (formerly known as Feather)" or similar? Not sure what the latest with those discussions is though, and how it impacts us in R.

I think this is the way to go given the thread, good catch. I think we should open a separate ticket and review + update the feather/arrow/arrow-ipc naming in the R package and the corresponding documentation.

https://arrow.apache.org/docs/r/reference/write_feather.html
https://arrow.apache.org/cookbook/r/reading-and-writing-data.html#write-an-ipcfeather-v2-file

thisisnic · 2022-10-20T09:56:48Z

r/content/datasets.Rmd

+
+It's possible to read in partitioned data in Parquet, Feather (aka Arrow), and CSV (or 
+other text-delimited) formats.  If you are choosing a partitioned or multifile format, we 
+recommend Parquet or Feather, both of which can have improved performance 


Is there something we can link to which compares these formats and helps people pick which? If the answer is no, do we want to create a ticket somewhere to suggest someone write something on this topic?

thisisnic · 2022-10-20T09:59:58Z

r/content/datasets.Rmd

+
+  expect_true(file.exists("starwars_data"))
+  expect_length(list.files("starwars_data"), 1)
+


Do we want to delete these directories we've created afterwards?

thisisnic · 2022-10-20T10:00:57Z

r/content/datasets.Rmd

+Note that in the example above, when there was an `NA` value in the `homeworld`
+column, these values are written to the `homeworld=__HIVE_DEFAULT_PARTITION__`
+directory.


This is an excellent detail to include

thisisnic · 2022-10-20T10:03:10Z

r/content/datasets.Rmd

+### Solution
+
+```{r, write_dataset_csv}
+# Need to update this example as we can't write list columns to CSV :(


How about we either make a subset of starwars dataset at the start of this chapter/section to use later, which doesn't include the list column, or just acknowledge the list column issue in the discussion section?

I refactored to stick with the airquality dataset, to be more consistent with the rest fo the read/write material in the R cookbook.

r/content/reading_and_writing_data.Rmd

Co-authored-by: Nic Crane <[email protected]>

…into datasets

stephhazlitt · 2022-10-27T04:32:06Z

@thisisnic Thanks for the review and great suggestions. I have incorporated them and/or opened tickets as placeholders for further work.

I was inspired by the single file API and Dataset API approach in @fmichonneau's blog post, and have tried to subtly weave in this framing by having two separate read+write chapters. The datasets.Rmd was already mostly read+write recipes, so I changed the title and pulled over some content from reading_and_writing_data.Rmd. Let me know if you think this approach is promising.

I wonder about getting what is here clean enough to merge, and then tackling improvements+adding more content in subsequent (and smaller) PRs?

thisisnic

Thanks @stephhazlitt for taking over the task of getting this PR moving again! I'll merge this shortly! :D

thisisnic added 5 commits March 1, 2022 21:25

Initial datasets content

324a31a

Add intro

fc5b305

Add more recipes

5b6ad34

Remove unnecessary dplyr calls

7df0589

Add something on column names

65e675d

stephhazlitt mentioned this pull request May 26, 2022

WIP: Add Dataset chapter and refactor read/write content into individual and multi-file sections #221

Closed

stephhazlitt added 3 commits August 26, 2022 15:47

Merge branch 'main' into datasets

4824adf

add apache header

13d1334

include Dataset objects

ebb677a

stephhazlitt added 8 commits September 12, 2022 15:14

Merge branch 'apache:main' into datasets

86b77bc

text edit

ddaa1fd

more intro edits

f23b09c

Merge branch 'main' into datasets

a4405f6

consistent camel case

34ee5c7

temp fix for test

033eed1

data vs dataset

6041d34

Merge branch 'main' into datasets

342be13

thisisnic commented Oct 20, 2022

View reviewed changes

stephhazlitt and others added 8 commits October 20, 2022 13:51

Update r/content/datasets.Rmd

fabaad0

Co-authored-by: Nic Crane <[email protected]>

Update r/content/reading_and_writing_data.Rmd

e127b92

Co-authored-by: Nic Crane <[email protected]>

use single vs dataset api theme

85a769e

Merge branch 'apache:main' into datasets

ef80ad8

mv up datasets

e2b6f18

focus on single file API

08138bc

focus on read-write dataset api

4f303ec

Merge branch 'datasets' of https://github.com/thisisnic/arrow-cookbook …

4d9da6e

…into datasets

thisisnic marked this pull request as ready for review November 8, 2022 12:47

thisisnic commented Nov 8, 2022

View reviewed changes

thisisnic merged commit 7df8c28 into apache:main Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Initial datasets content #159

[R] Initial datasets content #159

thisisnic commented Mar 1, 2022

stephhazlitt commented May 18, 2022 •

edited

Loading

thisisnic commented Aug 30, 2022

stephhazlitt commented Aug 30, 2022

thisisnic commented Aug 30, 2022

thisisnic left a comment

thisisnic Oct 20, 2022

stephhazlitt Oct 20, 2022 •

edited

Loading

stephhazlitt Oct 24, 2022

thisisnic Oct 20, 2022

stephhazlitt Oct 24, 2022

thisisnic Oct 20, 2022

stephhazlitt Oct 27, 2022

thisisnic Oct 20, 2022

thisisnic Oct 20, 2022

stephhazlitt Oct 27, 2022

stephhazlitt commented Oct 27, 2022

thisisnic left a comment


		expect_true(file.exists("starwars_data"))
		expect_length(list.files("starwars_data"), 1)

[R] Initial datasets content #159

[R] Initial datasets content #159

Conversation

thisisnic commented Mar 1, 2022

stephhazlitt commented May 18, 2022 • edited Loading

thisisnic commented Aug 30, 2022

stephhazlitt commented Aug 30, 2022

thisisnic commented Aug 30, 2022

thisisnic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephhazlitt Oct 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephhazlitt commented Oct 27, 2022

thisisnic left a comment

Choose a reason for hiding this comment

stephhazlitt commented May 18, 2022 •

edited

Loading

stephhazlitt Oct 20, 2022 •

edited

Loading