-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Initial datasets content #159
Conversation
@thisisnic following up on our chat re: the |
@stephhazlitt Apologies, only saw your comment from May just now as there was some activity on this, but that sounds good to me. |
Thanks @thisisnic! After (finally) coming back to this, I decided my suggested approach fragmented the content too much. I have been working on your original PR, so stay tuned :) |
OK, that's fine too, look forward to seeing it! :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this PR @stephhazlitt , this is looking great, and just a few changes to suggest here!
r/content/datasets.Rmd
Outdated
further advantages when using Arrow, as Arrow will only | ||
read in the necessary partitioned files needed for any given analysis. | ||
|
||
It's possible to read in partitioned data in Parquet, Feather (aka Arrow), and CSV (or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if, given the discussions on the mailing list lately, we refer to it as "Arrow (formerly known as Feather)" or similar? Not sure what the latest with those discussions is though, and how it impacts us in R.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the way to go given the thread, good catch. I think we should open a separate ticket and review + update the feather/arrow/arrow-ipc naming in the R package and the corresponding documentation.
https://arrow.apache.org/docs/r/reference/write_feather.html
https://arrow.apache.org/cookbook/r/reading-and-writing-data.html#write-an-ipcfeather-v2-file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r/content/datasets.Rmd
Outdated
|
||
It's possible to read in partitioned data in Parquet, Feather (aka Arrow), and CSV (or | ||
other text-delimited) formats. If you are choosing a partitioned or multifile format, we | ||
recommend Parquet or Feather, both of which can have improved performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something we can link to which compares these formats and helps people pick which? If the answer is no, do we want to create a ticket somewhere to suggest someone write something on this topic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r/content/datasets.Rmd
Outdated
|
||
expect_true(file.exists("starwars_data")) | ||
expect_length(list.files("starwars_data"), 1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to delete these directories we've created afterwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
r/content/datasets.Rmd
Outdated
Note that in the example above, when there was an `NA` value in the `homeworld` | ||
column, these values are written to the `homeworld=__HIVE_DEFAULT_PARTITION__` | ||
directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent detail to include
r/content/datasets.Rmd
Outdated
### Solution | ||
|
||
```{r, write_dataset_csv} | ||
# Need to update this example as we can't write list columns to CSV :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we either make a subset of starwars dataset at the start of this chapter/section to use later, which doesn't include the list column, or just acknowledge the list column issue in the discussion section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored to stick with the airquality
dataset, to be more consistent with the rest fo the read/write material in the R cookbook.
Co-authored-by: Nic Crane <[email protected]>
Co-authored-by: Nic Crane <[email protected]>
@thisisnic Thanks for the review and great suggestions. I have incorporated them and/or opened tickets as placeholders for further work. I was inspired by the single file API and Dataset API approach in @fmichonneau's blog post, and have tried to subtly weave in this framing by having two separate read+write chapters. The I wonder about getting what is here clean enough to merge, and then tackling improvements+adding more content in subsequent (and smaller) PRs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @stephhazlitt for taking over the task of getting this PR moving again! I'll merge this shortly! :D
This PR adds a chapter on working with Datasets, though is still in draft form right now.