Skip to content

Commit

Permalink
Update reproducibikuty chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
RohanAlexander committed Oct 19, 2024
1 parent d944bc5 commit ff02685
Show file tree
Hide file tree
Showing 396 changed files with 961 additions and 884 deletions.
4 changes: 2 additions & 2 deletions 00-errata.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ Chapman and Hall/CRC published this book in July 2023. You can purchase that [he
This online version has some updates to what was printed. An online version that matches the print version is available [here](https://rohanalexander.github.io/telling_stories-published/).
:::

*Last updated: 15 October 2024.*
*Last updated: 19 October 2024.*

The book was reviewed by Piotr Fryzlewicz in *The American Statistician* [@Fryzlewicz2024] and Nick Cox on [Amazon](https://www.amazon.com/gp/customer-reviews/R3S602G9RUDOF/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=1032134771). I am grateful they gave such a lot of their time to provide the review, as well as their corrections and suggestions.
The book was reviewed by Piotr Fryzlewicz in *The American Statistician* [@Fryzlewicz2024] and Nick Cox on [Amazon](https://www.amazon.com/gp/customer-reviews/R3S602G9RUDOF/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=1032134771). I am grateful that they gave such a lot of their time to provide the review, as well as their corrections and suggestions.

Since the publication of this book in July 2023, there have been a variety of changes in the world. The rise of generative AI has changed the way that people code, Python has become easier to integrate alongside R because of Quarto, and packages continue to update (not to mention a new cohort of students has started going through the book). One advantage of having an online version is that I can make improvements.

Expand Down
2 changes: 1 addition & 1 deletion 02-drinking_from_a_fire_hose.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ install.packages("tidyverse")
install.packages("janitor")
```

Now that the packages are installed, they need to be loaded. As that package installation step only needs to be done once per computer, that code can be commented out so that it is not accidentally run, or even just removed. Additionally, we can remove the message that printed when we installed the packages (@fig-quarto-australian-elections-5).
Now that the packages are installed, they need to be loaded. As that package installation step only needs to be done once per computer, that code should be commented out so that it is not accidentally run, or even just removed. Additionally, we can remove the message that printed when we installed the packages (@fig-quarto-australian-elections-5).

```{r}
#| echo: true
Expand Down
76 changes: 3 additions & 73 deletions 03-workflow.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ Chapman and Hall/CRC published this book in July 2023. You can purchase that [he
- `renv` [@renv]
- `reprex` [@reprex]
- `styler` [@styler]
- `tictoc` [@Izrailev2014]
- `tidyverse` [@tidyverse]
- `tinytex` [@tinytex]
- `usethis` [@usethis]
Expand All @@ -60,7 +59,6 @@ library(lintr)
library(renv)
library(reprex)
library(styler)
library(tictoc)
library(tidyverse)
library(tinytex)
library(usethis)
Expand All @@ -80,7 +78,7 @@ If science is about systematically building and organizing knowledge in terms of
- Reproducible research is when "[a]uthors provide all the necessary data and the computer codes to run the analysis again, re-creating the results."
- A replication is a study "that arrives at the same scientific findings as another study, collecting new data (possibly with different methods) and completing new analyses."

Regardless of what it is specifically called, @Gelman2016 identifies how large an issue the lack of it is in various social sciences. The problem with work that is not reproducible is that it does not contribute to our stock of knowledge about the world. This is wasteful and potentially even unethical. Since @Gelman2016, a great deal of work has been done in many social sciences and the situation has improved a little, but much work remains. That is also the case in the life sciences [@heil2021reproducibility] and computer science [@pineau2021improving].\index{computer science}
Regardless of what it is specifically called, @Gelman2016 identifies how large an issue the lack of it is in various social sciences. Work that is not reproducible does not contribute to our stock of knowledge about the world. This is wasteful and potentially even unethical. Since @Gelman2016, a great deal of work has been done in many social sciences and the situation has improved a little, but much work remains. That is also the case in the life sciences [@heil2021reproducibility], cancer research [@Begley2012; @Mullard2021], and computer science [@pineau2021improving].\index{computer science}

Some of the examples that @Gelman2016 talks about are not that important in the scheme of things. But at the same time, we saw, and continue to see, similar approaches being used in areas with big impacts. For instance, many governments have created "nudge" units that implement public policy [@sunstein2017economics] even though there is evidence that some of the claims lack credibility [@nonudge; @gelmannudge].\index{public policy} Governments are increasingly using algorithms that they do not make open [@chouldechova18a]. And @herndon2014does document how research in economics that was used by governments to justify austerity policies following the 2007–2008 financial crisis turned out to not be reproducible.\index{economics}

Expand Down Expand Up @@ -782,7 +780,7 @@ It is difficult to define a complete and general suite of tests, but broadly we
5) duplicates, and
6) regression results.
We do all this initially on our simulated data and then move to the real data. It is possible to write an infinite number of tests but a smaller number of high-quality tests is better than many thoughtless tests.
We do all this initially on our simulated data and then move to the real data. The mirrors the evolution of testing during the Apollo Program. Initially testing occured based on expectations of requirements, and these tests were later updated to take into account actual launch measurements [@testingforsuccess, p. 21]. It is possible to write an infinite number of tests but a smaller number of high-quality tests is better than many thoughtless tests.
One type of test is an "assertion".\index{testing!assertion} Assertions are written throughout the code to check whether something is true and stop the code from running if not [@researchsoftware, p. 272]. For instance, you might assert that a variable should be numeric. If it was tested against this assertion and found to be a character, then the test would fail and the script would stop running. Assertion tests in data science will typically be used in data cleaning and preparation scripts. We have more to say about these in @sec-clean-and-prepare. Unit tests check some complete aspect of code [@researchsoftware, p. 274].\index{testing!unit tests} We will consider them more in @sec-its-just-a-linear-model when we consider modeling.
Expand Down Expand Up @@ -957,72 +955,9 @@ mtcars <-
```
### Parallel processing
Sometimes code is slow because the computer needs to do the same thing many times.\index{efficiency!parallel processing} We may be able to take advantage of this and enable these jobs to be done at the same time using parallel processing. This will be especially useful starting from @sec-its-just-a-linear-model for modeling.
After installing and loading `tictoc` we can use `tic()` and `toc()` to time various aspects of our code.\index{efficiency!timing} This is useful with parallel processing, but also more generally, to help us find out where the largest delays are.
```{r}
#| message: false
#| warning: false
tic("First bit of code")
print("Fast code")
toc()
tic("Second bit of code")
Sys.sleep(3)
print("Slow code")
toc()
```
And so we know that there is something slowing down the code. (In this artificial case it is `Sys.sleep()` causing a delay of three seconds.)
We could use `parallel` which is part of base R to run functions in parallel.\index{efficiency!parallel processing} We could also use `future` which brings additional features. After installing and loading `future` we use `plan()` to specify whether we want to run things sequentially ("sequential") or in parallel ("multisession"). We then wrap what we want this applied to within `future()`.
To see this in action we will create a dataset and then implement a function on a row-wise basis.
```{r}
#| eval: false
simulated_data <-
tibble(
random_draws = runif(n = 1000000, min = 0, max = 1000) |> round(),
more_random_draws = runif(n = 1000000, min = 0, max = 1000) |> round()
)
plan(sequential)
tic()
simulated_data <-
simulated_data |>
rowwise() |>
mutate(which_is_smaller =
min(c(random_draws,
more_random_draws)))
toc()
plan(multisession)
tic()
simulated_data <-
future(simulated_data |>
rowwise() |>
mutate(which_is_smaller =
min(c(
random_draws,
more_random_draws
))))
toc()
```
The sequential approach takes about 5 seconds, while the multisession approach takes about 0.3 seconds.
## Concluding remarks
In this chapter we have considered much and it is normal to be overwhelmed. Come back to the Quarto section as needed. Many people are confused by Git and Github and just know enough to get by. And while there was a lot of material in efficiency, the most important aspect of performant code is making it easier for another person to read it, even if that person is just yourself returning after a break.
In this chapter we have considered much and it is normal to be overwhelmed. Come back to the Quarto section as needed. Many people are confused by Git and GitHub and just know enough to get by. And while there was a lot of material in efficiency, the most important aspect of performant code is making it easier for another person to read it, even if that person is just yourself returning after a break.
Expand Down Expand Up @@ -1186,11 +1121,6 @@ election_results |>
b. Sharing code and data as well as results.
c. Using Quarto for integrating R and Python code in papers.
d. Using version control with Git and GitHub.
26. What is one advantage of parallel processing (pick one)?
a. It ensures code reproducibility.
b. It allows multiple chunks of code to run simultaneously, reducing execution time.
c. It improves the readability of code.
d. It helps automatically generate documentation.
### Activity {.unnumbered}
Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit ff02685

Please sign in to comment.