Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with formula #339

Closed
devSJR opened this issue Mar 13, 2025 · 10 comments · Fixed by #341
Closed

Dealing with formula #339

devSJR opened this issue Mar 13, 2025 · 10 comments · Fixed by #341

Comments

@devSJR
Copy link

devSJR commented Mar 13, 2025

Hello. It would be great if tinyplot can deal with more complex formulas because it behaves different from base plot. I think this is on the to-do list. I just wanted to put it here

mtcars

par(mfrow = c(1,2))
plt(mpg/hp ~ wt, data = mtcars)
plot(mpg/hp ~ wt, data = mtcars)

Image

@grantmcdermott
Copy link
Owner

Thanks @devSJR. As you probably know, the workaround here is to wrap with I()...

plt(I(mpg/hp) ~ wt, data = mtcars)

... but I certainly agree that consistency with vanilla plot would be desirable if we can achieve it.

@zeileis
Copy link
Collaborator

zeileis commented Mar 13, 2025

The fact that tinyplot handles much more complex formulas than base plot is the reason for this deviation. Internally, tinyplot converts a formula like y ~ x | a + b into ~ y + x + a + b in order to first set up a model frame with all variables and then extracts and processes the building blocks ~ y, ~ x, and ~ a + b.

If the blocks for y and x contain more than one variable, then an informative warning is issued, e.g.,

plt(mpg/hp ~ wt, data = mtcars)
## Warning message:
## In tinyplot.formula(mpg/hp ~ wt, data = mtcars) :
##   formula should specify at most one y-variable, using: mpg

This is done the same way for the x variable:

plt(mpg ~ wt/hp, data = mtcars)
## Warning message:
## In tinyplot.formula(mpg ~ wt/hp, data = mtcars) :
##   formula should specify exactly one x-variable, using: wt

So tinyplot is consistent here between the left-hand side and right-hand side. In contrast, base plot is not consistent:

  • On the left-hand side mpg/hp has its arithmetic meaning: mpg divided by hp.
  • On the right-hand side wt/hp has its symbolic formula meaning: mpg nested in hp which here gets translated into a sequence of two plots mpg ~ wt and mpg ~ hp.

I think that this behavior of base R is inconsistent and very confusing. Of course, it would be possible to mimic this behavior but I don't think we should do it. Instead we should be consistent in the processing of the left-hand side and right-hand side. Either by warning about this situation (current solution) or by using the arithmetic meaning for both the y and the x variable.

The latter would create another inconsistency, though, namely that + would have to be handled differently in the x and the by part, e.g., y ~ x1 + x2 | a + b. For the x part we would then use the arithmetic meaning (x1 plus x2) but for by part the symbolic formula meaning (two variables a and b).

So to cut a long story short: My opinion is that tinyplot's current solution is the only one that is consistent and not confusing. If the users want to use operators with their arithmetic meaning, they need to insulate them, e.g., via I(), both on the left-hand side and the right-hand side. Thus, you can use I(mpg/hp) ~ wt or mpg ~ I(wt/hp) which both work exactly the same in tinyplot and in base R.

@zeileis
Copy link
Collaborator

zeileis commented Mar 13, 2025

Some additional side remarks (aka rant). Feel free to ignore this, it's not really related to formulas for plotting:

  • The formula support (added with S3 in the white book) is an incredibly powerful feature for doing statistics and it's great to have it wired into the base language.
  • However, the choice that model.frame() et al. would keep the arithmetic meaning of operators on the left-hand side but not on the right-hand side is just wrong IMO.
  • Why can I use lm(y ~ x1 + x2) to specify a model with two regressors but have to use lm(cbind(y1, y2) ~ x) rather than lm(y1 + y2 ~ x) for a model with two dependent variables?
  • This shortcoming also means that I cannot easily specify two factor response variables (say for a bivariate probit model) because cbind(y1, y2) whould drop the factor attributes and data.frame(y1, y2) is not allowed in a formula processed with model.frame().

@zeileis
Copy link
Collaborator

zeileis commented Mar 14, 2025

If we want to disable the symbolic interpretation of the formula operators in tinyframe() we can do so as follows:

tinyframe = function(formula, data, drop = FALSE, symbolic = TRUE) {
  ## input
  ## - formula: (sub-)formula
  ## - data: model.frame from full formula
  if (is.null(formula)) return(NULL)
  if (symbolic) {
    names = sapply(attr(terms(formula), "variables")[-1L], deparse, width.cutoff = 500L)
  } else {
    rhs = formula[[2L]]
    names = deparse(rhs, width.cutoff = 500L)
    data[[names]] = with(data, eval(rhs))
  }
  data[, names, drop = drop]
}

The default symbolic = TRUE is the behavior we have up to now:

d <- data.frame(a = 1:3, b = 3:1)
tinyframe(~ a + b, data = d)
##   a b
## 1 1 3
## 2 2 2
## 3 3 1

But then we can switch to symbolic = FALSE:

tinyframe(~ a + b, data = d, symbolic = FALSE)
##   a + b
## 1     4
## 2     4
## 3     4

Thus, with that modification we could turn off the symbolic interpretation of the y and/or x part of the formula inside tinyplot.formula.

So the changes to the code are really minimal and with can easily implement any of the following strategies:

  1. Current tinyplot behavior: Use symbolic = TRUE for both y and x. This forces users to use I() on both the left-hand side and right-hand side.
  2. Mimic base plot: Use symbolic = TRUE for x but symbolic = FALSE for y.
  3. Force single y and x variable: Use symbolic = FALSE in both x and y.

My personal preference is (1) >> (3) > (2). But if Grant and/or Vincent clearly prefer consistency with base R, I'm also willing to implement that 😜

zeileis added a commit that referenced this issue Mar 14, 2025
…ymbolic interpretation of formula parts (discussed in #339)
@devSJR
Copy link
Author

devSJR commented Mar 14, 2025

I started my morning with a coffee, an additional side remarks (aka rant) and finished with a smile.
You rock!
Keep up the good work.

I think a small section in the vignette (even verbatim of this discussion here) would do the trick.

@zeileis
Copy link
Collaborator

zeileis commented Mar 14, 2025

Thanks for the nice words - and thanks for raising the issue in the first place!

After some sleep, I also don't file quite as strongly about strategy 2. So if Grant decides that he wants to go with consistency with base R, I'm also very fine with that 😇

@grantmcdermott
Copy link
Owner

grantmcdermott commented Mar 14, 2025

I defer to to @zeileis on all matters related to formulae!

In seriousness, this is a great discussion with excellent points. The base inconsistency of / across the formula lhs and rhs is a killer point... as is the cbind() versus + point, which I too have been much frustrated by in the past.

Summarizing... @zeileis I think you make a perfectly compelling case that we should leave the current behaviour as-is, perhaps with some additional documentation or an example (as suggested by @devSJR). I can see that you've opened a branch that would enable users to select into the alternative symbolic = FALSE behaviour, so perhaps that's the place to do it. I also agree that (1) > (3) > (2), so let's ensure that symbolic = FALSE turns off symbolic behaviour for both x and y. Better to be internally consistent than follow "for compatibility with S" legacy behaviour IMO :-)

@devSJR
Copy link
Author

devSJR commented Mar 14, 2025

I will keep watching this. I guess more users will come across this, and they will certainly be able to find the symbolic parameter.

Moreover, examples in the example section are always great. Actually, they are better than the vignette! I tell my students to go the examples first. I guess others do the same.

@zeileis
Copy link
Collaborator

zeileis commented Mar 14, 2025

OK, good, let's stick with the current behavior then. We could also expand the error message, e.g.,

formula should specify at most one y-variable, using: mpg
if you want to use arithmetic operators, make sure to wrap them inside I()

or something along those lines?


As for tinyframe(..., symbolic = FALSE). My thought was that I would implement the option in tinyformula() so that we don't forget and can choose to use it or not. I wouldn't export the symbolic argument to the user, though. We should set (or not set) this internally.

However, in the meantime I thought of one problem that the current implementation does not cover: arithmetic operators in combination with other functions, e.g., log(y1) + y2 ~ x. This is admittedly contrived but it would necessitate a different implementation.

So if we stick to the current behavior, I would probably just discard the tinyframe-symbolic branch.

@grantmcdermott
Copy link
Owner

So if we stick to the current behavior, I would probably just discard the tinyframe-symbolic branch.

I must confess I haven't been keeping track of all the ins and outs here, and I desparately need to switch to "day job" work now as I've got a bunch of looming deadlines.

@zeileis my feeling is that I'm happy with whatever you think is best. So I'll leave you to close this issue or resolve in whatever fashion you feel is most appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants