Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Citation links #15

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

Klafyvel
Copy link
Contributor

This is heavily work in progress, and is meant to centralize discussion about the syntax for citing references.

@d-r-a-b
Copy link

d-r-a-b commented Apr 20, 2023

Proposed order of operations in general:

  1. Collect examples of real-world citations form many fields (crowdsource?)
  2. Decide what subset is worth supporting in neorg
    • in-text and parenthetical?
    • what bibliographic styles (conversion rules for et al, author-year vs numerical, parens vs brackets, etc)? how to configure? full-blown support for something like .sty files or just a few switches?
    • contextual notations? (see Jones, 2020, p44; Smith, 2022)
    • footnote styles?
  3. Examine other software to see how prior solutions to handling citations
  4. Propose neorg format
    • easily parsed
    • expose data fields in such a way that the support feature subset is implementable (by the macro system?)
    • consistency with other norg conventions

For this post, I am shortcutting 1-3 somewhat by leaning on natbib, a mature citation package used in latex documents but it would be very useful to look at other options. An illustrative subset of citation types supported by natbib follows:

Author-year styles
\citet{jon90}                ⇒ Jones et al. (1990)
\citet[chap.~2]{jon90}       ⇒ Jones et al. (1990, chap. 2)
\citep{jon90}                ⇒ (Jones et al., 1990)
\citep[chap.~2]{jon90}       ⇒ (Jones et al., 1990, chap. 2)
\citep[see][]{jon90}         ⇒ (see Jones et al., 1990)
\citep[see][chap.~2]{jon90}  ⇒ (see Jones et al., 1990, chap. 2)
\citet*{jon90}               ⇒ Jones, Baker, and Williams (1990)
\citep*{jon90}               ⇒ (Jones, Baker, and Williams, 1990)
\citet{jon90,jam91}          ⇒ Jones et al. (1990); James et al. (1991)
\citep{jon90,jam91}          ⇒ (Jones et al., 1990; James et al. 1991)
\citep{jon90,jon91}          ⇒ (Jones et al., 1990, 1991)
\citep{jon90a,jon90b}        ⇒ (Jones et al., 1990a,b)

Numerical Styles
\citet{jon90}                ⇒ Jones et al. [21]
\citet[chap.~2]{jon90}       ⇒ Jones et al. [21, chap. 2]
\citep{jon90}                ⇒ [21]
\citep[chap.~2]{jon90}       ⇒ [21, chap. 2]
\citep[see][]{jon90}         ⇒ [see 21]
\citep[see][chap.~2]{jon90}  ⇒ [see 21, chap. 2]
\citep{jon90a,jon90b}        ⇒ [21, 32]

In terms of spec features that could be re-used (I'm sure I'm missing some):

  1. It seems that a {link}-like syntax would be appropriate, as it references data from outside the file
  2. The : joiner can profitably be used to separate multiple citations. If we define exactly 1 citation key within each citation entry section, it may also make it easier to implement "see"/"page" type additions.
    • alternatively the | contextual separator might serve a similar purpose.
  3. The attached modifier extensions may be useful for specifying things like in-text/parenthetical
  4. The @document.meta block may be useful for specifying bibliography style
    • alternatively it would also make sense for this to be a part of a [core.citation] config block, perhaps specified default and per-workspace

One potential mapping that might feel appropriate, showing proposed ⇒ natbib ⇒ rendered using citation-style :

@document.meta
citation-style: style
@end

Author-year styles
{& &jon90}                   ⇒ \citet{jon90}                ⇒ Jones et al. (1990)
{& &jon90, chap ~2}          ⇒ \citet[chap.~2]{jon90}       ⇒ Jones et al. (1990, chap. 2)
{& &jon90}(p)                ⇒ \citep{jon90}                ⇒ (Jones et al., 1990)
{& &jon90, chap. 2}(p)       ⇒ \citep[chap.~2]{jon90}       ⇒ (Jones et al., 1990, chap. 2)
{& see &jon90}(p)            ⇒ \citep[see][]{jon90}         ⇒ (see Jones et al., 1990)
{& see &jon90, chap. 2}(p)   ⇒ \citep[see][chap.~2]{jon90}  ⇒ (see Jones et al., 1990, chap. 2)
{& &jon90}(allauthors)       ⇒ \citet*{jon90}               ⇒ Jones, Baker, and Williams (1990)
{& &jon90}(p|allauthors)     ⇒ \citep*{jon90}               ⇒ (Jones, Baker, and Williams, 1990)
{& &jon90 : &jam91}          ⇒ \citet{jon90,jam91}          ⇒ Jones et al. (1990); James et al. (1991)
{& &jon90 : &jam91}(p)       ⇒ \citep{jon90,jam91}          ⇒ (Jones et al., 1990; James et al. 1991)
{& &jon90 : &jon91}(p)       ⇒ \citep{jon90,jon91}          ⇒ (Jones et al., 1990, 1991)
{& &jon90a : &jon90b}(p)     ⇒ \citep{jon90a,jon90b}        ⇒ (Jones et al., 1990a,b)

Numerical Styles
{& &jon90}                   ⇒ \citet{jon90}                ⇒ Jones et al. [21]
{& &jon90, chap. 2}          ⇒ \citet[chap.~2]{jon90}       ⇒ Jones et al. [21, chap. 2]
{& &jon90}(p)                ⇒ \citep{jon90}                ⇒ [21]
{& &jon90, chap. 2}(p)       ⇒ \citep[chap.~2]{jon90}       ⇒ [21, chap. 2]
{& see &jon90}(p)            ⇒ \citep[see][]{jon90}         ⇒ [see 21]
{& see &jon90, chap. 2}(p)   ⇒ \citep[see][chap.~2]{jon90}  ⇒ [see 21, chap. 2]
{& &jon90a : &jon90b}(p)     ⇒ \citep{jon90a,jon90b}        ⇒ [21, 32]

A couple of questionable decision I made above

  • Maybe | really would be better than : -- at least I can think of more cases where I might want to use a : character in the text contextualizing a citation
  • how to deal with , e.g. , chap. 2. I attach it to the citekey but not sure if that is wise
  • I omitted a (t) for the citet equivalents, effectively making it the default. Not sure if (t) and (p) make more sense represented succinctly (whether or not a default is chosen/configurable), or if it should be a key:value type
  • Similarly the allauthors seems somewhat verbose, but very clear. I don't overly like * as a metadata char here and not sure if it is valid in attached modifier extensions. Perhaps all as a shorter but clear form?
  • & was chosen arbitrarily because @ is already taken for timestamps and & means reference in many proglangs

@Klafyvel
Copy link
Contributor Author

Klafyvel commented Apr 20, 2023

An idea I gave on the Discord server to account for @vhyrro 's proposal:

Ok, so there may be a tradeoff here: have a very lightweight spec for citations, something like what @vhyrro described: {= jones2022}(my_bib). It is simple, stupid to implement for parsers. But then the semantics and standard library can build on that.

Someone wanting to define a new style would implement a set of functions, e.g. my_citep etc. as listed by @d-r-a-b in his github post. Then, you could just do {= jones2022}(my_citep). To solve the issue of easily changing citation style, we could say that the standard library defines some citep macro that knows how to find the correct function my_citep (for example from @document.meta) and replace the a .citep call in norg files with {= jones2022}(my_citep).

What do you think ?
Parser-wise that means basically no change at all, stdlib-wise there's some work, but there's a lot of work to be done on stdlib anyway.

And later we figured that an example of what a citation style library could look like:

@document.meta
title: my_library
@end

@document.bib
citep: my_citep
citet: my_citet
@end

=name my_citet
#eval
@code janet
...
@end
=end
...

@bdarcus
Copy link

bdarcus commented Apr 20, 2023

I've not read this carefully at all, but I'll just repeat the essence of what I earlier posted on a citation issue here:

Look at the pandoc citation model, and the new newer org-cite one.

The former has been around longer, and been battle-tested by a lot of users, and the latter in turn learned from that (as well as BibLaTeX).

Having a model that is conceptually close will not only get you needed features without having to reinvent the wheel, but also can ensure it's easy to losslessly convert back and forth.

As for styling, I'm obviously biased, since I created CSL (used for styling in both), but I think we collectively solved a lot of challenges in this domain.

But it's really difficult, with trade-offs. If you design an approach organized around natbib styling, that leaves out better, more general, options in the TeX world (namely BibLaTeX), but also newer solutions like CSL, that work outside the TeX world.

I think pandoc is the best balance; it's rich enough (it's citation model is richer than natbib's AND more concise), but can export to key TeX styles, while also supporting native formatting in the CSL-based citeproc processor.

In org-cite, we adopted a heavier weight approach to styling, which likely wouldn't work here. But an advantage of it is a document's citations are portable across included export backbends.

The org-biblatex module actually has a variable that allows the users to define their own mappings, if they prefer, somewhat like the macro idea here. But the default value is curated so that the styles map more-or-less consistently to other backends.

Pandoc also has that as well, but supports a more limited range of local citation styles (just the equivalent of citet, citep, and I guess citea).

@d-r-a-b
Copy link

d-r-a-b commented Apr 21, 2023

Thanks for offering an experienced viewpoint in this domain!

In org-cite, we adopted a heavier weight approach to styling, which likely wouldn't work here. But an advantage of it is a document's citations are portable across included export backbends.

Can you explain what aspects of org-cite specifically seem non-workable in a norg context? Looking into it there is just a cite token with optional /style specifier for things like textual vs. parenthetical, a universal prefix/suffix and a per-publication prefix/suffix. Considering that the likely candidate for macros will have builtin parsing expression grammars this looks very tractable in general. It actually seems a lot easier from an implementation standpoint that the pandoc format, since the pandoc in-text citation style is not a clearly delineated from the surrounding text. I can't think of a reason why norg couldn't have {= universal prefix; prefix @key suffix; . . . ; universal suffix}{cite|style}. Ultimately the cite macro (and composable style macro) defines the format of the preceding block, not the norg-spec per se so there's a question in my mind about whether this is a macro-implementation detail (in much the way that there was org-ref and then org-cite), or does norg-spec take a stance.

On thinking about it more, I tend to like a generalized macro approach with norg-spec not necessarily taking a strong stance. Then it's purely a macro-implementation detail, and the macro can also decides what subset of databases (.bib, biber .bib, zotero sqlite, new custom one, etc) and subset of style specs it supports (.bst, .csl, new custom one, etc). A robust norg-cite macro would then either be its own repo vs. part of the modules that ship with nvim-neorg/neorg.

@d-r-a-b
Copy link

d-r-a-b commented Apr 21, 2023

Skimmed through oc.el and oc-basic.el at https://git.savannah.gnu.org/cgit/emacs/org-mode.git/. Also the (brief) doc at https://orgmode.org/manual/Citation-handling.html. Very interesting ideas, make sure to read the ;;; Commentary section that immediately follows the license stuff. Very succinct and easy to miss if you just try to skip to the code.

The basic idea seems to be that oc.el leverages macros to create a unified interface for registering processors like oc-basic.el/oc-natbib.el/oc-csl.el. Each processor can provide 4 primary capabilities: activate, follow, insert, export. Somewhat translating to neovim/neorg terms:

  • activate applies highlight/parse groups to the contents of the macro. This allows e.g. [cite:see @key] to be shown with @key highlighted based on whether or not @key is actually a valid key in any of the sourced .bib, .json databases
  • follow implements essentially [core.esupports.hop] functionality for @keys . Could also implement "hover" functionality which is really just hop in a new split/new float.
  • insert which is basically autocompletion. Haven't looked into whether processors actually implement the completion menu display+filtering or just provide a list of candidates similar to nvim-cmp sources
  • export handles the actual transform to a target format
    • [cite: @jones2022] => "Jones et al (2022)" if output is to text/html/pdf etc
    • [cite: @jones2022] => \cite{@jones2022} if output is to latex

@d-r-a-b
Copy link

d-r-a-b commented Apr 21, 2023

@vhyrro, would this activate idea be interesting to explore as a general macro feature so they pass information about their DSLs back to neovim/neorg ?

@bdarcus
Copy link

bdarcus commented Apr 21, 2023

Can you explain what aspects of org-cite specifically seem non-workable in a norg context?

I should indeed clarify that.

I only meant the cite/t/c bit below, where the t is a shortcut for text and c a substyle for initial capitalization:

[cite/t/c@jones; ...]

That design is really powerful and flexible, but I was thinking "cite" and such wasn't really consistent with the syntax approach in neorg.

@bdarcus
Copy link

bdarcus commented Apr 21, 2023

Haven't looked into whether processors actually implement the completion menu display+filtering or just provide a list of candidates similar to nvim-cmp sources.

The default insert processor uses minibuffer completion.

My citar package provides a richer alternative for that, and also a cmp-like completion at point (see two screenshots at the top).

https://github.com/emacs-citar/citar

The latter does use org for finding the citations, but is otherwise independent of org-cite.

The minibuffer completion is provided as an insert processor, so that it will be used when calling org-cite-insert.

It's a really well-designed system that makes that easy to do on my end. The insert processor is just a few lines of code.

Basically the insert and follow processor frameworks allows you to plug in custom functions to standard org commands.

@bdarcus
Copy link

bdarcus commented Apr 21, 2023

@d-r-a-b - just looking at your earlier note a bit more closely.

I can't think of a reason why norg couldn't have {= universal prefix; prefix @key suffix; . . . ; universal suffix}{cite|style}.

+1

I tend to like a generalized macro approach with norg-spec not necessarily taking a strong stance. Then it's purely a macro-implementation detail, and the macro can also decides what subset of databases (.bib, biber .bib, zotero sqlite, new custom one, etc) and subset of style specs it supports (.bst, .csl, new custom one, etc). A robust norg-cite macro would then either be its own repo vs. part of the modules that ship with nvim-neorg/neorg.

I only casually follow neorg, so forgive me: WDYM by "macro" in this context?

It sounds like maybe your suggestion would be coupling citation syntax to a particular output backend? Or am I misreading?

@d-r-a-b
Copy link

d-r-a-b commented Apr 22, 2023

It sounds like maybe your suggestion would be coupling citation syntax to a particular output backend? Or am I misreading?

Depending on how you define an "output backend", yes. To put it in org-cite terms, the idea I was describing above would be that a given citation format (the ascii/UTF characters entered into a norg file) would be determined by a library analogous to org-cite/ oc.el. I very specifically do not mean for the citation format to be specific to an output like oc-basic or oc-biblatex.el or oc-csl.el. In theory, a third-party could either write additional backends for the org-cite analogue, or they could write their entirely own macro library much like org-ref was once written. My hope would be that norg-cite would be good enough (and built into neorg from an early stage) so that efforts could be more consolidated around it.

On the other hand, I am advocating for some conceptual separation of norg-cite-spec from norg-spec itself to allow for different approaches to be trialed by the community if they really want to experiment with their own norg-refs.

@d-r-a-b
Copy link

d-r-a-b commented Apr 22, 2023

I only casually follow neorg, so forgive me: WDYM by "macro" in this context?

It's probably better to excerpt from the spec for this. It's a bit longish, but a quick skim overall.

*** Macro Tags
Macro tags (also known as /macro definitions/) are a tag type designed to declare and define
macros. Macros are templates that can be executed with parameters in order to place some
structured text into the document.

The content of the macro tag is /any/ Norg markup - this includes {** structural detached
modifiers} and nested {* tags}. The macro tag is closed with the =end statement.

Under the hood, all other {* tags}[tag] types are implemented as macros with special parameters
and contents. . . (continues with examples)

The specific {= ...}(macro) syntax I used above is described:

**** Extendable Links (=)
Apart from having links with set behaviors Norg also features an extendable link marked with
the = character. This link has its behaviors governed by {** attached modifier extensions}
supplied to the link and by the software running the Norg format (e.g. [Neorg]).

Syntax:
|example
+bibliography ./myreferences.bib
% my_bibliography

This is a reference to a bibliography: {= Neorg2022}(my_bibliography).
|end

For a more detailed explanation of the behavior of this link consult the [semantics document].

See also

**** Edge Cases and Semantic Interpretation
A commonly arising question is "how are these interpreted at parse time?" - can you link to
elements within \|comment tags? What governs the behavior of these differing tags?

The answer may be illustrated simply by showing how these tags are implemented.
As mentioned in the {*** macro tags} section, all tag types (apart from the macro tag) are a
Macro invocation under the hood. Below are the implementations for \|comment and \|group,
respectively:

  • :
    |example
    =comment ...
    =end
    |end
  • :
    |example
    =group ...
    &...&
    =end
    |end

The \|comment tag evaluates to /no value/. Anything that is placed within a comment during
invocation (the ... parameter) is simply dropped. Because of this it is not possible to
link to elements within comment tags. The \|group tag returns everything that you give
it, because of this it is possible to freely link to any element within a \|group.

To summarize - the behavior of each individual standard ranged tag is fully governed by its
implementation - see the [semantics document] for more details.

Macros are an active focus of neorg development right now. At the moment, I believe that the "macro" system is completely implemented in lua. However, there are experiments to embed Janet as a very lightweight LISP-like that comes with Parsing Expression Grammars built in as part of the standard library. My understanding is that this would then form the basis of most macros, but you'd really have to pick Vhyrro's brain for a better explanation of how the responsibilities are going to be delegated.

@bdarcus
Copy link

bdarcus commented Apr 22, 2023

edited for clarity

In theory, a third-party could either write additional backends for the org-cite analogue, or they could write their entirely own macro library much like org-ref was once written. My hope would be that norg-cite would be good enough (and built into neorg from an early stage) so that efforts could be more consolidated around it.

OK, I get it. So in effect, you are meaning a default citation syntax that should be good enough for the vast majority of cases, but that's not hard-coded? If yes, that seems reasonable.

By "backend" I was meaning TeX (natbib vs biblatex, etc.) vs a CSL solution (including the lua-based one for TeX) vs a newer example like Typst.

Aside: have you all identified the list of requirements? I would suggest this as one of them:

  • A citation should work across output targets without surprises.

IMO, the org-ref/org-cite thing is a problem ATM; it fragments the development ecosystem, and forces users to choose between incompatible approaches (you can't mix the two syntaxes in the same document, or things will break), all because some people insist on "style" names that look like natbib's command names, which is more of a UI issue than anything.

But I guess that's neither here-nor-there; users and developers do have different priorities, and that's unavoidable at some level.

The natbib citation model, BTW, is more limited than all the alternatives we've raised here.

@d-r-a-b
Copy link

d-r-a-b commented Apr 22, 2023

OK, I get it. So in effect, you are meaning a default citation syntax that should be good enough for the vast majority of cases, but that's not hard-coded? If yes, that seems reasonable.

I think I am advocating for a neorg-blessed norg-cite macro library, which would be implemented as a neorg module either in the neorg dev tree or as a separate repo. This norg-cite library would in turn define its own norg-cite-spec. Given that in this scenario neorg is blessing norg-cite as being under the general umbrella of the neorg project and not a totally independent third-party effort, it would make sense for norg-cite-spec to offer a relatively complete set of features, but I'll leave that for a separate reply to try and consolidate some requirements under a single post.

This means that norg-spec would not have 1st class support for citations, but rather treat them as a subset of {=....}- style Extendable Link. The alternative is a new type of link e.g. {& ... }which would have 1st class support as anorg-citation` object.

Pro/cons as I see it revolve around

  • whether other modules that are not directly part of norg-cite have easy access to the citation keys and universal/per-key affixes
  • flexibility to move to other citation engines or in-text representations without having to alter norg-spec and tree-sitter-norg

@d-r-a-b
Copy link

d-r-a-b commented Apr 22, 2023

Aside: have you all identified the list of requirements? I would suggest this as one of them:

  • A citation should work across output targets without surprises.

IMO, the org-ref/org-cite thing is a problem ATM; it fragments the development ecosystem, and forces users to choose between incompatible approaches (you can't mix the two syntaxes in the same document, or things will break),

Fragmentation is always a problem. To be fair, I don't expect to use both Zotero and Endnote in my word docs and for them to play well together. At least with these markup languages it's easy to see at a glance which package is being used.

all because some people insist on "style" names that look like natbib's command names, which is more of a UI issue than anything.

If I understand correctly, org-cite came after org-ref was already a pretty featureful package. It could have implemented a syntax that was a superset of org-ref but chose not to. I'm sure there were good technical reasons for implementing a different syntax, but it didn't and so there is fragmentation. I'm really not surprised that people who spent 4-5 years learning to use org-ref productively would be reluctant to reformat all their notes into a org-cite compatible format and convert all their tooling if they don't have need for the additional flexibility offered by org-cite. Of course, we'd like to head this fragmentation off at the pass for neorg and that's why we're trying to discuss it now.

The natbib citation model, BTW, is more limited than all the alternatives we've raised here.

For the discussion, I think it's useful to be explicit. I mentally hold distinct the following 3 concepts and these are how I use these terms.

  • model: set of citation-related data including
    • references into 1-or-more database backends
    • metadata (affixes)
    • metadata (display variant - in-text, parenthetical, capitalize last names starting with von, allauthors)
    • relationship/hierarchy between these references and associated metadata
  • syntax: how that model is communicated in a plain-text format
  • package: specific implementation of a set of features that translates a syntax into a model (like an AST) and then back to an output format on export
    • output to another syntax, which may potentially be done without any additional lookups
    • output to a final text form, which likely needs to reference 1-or-more databases to complete it's information and 1-or-more types of style formats to decide how to translate the model into a author-year, numerical, author-title, etc citation

It's fair to say that certain syntaxes limit what models are supported, and if you want to convert from a package that only supports a .bib backend to one that only supports a sqlite backend there are some squicky conversion that probably need to happen outside of neorg, with neorg probably assuming that the cite-key remains constant or can be transformed in a regular way.

I more strongly align to your earlier statement that much of this is a UI issue. Is there really a difference in the model just because the syntax looks a bit different? I will say the natbib package, does offer a more limited model because it only offers universal affixes and not per-key affixes. norg-cite-spec could choose to address this or not, although my strong preference would be that it does, even at the cost of only supporting lossy export to a natbib latex backend. On the other hand, it isn't clear at all that choosing a natbib-inspired syntax means you have to inherit that particular limitation. Whether you call it citep or cite/p doesn't seem to make a real difference, nor does whether you order your prefix/suffix arguments as UP, KP @key KS, . . . US or UP | US | KP & KS & @key | . . . or whether you put the "citep" equivalent at the beginning or at the end of the syntax. I know what I like better, but my sense is the resulting model is the same in both even if one syntax feels more elegant.

@d-r-a-b
Copy link

d-r-a-b commented Apr 22, 2023

Some possible requirements to get us going, all of which are up for debate

Model

  • single references (suggestion: as a case of the multiple-reference syntax)
  • multiple references
  • what metadata fields (per key vs per citation vs universal)
    • affix (suggestion: both per citation and per key)
    • citation style variant (suggestion: per citation)
    • citation style (APA, MLA, IEEE) (suggestion: universal)
    • org-cite offers a location reference as part of its cite (suggestion: per key)
      • technically per-key suffix allows the same textual output but data-aware allows style guides to potentially work directly with the range
      • multiple locator case?
  • I think @bdarcus "A citation should work across output targets without surprises" goes here, or else in Package

Syntax

  • first-class norg-spec support for citations vs second-class as macro arguments
  • tree-sitter parse-able whether as first-class into components or second-class single object that gets passed
  • defining the databases which hold the actual data that the citekeys reference
    • 1 database vs several?
    • precedence on duplicated citekey?
  • preferably consistent with other aspects of norg-spec
  • preferably similar to other packages unless there's a good technical reason to diverge
    • the general organization of [UP; KP @key KS,; ... ; US] seems reasonable to most people who have seen it here
    • do you specify the punctuation between P/S and whatever is ultimately generated by @key, or is that up to the styles ultimately?
  • mechanism to indicate citation should generate:
    • entry in works cited (suggest: simply default behavior)
    • no entry if generating a works cited, but an entry if generating a bibliography
  • mechanism for specifying stylistic variant
    • different processor vs argument to processor
    • in-text
      • style-guide default
      • parenthetical
      • genitive/possessive (Smith's (1980) paper)
      • all-authors
      • one-author
      • date-only (primary use case seems to be to "get around" lack of stylistic variant for e.g. genitive)
      • title-only
    • parenthetical
    • "invisible"
      • mark citation as to be included in the bibliography vs. the bibliography implies every entry in the .bib or sqlite database

Package

  • addition of citations to some list for bibliography generation
    • automatic with citations
    • concept of works cited vs bibliography, to be supported by the syntax features above
  • export to other markup formats (org-mode with org-cite), latex (with biblatex or natbib or whatever)
    • how to gracefully degrade for output to other syntaxes/models that don't support the same feature set offered by norg-cite and norg-cite-spec.
  • export to text
    • referencing database
      • .bib (seems like a basic thing that everyone wants to support)
        • with biber extensions
      • Zotero sqlite
      • others?
    • referencing style definition
      • norg-specific (vote against)
      • CSL
      • sty
      • others?
  • many of the above functionality rely on stateful tracking for at least a subset of functionality
    • generation of works cited
    • numerical styles, for re-using numbers when the same source is cited twice
    • numerical styles where "works cited" specifies different order from citation order (e.g. alphabetical)
    • more contextual citation styles, such as MLA preference to simply cite page number if citing the same source as the most recent citation (https://kristofferbalintona.me/posts/202206141852/), which additionally require tracking most recent citation.

Compelling User story

  • Write citation syntax which allows package to generate a model.
  • Model used as basis for encoding an output into text vs. other markup, on the basis of style guide (IEEE, APA...) and style variant
  • Package uses model and possibly syntax to decide whether or not to add a citation to 1 or more internal lists which are used for works-cited vs bibliography generation
  • Package enables low-friction integration of whatever reference management solution they are already using, or to work with some format that their collaborators need to use
  • Package enables low-friction (ideally seamless) change of style guide using a .sty or .csl that they find
  • Package vs. Syntax makes it easy to interop with other tools/formats

EDITED: add genitive citations
EDITED: distinguish between universal, per-citation and per-key
EDITED: bibliography vs works-cited
EDITED: location specifier
EDITED: interop with e.g. pandoc
EDITED: stateful tracking

@bdarcus
Copy link

bdarcus commented Apr 22, 2023

Is there really a difference in the model just because the syntax looks a bit different?

Absolutely not. I do indeed mean model; the abstractions behind the syntax.

I will say the natbib package, does offer a more limited model because it only offers universal affixes and not per-key affixes.

Yes, this is what I meant; that's an arbitrary limitation that makes it impractical in many fields; one that pandoc, biblatex, org-cite don't have.

Well, maybe not "arbitrary" exactly; I am guessing one has to do some gymnastics to support it in TeX that aren't necessary elsewhere, which would explain why biblatex has two different syntaxes for single vs multiple.

Whether you call it citep or cite/p doesn't seem to make a real difference.

That's correct; it's why the org-cite code is completely agnostic on it, with just some best-attempt reasonable defaults, which allow documents to work pretty consistently across different output targets.

For example cite/t will render the same in defaullt natbib, biblatex, and csl export processors.


Org-cite history, etc ...

Don't want to get too focused on this; I really just raised this issue to encourage you all to avoid this yourselves, which you have an opportunity to do since this is a pretty new project.

If I understand correctly, org-cite came after org-ref was already a pretty featureful package. It could have implemented a syntax that was a superset of org-ref but chose not to. I'm sure there were good technical reasons for implementing a different syntax, but it didn't and so there is fragmentation.

I'd say more like the development of org-cite took place in public on the org mailing list over many years, preceding even the development of org-ref ;-)

I was only involved in the six months or so of it, long after the syntax/model discussions were settled. But the org-ref folks had every opportunity to make their case there; either they didn't, or they failed to convince.

In org-element, citations and their components (citation-reference) are first-class objects, so you can have code like this in citar-org:

          (current-citation (if (eq 'citation (org-element-type datum)) datum
                             (org-element-property :parent datum)))
          (current-ref (when (eq 'citation-reference (org-element-type datum)) datum))
          (refs (org-cite-get-references current-citation))

I do know some of the org-cite syntax decisions were guided by technical reasons; for example, parsing.

@d-r-a-b
Copy link

d-r-a-b commented Apr 23, 2023

Org-cite history, etc ...

Don't want to get too focused on this; I really just raised this issue to encourage you all to avoid this yourselves, which you have an opportunity to do since this is a pretty new project.

Understood and thank you for your input. Is there anything you've run into that the model generated by citations using org-cite has been unable to handle? Since this is a fresh opportunity it would be good to know what you perceive as difficulties in implementing any citation-related activities using a org-cite-like approach.

In org-element, citations and their components (citation-reference) are first-class objects, so you can have code like this in citar-org:

Thank you for the example! Is there any technical reason that this code example would require first-class object support, instead of just depending on the macro library that would provide the same AST? From a theoretical standpoint my somewhat limited understanding is that the primary difference is whether a citar-like tool/library would have access to:

  1. 2nd class approach: just the citation AST, perhaps also complemented by another argument that passes the document AST and an indication of which element represents the citation currently being processed. This document AST argument could also be replaced by a node representing the citation currently being processed, after which parent and root methods could be called to traverse the document AST more fully vs
  2. 1st class approach: a citation AST contained within a single node, upon which you could call some parent method and traverse into the main document AST

In what cases do you find a unified AST provides useful functionality over and above the 2nd class model? I'm struggling a bit to come up with strong use cases for this, but that doesn't mean there isn't a good strong use case.

@d-r-a-b
Copy link

d-r-a-b commented Apr 23, 2023

Another part of step 1 - collect examples of citation styles we want to support.

https://www.overleaf.com/learn/latex/Questions/How_do_I_create_a_possessive_or_genitive_citation%3F

How do I create a possessive or genitive citation?

If you're submitting to a journal or conference proceedings, the publisher's template may have defined a custom command for creating genitive or possessive cites (e.g. Smith's (1990) study shows...), so check the template provided—and any instructions to authors.

Otherwise, you can create such cites depending on which package is used. Here are some examples:

Using harvard

\usepackage{harvard}
\possessivecite{Smith:1990}

Using natbib

\usepackage{natbib}
\citeauthor{Smith:1990}'s \citeyearpar{Smith:1990}

Using biblatex

\usepackage[backend=biber,style=authoryear]{biblatex}
\citeauthor{Smith:1990}'s (\citeyear{Smith:1990})

This is essentially a style variant in line with parenthetical, in-text, etc.

@d-r-a-b d-r-a-b mentioned this pull request Apr 23, 2023
@d-r-a-b
Copy link

d-r-a-b commented Apr 23, 2023

I said:

In what cases do you find a unified AST provides useful functionality over and above the 2nd class model? I'm struggling a bit to come up with strong use cases for this, but that doesn't mean there isn't a good strong use case.

Pandoc defines a citation in its AST (https://hackage.haskell.org/package/pandoc-types). 1st class definition of a citation in norg-spec will likely make conversion between formats easier, but it's not impossible with a 2nd class citation. There's a bunch of nuance here. What workflows should be supported (i.e. for an export workflow, does norg-spec worry mostly about how neorg might achieve this, or about making implementation of export easy with any other tool)? How much should norg-spec be constrained by third-party considerations (like supporting pandoc or other useful tools)? How much does the spec want to take direct ownership vs delegate to extensions?

Anyone else have specific pros/cons for 1st class vs 2nd class citation objects?

@Klafyvel
Copy link
Contributor Author

You guys have been productive! I think @d-r-a-b 's list is a good starting point. Some random thoughts:

  • Model: I think we would gain a lot by specifying it here, or at least in the spec repository. That way all various syntaxes that users decide to implement remain somewhat compatible. In its essence, each citation must be able to embed a set of metadata fields (maybe the full list can come from CSL?) and maybe a link to the next citation for grouped citations.
  • Syntax:
    • Second-class norg syntax is OK for me, even for various interpreters such as Norg.jl that should not be an issue once the macro system works fine. Even better, if a "norg-blessed" library emerges and its syntax is correctly specified, people can reimplement it in the parser's language for more efficiency. (That's what I plan to do for the standard library in Norg.jl).
    • Tree-sitter: that should not be an issue. Tree-sitter can define "sub-parsers" so we can write one for each library and have it loaded in Neorg.
    • databases: This is mostly a semantics issue, but we can decide to mix a bit of semantics here and have, say, a list of sources in a hypothetic @document.bibliography that get read in order until a citation key is found.
    • More generally, I think most configurations can go in @document.bibliography, and use a syntax similar to the one in@document.meta.
    • the [UP; KP @key KS,; ... ; US] syntax seems ok to me. That means most citations would look like {= @key} (I guess most people use the norg syntax to take notes rather than typesetting big documents, so we need to keep in mind that they are most likely going to set per-citation style only rarely).
    • We could define some kinds of layers, as in the norg spec, for parsers to implement. Like layer 1 would be "handle {= @key} only etc.
  • Package: I'm Ok with everything here. CSL seems to be the best option for text generation, and for neorg devs, there is https://github.com/zepinglee/citeproc-lua .

Overall, I think we should aim for something not too huge to implement, but that can be extended afterwards, i.e. support one source for now (bibtext for example, as a lot of tools can export to that), support one export format specificator (CSL). etc.

@d-r-a-b
Copy link

d-r-a-b commented Apr 23, 2023

  • Model: I think we would gain a lot by specifying it here, or at least in the spec repository. ... [E]ach citation must ... embed a set of metadata fields (maybe the full list can come from CSL?) and maybe a link to the next citation for grouped citations.

For sure. My understanding is that the CSL specification only defines the style component in the diagram at https://docs.citationstyles.org/en/stable/primer.html, which essentially states

CSL processor : function( Style, Item Metadata, Locale, Citation Details) -> (Citation, Bibliography)

I haven't found a place that really describes how a CSL processor is meant to implement things like per-key prefix/affixes within https://citationstyles.org, nor a strong description of what we have been terming a model. These are things that I think correspond to "Citation Details" in the diagram referenced above.

For Citation Details/Model level details, the best references that I have seen are pandoc.types and citeproc.types. My sense is that CSL as a spec leaves open the interpretation of a citation model to be implementation-specific for a processor. There is also a "test suite maintained by Frank Bennett for testing of [the citation processor] citeproc-js. The test suite can be used by authors of other CSL processors, but contains tests that go beyond the scope of the CSL specification." @bdarcus I would really appreciate your input on whether my understanding of what CSL defines is accurate here.

  • Syntax:
    • Second-class norg syntax is OK for me, ...
    • Tree-sitter: that should not be an issue. ...
    • databases: ...generally, I think most configurations can go in @document.bibliography, and use a syntax similar to the one in@document.meta ...
    • We could define some kinds of layers, as in the norg spec, for parsers to implement. Like layer 1 would be "handle {= @key} only etc. ...

Overall, I think we should aim for something not too huge to implement, but that can be extended afterwards, i.e. support one source for now (bibtext for example, as a lot of tools can export to that), support one export format specificator (CSL). etc.

+1

  • the [UP; KP @key KS,; ... ; US] syntax seems ok to me. That means most citations would look like {= @key}

I would consider adding support for org-cite style locator as well as a part of the norg-cite-spec. It functionally looks exactly the same as a suffix but has recognized keywords for e.g. page ranges, and would still mean that the base cite looks like {= @key}(cite). Since we are currently proposing to borrow norg-spec extendable links, it needs the "(cite)" too.

(I guess most people use the norg syntax to take notes rather than typesetting big documents, so we need to keep in mind that they are most likely going to set per-citation style only rarely).

I don't think it's rare at all. {= @key1}(cite/t} showed something really cool. I also wonder if it's related to this other idea [= @key2}(cite} seems like a really common use-case because I don't expect a note-taker to always put their citations in-text or always at the end. The value-add of using a citation key + reference manager solution within norg is really only apparent (to me) if you require export in a somewhat serious way. For simpler notes, it seems much, much more useful to just do {https://your.resource.com} or {/ /path/to/your/pdf} or {:other_norg_doc_summarizing_or_standing_in_for_ref:}. There's less hassle involved to get the exact same benefit of referencing a base document and quickly accessing it if that's all you actually want. Using native norg links would also have serious benefits once backlinks functionality comes online. Backlink support for citations would likely lag as a feature compared to native links, even more so if implemented in a 2nd class way as is being suggested here.

I fully agree that a lightweight syntax for the simplest use-case is desirable, but I think the main benefit is it makes documents more readable because the intent is clearer. Reducing syntax-related pain points also helps to preserve a writing flow that is focused on content instead of syntax.

@bdarcus
Copy link

bdarcus commented Apr 23, 2023

@d-r-a-b

Yeah, there is no real CSL API, though I think we have enough experience and implementations to define one.

I haven't found a place that really describes how a CSL processor is meant to implement things like per-key prefix/affixes within https://citationstyles.org/, nor a strong description of what we have been terming a model.

Yes, because when we published the first version and docs, we weren't sure, and other priorities took over.

But then citeproc-js came along, which Zotero used but was independent from, and Frank developed a couple of JSON Schemas that described a kind of API between the two.

The CSL schemas repo now hosts versions we adapted from that work.

https://github.com/citation-style-language/schema/tree/master/schemas/input

Newer implementations certainly studied that, and evolved it.

I'd say pandoc and it's citeproc is a good place to look, since it's much newer, and very well designed.

https://github.com/jgm/citeproc

See the JSON CLI server, for example.

https://github.com/jgm/citeproc/blob/master/man/citeproc.1.md#notes

As I say, I think we should better formalize those, and an API.

I'm actually working on an experiment that may address all this, including the API.

https://github.com/bdarcus/csl-next.js

But it's very tentative ATM.

There is also a "test suite maintained by Frank Bennett for testing of [the citation processor] citeproc-js. The test suite can be used by authors of other CSL processors, but contains tests that go beyond the scope of the CSL specification."

This is the test suite pretty much all the CSL projects use, which was adapted from Frank's, but now is a broader community effort. We've mostly removed tests specific to citeproc-js and CSL-M.

https://github.com/citation-style-language/test-suite

I would consider adding support for org-cite style locator as well as a part of the norg-cite-spec.

A citation that only allowed a single key wouldn't be usable for people in many fields.

@Klafyvel

I guess most people use the norg syntax to take notes rather than typesetting big documents, so we need to keep in mind that they are most likely going to set per-citation style only rarely)

I'd expect many would be want to be able to publish finished manuscripts in PDF or OpenDocument from their norg documents, at least in time.

But even if one is only using norg for note-taking, a key part of that it properly citing while doing that.

A single key without additional metadata is basically a non-starter for people in many fields. For example, many fields in the humanities and social sciences do a lot of quotation of source material. If one is doing that, they must include the page number(s) or other "locators" (what we call them in CSL land).

So not sure your distinction between basically two modes of citation holds.

@bdarcus
Copy link

bdarcus commented Apr 24, 2023

Is there anything you've run into that the model generated by citations using org-cite has been unable to handle?

No.

As I said, it's just an iterative improvement on the pandoc model, which works pretty well.

The only differences:

  • global affixes (though John is open (edit: omitted initially) to adding those to the Djot citation syntax)
  • the style/substyle system

In what cases do you find a unified AST provides useful functionality over and above the 2nd class model?

Hard to say, since as I said, I don't follow norg or neovim much at all

I can just say that org-cite makes it really easy to write functional integration.

@d-r-a-b
Copy link

d-r-a-b commented Apr 24, 2023

Amazing, thank you so much for your comments and clarification. It's also exciting to hear about developments happening in this space for CSL!

I would consider adding support for org-cite style locator as well as a part of the norg-cite-spec.

A citation that only allowed a single key wouldn't be usable for people in many fields.

Just to clarify, was this comment interleaved correctly? As I understand them, supporting locators does not imply a limitation to single key. For example, {= for more details see; @key1; @key2 pp. 5-12 for clarification}, where "pp. 5-12" is the locator and "for clarification" is the suffix. Following org-cite for their particular parsing (locator is a list of keywords pp., p., lines, etc followed directly by a number or numerical range) also doesn't prevent {= @key more on pp. 5-12}, where "more on pp. 5-12" is then a suffix with no locator represented in the parse tree.

@Klafyvel, I do understand your concerns about raising the scope too far especially for a initial implementation/minimum viable product. They are very reasonable and cogent concerns. I hope that I haven't been giving the impression that all the features being discussed need to be implemented early (or even at all). My goal in enumerating all of the various features that a user of citations might want is to make sure they are being considered so that 1 of 2 decisions can be made. 1) we would eventually like to support so-and-so feature and so we should make design decisions that will not require kludges later. 2) we say that some feature is explicitly never supported (because we think it's harmful, because it creates some form of ambiguity in parsing or output, because it's way too complicated and you should use some other tool if you want that, etc) and then we don't feel bad later if some citation doesn't work with norg because we made the decision consciously and after deliberation.

Parsing citation-pre; key-pre @key key-loc key-suf; . . . ; citation-suf into an AST is a very easy task within the overall "implement citations" project and is compatible with a lightweight syntax {= @key}(cite) in the norg doc. The difficult parts will definitely be what to do with the AST afterwards to make it interact usefully with the other tools in the space (reference managers, .bib, .sty, sqlite, CSL, output to md/txt/pdf etc). As you've suggested, limiting ourselves to 1 database provider (current proposal:.bib) and 1 style format (current proposal:.csl) is a great feature subset to begin with. I actually think we can go even further by only supporting 1 data provider (.bib) and 1 processor that doesn't worry about implementing multiple citation styles like implied by CSL. See oc-basic for an example of this. From an interative dev standpoint to shorten time to MVP, very early iterations can limit themselves to discarding anything in the AST other than the first @key. Then, progressively handle more of the AST for multiple keys or affixes. Then, increase the range of output processors. By only doing a norg-basic processor first it also means we would get to a state where the package implementation can think about architecture for multiple processors relatively early but with the benefit of having explored the problem domain with some actual code.

@d-r-a-b
Copy link

d-r-a-b commented Apr 24, 2023

To illustrate informal, incomplete grammar for what seems like the current iteration of the citation syntax under discussion:

citation: (citation-pre ';')? citation-item (';' citation-item)* (';' citation-post)?
citation-item: item-pre? cite-key locator? item-post?
cite-key: '@' key-ref
key-ref: [A-Za-z0-9_]+

citation-pre, citation-post:, item-pre, item-post: string-but-escape-semicolons-and-at-signs

locator: loc-specifier loc-quantify
loc-specifier: 'p.' | 'pp.' | 'page' | 'pages' | 'l.' | 'll.' | 'line' | 'lines' | ... 
loc-quantify: num | num-range
num-range: num '-' num

Considering how much the current proposal mirrors the org-cite syntax, it's probably worth looking for their formal grammar definition or else just rip it from their parser

The (cite/variant) part, as currently proposed, will just be a convention for naming the cite macros provided by a given output processor. I imagine Janet has something similar to Lua metatable for calling functions that don't technically exist and mapping them to a base fallback function.

@d-r-a-b
Copy link

d-r-a-b commented Apr 24, 2023

hmmm... If multiple locators are supported within the AST, the syntax for that needs to be figured out. Would correspond to output of something like "(Jones 1988, pp 12-15, 30-34, 88 for more details)".

To some extent, if an author needs multiple locators they can utilize item-suffix to cover 90% of cases. However, main use cases for an AST representation of a locator that come quickly to mind are for consistency (always put "see" in front locator in output citation), for localization of that "see" string, and for being able to support a style directive that removes all locators in the output citations. That style-directive will act very poorly in the 1 locator case because it will look like it works for most citations and then the author will be surprised by the multiple locator situation that only hides the first locator in the output. Of course in 0 locator case a style directive would do nothing, but there are no surprises.

Decision: 1 locator, multiple locators, 0 locators in AST
Subsequent decision: syntax to accommodate

My impression is that explicitly 1 locator in the AST is worse than 0 or multiple locators for reasons listed above.

@bdarcus
Copy link

bdarcus commented Apr 24, 2023

Just to clarify, was this comment interleaved correctly?

It was awkward; just meant to agree with you in that, and suggest there's no reason to allow only one.

If multiple locators are supported within the AST, the syntax for that needs to be figured out.

The existing implementations I am aware of (elisp, Haskell, JS, rust) accept citation-reference suffix strings as input and parse them into lists of locators, assuming a standardized syntax. I don't have the details handy, but they're documented in those implementations.

EDIT: See the very precise english description in the oc-csl.el commentary.

He derived it from citeproc-org, which likely borrowed from pandoc :-)

It's also exciting to hear about developments happening in this space for CSL.

I don't want to overstate it: it's currently a personal experiment.

I am, however, starting to document the model (I adapted the locators docs I link above to a docstring there), and have added a docs target for make that will generate documentation. A relevant example:

image

@d-r-a-b
Copy link

d-r-a-b commented Apr 25, 2023

EDIT: See the very precise english description in the oc-csl.el commentary.

[the locator] ends with the last comma or digit in the suffix, whichever comes last, or runs till the end of the suffix.

I really dislike their parse method if this is correct. It leads to surprising parse behavior if you have an org-cite citation like [cite: @key pp. 5-9 for 5 different methods]. In this case, the locator as parsed by their definition would be "pp 5-9 for 5" and the suffix would be "different methods", likely an inaccurate model of the intention. It also breaks on [cite: @key pp 3, 8, 12 for descriptions of A, B and C, respectively]. My view is that actively fighting against software is always a worse experience than simply not having software offer a feature that might help you. This is something I think we could iterate on, or decide to not support a syntax for locators.

I am, however, starting to document the model

That looks really awesome!

@Klafyvel
Copy link
Contributor Author

@Klafyvel, I do understand your concerns about raising the scope too far especially for a initial implementation/minimum viable product. They are very reasonable and cogent concerns. I hope that I haven't been giving the impression that all the features being discussed need to be implemented early (or even at all). My goal in enumerating all of the various features that a user of citations might want is to make sure they are being considered so that 1 of 2 decisions can be made. 1) we would eventually like to support so-and-so feature and so we should make design decisions that will not require kludges later. 2) we say that some feature is explicitly never supported (because we think it's harmful, because it creates some form of ambiguity in parsing or output, because it's way too complicated and you should use some other tool if you want that, etc) and then we don't feel bad later if some citation doesn't work with norg because we made the decision consciously and after deliberation.

Thank you for the clarification!

@bdarcus
Copy link

bdarcus commented Apr 25, 2023

I really dislike their parse method if this is correct.

Pandoc has some additional options to handle the problems you note, I believe, using TeX-like brackets.

These data issues are really tricky. You need to support well by far the most important case here, which is page numbers, but not foreclose other options.

E.g. the old make the common easy and the complex possible.

When we were working on enhancements to CSL early in the pandemic, we actually converted it to an array of objects. But that's awfully complex for the common case.

@d-r-a-b
Copy link

d-r-a-b commented Apr 25, 2023

You need to support well by far the most important case here, which is page numbers, but not foreclose other options. E.g. the old make the common easy and the complex possible.

I think we all agree that the ideal scenario would be a syntax that allows explicit specification of multiple locators and still degrades nicely to a simple/easy syntax for the simple or no locator case. I'll be putting some thought into how to achieve that.

What I really, really dislike is when the software makes the complex impossible. When things that are meant to automate and ease your life turn into a slogfest of trying to just get the thing to do what you want, it's much worse than never automating it at all. Consider an author who writes their citation with multiple logical locators cite: @key pp 5-6,16-18 for 2 examples who then changes style guides in a way that wants to manipulate locator data.

  • syntax supports multiple locators properly: magic happens and everyone is happy
  • syntax supports only prefix/suffix and doesn't parse locators to an AST: no magic, but the solution is conceptually simple -> author manually changes suffix
  • syntax supports only single locator, or parses locators wrong: magic goes awry. Author only notices if they read their draft carefully and author needs to find some magic invocation to fix things, sometimes involving patching the citation software itself

If the processing of locators can be wrong, then there needs to be a simple way to turn locator processing off or to provide it in a more verbose syntax. If that cannot be done, then I'd rather have no explicit support and say that authors will have to manually fix their suffixes if they switch style guides and now need "pages" to show up as "pps.".

we actually converted it to an array of object

In essence, I think the AST for the locator should look like an array of objects. To achieve the ideal case, my sense is that we would have to figure out how to create a syntax that hides this underlying model in the simple case of 0 or 1 locators. It would be really cool if we can hide it in the multiple locator case too, although I wonder how feasible that will be.

The current org-cite solution seems to achieve the goal of hiding the model for the simple case, but break down in a lot of specific cases that an author might want. You mentioned that pandoc has some optional facilities to deal with this. I'll see if I can track down that syntax for ideas.

EDITS: many for formatting. I'm a klutz this morning.

@bdarcus
Copy link

bdarcus commented Apr 25, 2023

You mentioned that pandoc has some optional facilities to deal with this. I'll see if I can track down that syntax for ideas.

See here, in para that starts "In complex cases ...".

[@smith{ii, A, D-Z}, with a suffix]
[@smith, {pp. iv, vi-xi, (xv)-(xvii)} with suffix here]
[@smith{}, 99 years later]

Edit: Of course, you can play with pandoc -t json ... to see how it handles these internally.

@d-r-a-b
Copy link

d-r-a-b commented Apr 29, 2023

Still thinking about the locator syntax problem, but something I came across: https://list.orgmode.org/orgmode/[email protected]/

From the org-mode mailing list, relating to macros and citations and citation processors. There is more to see in that thread and another 2 threads which reference it, but a summary is that figuring out how to make their CSL export processor interoperate with Org elements in the prefix/suffix elements is a tricky problem to solve because CSL has it's own concept of formatted text that does not always map to what org-mode believes in (i.e. a smallcaps format). I wanted to highlight the user who is trying to make a macro work in the prefix.

That being said, I /found/ an alternative that works, albeit it is a bit ugly. I can create an explicit footnote, use a [cite/default/bare:] construct (to suppress the terminal period) within it and terminate the citation before the macro begins. That way, the macro is outside of the citation construct. This construction is however unfortunate when I want to cite multiple sources and have the macro used on an earlier one, e.g.:

[fn:1] [cite/default/bare:@foo p. 5], countering {{{name(Doe’s)}}} argument; [cite/default/bare:@bar p. 37].

It would be nicer if I could just write into the main text

[cite:@foo p. 5, countering {{{name(Doe’s)}}} argument;@bar p. 37]

This is precisely what I mean about having to fight with the citation system, although in this case I would say that it originates more from the package than from the syntax. The user encountered the problem in Oct 2022 and was trying to submit patches to get it fixed through Jan 2023. I didn't see a resolution, but maybe in another thread. It's worth making sure that there is at least a fallback that minimally provides this kind of hacky solution, but it would be nice if we either had a better fallback, or made it very clear that you could always ask the processor for an individual bibliographic element, ideally formatted according to the main rules of the CSL style.

@bdarcus
Copy link

bdarcus commented Apr 29, 2023

FWIW, here's what we came up with for the JSON schema model for CSL v1.1.

Here's an example:

              {
                "locators": [
                  { "page": 23 },
                  { "begin": { "page": 25 }, "end": { "page": 28 } }
                ]
              }

We basically concluded our priority for these input files, which aren't likely to be touched by users, is correctness, and in this case wanting to allow processing of those lists.

A possibly reasonable alternative could be something like this, but it raises other issues (like, it would assume on a processor treating the value as a plain string):

              {
                "locators": [
                  { "page": "23, 25-28"} }
                ]
              }

PS - not clear what the future of that 1.1 branch is, but why I'm experimenting also with the alternative.

@d-r-a-b
Copy link

d-r-a-b commented Apr 29, 2023

Thanks for the input!

Just making sure I am remembering correctly - CSL has nothing to say on author-supplied affixes right? These are purely implementation-defined by whichever citeproc variant is parsing the CSL?

@bdarcus
Copy link

bdarcus commented Apr 29, 2023

CSL has nothing to say on author-supplied affixes right? These are purely implementation-defined by whichever citeproc variant is parsing the CSL?

Correct.

But they all have settled on a similar approach, so seems past time to standardize.

Aside: suffice to say, the success of CSL is sometimes a bit of a challenge. Imagine ten different implementations of neorg!

Worth keeping in mind, though, there are too broad groups here:

  1. GUI apps (like the biggest, Zotero), where users are putting strings in fields, and formatting is constantly updated.
  2. batch processors (pandoc, org-cite CSL, citeproc-lua in TeX), where users are hand-editing content

In both cases, you have to expose a sane UI to users, whether in the form of GUI field(s), or a markup syntax.

That tension between machine-friendly and human-friendly is the nub of the challenge.

@d-r-a-b
Copy link

d-r-a-b commented Apr 29, 2023

But they all have settled on a similar approach, so seems past time to standardize.

Standardize all the things!

But also, I don't like the existing implementations in terms of the functionality they currently offer around more ad-hoc citations, mostly around these bugbears:

  • Multiple locators
    • In particular per-locator prefix and suffix
    • How to model a list of locator types?
      • [ (loctype, numeric|range, affixes) ]
      • [ (loctype, [numeric|range], affixes) ]
      • [ ( loctype, [(numeric|range), affixes] ) ]
      • [ ( loctype, [ (numeric|range, affixes) ] ) ]
      • can a list contain sublists with per-sublist affixes?
  • Weird capitalization and formatting things - You need a formatted symbol, like the one for LaTEX to appear.
  • "Dynamic" or macro content

These support things like

  1. dynamic content: [cite:{{{name(Interviewee)}}} in @interviewer p. 5]
  2. per-locator affixes: . . . innovative proofs (Smith, 2015, see Ch. 12 for problem sets and pp. 585-90 for worked solutions)
  3. inter-locator dependence: . . . innovative proofs (Smith, 2015, see pp. 585-90 in the back of Ch. 14 for worked solutions)
  4. capitalization, symbols and an author-supplied link: Von Surname was an amazing contributor to many projects (von Surname, 2015, for details on contributions to LaTEX, excerpts of which are available at website schema://website.clickable)

The citation syntaxes and resulting models that I've seen, especially in the lightweight markup world, don't seem to deal with these citations well. I would be happy to have counter-example of syntaxes that do cover these cases well and can produce the appropriate output as citation styles change. Some of this is modeling and syntax and needs to be fixed at those levels, some of it is just the difficult problem of software interop, but these examples show some of the real edge cases that authors want to be able to express. My hope for this thread is for us to come to some conclusion about whether or not we want to support authors who want these things and if so, how to do it in a way that parses cleanly and limits the introduction of new markup unless absolutely needed.

@d-r-a-b
Copy link

d-r-a-b commented Apr 29, 2023

I suppose the other thing is there is a weird asymmetry in all the syntaxes. Why isn't there any support for the concept of a citation like

(See p. 5 in Jones, 2023)

?

I know that I would never write that, but there are many things I wouldn't do that other authors might want or have a business need to do.

@bdarcus
Copy link

bdarcus commented Apr 29, 2023

I suppose the other thing is there is a weird asymmetry in all the syntaxes. Why isn't there any support for the concept of a citation like ...

I agree, but the answer is because it's never come up AFAIK.

We also haven't talked about "rich" markup there, which has come up.

@d-r-a-b
Copy link

d-r-a-b commented May 11, 2023

Still very interested in this, but balancing a number of priorities. See https://gist.github.com/d-r-a-b/e359904b2e8f1bd4e9eca2574b8e6265 for a flash-frozen state of my thoughts at the moment. The only really useful bit of it is the list of terms, but other bits might be scavenged either to see online resources related to citations or for sentences that might be useful for an actual draft document.

It's also missing the idea I'm about to suggest.

I think I have a viable suggestion for how to model a citation (abstract), and from that some ideas on how it might be realized syntactically.

Assumption: citation style (MLA/APA/etc) for the document is defined elsewhere. This post only concerns the citation object itself and how it exposes sufficient information that an implementing citation processor can use the AST to produce a practical citation export in combination with some style specification (CSL, .sty etc)

Continuing from the basic grammar in #15 (comment), I am suggesting conceptually the same {= citation-prefix? ; key-prefix? @key key-suffix? ; citation-suffix?}(cite/var/subvar) but with the added twist that all affixes take the form of a list conceptually, which may contain the following 3 types of objects in any order and combination

  1. locator data: consists of a single locator-type and 1 or more locator-position | locator-range. Locator type ideally should be among the keywords understood by the processor, and may be localized according to the style specification by the processor. Positions and ranges do not need to be numeric, as they may correspond to e.g. "verse A", "section 5.2-6.1" etc. When the citation processor is able to parse the range, it may choose to manipulate it on the basis of the style specification.
  2. invariant data: data which acts just like a string from the perspective of the citation processor. This invariant data may contain formatting information or syntax that will be expanded at a later time by some macro engine, but the basic idea is that the citation processor processes it as a static string. The citation processor does no localization or alteration to this invariant data, it simply places it accordingly
  3. insertion-markers: allow the user to specify where per-citation or per-key affixes are inserted.

The exact syntax to achieve this is up to debate, but this might be more easily understood if:

  1. affix tokens are invariant by default
  2. locator data is bounded by some characters, perhaps { and } to be more familiar with the pandoc syntax
  3. insertion-marker is a single character #, which may also be immediately followed by an attached alphanumeric identifier such as #main.

This would allow the citation (consult Smith, 1995, pp. 5-12 on the bottom sections of the pages; for an example, consult Walters, 2000; consult pages 8-9 in Adams, 2015; the day of the week is Monday; I hate that day and that is why this citation is ugly) to be represented as

{= @smith1995 {pages 5-12} on the bottom sections of the pages ; for an example, # @walters2000 ; # pages 8-9 in @adams2015 ; the day of the week is {= }(dayoftheweek)\; which I hate and that is why this citation is ugly}(cite/variant)

This illustrates several points about the proposed model

  1. only the locators within delimiters get processed by the style, so {pages 5-12} becomes "pp 5-12" according to whatever the style rules are, but pages 8-9 are part of invariant data and so are left alone.
  2. this example presumes a style that uses a per-key prefix of "consult". Usage of the # permits the citation object to indicate a preference for insertion location. In the absence of any insertion-markers, a reasonable default would be for the citation processor to infer an insertion-marker immediately adjacent to the citation key.
  3. citation-prefix and citation-postfix are identified by lack of citation key and overall position
  4. dynamic content is represented as part of an invariant data. It may be macro expanded either before processing by the citation processor or after. Probably makes more sense to expand beforehand?
  5. Need for an escape character for ;, #, {, } and the escape character itself

What do people think of the general idea (model or syntax)? It allows you to do something simple like {= @King2023, p 5} and get what you expect, while asking you to start adding more syntax if you want the processor to do any locator manipulation or need to insert the style-specified affix in a more sensible place while still specifying your own.

@bdarcus
Copy link

bdarcus commented May 11, 2023

What do people think of the general idea (model or syntax)? It allows you to do something simple like {= @King2023, p 5} and get what you expect, while asking you to start adding more syntax if you want the processor to do any locator manipulation or need to insert the style-specified affix in a more sensible place while still specifying your own.

I've only quickly read this @d-r-a-b, but I like it.

I think the only reason org didn't go with the wrapper for locator is some technical reason.

And as you note, the really common simple case remains simple.

Except, why does the simple example not include the wrapper for the locator? Did I miss some exception?

I guess I should mention, since I don't think we discussed it, and it could impact details: some styles require resorting and grouping of multi-reference citations for output. It's one reason why a distinction between global and local affixes is useful.

So ...

[@doe20; @smith20; @doe10]

... might become:

(Doe, 2010, 2020; Smith, 2020]

@d-r-a-b
Copy link

d-r-a-b commented May 11, 2023

Except, why does the simple example not include the wrapper for the locator? Did I miss some exception?

To my mind, the truly simple case is to reduce the amount of magic that the citation processor will do without prompting; hence by default the simple case that I chose to show as {= @King2023, p 5}(cite) will just treat the , p 5 as a string to be treated normally as a string affix. It will not attempt to do anything fancy like consistently put locators in full/abbreviated form, re-order ranges, localize to region, etc. If you need these they are available through the syntax, but there is no greedy parsing that will get things right 98% of the time but be almost impossible to fix 2% of the time. To get the processor to treat the locator as such, and to do something special with it. If you want locator processing, then this syntax would suggest {= @author {p 5}}(cite). The changes are the additional brackets and the omission of the comma, since the processor should know if it wants to add a comma for a locator.

I guess I should mention, since I don't think we discussed it, and it could impact details: some styles require resorting and grouping of multi-reference citations for output. It's one reason why a distinction between global and local affixes is useful.

Unless I'm missing something, I believe the the syntax I discussed still has the concept of "global" or per-citation affixes and "local" or per-per-key affixes. The primary difference is that the structure that represents any of them is identical: a list of Option<invariant-data|locator-data|insertion-marker>. Suffixes and Prefixes are not different data structures from one another, locators are not limited to suffixes, and there is a mechanism to specify where the affixes that the style guideline wants to insert should actually go. From a parse perspective on the opening of the citation, the first non-whitespace character after {= may be

  1. @, in which case there is no per-citation or per-key prefix
  2. A non-whitespace, non-@ character (or the escaped \@ sequence if you need to put in @ for some reason), in which case it is known that there is a prefix. The decision on whether it is a per-citation or per-key prefix is determined upon the first encounter with a @ character or a ; character, corresponding respectively to a per-key/local prefix or a per-citation/global prefix.

@bdarcus
Copy link

bdarcus commented May 11, 2023

If you want locator processing, then this syntax would suggest {= @author {p 5}}(cite).

Right, so the simple example is maybe a little too simple for practical use.

But I think that's fine. As we discussed, there are no magic bullets here that nicely balance all priorities, and if you want to have rigorous parsing without magic, that's a totally cool design decision.

The changes are the additional brackets and the omission of the comma, since the processor should know if it wants to add a comma for a locator.

It's really, at least in the CSL world, the style that governs whether it's a colon, comma, etc.

Unless I'm missing something, I believe the the syntax I discussed still has the concept of "global" or per-citation affixes and "local" or per-per-key affixes.

Yes, I wasn't meaning to suggest otherwise. It just occurred to me worth mentioning in this context.

@d-r-a-b
Copy link

d-r-a-b commented May 12, 2023

Right, so the simple example is maybe a little too simple for practical use.

I would argue that depends on the user, but it's true that it makes locator magic opt-in and and that does make the simplest version undesirable for peer-reviewed publishing (as opposed to personal or even school-assignment level citation needs, where the pure string version could serve many people well).

I suppose the other option is to provide a set of delimiters that would actually make something explicitly a string instead, thus making locators more opt-out. Tbh, I'm not sure which decision is more elegant or practical. There's a certain purity to requiring a delimiter to "promote" a string into a locator that is very nice and reduces the surprises a lot. It also makes it clear that the user expects this next token to be a locator-unit, so a mistyping can trigger a macro error instead of silently failing to convert into a locator AST. It also seems potentially a bit weird to require a delimiter to make something a string when the surround invariant-data is already treated as string without requiring any such delimiters. There's a consistency and principle of least surprise that feels like it works well with the rest of the norg-spec philosophy.

OTOH, if most people would prefer the magic, there is an elegance to making the syntax for it as straight-forward as possible and relegating the less common option to the more cumbersome syntax.

@bdarcus
Copy link

bdarcus commented May 22, 2023

In the change I just pushed, the YAML representation would be:

suffix: [see, page: 23, section: V]

... which is reasonably elegant for human writer and also machine parser.

Edit: except it's biased towards English speakers, since all the symbols are English.

Do you have any conventions for that sort of thing in neorg?

@d-r-a-b
Copy link

d-r-a-b commented May 23, 2023

Nice to see the "affix as array of option<locator,string>" idea get a concrete commit! Would commas in the strings be escaped?

Do you have any conventions for that sort of thing in neorg?

Conventions for what exactly? Lists? Key-value pairs?

Tags are one of the mechanisms for calling macros; they can take a series of space-delimited parameters:

  There are 6 different tag types, each with their own way of changing the way text in Norg is
  interpreted. Before we discuss those, however, we should discuss the syntax rules for tags:
  - A tag is similar to a {# detached modifiers}[detached modifier] in the sense that it must begin
    at the beginning of a line with optional {*** whitespace} (but nothing else) preceding it.
  - After that you will encounter a special tag character (`=`, `|`, `@`, `#`, `+` and `.`), /none/
    of which are attached modifiers (see {^ disambiguating tags and attached modifiers}). The
    special tag character is then /immediately/ followed by text, which becomes the /tag name/. Said
    tag name can consist of any {# regular characters}[regular character] and/or `-` and `_`.
  - Tags can have their names delimited by a `.` in order to create a "hierarchy", e.g.
    `document.meta`.
  - ::
    After a {*** whitespace} character any number of parameters on the same line may follow:
    |example
    #tag-name.subtag parameter1 parameter2
    |end
    By default parameters are space-separated. In order to create multi-word parameters, you may
    escape the space character with a backslash (`\`).
    |example
    #tag-name.subtag parameter1\ with\ spaces parameter2
    |end
    Parameters may consist of any character (apart from a {*** line endings}[line ending], of course).

Attributes are also in the spec, and are closest to the idea of a key-value pair. Multiple items are | delimited and then key-value pairs are : delimited. However, the last I heard in the discord @vhyrro was thinking of scrapping them in favor of a more general macro syntax.

** Attached Modifier Extensions
   Similarly to {** detached modifier extensions}, attached modifier extensions serve as a way
   to attach metadata to {* attached modifiers}.
   The metadata that you can attach, however, differs from {** detached modifier extensions}, as they
   serve different use cases.

   The content of attached modifier extensions consists of a set of references to many
   {*** attributes}. These attributes are delimited by the {* contextual `|` delimiter}.
   If the attribute is part of a hierarchy (see {*** attributes}), you may use the `:`
   character to link them together. Some inbuilt attributes are the `lang` and `color` hierarchies
   (a comprehensive list can be found in the [semantics document]).

*** Examples
    |example
    `print("This is some python")`(lang:python) <- The lang:python attribute highlights the text as python
    *some green and bold text!*(color:green)    <- some green and bold text

    {* Link location}[this is an important link](important|color:red) <- Highlights the link as big,
                                                                         bold (important) and red.
    |end

Is this what you were asking about?

@Atreyu-94
Copy link

Hey! Is there any update on this topic? I think this could be helpful. It is a simple-to-use telescope plugin that allows you to render and pick BibTeX citations from .bib files. You can find it here: Telescope-BibTex

BibTeX/BibLaTeX is widely used in science, and I believe it's one of the most commonly used formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants