Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(smart-apply): stripping of markdown code blocks #7105

Merged
merged 14 commits into from
Feb 19, 2025

Conversation

ichim-david
Copy link
Collaborator

@ichim-david ichim-david commented Feb 15, 2025

Description

Issue: https://linear.app/sourcegraph/issue/CODY-4428/markdown-code-blocks-render-incorrectly-due-to-unintended-triple

Test plan

Ask Cody to generate a readme that documents a file with code examples.
Notice when applying the changes that the code blocks remain

Add instructions in the chat preamble to close fenced code blocks correctly and introduce a transform that wraps code blocks starting with ```markdown in backticks to preserve their formatting in the UI when it contains other code blocks.
- After removing the filtering having these tests means they fail
… constant

We need to check for PromptString as that is what prompt contains, not just simple strings
`${MARKDOWN_CODE_BLOCK_DELIMITER_START}\\s*([\\s\\S]*?)\\s*${MARKDOWN_CODE_BLOCK_DELIMITER_END}`,
'g'
)
// const MARKDOWN_CODE_BLOCK_DELIMITER_START = '```(?:\\w+)?'
Copy link
Contributor

@umpox umpox Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ichim-david!

Just trying to understand this PR more, do we need to remove these? Or are they just commented out for this branch.

We need these for typical edit commands, as often LLMs will give us code in Markdown blocks and in 99% of cases we do not want to apply those to the document.

Is this essentially:

  1. We get a chat response with some Markdown text, it contains code blocks within it.
  2. We try to apply this Markdown to the document (smart apply), but our edit logic removes these code blocks.

…mprove text stripping logic

- Remove markdown tags from non markdown files
- Avoid transformation until message is no longer in progress to avoid expensive
  mutations
@ichim-david ichim-david marked this pull request as ready for review February 17, 2025 22:01
@ichim-david
Copy link
Collaborator Author

@umpox Tests fail but cleaning logic is now sound and this work can now be checked for any potential fixes or regressions from this work

@ichim-david
Copy link
Collaborator Author

:( tests were green before merge from master to fix conflict and adapt to how others referenced the preamble chats 58e9b46

@julialeex
Copy link
Collaborator

julialeex commented Feb 19, 2025

Hey @ichim-david!
nit: Can you add a more detailed PR description to what you fixed and why you did this? Can use this guide for reference. For example, you can link the linear ticket in the PR title

Comment on lines 46 to 56
let strippedText = text
// Strip specific XML tags referenced in the prompt, e.g. <CODE511>
.replaceAll(PROMPT_TOPIC_REGEX, '')
// Strip Markdown syntax for code blocks, e.g. ```typescript.
.replaceAll(MARKDOWN_CODE_BLOCK_REGEX, block =>

// Strip Markdown syntax for code blocks, e.g. ```typescript, leaving them for markdown files
if (task?.document?.languageId !== 'markdown') {
strippedText = strippedText.replaceAll(MARKDOWN_CODE_BLOCK_REGEX, block =>
block.replace(MARKDOWN_CODE_BLOCK_START, '').replace(MARKDOWN_CODE_BLOCK_END, '')
)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to extract this into a function and add some more docs for future.

Then we can just do an early return if it's a Markdown file which makes this a bit easier to read too!

/**
 * Strips the text of any unnecessary content.
 * This includes:
 * 1. Prompt topics, e.g. <CODE511>. These are used by the LLM to wrap the output code.
 * 2. Markdown code blocks, e.g. ```typescript. Most LLMs are trained to produce Markdown-suitable responses.
 */
function stripText(text: string, task: FixupTask): string {
    const strippedText = text
        // Strip specific XML tags referenced in the prompt, e.g. <CODE511>
        .replaceAll(PROMPT_TOPIC_REGEX, '')

    if (task.document.languageId === 'markdown') {
        // Return this text as is, we do not want to strip Markdown blocks as they may be valuable
        // in Markdown files
        return strippedText
    }

    // Strip Markdown syntax for code blocks, e.g. ```typescript.
    return strippedText.replaceAll(MARKDOWN_CODE_BLOCK_REGEX, block =>
        block.replace(MARKDOWN_CODE_BLOCK_START, '').replace(MARKDOWN_CODE_BLOCK_END, '')
    )
}

@umpox umpox merged commit 74aeb4b into main Feb 19, 2025
21 checks passed
@umpox umpox deleted the ichimdav/markdown_escape branch February 19, 2025 11:13
@sourcegraph-release-bot
Copy link
Collaborator

The backport to M72 failed at https://github.com/sourcegraph/cody/actions/runs/13411266762:

The process '/usr/bin/git' failed with exit code 1

To backport this PR manually, you can either:

Via the sg tool

Use the sg backport command to backport your commit to the release branch.

sg backport -r M72 -p 7105
Via your terminal

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-M72 M72
# Navigate to the new working tree
cd .worktrees/backport-M72
# Create a new branch
git switch --create backport-7105-to-M72
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 74aeb4bf086670f28dca061e43be322b1e8cf850
# Push it to GitHub
git push --set-upstream origin backport-7105-to-M72
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-M72

If you encouter conflict, first resolve the conflict and stage all files, then run the commands below:

git cherry-pick --continue
# Push it to GitHub
git push --set-upstream origin backport-7105-to-M72
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-M72
  • Follow above instructions to backport the commit.
  • Create a pull request where the base branch is M72 and the compare/head branch is backport-7105-to-M72., click here to create the pull request.

Once the pull request has been created, please ensure the following:

  • Make sure to tag @sourcegraph/release in the pull request description.

  • kindly remove the release-blocker from this pull request.

umpox pushed a commit that referenced this pull request Feb 19, 2025
Issue:
https://linear.app/sourcegraph/issue/CODY-4428/markdown-code-blocks-render-incorrectly-due-to-unintended-triple

Ask Cody to generate a readme that documents a file with code examples.
Notice when applying the changes that the code blocks remain

<!-- Required. See
https://docs-legacy.sourcegraph.com/dev/background-information/testing_principles.
-->

---------

Co-authored-by: Tom Ross <[email protected]>
(cherry picked from commit 74aeb4b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants