Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github Pages Connector #2282

Open
Weves opened this issue Aug 31, 2024 · 26 comments · May be fixed by #4233
Open

Github Pages Connector #2282

Weves opened this issue Aug 31, 2024 · 26 comments · May be fixed by #4233
Assignees
Labels
💎 Bounty maintainer approved Maintainer has approved this feature/fix/request. Contributors feel free to take it up.

Comments

@Weves
Copy link
Contributor

Weves commented Aug 31, 2024

Indexes all pages that are part of a Github Pages-based website. Should use the Github APIs directly, as we want to be able to index sites that are behind authentication / internal firewalls. Where possible, we should re-use functionality / common utilities present in the existing Github / Web connectors.

Checkout the connector creation README here for more details on the best way to add new connectors: https://github.com/danswer-ai/danswer/blob/main/backend/danswer/connectors/README.md.

@Weves Weves added the maintainer approved Maintainer has approved this feature/fix/request. Contributors feel free to take it up. label Aug 31, 2024
@Weves
Copy link
Contributor Author

Weves commented Aug 31, 2024

/bounty 250

Copy link

algora-pbc bot commented Aug 31, 2024

💎 $250 bounty • Onyx (YC W24)

Steps to solve:

  1. Start working: Comment /attempt #2282 with your implementation plan
  2. Submit work: Create a pull request including /claim #2282 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to onyx-dot-app/onyx!

Add a bountyShare on socials

Attempt Started (GMT+0) Solution
🟢 @Myestery Aug 31, 2024, 12:32:30 AM WIP
🟢 @PlanetKumbhaj Aug 31, 2024, 9:11:36 AM WIP
🟢 @webbdays Sep 1, 2024, 1:32:37 PM WIP
🔴 @ashish111333 Sep 7, 2024, 9:47:18 AM WIP
🟢 @techsouvik Sep 10, 2024, 7:18:57 PM WIP
🔴 @Rutik7066 Oct 26, 2024, 12:14:23 PM WIP
🔴 @Haroldg1020 Dec 13, 2024, 3:57:55 PM WIP
🟢 @maq796113 Dec 15, 2024, 9:29:23 AM WIP
🟢 @akhilender-bongirwar #3411

@Weves
Copy link
Contributor Author

Weves commented Aug 31, 2024

Note on the above: please check the Before opening PR section in the Connector README for additional requirements (testing + style)!

@Myestery
Copy link

Myestery commented Aug 31, 2024

/attempt #2282

Algora profile Completed bounties Tech Active attempts Options
@Myestery 1 bounty from 1 project
MDX, TypeScript,
CSS
Cancel attempt

@thekumbhaj
Copy link

thekumbhaj commented Aug 31, 2024

/attempt #2282

@webbdays
Copy link

webbdays commented Sep 1, 2024

/attempt #2282

Algora profile Completed bounties Tech Active attempts Options
@webbdays 4 bounties from 1 project
Python, Rust,
HTML & more
Cancel attempt

@webbdays
Copy link

webbdays commented Sep 1, 2024

@Weves
After some research.
found this => https://github.com/orgs/community/discussions/59659

@webbdays
Copy link

webbdays commented Sep 1, 2024

But if the user can provide the github pages repo.
we can get the repo content (i.e page contents)via github api.

@ashish111333
Copy link

@Weves ok how do we start ,do you need to be assigned to work on this or we can just submit our PR?

@ashish111333
Copy link

ashish111333 commented Sep 7, 2024

/attempt #2282

1 similar comment
@techsouvik
Copy link

techsouvik commented Sep 10, 2024

/attempt #2282

techsouvik added a commit to techsouvik/danswer that referenced this issue Sep 11, 2024
techsouvik added a commit to techsouvik/danswer that referenced this issue Sep 12, 2024
@aadarsh-nagrath
Copy link

status ?

@Rutik7066
Copy link
Contributor

Rutik7066 commented Oct 26, 2024

/attempt #2282

Algora profile Completed bounties Tech Active attempts Options
@Rutik7066    2 danswer-ai bounties
+ 8 bounties from 6 projects
Go, Rust,
Python & more
Cancel attempt

@Haroldg1020
Copy link

Haroldg1020 commented Dec 13, 2024

/attempt #2282

1 similar comment
@maq796113
Copy link

maq796113 commented Dec 15, 2024

/attempt #2282

@serunkuma
Copy link

/attempt #2282

@PaulHLiatrio
Copy link

This pull request looks promising #3411 any plans to release this in the near future? My org could really use this.

@akhilender-bongirwar
Copy link

@Weves, regarding PR #3411, is it still needed? If so, I'll resolve conflicts, otherwise, I can close it.

@AayushSaini101
Copy link

AayushSaini101 commented Mar 6, 2025

I want to work on this issue can you please assign to me @yuhongsun96 ? https://github.com/onyx-dot-app/onyx/blob/main/backend/danswer/connectors/README.md also the link is broken

@yuhongsun96
Copy link
Contributor

assigned

@akhilender-bongirwar
Copy link

@Weves, regarding PR #3411, is it still needed? If so, I'll resolve conflicts, otherwise, I can close it.

@yuhongsun96, I had already submitted a PR for this feature back in December (#3411) and followed up last week asking if it was still needed, but didn’t receive a response. Since my PR is already implemented (though it needs conflict resolution), I would appreciate some clarity on whether it is being considered or if I should close it. Thanks!

@yuhongsun96
Copy link
Contributor

I see, let me take a look tomorrow or over the weekend then. @AayushSaini101 maybe hold off until I can verify

@akhilender-bongirwar akhilender-bongirwar linked a pull request Mar 7, 2025 that will close this issue
@yuhongsun96
Copy link
Contributor

I'm going to pass this one to @AayushSaini101. Some feedback in the proposed PR:
Github Pages Connector Feedback:

  • The load_from_state method isn’t implemented properly.
  • The poll source method is missing logic to fetch pages based on start and end times.
  • The yield statement isn’t used correctly.
  • Right now, the site’s HTML is being fetched twice—once for the URL and again for the page content. This can be combined.
  • _crawl_github_pages won’t capture all links if it hits the batch size limit, causing some links to be skipped.
  • There’s no check for the updated_at time or date, which means the same documents might get indexed repeatedly.
  • The retry mechanism could be handled better.
  • _get_all_crawled_urls may run infinitely.
  • In github pages connector we can also think about reusing the github credentials

@akhilender-bongirwar
Copy link

@yuhongsun96, thank you for reviewing my pr and providing feedback. I understand these areas you mentioned need improvement and I'm ready to make those changes. As I've already put in the effort to create a PR, I'd like to ensure I have the opportunity to implement your suggestions and see this through.

@AayushSaini101
Copy link

@yuhongsun96, thank you for reviewing my pr and providing feedback. I understand these areas you mentioned need improvement and I'm ready to make those changes. As I've already put in the effort to create a PR, I'd like to ensure I have the opportunity to implement your suggestions and see this through.

@akhilender-bongirwar i have already started working on this, please check slack channel to know more about the current status of direction of your PR thanks cc: @yuhongsun96

@akhilender-bongirwar
Copy link

akhilender-bongirwar commented Mar 26, 2025

@akhilender-bongirwar i have already started working on this, please check slack channel to know more about the current status of direction of your PR thanks cc: @yuhongsun96

@AayushSaini101, I understand you've started working on this yesterday as @yuhongsun96 suggested. However, I've already invested significant time and effort into PR's #3009, #3411, and #4233, including waiting for approvals and feedback. Just yesterday, I received first suggestions or first review for improvement on #4233, only to see the issue reassigned.

You could use my existing implementation in #4233 as a foundation for your work, building upon it with the suggested modifications. This would acknowledge the work I've already done.

@yuhongsun96, At the very least, I'd appreciate some form of recognition or compensation for the effort I've put into this issue, especially considering the bounty associated with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💎 Bounty maintainer approved Maintainer has approved this feature/fix/request. Contributors feel free to take it up.
Projects
None yet