Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag extraction from packages #115

Open
tchoutri opened this issue May 29, 2022 · 1 comment
Open

Tag extraction from packages #115

tchoutri opened this issue May 29, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@tchoutri
Copy link
Contributor

tchoutri commented May 29, 2022

We can extract tags from packages using RAKE.
This will require tuning, filtering and more automated filtering. The datalog can be useful.

This requires the following:

  • Pre-compute some tags from categories, such as the "phantom types" category and the "yesod" category.
  • Store more information about the source repo (service, url, description, topics)

In terms of normalisation, we can learn a great deal from lib.rs:

I normalize keywords to kebab-case, except CJK and a few exceptions like "iOS" which looks silly.
I had to manage synonyms mostly manually: https://gitlab.com/crates.rs/crates.rs/-/blob/main/data/tag-synonyms.csv
Joining adjacent keywords into pairs helps ["data", "structures"] => ["data-structures"].
Each keyword has a weight, and for similarity search I add hidden keywords: https://gitlab.com/crates.rs/crates.rs/-/blob/main/crate_db/src/lib_crate_db.rs#L306
For keyword extraction I take markdown sections into account: https://gitlab.com/crates.rs/crates.rs/-/blob/main/feat_extractor/src/lib.rs#L44
and use only never-seen-before sentences.

https://mastodon.social/@kornel/109508654611639728

@tchoutri tchoutri pinned this issue Sep 11, 2022
@tchoutri tchoutri mentioned this issue Nov 4, 2022
3 tasks
@tchoutri tchoutri added enhancement New feature or request help wanted Help is needed for this ticket labels Feb 16, 2023
@tchoutri tchoutri added Warwick 2023 and removed help wanted Help is needed for this ticket labels Apr 12, 2023
@tchoutri
Copy link
Contributor Author

@qw04 will take this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant