Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering Data With Embeddings #136

Open
1 of 11 tasks
NadimKawwa opened this issue Apr 13, 2024 · 0 comments
Open
1 of 11 tasks

Clustering Data With Embeddings #136

NadimKawwa opened this issue Apr 13, 2024 · 0 comments

Comments

@NadimKawwa
Copy link

NadimKawwa commented Apr 13, 2024

Information

The question or comment is about chapter:

  • Introduction
  • Text Classification
  • Transformer Anatomy
  • Multilingual Named Entity Recognition
  • Text Generation
  • Summarization
  • Question Answering
  • Making Transformers Efficient in Production
  • Dealing with Few to No Labels
  • Training Transformers from Scratch
  • Future Directions

Question or comment

Hello, I noticed that the book doesn't really have much information about clustering unlabeled data. I'm aware that there are some resources out there that address this question. However it would be nice to know what are some techniques that work best to cluster text, especially ones that don't rely on API calls that might be rate limited.
I have been pondering on these issues lately and the winning method so far is:

  1. Generate embeddings.
  2. MinMax scaler on features.
  3. Use algorithm like K-means and plot number of clusters versus silhouette score.

Would appreciate to know your thoughts on this.

Best,
Nadim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant