-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add drop_duplicates #4
Comments
@jbesomi Should it be checking line by line and if a line is duplicate remove it? |
The idea here is to compare long text of documents and try to find if there are some of them too similar; in this case, it might mean that documents are indeed duplicates. There are many applications for that, for instance, to detect plagiarism in papers. A naive approach is to apply TF-IDF and look at the distance between vectors. |
conI suggest having several methods for handling duplicated content. In the very simplest form, you might just need to chech againsta a hash (sha1, for instance) to be sure you don't have exact duplicates (ok, this might be a preprocessing job). The inteface might look like Pandas.Series.unique() but specifying a method / way to do the deduplication: unique( method='hash | jaquard | etc.' , threshold=xx). |
Hey @igponce, Exactly, the interface would look like A simple-yet-powerful solution is to simply compute a good representation of each text and remove documents that have very similar vectors. Right, as you point the function will have as argument Would you be interested in implementing this solution? Jaccard might work as well but it's easier to do better and to use word vectors instead of just counting. Food for thoughts: what if the input must already be a representation? This would be even a better solution. In this case, the arguments might be the |
(Edit)
Add
hero.drop_duplicates(s, representation, distance_algorithm, threshold)
.Where:
s
is a Pandas Seriesrepresentation
is either a Flair embedding or a hero representation function. Need to define a default value.distance_algorithm
is either a string or a function that takes as input two vectors and it computes their distance. Example of such a function issklearn.metrics.pairwise.euclidean_distances
(see scikit-learn repository)threshold
boolean values. All vectors that share a distance less than this value will be considered as a single document. The first in order of appearance of the Pandas Series will be kept.Task:
Drop all duplicated from the given Pandas Series and return a cleaned version of it.
TODO:
It will be interesting to drop_duplicates from a DataFrame, specifying which column to drop (as done in Pandas).
The text was updated successfully, but these errors were encountered: