Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add infer_lang function (Issue number #3) #79

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

tmankita
Copy link

No description provided.

@tmankita tmankita changed the title Add infer_lang function Add infer_lang function #3 Jul 14, 2020
@tmankita tmankita changed the title Add infer_lang function #3 Add infer_lang function (Issue number #3) Jul 14, 2020
Update detect_language function
Add padding_tuple function
@tmankita
Copy link
Author

@jbesomi , why Travis CI failed? What I do wrong? on my local machine, all the tests succeeds

@jbesomi
Copy link
Owner

jbesomi commented Jul 14, 2020

Thank you, this is a good start! 👍

If you click "details" and then select a job you will have the log: example

The problem is that you use some dependencies that are not installed. You will need to update setup.cfg.

Review:

  1. If you use "private" functions, you should add a trailing underscore _example()
  2. By default, we probably want to return just a single lang for each cell. And then, we might have an argument that when set to True, the function returns a list of tuples (this is the improvement)
  3. What does this code do?
l = list(doc._.language["result"].items())
curr_size = len(l)
t = foldl(operator.add, (), l)
if max_list_size < curr_size:
   padding_list(infer_languages, curr_size)
   max_list_size = curr_size
elif curr_size < max_list_size:
   padding_tuple(t, max_list_size)

Why we cannot simply do infer_languages.append(l)? Also, if possible we should avoid for/loop ad in the rest of the code (at the exception of the for pipe loop, this is multi-threading and fast), we almost never use for/while loops as they are generally slow.

  1. We should test the function more extensively. For instance, by passing a Pandas Series formed of X different languages
  2. The docstring should be more precise, stating exactly which languages can be detected

…ssary code -Update infer_lang documentation
@tmankita
Copy link
Author

@jbesomi Thank you! very helpful comment and review. I tried as much I can to stick to your instructions in the review.
If there anything else I need to adjust I will be glad to know about it.

@jbesomi
Copy link
Owner

jbesomi commented Jul 14, 2020

Hi Tomer,

thank you for your improvements! You are going clearly in the right direction, still, we need to solve some issues, but we will do it together 👍

Review:

  1. Great you mention the ISO code!
  2. Testing: we will need to have a test for each single ISO code (i.e each language)...
  3. If you look carefully at spacy-langdetect, you discover that it just call langdetect ... we can clearly avoid to use spacy-lang therefore and use directly langdetect. Have a look at the LanguageDetector code; it's quite simple.
  4. There are two ways we can write infer_lang. We will need to test which one is faster on a large dataset
    • Keep using spacy (by adding a custom element to the pipeline). I'm not sure this makes much sense as we are not really making use of spaCy ... the only advantage I see is that the pipe is multi-threading. But we might just want to write our own multi-threading code that computes the lang for each cell of the Pandas Series.
    • Using Pandas apply. This might be slower than a multi-threading approach, but I'm not sure about that.
  5. _infer_lang_ret_list / _infer_lang this code are about the same. What about keeping the for loop in the main function and then having an if/else condition when setting the pipe? (assuming you keep spaCy that is probably unnecessary)
  6. The default spaCy pipe might have other tools we would like to disable, isn't? (assuming you keep spaCy that is probably unnecessary)
  7. ret_list what about call it probability and when set to True return a list of tuples as you did? When set to False, returning the ISO nomenclature should be enough (no need to return a tuple)
  8. _Language_to_dict: return a tuple, not a dict

@jbesomi jbesomi marked this pull request as draft July 15, 2020 08:23
… ISO code -Change name ret_list to probability - Change name _Language_to_dict to _Language_to_tuple
@tmankita
Copy link
Author

tmankita commented Jul 17, 2020

Hi @jbesomi ,
I do the test on 21000 sentences
The results are:

Using Pandas apply infer_lang no probability: 1.0 min 58.464749813079834 sec
Using spacy infer_lang no probability: 3.0 min 6.3345091342926025 sec
Using Pandas apply infer_lang with probability: 2.0 min 0.9787819385528564 sec
Using spacy infer_lang with probability: 3.0 min 7.99077296257019 sec

You are right using pandas apply is definitely more time saver!

Thanks for the detailed comment, I was reading it very carefully and tried to apply everything you wrote.

@jbesomi
Copy link
Owner

jbesomi commented Jul 17, 2020

Hi Tomer,

wow, congrats; you did such a great job! that's super cool, I'm impressed 🎉 👍

For the stats, which dataset did you use? I'm impressed by how long it takes ...
The reason why it takes that long is probably because langdetect is non-deterministic (it uses a model to find the language) as you said.

Review:

  1. Everything looks super!
  2. The only problem is that probably langdetect is too slow; 21k rows is quite a small dataset ... (but maybe the documents were super long?)
  3. One way to speed up the process is to cut the length of each cell. We might want to test this (I can do it) and see if it speeds up the procedure. Can you please attach to this issue the notebook you used to do the comparison? Another approach would be to use a more simple deterministic algo (at the cost of a higher error rate)
  4. Some of your functions are still using google docstring style :param instead of the numpy docstring style. Also, why some head titles starts with "gured out"?

Regards,

@tmankita
Copy link
Author

tmankita commented Jul 17, 2020

@jbesomi
I used the same dataset I present in infer_lang issue
(link to the relevant comment #3 (comment)
at the end of the comment you have a link to download the dataset).

According to 1 +2 +3, from the infer_lang issue:
+----------------+----------------+------------------+
| library | average time | accuracy score |
+================+================+==================+
| langdetect | 0.00606 sec | 86.308% |
+----------------+----------------+------------------+
The average time is refer to process one sentence (without consider the sentence length).
So in the average case, we will end with (0.00606*21000)/60 = 2.121 minutes. Therefore, the time was taken to process all the 21k sentences is pretty good for my opinion. Maybe with cut the sentences we can improve it, take a look in the data set and see if doing so is relevant (maybe the sentences are too short). If that suggestion will fail I will recommend using Cld3 library with average time: 0.00026 sec and accuracy score: 84.973%.

According to 4, I will remove the docstring from helper functions, the "guard out" note refers to the try and catch expression in the function.

I waiting for you to examine your suggestion. Once you have some conclusions share with me and I will start working on the solution.

Thanks for your comments, I learned a lot from them.

@jbesomi
Copy link
Owner

jbesomi commented Jul 17, 2020

🎉

Can you please attach here the Jupyter Notebook you used to test the apply vs spaCy implementation? I will start from there and test different approaches, see if we can reduce a bit the exec. time

@tmankita
Copy link
Author

Sure!
The link:

ApplyVsPipe.ipynb.zip

@jbesomi
Copy link
Owner

jbesomi commented Jul 18, 2020

Thanks, that's cool! :)

@tmankita
Copy link
Author

Hi @jbesomi ,
Long time, no hear, just wonder if I can help.

Thanks

@jbesomi
Copy link
Owner

jbesomi commented Jul 27, 2020

Hey @tmankita, I'm very sorry, I haven't yet compared the different version. I will be back to you as soon as possible. In the meantime, you can have a look at the other open issues 👍

@henrifroese henrifroese added the enhancement New feature or request label Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants