-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor edits to instructions in readme, requirements, fuzzy and semantic dedupe flags #548
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Rucha Apte <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>
Signed-off-by: Rucha Apte <[email protected]>
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>
Signed-off-by: Rucha Apte <[email protected]>
pre-commit.ci autofix |
sim_metric: "cosine" | ||
which_to_keep: "hard" | ||
batched_cosine_similarity: 1024 | ||
sort_clusters: true | ||
kmeans_with_cos_dist: false | ||
clustering_input_partition_size: "2gb" | ||
partition_size: "2gb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has this key changed? I believe this key is still called clustering_input_partition_size
NeMo-Curator/nemo_curator/modules/config.py
Line 222 in 85d9589
clustering_input_partition_size: str = "2gb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VibhuJawa / @praateekmahajan - I tested with this change in the 25.02 image, however clustering_input_partition_size: "2gb"
it gave me a key error.
When I reverted it to partition_size: "2gb"
I was able to run the code to completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the error that pops up upon changing the line to clustering_input_partition_size: "2gb"
Traceback (most recent call last): File "/opt/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 301, in <module> main() File "/opt/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 282, in main run_curation_pipeline(args, text_files, code_files) File "/opt/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 189, in run_curation_pipeline duplicates = semantic_dedupe( ^^^^^^^^^^^^^^^^ File "/opt/NeMo-Curator/tutorials/dapt-curation/code/utils.py", line 343, in semantic_dedupe semdedup_config = SemDedupConfig.from_yaml(sem_dedupe_config_yaml_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/NeMo-Curator/nemo_curator/modules/config.py", line 27, in from_yaml yaml_dict = yaml.safe_load(file) ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/__init__.py", line 125, in safe_load return load(stream, SafeLoader) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/__init__.py", line 81, in load return loader.get_single_data() ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/constructor.py", line 49, in get_single_data node = self.get_single_node() ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 36, in get_single_node document = self.compose_document() ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 55, in compose_document node = self.compose_node(None, None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 84, in compose_node node = self.compose_mapping_node(anchor) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 127, in compose_mapping_node while not self.check_event(MappingEndEvent): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/parser.py", line 98, in check_event self.current_event = self.state() ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/parser.py", line 428, in parse_block_mapping_key if self.check_token(KeyToken): ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/scanner.py", line 116, in check_token self.fetch_more_tokens() File "/usr/local/lib/python3.12/dist-packages/yaml/scanner.py", line 223, in fetch_more_tokens return self.fetch_value() ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/scanner.py", line 577, in fetch_value raise ScannerError(None, None, yaml.scanner.ScannerError: mapping values are not allowed here in "/opt/NeMo-Curator/tutorials/dapt-curation/code/configs/text_semantic_dedupe_config.yaml", line 46, column 28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. So in we added clustering_input_partition_size
in : https://github.com/NVIDIA/NeMo-Curator/pull/564/files
I wonder if you just remove this line/config because for both versions it should hit the default (which you anyways are using).
qgrid | ||
tesseract-ocr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently discussed this, let's pin version here so that we don't see breakage with subsequent releases cc @ayushdg @ryantwolf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look okay to me. Have general follow up questions
tutorials/dapt-curation/code/configs/text_semantic_dedupe_config.yaml
Outdated
Show resolved
Hide resolved
…ig.yaml Co-authored-by: Vibhu Jawa <[email protected]> Signed-off-by: Rucha Apte <[email protected]>
Signed-off-by: Rucha Apte <[email protected]>
Description
The PR inscludes:
Usage
Checklist