Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor edits to instructions in readme, requirements, fuzzy and semantic dedupe flags #548
base: main
Are you sure you want to change the base?
Minor edits to instructions in readme, requirements, fuzzy and semantic dedupe flags #548
Changes from all commits
e661439
a8d2513
dab3789
8e86a1a
517a219
4e3a848
d3b3144
bec0bfe
6b14249
0c5868f
31a62da
6cb179f
8d09ae3
8026d7d
063f5ed
82682a7
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has this key changed? I believe this key is still called
clustering_input_partition_size
NeMo-Curator/nemo_curator/modules/config.py
Line 222 in 85d9589
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VibhuJawa / @praateekmahajan - I tested with this change in the 25.02 image, however
clustering_input_partition_size: "2gb"
it gave me a key error.When I reverted it to
partition_size: "2gb"
I was able to run the code to completion.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the error that pops up upon changing the line to
clustering_input_partition_size: "2gb"
Traceback (most recent call last): File "/opt/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 301, in <module> main() File "/opt/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 282, in main run_curation_pipeline(args, text_files, code_files) File "/opt/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 189, in run_curation_pipeline duplicates = semantic_dedupe( ^^^^^^^^^^^^^^^^ File "/opt/NeMo-Curator/tutorials/dapt-curation/code/utils.py", line 343, in semantic_dedupe semdedup_config = SemDedupConfig.from_yaml(sem_dedupe_config_yaml_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/NeMo-Curator/nemo_curator/modules/config.py", line 27, in from_yaml yaml_dict = yaml.safe_load(file) ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/__init__.py", line 125, in safe_load return load(stream, SafeLoader) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/__init__.py", line 81, in load return loader.get_single_data() ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/constructor.py", line 49, in get_single_data node = self.get_single_node() ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 36, in get_single_node document = self.compose_document() ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 55, in compose_document node = self.compose_node(None, None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 84, in compose_node node = self.compose_mapping_node(anchor) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 127, in compose_mapping_node while not self.check_event(MappingEndEvent): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/parser.py", line 98, in check_event self.current_event = self.state() ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/parser.py", line 428, in parse_block_mapping_key if self.check_token(KeyToken): ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/scanner.py", line 116, in check_token self.fetch_more_tokens() File "/usr/local/lib/python3.12/dist-packages/yaml/scanner.py", line 223, in fetch_more_tokens return self.fetch_value() ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/scanner.py", line 577, in fetch_value raise ScannerError(None, None, yaml.scanner.ScannerError: mapping values are not allowed here in "/opt/NeMo-Curator/tutorials/dapt-curation/code/configs/text_semantic_dedupe_config.yaml", line 46, column 28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. So in we added
clustering_input_partition_size
in : https://github.com/NVIDIA/NeMo-Curator/pull/564/filesI wonder if you just remove this line/config because for both versions it should hit the default (which you anyways are using).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently discussed this, let's pin version here so that we don't see breakage with subsequent releases cc @ayushdg @ryantwolf