Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor edits to instructions in readme, requirements, fuzzy and semantic dedupe flags #548

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

ruchaa-apte
Copy link
Contributor

@ruchaa-apte ruchaa-apte commented Feb 14, 2025

Description

The PR inscludes:

  1. Changes in README to install packages
  2. Changes in requirements.txt for updates to the new packages
  3. Changes in flags for fuzzy and semantic dedupe

Usage

python main.py --device "gpu"

Checklist

  • [Y] I am familiar with the Contributing Guide.
  • [Y] New or Existing tests cover these changes.
  • [Y]The documentation is up to date with these changes.

@ruchaa-apte
Copy link
Contributor Author

pre-commit.ci autofix

ruchaa-apte and others added 2 commits February 26, 2025 14:55
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Rucha Apte <[email protected]>
@ruchaa-apte
Copy link
Contributor Author

pre-commit.ci autofix

@ayushdg ayushdg requested a review from sarahyurick March 12, 2025 17:54
sim_metric: "cosine"
which_to_keep: "hard"
batched_cosine_similarity: 1024
sort_clusters: true
kmeans_with_cos_dist: false
clustering_input_partition_size: "2gb"
partition_size: "2gb"
Copy link
Collaborator

@praateekmahajan praateekmahajan Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this key changed? I believe this key is still called clustering_input_partition_size

clustering_input_partition_size: str = "2gb"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the change

Copy link
Contributor Author

@ruchaa-apte ruchaa-apte Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VibhuJawa / @praateekmahajan - I tested with this change in the 25.02 image, however clustering_input_partition_size: "2gb" it gave me a key error.
When I reverted it to partition_size: "2gb" I was able to run the code to completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the error that pops up upon changing the line to clustering_input_partition_size: "2gb"

Traceback (most recent call last): File "/opt/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 301, in <module> main() File "/opt/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 282, in main run_curation_pipeline(args, text_files, code_files) File "/opt/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 189, in run_curation_pipeline duplicates = semantic_dedupe( ^^^^^^^^^^^^^^^^ File "/opt/NeMo-Curator/tutorials/dapt-curation/code/utils.py", line 343, in semantic_dedupe semdedup_config = SemDedupConfig.from_yaml(sem_dedupe_config_yaml_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/NeMo-Curator/nemo_curator/modules/config.py", line 27, in from_yaml yaml_dict = yaml.safe_load(file) ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/__init__.py", line 125, in safe_load return load(stream, SafeLoader) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/__init__.py", line 81, in load return loader.get_single_data() ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/constructor.py", line 49, in get_single_data node = self.get_single_node() ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 36, in get_single_node document = self.compose_document() ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 55, in compose_document node = self.compose_node(None, None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 84, in compose_node node = self.compose_mapping_node(anchor) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 127, in compose_mapping_node while not self.check_event(MappingEndEvent): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/parser.py", line 98, in check_event self.current_event = self.state() ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/parser.py", line 428, in parse_block_mapping_key if self.check_token(KeyToken): ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/scanner.py", line 116, in check_token self.fetch_more_tokens() File "/usr/local/lib/python3.12/dist-packages/yaml/scanner.py", line 223, in fetch_more_tokens return self.fetch_value() ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/scanner.py", line 577, in fetch_value raise ScannerError(None, None, yaml.scanner.ScannerError: mapping values are not allowed here in "/opt/NeMo-Curator/tutorials/dapt-curation/code/configs/text_semantic_dedupe_config.yaml", line 46, column 28

Copy link
Collaborator

@VibhuJawa VibhuJawa Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. So in we added clustering_input_partition_size in : https://github.com/NVIDIA/NeMo-Curator/pull/564/files

I wonder if you just remove this line/config because for both versions it should hit the default (which you anyways are using).

Comment on lines +6 to +7
qgrid
tesseract-ocr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently discussed this, let's pin version here so that we don't see breakage with subsequent releases cc @ayushdg @ryantwolf

Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look okay to me. Have general follow up questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants