-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kdbai v1.4 #112
base: main
Are you sure you want to change the base?
Kdbai v1.4 #112
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider implementing the following changes to improve the code.
pbar = tqdm(total=df.shape[0], desc="Inserting data") | ||
while i < df.shape[0]: | ||
chunk = df[i : i + batch_size].reset_index(drop=True) | ||
# Assuming 'table' has an 'insert' method | ||
try: | ||
table.insert(chunk) | ||
pbar.update(chunk.shape[0]) | ||
i += batch_size | ||
except kdbai.KDBAIException as e: | ||
if "smaller batches" in str(e): | ||
tqdm.write( | ||
f"Reducing batch size to {batch_size * 2 // 3}" | ||
) | ||
batch_size = batch_size * 2 // 3 | ||
else: | ||
|
||
i = 0 | ||
try: | ||
while i < df.shape[0]: | ||
chunk = df.iloc[ | ||
i : min(i + batch_size, df.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: Potential performance issue with large data inserts.
Solution: Consider optimizing the insertion logic by using bulk inserts or asynchronous processing if supported by the underlying database.
!! Make sure the following suggestion is correct before committing it !!
pbar = tqdm(total=df.shape[0], desc="Inserting data") | |
while i < df.shape[0]: | |
chunk = df[i : i + batch_size].reset_index(drop=True) | |
# Assuming 'table' has an 'insert' method | |
try: | |
table.insert(chunk) | |
pbar.update(chunk.shape[0]) | |
i += batch_size | |
except kdbai.KDBAIException as e: | |
if "smaller batches" in str(e): | |
tqdm.write( | |
f"Reducing batch size to {batch_size * 2 // 3}" | |
) | |
batch_size = batch_size * 2 // 3 | |
else: | |
i = 0 | |
try: | |
while i < df.shape[0]: | |
chunk = df.iloc[ | |
i : min(i + batch_size, df.shape[0]) | |
for start in range(0, df.shape[0], batch_size): | |
chunk = df.iloc[start:start + batch_size].reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Reviewed everything up to fd5b324 in 33 seconds
More details
- Looked at
478
lines of code in3
files - Skipped
1
files when reviewing. - Skipped posting
5
drafted comments based on config settings.
1. src/vdf_io/export_vdf/kdbai_export.py:57
- Draft comment:
Consider creating a helper function to handle the repeated pattern of checking if a key exists inargs
and setting it using a function if it doesn't. This will reduce redundancy and improve readability. - Reason this comment was not posted:
Confidence changes required:50%
The code inkdbai_export.py
andkdbai_import.py
has a repeated pattern where it checks if a key exists in theargs
dictionary and then sets it using a function if it doesn't. This pattern can be simplified using a helper function to reduce redundancy and improve readability.
2. src/vdf_io/import_vdf/kdbai_import.py:56
- Draft comment:
Consider creating a helper function to handle the repeated pattern of checking if a key exists inargs
and setting it using a function if it doesn't. This will reduce redundancy and improve readability. - Reason this comment was not posted:
Confidence changes required:50%
The code inkdbai_import.py
has a repeated pattern where it checks if a key exists in theargs
dictionary and then sets it using a function if it doesn't. This pattern can be simplified using a helper function to reduce redundancy and improve readability.
3. src/vdf_io/export_vdf/kdbai_export.py:141
- Draft comment:
Ensuretable.indexes
is not empty before accessing its first element to avoid potentialIndexError
. - Reason this comment was not posted:
Comment did not seem useful.
4. src/vdf_io/import_vdf/kdbai_import.py:119
- Draft comment:
Ensureindexes_content
is not empty before accessing its first element to avoid potential errors. - Reason this comment was not posted:
Marked as duplicate.
5. src/vdf_io/import_vdf/kdbai_import.py:194
- Draft comment:
Replace the hardcoded index name 'flat' with the actual index name from the arguments or configuration for flexibility and correctness. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The comment points out a potential issue with hardcoding the index name 'flat'. If the index name should be dynamic, this is a valid concern. The code does not show any logic for dynamically setting the index name, which suggests the comment is correct.
The comment assumes that the index name should be dynamic without evidence from the code. It's possible that 'flat' is intentionally hardcoded for a specific reason.
The lack of any logic to dynamically set the index name suggests that the comment is valid. If 'flat' is meant to be dynamic, the code should reflect that.
The comment is valid as it points out a potential issue with hardcoding the index name 'flat'. It should be kept for further review by the author.
Workflow ID: wflow_UynEPN91xaP1IA8b
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider implementing the following changes to improve the code.
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"execution_count": null, | ||
"id": "5f9392fc-87fd-42ae-bb5c-1518c2022028", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#Set up KDB.AI endpoint and API key\n", | ||
"KDBAI_ENDPOINT = (\n", | ||
" os.environ[\"KDBAI_ENDPOINT\"]\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: Use environment variables for sensitive information
Solution: Use environment variables for the KDB.AI endpoint and API key, and fall back to user input if the environment variables are not set.
!! Make sure the following suggestion is correct before committing it !!
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"execution_count": null, | |
"id": "5f9392fc-87fd-42ae-bb5c-1518c2022028", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Set up KDB.AI endpoint and API key\n", | |
"KDBAI_ENDPOINT = (\n", | |
" os.environ[\"KDBAI_ENDPOINT\"]\n", | |
['import os', 'from getpass import getpass', '', "KDBAI_ENDPOINT = os.environ.get('KDBAI_ENDPOINT', input('KDB.AI endpoint: '))", "KDBAI_API_KEY = os.environ.get('KDBAI_API_KEY', getpass('KDB.AI API key: '))"] |
batch_size = self.args.get("batch_size", 10_000) or 10_000 | ||
pbar = tqdm(total=df.shape[0], desc="Inserting data") | ||
while i < df.shape[0]: | ||
chunk = df[i : i + batch_size].reset_index(drop=True) | ||
# Assuming 'table' has an 'insert' method | ||
try: | ||
table.insert(chunk) | ||
pbar.update(chunk.shape[0]) | ||
i += batch_size | ||
except kdbai.KDBAIException as e: | ||
if "smaller batches" in str(e): | ||
tqdm.write( | ||
f"Reducing batch size to {batch_size * 2 // 3}" | ||
) | ||
batch_size = batch_size * 2 // 3 | ||
else: | ||
|
||
i = 0 | ||
try: | ||
while i < df.shape[0]: | ||
chunk = df.iloc[ | ||
i : min(i + batch_size, df.shape[0]) | ||
].reset_index(drop=True) | ||
|
||
try: | ||
table.insert(chunk) | ||
pbar.update(chunk.shape[0]) | ||
i += batch_size | ||
except kdbai.KDBAIException as e: | ||
raise RuntimeError(f"Error inserting chunk: {e}") | ||
continue | ||
self.total_imported_count += len(df) | ||
if max_hit: | ||
break | ||
if max_hit: | ||
break | ||
if max_hit: | ||
tqdm.write( | ||
f"Max rows to be imported {self.args['max_num_rows']} hit. Exiting" | ||
) | ||
break | ||
finally: | ||
pbar.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: Use batch inserts for improved performance
Solution: Implement batch inserts using the table.insert()
method with a configurable batch size.
!! Make sure the following suggestion is correct before committing it !!
batch_size = self.args.get("batch_size", 10_000) or 10_000 | |
pbar = tqdm(total=df.shape[0], desc="Inserting data") | |
while i < df.shape[0]: | |
chunk = df[i : i + batch_size].reset_index(drop=True) | |
# Assuming 'table' has an 'insert' method | |
try: | |
table.insert(chunk) | |
pbar.update(chunk.shape[0]) | |
i += batch_size | |
except kdbai.KDBAIException as e: | |
if "smaller batches" in str(e): | |
tqdm.write( | |
f"Reducing batch size to {batch_size * 2 // 3}" | |
) | |
batch_size = batch_size * 2 // 3 | |
else: | |
i = 0 | |
try: | |
while i < df.shape[0]: | |
chunk = df.iloc[ | |
i : min(i + batch_size, df.shape[0]) | |
].reset_index(drop=True) | |
try: | |
table.insert(chunk) | |
pbar.update(chunk.shape[0]) | |
i += batch_size | |
except kdbai.KDBAIException as e: | |
raise RuntimeError(f"Error inserting chunk: {e}") | |
continue | |
self.total_imported_count += len(df) | |
if max_hit: | |
break | |
if max_hit: | |
break | |
if max_hit: | |
tqdm.write( | |
f"Max rows to be imported {self.args['max_num_rows']} hit. Exiting" | |
) | |
break | |
finally: | |
pbar.close() | |
['batch_size = 10_000', 'pbar = tqdm(total=df.shape[0], desc="Inserting data")', 'i = 0', 'while i < df.shape[0]:', ' chunk = df.iloc[', ' i : min(i + batch_size, df.shape[0])', ' ].reset_index(drop=True)', ' table.insert(chunk)', ' pbar.update(chunk.shape[0])', ' i += batch_size', 'pbar.close()'] |
try: | ||
if new_index_name in self.session.list(): | ||
table = self.session.table(new_index_name) | ||
if new_index_name in [name.name for name in self.db.tables]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: Sanitize user input for table name
Solution: Implement a function to sanitize the table name before using it to create or access the table.
!! Make sure the following suggestion is correct before committing it !!
if new_index_name in [name.name for name in self.db.tables]: | |
['def _sanitize_table_name(self, name):', ' # Implement sanitization logic here, e.g.:', " return re.sub(r'[^a-zA-Z0-9_]', '_', name)", '', 'def _create_table(self, name, schema, index):', ' name = self._sanitize_table_name(name)', ' # ... (existing code)'] |
Signed-off-by: alexgiannak <[email protected]>
embedding_dist = standardize_metric( | ||
tab_schema["columns"][i]["vectorIndex"]["metric"], self.DB_NAME_SLUG | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove the commented out code above as well
Thanks for sending this out, @alexgiannak! |
🔍 Code Review Summary❗ Attention Required: This push has potential issues. 🚨 Overview
🚨 Critical Issuesbest_practices (2 issues)1. Use environment variables for sensitive information📁 File: src/vdf_io/notebooks/kdbai_end_to_end_vectorIO.ipynb 💡 Solution: Current Code: ['KDBAI_ENDPOINT = (', ' os.environ["KDBAI_ENDPOINT"]', ' if "KDBAI_ENDPOINT" in os.environ', ' else input("KDB.AI endpoint: ")', ')', 'KDBAI_API_KEY = (', ' os.environ["KDBAI_API_KEY"]', ' if "KDBAI_API_KEY" in os.environ', ' else getpass("KDB.AI API key: ")', ')'] Suggested Code: ['KDBAI_ENDPOINT = os.environ.get("KDBAI_ENDPOINT", input("KDB.AI endpoint: "))', 'KDBAI_API_KEY = os.environ.get("KDBAI_API_KEY", getpass("KDB.AI API key: "))'] 2. Validate input for table name📁 File: src/vdf_io/import_vdf/kdbai_import.py 💡 Solution: Current Code: ['new_index_name = self.compliant_name(index_name)'] Suggested Code: ['import re', '', 'def validate_table_name(name):', " allowed_pattern = r'^[a-zA-Z0-9_]+$'", ' if not re.match(allowed_pattern, name):', " raise ValueError(f'Invalid table name:{name}. Only alphanumeric characters and underscores are allowed.')", ' return name', '', 'new_index_name = validate_table_name(self.compliant_name(index_name))']
Useful Commands
|
Update KDB.AI Integration and CLI Enhancements
Improve KDB.AI client integration and enhance command-line interface for exporting and importing data.
kdbai-client
dependency to version 1.4.0.These changes streamline the user experience and improve the robustness of KDB.AI data operations.
Original Description
# Update KDB.AI Client and Improve Import/Export Functionality**
Update the KDB.AI client library and enhance the import/export functionality for the VDF (Vector Data Format) IO module.
kdbai-client
dependency to version 1.4.0 or higher.ExportKDBAI
andImportKDBAI
classes to handle various data types and provide a more robust import/export experience.ImportKDBAI
class.ImportKDBAI
class to handle larger datasets more efficiently.**
These changes will improve the reliability and usability of the VDF IO module when interacting with KDB.AI cloud or server instances. Users can now export and import a wider range of data types and index configurations, leading to a more seamless integration with KDB.AI.
Original Description
# Update KDB.AI Client and Improve Import/Export Workflows**
**
Update the KDB.AI client library, simplify the import/export workflows, and enhance the overall functionality.
max_num_rows
limit and batch size handling, allowing full data import.**
**
The changes improve the overall user experience and reliability of the KDB.AI integration, making it easier to import and export data to/from the KDB.AI platform.
Original Description
# Update KDB.AI Integration**
**
**
Enhance the KDB.AI integration by updating dependencies and improving argument handling.
kdbai-client
dependency to version>=1.4.0
.kdbai_endpoint
,kdbai_api_key
, andtables_names
inkdbai_export.py
.None
values and prompt for input if necessary.kdbai_import.py
.**
**
**
These changes streamline the integration process and improve usability for developers working with KDB.AI.
Original Description
Update to work with KDB.AI v1.4 that is coming out on Monday 21st