Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGVector impl #100

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

PGVector impl #100

wants to merge 10 commits into from

Conversation

dhruv-anand-aintech
Copy link
Member

@dhruv-anand-aintech dhruv-anand-aintech commented May 21, 2024

fixes #54

Copy link

ellipsis-dev bot commented May 21, 2024

Your free trial has expired. To keep using Ellipsis, sign up at https://app.ellipsis.dev for $20/seat/month or reach us at [email protected]

Copy link
Contributor

sweep-ai bot commented May 21, 2024

Sweep: PR Review

README.md

The changes in the README.md file involve reordering and updating the status of various vector databases in the "In Progress" and "Not Supported" sections.

Potential Issues

Sweep isn't 100% sure if the following are issues or not but they may be worth taking a look at.

  • The removal of "Neo4j" and "Apache Solr" from the "Not Supported" section contradicts their addition in the same section, leading to potential confusion about their support status.
  • vector-io/README.md

    Lines 84 to 88 in cb7e7ff

    | Neo4j |||
    | Marqo |||
    | OpenSearch |||
    | Elasticsearch |||
    | Apache Solr |||

    View Diff


src/vdf_io/export_vdf/pgvector_export.py

The changes introduce the ExportPGVector class to handle exporting data from PGVector tables in a PostgreSQL database, including methods for argument parsing, data retrieval, and metadata generation.

Sweep Found These Issues

  • The get_all_schemas and get_all_table_names methods use self.conn.execute which is not a valid method for a psycopg2 connection object; it should be self.conn.cursor().execute.
  • def get_all_schemas(self):
    schemas = self.conn.execute(
    "SELECT schema_name FROM information_schema.schemata"
    )
    self.all_schemas = [schema[0] for schema in schemas]
    return [schema[0] for schema in schemas]
    def get_all_table_names(self):
    tables = self.conn.execute(
    "SELECT table_name FROM information_schema.tables WHERE table_schema='public'"
    )
    self.all_tables = [table[0] for table in tables]
    return [table[0] for table in tables]

    View Diff


src/vdf_io/import_vdf/pgvector_import.py

Introduced a new class ImportPGVector for importing data into a PGVector database, including methods for database connection, schema and table retrieval, and data upsertion from Parquet files.

Sweep Found These Issues

  • The get_all_schemas and get_all_table_names methods use self.conn.execute which is not a valid method for a psycopg2 connection object; it should use a cursor object to execute SQL queries.
  • schemas = self.conn.execute(
    "SELECT schema_name FROM information_schema.schemata"
    )
    self.all_schemas = [schema[0] for schema in schemas]
    return [schema[0] for schema in schemas]
    def get_all_table_names(self):
    tables = self.conn.execute(
    "SELECT table_name FROM information_schema.tables WHERE table_schema='public'"
    )
    self.all_tables = [table[0] for table in tables]

    View Diff

  • The upsert_data method assumes that self.vdf_meta is already populated, but there is no code to load or initialize this attribute, which may lead to AttributeError.
  • indexes_content: Dict[str, List[NamespaceMeta]] = self.vdf_meta["indexes"]
    index_names: List[str] = list(indexes_content.keys())
    if len(index_names) == 0:
    raise ValueError("No indexes found in VDF_META.json")

    View Diff

Potential Issues

Sweep isn't 100% sure if the following are issues or not but they may be worth taking a look at.

  • The upsert_data method uses self.conn.create_table and self.conn.open_table which are not valid methods for a psycopg2 connection object; these should be replaced with appropriate SQL commands or ORM methods.
  • table = self.conn.create_table(
    new_index_name, schema=pq.read_schema(parquet_files[0])
    )
    tqdm.write(f"Created table {new_index_name}")
    else:
    table = self.conn.open_table(new_index_name)
    tqdm.write(f"Opened table {new_index_name}")

    View Diff


src/vdf_io/names.py

A new class attribute PGVECTOR was added to the DBNames class to include the "pgvector" database.


src/vdf_io/notebooks/jsonl_to_parquet.ipynb

The changes include reordering the execution of code cells, updating the file path for loading data, modifying the DataFrame display headers and data, and adding new functionality to calculate DataFrame length, display a summary, add a new column, and save the DataFrame as a Parquet file.

Sweep Found These Issues

  • The change in the file path for the jsonl_file variable may introduce issues if the new file does not exist or is not accessible, leading to a FileNotFoundError or similar error.
  • "jsonl_file = '/Users/dhruvanand/Downloads/579aac087579b5acbc881164eb7af8b8-662d020154513ab1ef571e325a8fc649ebf64561/o200k_tokens.jsonl'\n"

    View Diff


src/vdf_io/pgvector_util.py

Introduced a new module pgvector_util.py with functions to create a command-line argument parser and prompt for Postgres connection details.

Sweep Found These Issues

  • The function set_pgv_args_from_prompt does not validate the connection string format, which could lead to runtime errors if an invalid connection string is provided.
  • set_arg_from_input(
    args,
    "connection_string",
    "Enter the connection string to Postgres instance: ",
    str,
    )

    View Diff

  • The default password "postgres" is set if the user does not provide one, which could be a security risk.
  • if not args.get("password"):
    # If password is not provided, set it to "postgres"
    args["password"] = "postgres"

    View Diff


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for pgvector
1 participant