Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__init__.py: rework imports #334

Merged
merged 5 commits into from
Mar 6, 2025

Conversation

mistydemeo
Copy link
Contributor

Although doublethink is an optional dependency to allow brozzler to be used as a library without it, in practice we had some mandatory import statements that prevented brozzler from being imported without it. This fixes that by gating off some of the imports and exports.

If doublethink is available, brozzler works as it is now. But if it isn't, we make a few changes:

  • brozzler.worker, brozzler.cli and brozzler.model reexports are disabled
  • One brozzler.cli function, which is used outside brozzler's own cli, has been moved into brozzler's __init__.py. For compatibility, it's reexported from brozzler.cli.

@mistydemeo
Copy link
Contributor Author

I've cherry-picked @avdempsey's two commits from #335, with black formatting applied and the necessary tweaking to apply on top of my patch.

@mistydemeo mistydemeo force-pushed the doublethink-optional branch from 45556ca to a0a06b5 Compare March 4, 2025 22:33
@mistydemeo mistydemeo force-pushed the doublethink-optional branch 2 times, most recently from 1f89a91 to 91322ab Compare March 5, 2025 18:13
mistydemeo and others added 4 commits March 5, 2025 11:02
Although doublethink is an optional dependency to allow brozzler to be
used as a library without it, in practice we had some mandatory import
statements that prevented brozzler from being imported without it.
This fixes that by gating off some of the imports and exports.

If doublethink is available, brozzler works as it is now. But if it
isn't, we make a few changes:

* brozzler.worker, brozzler.cli and brozzler.model reexports are
  disabled
* One brozzler.cli function, which is used outside brozzler's own cli,
  has been moved into brozzler's __init__.py. For compatibility, it's
  reexported from brozzler.cli.
@@ -260,6 +269,38 @@ def thumb_jpeg(self, full_jpeg):
img.save(out, "jpeg", quality=95)
return out.getbuffer()

def should_ytdlp(self, logger, site, page, page_status, skip_av_seeds):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More a note to self than anything, but I have changes to this method in #288 (per-seed video exclusion) that will need to be moved over.

@mistydemeo mistydemeo force-pushed the doublethink-optional branch from 4879e00 to 907eace Compare March 5, 2025 19:39
@mistydemeo
Copy link
Contributor Author

In local testing, I realized that the brozzle-page command still had the yt-dlp import error since we bypass worker's constructor check for yt-dlp and go directly to the CLI. Added a commit to address that.

@gretchenleighmiller
Copy link
Contributor

gretchenleighmiller commented Mar 5, 2025

I've been doing some local testing, and I've found that rethinkdb==2.4.9 does not play well with Python 3.12. I tried locally relaxing that constraint to "rethinkdb>=2.4.9" and upgrading to 2.4.10post1; that worked.

Before with rethinkdb==2.4.9, I was getting distutils errors and could not install a version that satisfied requirements:

(brozzler-updated-deps) gretchenmiller@Gretchens-MacBook-Pro brozzler-updated-deps % brozzle-page https://example.org/      
Traceback (most recent call last):
  File "/Users/gretchenmiller/Repos/brozzler-updated-deps/.venv/bin/brozzle-page", line 5, in <module>
    from brozzler.cli import brozzle_page
  File "/Users/gretchenmiller/Repos/brozzler-updated-deps/.venv/lib/python3.12/site-packages/brozzler/cli.py", line 22, in <module>
    import brozzler.worker
  File "/Users/gretchenmiller/Repos/brozzler-updated-deps/.venv/lib/python3.12/site-packages/brozzler/worker.py", line 36, in <module>
    import doublethink
  File "/Users/gretchenmiller/Repos/brozzler-updated-deps/.venv/lib/python3.12/site-packages/doublethink/__init__.py", line 20, in <module>
    import rethinkdb
  File "/Users/gretchenmiller/Repos/brozzler-updated-deps/.venv/lib/python3.12/site-packages/rethinkdb/__init__.py", line 93, in <module>
    r = RethinkDB()
        ^^^^^^^^^^^
  File "/Users/gretchenmiller/Repos/brozzler-updated-deps/.venv/lib/python3.12/site-packages/rethinkdb/__init__.py", line 32, in __init__
    from rethinkdb import (
  File "/Users/gretchenmiller/Repos/brozzler-updated-deps/.venv/lib/python3.12/site-packages/rethinkdb/_dump.py", line 35, in <module>
    from rethinkdb import _export, utils_common
  File "/Users/gretchenmiller/Repos/brozzler-updated-deps/.venv/lib/python3.12/site-packages/rethinkdb/_export.py", line 40, in <module>
    from rethinkdb import errors, query, utils_common
  File "/Users/gretchenmiller/Repos/brozzler-updated-deps/.venv/lib/python3.12/site-packages/rethinkdb/utils_common.py", line 22, in <module>
    import distutils.version
ModuleNotFoundError: No module named 'distutils'
(brozzler-updated-deps) gretchenmiller@Gretchens-MacBook-Pro brozzler-updated-deps % uv pip install distutils
  × No solution found when resolving dependencies:
  ╰─▶ Because there are no versions of distutils and you require distutils, we can conclude that your requirements are unsatisfiable.

This is under Python 3.12. I'm using uv to manage the virtual env, and I'm installing from a local copy of the doublethink-optional branch and making just this one change.

--- a/setup.py
+++ b/setup.py
@@ -85,7 +85,7 @@ setuptools.setup(
             "warcprox>=2.4.31",
         ],
         "rethinkdb": [
-            "rethinkdb==2.4.9",
+            "rethinkdb>=2.4.9",
             "doublethink==0.4.9",
         ],
     },

Virtual env packages installed:

(brozzler-updated-deps) gretchenmiller@Gretchens-MacBook-Pro misty-brozzler % uv pip list
Using Python 3.12.7 environment at /Users/gretchenmiller/Repos/brozzler-updated-deps/.venv
Package            Version
------------------ ------------
brozzler           1.6.9
cerberus           1.3.7
certifi            2025.1.31
cffi               1.17.1
charset-normalizer 3.4.1
cryptography       44.0.2
doublethink        0.4.9
idna               3.10
jinja2             3.1.6
looseversion       1.3.0
markupsafe         3.0.2
pillow             11.1.0
prometheus-client  0.21.1
publicsuffix       1.1.1
pycparser          2.22
python-dateutil    2.9.0.post0
python-magic       0.4.27
pyyaml             6.0.2
reppy              0.3.4
requests           2.32.3
rethinkdb          2.4.10.post1
six                1.17.0
structlog          25.1.0
url                0.1.7
urlcanon           0.3.1
urllib3            2.3.0
websocket-client   1.8.0

Upgrading rethinkdb followed by brozzle-page success

(brozzler-updated-deps) gretchenmiller@Gretchens-MacBook-Pro brozzler-updated-deps % uv pip install --upgrade rethinkdb
Resolved 3 packages in 75ms
Prepared 2 packages in 33ms
Uninstalled 1 package in 9ms
Installed 2 packages in 3ms
 + looseversion==1.3.0
 - rethinkdb==2.4.9
 + rethinkdb==2.4.10.post1
(brozzler-updated-deps) gretchenmiller@Gretchens-MacBook-Pro brozzler-updated-deps % brozzle-page https://example.org                      
2025-03-05 20:08:46 [info     ] adding ssurt to scope accept rules [brozzler.model.Site._accept_ssurt_if_not_redundant(model.py:284)] ssurt=org,example,//https:/
2025-03-05 20:08:46 [info     ] optional yt-dlp extra not installed; setting skip_youtube_dl to True [brozzler.worker.BrozzlerWorker.__init__(worker.py:103)]
2025-03-05 20:08:46 [warning  ] not starting prometheus scrape endpoint: metrics_port is undefined [brozzler.worker.BrozzlerWorker.__init__(worker.py:138)]
2025-03-05 20:08:46 [info     ] running                        [brozzler.chrome.Chrome.start(chrome.py:224)] chrome_args='"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" -v --remote-debugging-port=9222 --remote-allow-origins=http://localhost:9222 --use-mock-keychain --user-data-dir=/var/folders/y4/pf777_x15f18g2vwk93s2l040000gn/T/tmp1rjnrfy6/chrome-user-data --disable-background-networking --disable-breakpad --disable-renderer-backgrounding --disable-hang-monitor --disable-background-timer-throttling --mute-audio --disable-web-sockets --window-size=1400,900 --no-default-browser-check --disable-first-run-ui --no-first-run --homepage=about:blank --disable-features=HttpsUpgrades,OptimizationGuideModelDownloading,OptimizationHintsFetching,OptimizationTargetPrediction,OptimizationHints --disable-direct-npapi-requests --disable-web-security --disable-notifications --disable-extensions --disable-save-password-bubble --disable-sync --headless=new about:blank'
2025-03-05 20:08:46 [info     ] chrome running                 [brozzler.chrome.Chrome.start(chrome.py:240)] pid=38116
2025-03-05 20:08:46 [info     ] got chrome window websocket debug url [brozzler.chrome.Chrome._websocket_url(chrome.py:259)] debug_url=ws://localhost:9222/devtools/page/3649B104DEB6BE1084650A9038D2DFE0 json_url=http://localhost:9222/json
2025-03-05 12:08:47,501 38004 INFO WebsockThread:9222 websocket.info(_logging.py:89) Websocket connected
2025-03-05 20:08:48 [info     ] brozzling                      [brozzler.worker.BrozzlerWorker.brozzle_page(worker.py:316)] page={'url': 'https://example.org', 'site_id': -1, 'retry_after': None, 'failed_attempts': 0, 'hops_from_seed': 0, 'hop_path': None, 'via_page_url': None, 'brozzle_count': 0, 'claimed': False, 'hops_off': 0, 'needs_robots_check': False, 'priority': 13, 'id': '5267dc6e659be65bdda96c120d131df3a4b25f04'}
2025-03-05 20:08:48 [info     ] getting page headers           [brozzler.worker.BrozzlerWorker._get_page_headers(worker.py:382)] url=https://example.org
2025-03-05 20:08:48 [info     ] needs browsing                 [brozzler.worker.BrozzlerWorker.brozzle_page(worker.py:325)] page={'url': 'https://example.org', 'site_id': -1, 'retry_after': None, 'failed_attempts': 0, 'hops_from_seed': 0, 'hop_path': None, 'via_page_url': None, 'brozzle_count': 0, 'claimed': False, 'hops_off': 0, 'needs_robots_check': False, 'priority': 13, 'id': '5267dc6e659be65bdda96c120d131df3a4b25f04'}
2025-03-05 20:08:48 [info     ] navigating to page             [brozzler.browser.Browser.navigate_to_page(browser.py:686)] page_url=https://example.org
2025-03-05 20:08:48 [info     ] rendering template             [brozzler.behavior_script(__init__.py:139)] parameters={'interval': 10, 'actions': [{'selector': 'button#didomi-notice-agree-button, button.sc-button-play, .playButton, div.soundItem, .jwlist>a, audio, video'}]} template=umbraBehavior.js.j2 url=https://example.org
2025-03-05 20:09:03 [info     ] behavior decided it has finished [brozzler.browser.Browser.run_behavior(browser.py:818)]
2025-03-05 20:09:04 [info     ] taking screenshot              [brozzler.browser.Browser.screenshot(browser.py:728)]
2025-03-05 20:09:04 [info     ] wrote screenshot               [brozzler.cli.on_screenshot(cli.py:382)]
2025-03-05 20:09:04 [info     ] extracting outlinks            [brozzler.browser.Browser.extract_outlinks(browser.py:693)]
2025-03-05 20:09:05 [info     ] outlinks                       [brozzler.cli.brozzle_page(cli.py:398)] outlinks=['https://www.iana.org/domains/example']
2025-03-05 20:09:05 [info     ] shutting down websocket connection [brozzler.browser.Browser.stop(browser.py:430)]
2025-03-05 20:09:05 [info     ] terminating chrome             [brozzler.chrome.Chrome.stop(chrome.py:338)] pid=38116
2025-03-05 20:09:05 [info     ] chrome exited normally         [brozzler.chrome.Chrome.stop(chrome.py:348)] pid=38116

@vbanos
Copy link
Contributor

vbanos commented Mar 6, 2025

Looks good to me.

@mistydemeo
Copy link
Contributor Author

I've been asked about the impact of this on users more broadly, so I wanted to summarize exactly what users should expect. Broadly: there are no impacts, except in very specific circumstances. We haven't broken compatibility with anything, and users will only see any kinds of changes if they don't have optional modules installed that were formerly mandatory.

For users who use brozzler's commandline tools (brozzle-page or any of the worker entrypoints), there are only minimal changes. These usecases still require doublethink/rethinkdb like they did before, and users who had a working installation of brozzler before will have no issues upgrading to this version. The only change in this workflow is that yt-dlp is truly optional. It's been an optional dependency for some time, but our import structure meant that you couldn't actually run brozzler without it; it would error out about not being able to import the module. If users run brozzler with yt-dlp missing, they'll be informed in the logs and all yt-dlp functionality will be turned off.

The biggest impact is for users who use brozzler as a library, rather than run the worker or the CLI tool. Although we intended to make doublethink and yt-dlp optional to support this case, it wasn't actually possible to import brozzler without doublethink/yt-dlp being present. If users do have those libraries installed, nothing changes. But if they don't, it's now possible to import brozzler and safely use brozzler's APIs where it wasn't before. This shouldn't be a breaking change, just a purely positive one. To accommodate these users, doublethinkless library users need to import suggest_default_chrome_exe from brozzler instead of brozzler.cli because brozzler.cli is unsafe to import from without doublethink.

@mistydemeo mistydemeo merged commit 21102ca into internetarchive:master Mar 6, 2025
3 checks passed
@mistydemeo mistydemeo deleted the doublethink-optional branch March 6, 2025 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants