Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SystemLimit instead of settings for device rate limiting #35781

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

gherceg
Copy link
Contributor

@gherceg gherceg commented Feb 14, 2025

Product Description

Technical Summary

https://dimagi.atlassian.net/browse/SAAS-16355
https://dimagi.atlassian.net/browse/SAAS-16561

I recently added a SystemLimit class that makes it much easier to manage limits, and moving this limit over from settings to the SystemLimit in this PR. There are a few other changes that go along with this refactor, like moving the setting to enable the device rate limiter over to a feature flag, and a slight bug fix (sort of) related to the redis key not being scoped to a domain.

Feature Flag

DEVICE_RATE_LIMITER

Safety Assurance

Safety story

Automated test coverage

QA Plan

No, I'll test myself.

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

Realistically I don't think this is a likely scenario in our current
platform since a mobile worker is scoped to a domain, but noticed this
vulnerability in the test and decided it was easy enough to protect
against.
This commit did a lot. In addition to replacing the limit in settings
with SystemLimit, I removed the feature flag for an increased limit
since SystemLimit solves that, and moved the setting to enable the
device rate limiter to a feature flag for ease of use. I also updated
metrics to differentiate between an actual rate limited attempt vs a
would-be rate limited.
@gherceg gherceg marked this pull request as ready for review February 17, 2025 15:20
@gherceg gherceg requested a review from esoergel as a code owner February 17, 2025 15:20
@gherceg
Copy link
Contributor Author

gherceg commented Feb 17, 2025

@coderabbitai review

Copy link

coderabbitai bot commented Feb 17, 2025

Walkthrough

The changes modify how device rate limiting is implemented. In the user application, the DeviceRateLimiter class now retrieves the per-user device limit by calling SystemLimit.for_key with the domain provided, rather than relying on a feature flag to increase limits. The rate_limit_device method’s control flow now checks if the device rate limiter toggle is enabled for the given domain. If enabled, it logs a rate-limited metric and returns True; otherwise, it logs an alternative metric and returns False. Additionally, the Redis key generation has been updated to include the domain, ensuring uniqueness per domain and user pair. In the tests, the class now uses Django’s TestCase (with database access) instead of SimpleTestCase and includes additional scenarios for domain-specific limits and overlapping domains. Also, a new method for_key has been added to the SystemLimit model with tests validating its behavior. Finally, the old toggle and related settings have been removed in favor of the new restrictive approach using the DEVICE_RATE_LIMITER toggle.

Sequence Diagram(s)

sequenceDiagram
    participant U as User/Application
    participant DRL as DeviceRateLimiter
    participant Toggle as Toggle Module
    participant Redis as Redis Store
    participant Metric as Metrics Logger

    U->>DRL: Call rate_limit_device(domain, user_id, device_id)
    DRL->>Toggle: Check if DEVICE_RATE_LIMITER toggle enabled for domain
    alt Toggle Enabled
        DRL->>Metric: Log rate-limited event
        DRL->>DRL: Generate Redis key using (domain, user_id)
        DRL->>Redis: Retrieve/Update usage data with unique key
        DRL->>U: Return True
    else Toggle Disabled
        DRL->>Metric: Log non rate-limited event
        DRL->>U: Return False
    end
Loading
sequenceDiagram
    participant Caller as Caller
    participant SL as SystemLimit
    participant DB as Database

    Caller->>SL: Call for_key(key, domain)
    SL->>DB: Query SystemLimit objects filtering by key and domain (if provided)
    DB->>SL: Return sorted matching records (descending by domain)
    SL->>Caller: Return first matching limit or None
Loading
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
corehq/apps/users/device_rate_limiter.py (1)

30-30: Consider caching domain limit lookups.
Repeatedly calling SystemLimit.for_key on every request might be costly under heavy load. For high-volume domains, you could cache the result of SystemLimit.for_key for a short duration (e.g., using Django’s caching frameworks) to improve performance.

 def device_limit_per_user(self, domain):
-    return SystemLimit.for_key(DEVICE_LIMIT_PER_USER_KEY, domain=domain) or DEVICE_LIMIT_PER_USER_DEFAULT
+    limit = SystemLimit.for_key(DEVICE_LIMIT_PER_USER_KEY, domain=domain)
+    if limit is None:
+        limit = DEVICE_LIMIT_PER_USER_DEFAULT
+    return limit
corehq/project_limits/models.py (1)

101-111: LGTM! The implementation effectively handles domain-specific limits.

The for_key method is well-implemented with:

  • Clear prioritization of domain-specific limits over global limits
  • Efficient query construction using Q objects
  • Proper handling of the optional domain parameter

Consider adding docstring examples to illustrate the prioritization behavior:

     """
     Return the value associated with the given key, prioritizing specificity
+
+    Examples:
+        >>> SystemLimit.objects.create(key='foo', limit=1)  # global limit
+        >>> SystemLimit.objects.create(key='foo', domain='bar', limit=2)  # domain-specific
+        >>> SystemLimit.for_key('foo')  # returns 1 (global)
+        >>> SystemLimit.for_key('foo', domain='bar')  # returns 2 (domain-specific)
+        >>> SystemLimit.for_key('foo', domain='baz')  # returns 1 (falls back to global)
     """
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e96195 and d7b0205.

📒 Files selected for processing (6)
  • corehq/apps/users/device_rate_limiter.py (5 hunks)
  • corehq/apps/users/tests/test_device_rate_limiter.py (1 hunks)
  • corehq/project_limits/models.py (1 hunks)
  • corehq/project_limits/tests/test_system_limit.py (1 hunks)
  • corehq/toggles/__init__.py (1 hunks)
  • settings.py (0 hunks)
💤 Files with no reviewable changes (1)
  • settings.py
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: - Review the code following best practises and st...

**/*.py: - Review the code following best practises and standards

  • corehq/project_limits/models.py
  • corehq/apps/users/tests/test_device_rate_limiter.py
  • corehq/project_limits/tests/test_system_limit.py
  • corehq/toggles/__init__.py
  • corehq/apps/users/device_rate_limiter.py
🔇 Additional comments (9)
corehq/apps/users/device_rate_limiter.py (3)

17-18: Consistent naming for keys and fallback default.
Defining separate constants for the device limit key and its default value is good practice and keeps the logic clear.


67-74: Toggle-based rate limiting logic is well-structured.
Using the DEVICE_RATE_LIMITER toggle here is clear. The contrasting metric counters (rate_limited vs. rate_limited.test) help differentiate environments or usage scenarios. Ensure that the “test” metric is being tracked intentionally.


76-84: Domain-scoped Redis key fixes domain collision issues.
Including the domain in the Redis key creation is a solid fix to scope usage data more accurately. This prevents conflicts across domains.

corehq/toggles/__init__.py (1)

2998-3005: New feature toggle aligns with the PR’s objective.
The DEVICE_RATE_LIMITER toggle is declared clearly and includes an owner. Consider adding a help_link if there’s existing documentation for domain admins.

corehq/project_limits/tests/test_system_limit.py (1)

1-19: Comprehensive test coverage for system limits.
These tests thoroughly confirm the fallback to None, the general limit, and the domain-specific limit. The approach is straightforward and covers core scenarios.

corehq/apps/users/tests/test_device_rate_limiter.py (4)

11-12: LGTM! Good transition to TestCase for database access.

The change from SimpleTestCase to TestCase is appropriate given the need to interact with the database for SystemLimit objects.


65-74: LGTM! Well-structured test for domain-specific higher limits.

The test effectively verifies that:

  1. Domain-specific limits take precedence
  2. Different domains can have different limits
  3. Higher domain-specific limits allow more devices

75-85: LGTM! Comprehensive test for domain-specific lower limits.

The test properly verifies that:

  1. Global limits can be overridden by domain-specific limits
  2. Lower domain-specific limits are enforced correctly
  3. Other domains still use the global limit

97-101: LGTM! Good test for domain isolation.

The test ensures that:

  1. Device limits are properly scoped to domains
  2. Activity in one domain doesn't affect limits in another
  3. Same device ID can be rate-limited differently across domains

def test_allowed_if_no_devices_have_been_used_yet(self):
self.assertFalse(device_rate_limiter.rate_limit_device(self.domain, 'user-id', 'new-device-id'))

@override_settings(DEVICE_LIMIT_PER_USER=2)
@flag_enabled("DEVICE_RATE_LIMITER")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this on the class itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works at the class level, but then @flag_disabled didn't work for the one test where I wanted to use it since the patch had already been applied for the @flag_enabled decorator. I briefly looked to see if it would be an easy fix but came up empty handed and decided to just do this instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's annoying

self.addCleanup(clear_redis)

@flag_enabled("DEVICE_RATE_LIMITER")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this decorator not be applied to the class? If not, wonder how hard it would be to change it to work at the class level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works at the class level, but then @flag_disabled didn't work for the one test where I wanted to use it since the patch had already been applied for the @flag_enabled decorator. I briefly looked to see if it would be an easy fix but came up empty handed and decided to just do this instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙁

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these decorators originally and I vaguely recall also running into that shortcoming and not seeing an easy fix. Would love to see that addressed if anyone can figure it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it as a hackathon idea for tomorrow 🤷🏻

- add more assertions in SystemLimit.for_key tests
- handle case where limit is 0
- use tag to determine if rate limited metric was enforced or not
@gherceg
Copy link
Contributor Author

gherceg commented Feb 17, 2025

Expecting tests to fail due to not clearing redis. Offlining with @millerdev about what the proper solution is.

@gherceg
Copy link
Contributor Author

gherceg commented Feb 18, 2025

Expecting tests to fail due to not clearing redis.

Just kidding this was because I wasn't clearing the classmethod cache correctly! Fixed in dbdd553

- For testing, skip caching for_key as it creates too many problems
- Make sure to clear the cache when a SystemLimit is deleted too
Clearing all of redis masked issues with the caching logic in the
SystemLimit.for_key which isn't good. The right approach is to skip
caching when running in a test.
@gherceg
Copy link
Contributor Author

gherceg commented Feb 19, 2025

Alright no more commits! Sorry for the churn.

This doesn't actually matter since the data being saved in redis is very
small, but this change caches more wisely. Previously each key+domain
combo would have a cached key, even though in most cases, a domain will
not have a unique value from the global value. This change only stores a
key+domain value in cache when necessary.
@gherceg
Copy link
Contributor Author

gherceg commented Feb 19, 2025

I lied. Made a slight change to the caching strategy for SystemLimit that came up during the dev block. No more changes starting now.

Comment on lines 111 to 112
self._get_global_limit.clear(self.__class__, self.key)
self._get_domain_specific_limit.clear(self.__class__, self.key, self.domain)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be smarter about only clearing the cache for the specific methods when relevant (if self.domain is a blank string vs something specific)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...hmm but what someone modifies an existing limit to be scoped to domain (or vice versa). That would not be an ideal usage, but would result in us not clearing the correct cache with this updated logic...

We know which cache to clear depending on the domain field of the object
being saved/deleted.
domain_filter |= models.Q(domain=domain)
filters = models.Q(key=key) & domain_filter
return SystemLimit.objects.filter(filters).order_by("-domain").values_list("limit", flat=True).first()
domain_limit = cls._get_domain_specific_limit(key, domain) if domain else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change only stores a key+domain value in cache when necessary.

Not quite clear what was meant by "when necessary", but this will cache a value (possibly None) for key+domain any time bool(domain) is True, right?

Could the second DB query below be avoided in that case (and a useful limit cached for each key+domain) if _get_domain_specific_limit used the original query?

SystemLimit.objects.filter(filters).order_by("-domain").values_list("limit", flat=True).first()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might complicate cache clearing though since the global limit would be cached per domain, and those would not be cleared when the global limit is changed.

I'm curious how much more performant this is with caching applied? The SQL query should be very fast, possibly not much slower than hitting redis. Would be interesting to measure.

This reverts commit ce7b010.

It is possible to modify an existing limit and add a domain, in which
case we would be clearing the wrong cache, so we need to revert.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants