Use SystemLimit instead of settings for device rate limiting #35781

gherceg · 2025-02-14T21:42:32Z

Product Description

Technical Summary

https://dimagi.atlassian.net/browse/SAAS-16355
https://dimagi.atlassian.net/browse/SAAS-16561

I recently added a SystemLimit class that makes it much easier to manage limits, and moving this limit over from settings to the SystemLimit in this PR. There are a few other changes that go along with this refactor, like moving the setting to enable the device rate limiter over to a feature flag, and a slight bug fix (sort of) related to the redis key not being scoped to a domain.

Feature Flag

DEVICE_RATE_LIMITER

Safety Assurance

Safety story

Automated test coverage

QA Plan

No, I'll test myself.

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

Realistically I don't think this is a likely scenario in our current platform since a mobile worker is scoped to a domain, but noticed this vulnerability in the test and decided it was easy enough to protect against.

This commit did a lot. In addition to replacing the limit in settings with SystemLimit, I removed the feature flag for an increased limit since SystemLimit solves that, and moved the setting to enable the device rate limiter to a feature flag for ease of use. I also updated metrics to differentiate between an actual rate limited attempt vs a would-be rate limited.

gherceg · 2025-02-17T19:22:30Z

@coderabbitai review

coderabbitai · 2025-02-17T19:24:40Z

Walkthrough

The changes modify how device rate limiting is implemented. In the user application, the DeviceRateLimiter class now retrieves the per-user device limit by calling SystemLimit.for_key with the domain provided, rather than relying on a feature flag to increase limits. The rate_limit_device method’s control flow now checks if the device rate limiter toggle is enabled for the given domain. If enabled, it logs a rate-limited metric and returns True; otherwise, it logs an alternative metric and returns False. Additionally, the Redis key generation has been updated to include the domain, ensuring uniqueness per domain and user pair. In the tests, the class now uses Django’s TestCase (with database access) instead of SimpleTestCase and includes additional scenarios for domain-specific limits and overlapping domains. Also, a new method for_key has been added to the SystemLimit model with tests validating its behavior. Finally, the old toggle and related settings have been removed in favor of the new restrictive approach using the DEVICE_RATE_LIMITER toggle.

Sequence Diagram(s)

sequenceDiagram
    participant U as User/Application
    participant DRL as DeviceRateLimiter
    participant Toggle as Toggle Module
    participant Redis as Redis Store
    participant Metric as Metrics Logger

    U->>DRL: Call rate_limit_device(domain, user_id, device_id)
    DRL->>Toggle: Check if DEVICE_RATE_LIMITER toggle enabled for domain
    alt Toggle Enabled
        DRL->>Metric: Log rate-limited event
        DRL->>DRL: Generate Redis key using (domain, user_id)
        DRL->>Redis: Retrieve/Update usage data with unique key
        DRL->>U: Return True
    else Toggle Disabled
        DRL->>Metric: Log non rate-limited event
        DRL->>U: Return False
    end

sequenceDiagram
    participant Caller as Caller
    participant SL as SystemLimit
    participant DB as Database

    Caller->>SL: Call for_key(key, domain)
    SL->>DB: Query SystemLimit objects filtering by key and domain (if provided)
    DB->>SL: Return sorted matching records (descending by domain)
    SL->>Caller: Return first matching limit or None

✨ Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

corehq/apps/users/device_rate_limiter.py (1)
30-30: Consider caching domain limit lookups.
Repeatedly calling SystemLimit.for_key on every request might be costly under heavy load. For high-volume domains, you could cache the result of SystemLimit.for_key for a short duration (e.g., using Django’s caching frameworks) to improve performance.
 def device_limit_per_user(self, domain):
-    return SystemLimit.for_key(DEVICE_LIMIT_PER_USER_KEY, domain=domain) or DEVICE_LIMIT_PER_USER_DEFAULT
+    limit = SystemLimit.for_key(DEVICE_LIMIT_PER_USER_KEY, domain=domain)
+    if limit is None:
+        limit = DEVICE_LIMIT_PER_USER_DEFAULT
+    return limit
corehq/project_limits/models.py (1)
101-111: LGTM! The implementation effectively handles domain-specific limits.

The for_key method is well-implemented with:

Clear prioritization of domain-specific limits over global limits

Efficient query construction using Q objects

Proper handling of the optional domain parameter

Consider adding docstring examples to illustrate the prioritization behavior:
     """
     Return the value associated with the given key, prioritizing specificity
+
+    Examples:
+        >>> SystemLimit.objects.create(key='foo', limit=1)  # global limit
+        >>> SystemLimit.objects.create(key='foo', domain='bar', limit=2)  # domain-specific
+        >>> SystemLimit.for_key('foo')  # returns 1 (global)
+        >>> SystemLimit.for_key('foo', domain='bar')  # returns 2 (domain-specific)
+        >>> SystemLimit.for_key('foo', domain='baz')  # returns 1 (falls back to global)
     """

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e96195 and d7b0205.

📒 Files selected for processing (6)

corehq/apps/users/device_rate_limiter.py (5 hunks)
corehq/apps/users/tests/test_device_rate_limiter.py (1 hunks)
corehq/project_limits/models.py (1 hunks)
corehq/project_limits/tests/test_system_limit.py (1 hunks)
corehq/toggles/__init__.py (1 hunks)
settings.py (0 hunks)

💤 Files with no reviewable changes (1)

settings.py

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.py`: - Review the code following best practises and st...

**/*.py: - Review the code following best practises and standards

corehq/project_limits/models.py
corehq/apps/users/tests/test_device_rate_limiter.py
corehq/project_limits/tests/test_system_limit.py
corehq/toggles/__init__.py
corehq/apps/users/device_rate_limiter.py

🔇 Additional comments (9)

corehq/apps/users/device_rate_limiter.py (3)

17-18: Consistent naming for keys and fallback default.
Defining separate constants for the device limit key and its default value is good practice and keeps the logic clear.

67-74: Toggle-based rate limiting logic is well-structured.
Using the DEVICE_RATE_LIMITER toggle here is clear. The contrasting metric counters (rate_limited vs. rate_limited.test) help differentiate environments or usage scenarios. Ensure that the “test” metric is being tracked intentionally.

76-84: Domain-scoped Redis key fixes domain collision issues.
Including the domain in the Redis key creation is a solid fix to scope usage data more accurately. This prevents conflicts across domains.

corehq/toggles/__init__.py (1)

2998-3005: New feature toggle aligns with the PR’s objective.
The DEVICE_RATE_LIMITER toggle is declared clearly and includes an owner. Consider adding a help_link if there’s existing documentation for domain admins.

corehq/project_limits/tests/test_system_limit.py (1)

1-19: Comprehensive test coverage for system limits.
These tests thoroughly confirm the fallback to None, the general limit, and the domain-specific limit. The approach is straightforward and covers core scenarios.

corehq/apps/users/tests/test_device_rate_limiter.py (4)

11-12: LGTM! Good transition to TestCase for database access.

The change from SimpleTestCase to TestCase is appropriate given the need to interact with the database for SystemLimit objects.

65-74: LGTM! Well-structured test for domain-specific higher limits.

The test effectively verifies that:

Domain-specific limits take precedence

Different domains can have different limits

Higher domain-specific limits allow more devices

75-85: LGTM! Comprehensive test for domain-specific lower limits.

The test properly verifies that:

Global limits can be overridden by domain-specific limits

Lower domain-specific limits are enforced correctly

Other domains still use the global limit

97-101: LGTM! Good test for domain isolation.

The test ensures that:

Device limits are properly scoped to domains

Activity in one domain doesn't affect limits in another

Same device ID can be rate-limited differently across domains

corehq/project_limits/tests/test_system_limit.py

orangejenny · 2025-02-17T19:57:27Z

corehq/apps/users/tests/test_device_rate_limiter.py

    def test_allowed_if_no_devices_have_been_used_yet(self):
        self.assertFalse(device_rate_limiter.rate_limit_device(self.domain, 'user-id', 'new-device-id'))

-    @override_settings(DEVICE_LIMIT_PER_USER=2)
+    @flag_enabled("DEVICE_RATE_LIMITER")


Can you do this on the class itself?

it works at the class level, but then @flag_disabled didn't work for the one test where I wanted to use it since the patch had already been applied for the @flag_enabled decorator. I briefly looked to see if it would be an easy fix but came up empty handed and decided to just do this instead.

that's annoying

corehq/toggles/__init__.py

corehq/project_limits/models.py

corehq/project_limits/tests/test_system_limit.py

corehq/apps/users/device_rate_limiter.py

millerdev · 2025-02-17T19:48:18Z

corehq/apps/users/tests/test_device_rate_limiter.py

        self.addCleanup(clear_redis)

+    @flag_enabled("DEVICE_RATE_LIMITER")


Can this decorator not be applied to the class? If not, wonder how hard it would be to change it to work at the class level?

it works at the class level, but then @flag_disabled didn't work for the one test where I wanted to use it since the patch had already been applied for the @flag_enabled decorator. I briefly looked to see if it would be an easy fix but came up empty handed and decided to just do this instead.

I added these decorators originally and I vaguely recall also running into that shortcoming and not seeing an easy fix. Would love to see that addressed if anyone can figure it out.

Added it as a hackathon idea for tomorrow 🤷🏻

- add more assertions in SystemLimit.for_key tests - handle case where limit is 0 - use tag to determine if rate limited metric was enforced or not

gherceg · 2025-02-17T22:31:55Z

Expecting tests to fail due to not clearing redis. Offlining with @millerdev about what the proper solution is.

gherceg · 2025-02-18T12:17:13Z

Expecting tests to fail due to not clearing redis.

Just kidding this was because I wasn't clearing the classmethod cache correctly! Fixed in dbdd553

corehq/project_limits/models.py

- For testing, skip caching for_key as it creates too many problems - Make sure to clear the cache when a SystemLimit is deleted too

Clearing all of redis masked issues with the caching logic in the SystemLimit.for_key which isn't good. The right approach is to skip caching when running in a test.

gherceg · 2025-02-19T13:18:52Z

Alright no more commits! Sorry for the churn.

This doesn't actually matter since the data being saved in redis is very small, but this change caches more wisely. Previously each key+domain combo would have a cached key, even though in most cases, a domain will not have a unique value from the global value. This change only stores a key+domain value in cache when necessary.

gherceg · 2025-02-19T15:16:17Z

I lied. Made a slight change to the caching strategy for SystemLimit that came up during the dev block. No more changes starting now.

gherceg · 2025-02-19T18:30:17Z

corehq/project_limits/models.py

+        self._get_global_limit.clear(self.__class__, self.key)
+        self._get_domain_specific_limit.clear(self.__class__, self.key, self.domain)


I could be smarter about only clearing the cache for the specific methods when relevant (if self.domain is a blank string vs something specific)

...hmm but what someone modifies an existing limit to be scoped to domain (or vice versa). That would not be an ideal usage, but would result in us not clearing the correct cache with this updated logic...

We know which cache to clear depending on the domain field of the object being saved/deleted.

millerdev · 2025-02-19T20:53:58Z

corehq/project_limits/models.py

-            domain_filter |= models.Q(domain=domain)
-        filters = models.Q(key=key) & domain_filter
-        return SystemLimit.objects.filter(filters).order_by("-domain").values_list("limit", flat=True).first()
+        domain_limit = cls._get_domain_specific_limit(key, domain) if domain else None


This change only stores a key+domain value in cache when necessary.

Not quite clear what was meant by "when necessary", but this will cache a value (possibly None) for key+domain any time bool(domain) is True, right?

Could the second DB query below be avoided in that case (and a useful limit cached for each key+domain) if _get_domain_specific_limit used the original query?

SystemLimit.objects.filter(filters).order_by("-domain").values_list("limit", flat=True).first()

That might complicate cache clearing though since the global limit would be cached per domain, and those would not be cleared when the global limit is changed.

I'm curious how much more performant this is with caching applied? The SQL query should be very fast, possibly not much slower than hitting redis. Would be interesting to measure.

This reverts commit ce7b010. It is possible to modify an existing limit and add a domain, in which case we would be clearing the wrong cache, so we need to revert.

gherceg added 3 commits February 14, 2025 10:22

Add for_key classmethod on SystemLimit

9dcd856

Ensure redis key is unique per domain

b2780fa

Realistically I don't think this is a likely scenario in our current platform since a mobile worker is scoped to a domain, but noticed this vulnerability in the test and decided it was easy enough to protect against.

gherceg marked this pull request as ready for review February 17, 2025 15:20

gherceg requested a review from esoergel as a code owner February 17, 2025 15:20

gherceg requested review from millerdev, orangejenny and AmitPhulera February 17, 2025 15:21

coderabbitai bot reviewed Feb 17, 2025

View reviewed changes

orangejenny approved these changes Feb 17, 2025

View reviewed changes

millerdev approved these changes Feb 17, 2025

View reviewed changes

gherceg added 2 commits February 17, 2025 16:51

Add quickcache to SystemLimit.for_key

535ebfe

Address PR feedback

8bbb387

- add more assertions in SystemLimit.for_key tests - handle case where limit is 0 - use tag to determine if rate limited metric was enforced or not

orangejenny approved these changes Feb 17, 2025

View reviewed changes

Fix bug with clearing cache properly

dbdd553

millerdev approved these changes Feb 18, 2025

View reviewed changes

corehq/project_limits/models.py Outdated Show resolved Hide resolved

gherceg added 3 commits February 18, 2025 22:43

Just return the limit

11c4e63

Fix gaps in caching logic

d88a5e7

- For testing, skip caching for_key as it creates too many problems - Make sure to clear the cache when a SystemLimit is deleted too

Avoid clearing redis in device rate limiter tests

6221041

Clearing all of redis masked issues with the caching logic in the SystemLimit.for_key which isn't good. The right approach is to skip caching when running in a test.

gherceg commented Feb 19, 2025

View reviewed changes

Only clear relevant cache

ce7b010

We know which cache to clear depending on the domain field of the object being saved/deleted.

millerdev reviewed Feb 19, 2025

View reviewed changes

millerdev mentioned this pull request Feb 20, 2025

flag_enabled on method takes precedence over class decorator #35818

Open

1 task

gherceg added 2 commits February 20, 2025 20:48

Revert "Only clear relevant cache"

025d844

This reverts commit ce7b010. It is possible to modify an existing limit and add a domain, in which case we would be clearing the wrong cache, so we need to revert.

Manually cache domain specific limit

e2351c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SystemLimit instead of settings for device rate limiting #35781

Use SystemLimit instead of settings for device rate limiting #35781

gherceg commented Feb 14, 2025 •

edited

Loading

gherceg commented Feb 17, 2025

coderabbitai bot commented Feb 17, 2025 •

edited by millerdev

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

orangejenny Feb 17, 2025

gherceg Feb 17, 2025

orangejenny Feb 17, 2025

millerdev Feb 17, 2025

gherceg Feb 17, 2025

millerdev Feb 17, 2025

esoergel Feb 19, 2025

gherceg Feb 19, 2025

gherceg commented Feb 17, 2025

gherceg commented Feb 18, 2025

gherceg commented Feb 19, 2025

gherceg commented Feb 19, 2025

gherceg Feb 19, 2025

gherceg Feb 19, 2025

gherceg Feb 19, 2025

millerdev Feb 19, 2025

millerdev Feb 19, 2025

		self.addCleanup(clear_redis)

		@flag_enabled("DEVICE_RATE_LIMITER")

		self._get_global_limit.clear(self.__class__, self.key)
		self._get_domain_specific_limit.clear(self.__class__, self.key, self.domain)

Use SystemLimit instead of settings for device rate limiting #35781

Are you sure you want to change the base?

Use SystemLimit instead of settings for device rate limiting #35781

Conversation

gherceg commented Feb 14, 2025 • edited Loading

Product Description

Technical Summary

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Rollback instructions

Labels & Review

gherceg commented Feb 17, 2025

coderabbitai bot commented Feb 17, 2025 • edited by millerdev Loading

Walkthrough

Sequence Diagram(s)

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gherceg commented Feb 17, 2025

gherceg commented Feb 18, 2025

gherceg commented Feb 19, 2025

gherceg commented Feb 19, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gherceg commented Feb 14, 2025 •

edited

Loading

coderabbitai bot commented Feb 17, 2025 •

edited by millerdev

Loading