Add docs to achieve 200K scale. (#7812)

agnivade · mattermost-build · web-flow · commit f65a5f6594f3 · 2025-03-20T11:05:46.000+05:30
* Add docs to achieve 200K scale.

* Addressed review comments

---------

Co-authored-by: Mattermost Build &lt;build@mattermost.com&gt;
diff --git a/source/scale/scale-to-200000-users.rst b/source/scale/scale-to-200000-users.rst
@@ -0,0 +1,64 @@
+Scale Mattermost up to 200000 users
+====================================
+
+.. include:: ../_static/badges/ent-selfhosted.rst
+  :start-after: :nosearch:
+
+This page describes the Mattermost reference architecture designed for the load of up to 200000 concurrent users. Unsure which reference architecture to use? See the :doc:`scaling for enterprise </scale/scaling-for-enterprise>` documentation for details.
+
+- **High Availability**: Required
+- **Database Configuration**: writer, multiple readers
+
+.. note::
+  - Usage of CPU, RAM, and storage space can vary significantly based on user behavior. These hardware recommendations are based on traditional deployments and may grow or shrink depending on how active your users are.
+  - From Mattermost v10.4, Mattermost Enterprise customers can configure `Redis <https://redis.io/>`_ (Remote Dictionary Server) as an alternative cache backend. Using Redis can help ensure that Mattermost remains performant and efficient, even under heavy usage. See the :ref:`Redis cache backend <configure/environment-configuration-settings:redis cache backend>` configuration settings documentation for details.
+  - While the following Elasticsearch specifications may be more than sufficient for some use cases, we have not extensively tested configurations with lower resource allocations for this user scale. If cost optimization is a priority, admins may choose to experiment with smaller configurations, but we recommend starting with the tested specifications to ensure system stability and performance. Keep in mind that under-provisioning can lead to degraded user experience and additional troubleshooting effort.
+
+Requirements
+------------
+
++------------------------+-----------+----------------+-----------------------+
+| **Resource Type**      | **Nodes** | **vCPU/        | **AWS Instance**      |
+|                        |           | Memory (GiB)** |                       |
++========================+===========+================+=======================+
+| Mattermost Application | 14        | 16/32          | c7i.4xlarge           |
++------------------------+-----------+----------------+-----------------------+
+| RDS Writer             | 1         | 16/128         | db.r7g.4xlarge        |
++------------------------+-----------+----------------+-----------------------+
+| RDS Reader             | 6         | 16/128         | db.r7g.4xlarge        |
++------------------------+-----------+----------------+-----------------------+
+| Elasticsearch cluster  | 4         | 8/64           | r6g.2xlarge.search    |
++------------------------+-----------+----------------+-----------------------+
+| Proxy                  | 4         | 32/128        | m6in.8xlarge           |
++------------------------+-----------+----------------+-----------------------+
+| Redis                  | 1         | 8/32          | cache.m7g.2xlarge      |
++------------------------+-----------+----------------+-----------------------+
+
+Lifetime storage
+----------------
+
+.. include:: ../scale/lifetime-storage.rst
+  :start-after: :nosearch:
+
+Estimated storage per user, per month
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. include:: ../scale/estimated-storage-per-user-per-month.rst
+  :start-after: :nosearch:
+
+Example
+~~~~~~~
+
+A 200000-person team with medium usage (with a safety factor of 2x) would require between 21.12TB :sup:`1` and 105.6TB :sup:`2` of free space per annum.
+
+:sup:`1` 200000 users * 5 MB * 12 months * 2x safety factor
+
+:sup:`2` 200000 users * 25 MB * 12 months * 2x safety factor
+
+We strongly recommend that you review storage utilization at least quarterly to ensure adequate free space is available.
+
+Additional considerations
+-------------------------
+
+.. include:: ../scale/additional-ha-considerations.rst
+  :start-after: :nosearch:
diff --git a/source/scale/scaling-for-enterprise.rst b/source/scale/scaling-for-enterprise.rst
@@ -22,6 +22,7 @@ The following reference architectures are available as recommended starting poin
 * :doc:`Scale up to 80000 users </scale/scale-to-80000-users>` - Learn how to scale Mattermost to up to 80000 users.
 * :doc:`Scale up to 90000 users </scale/scale-to-90000-users>` - Learn how to scale Mattermost to up to 90000 users.
 * :doc:`Scale up to 100000 users </scale/scale-to-100000-users>` - Learn how to scale Mattermost to up to 100000 users.
+* :doc:`Scale up to 200000 users </scale/scale-to-200000-users>` - Learn how to scale Mattermost to up to 200000 users.
 
 .. important::
 
@@ -32,10 +33,12 @@ Testing methodology and updates
 
 All tests were executed with the custom load test tool built by the Mattermost development teams to determine supported users for each deployment size. Over time, this guide will be updated with new deployment sizes, deployment architectures, and newer versions of the Mattermost Server will be tested using an ESR. 
 
-At a high level, each deployment size was fixed (Mattermost server node count/sizing, database reader/writer count/sizing), and unbounded tests were used to report the maximum numbers of concurrent users the deployment can support. Each test included populated PostgreSQL v14 databases and a post table history of 100 million posts, ~3000 users, 20 teams, and ~720000 channels to provide a test simulation of a production Mattermost deployment. 
+At a high level, each deployment size was fixed (Mattermost server node count/sizing, database reader/writer count/sizing), and unbounded tests were used to report the maximum numbers of concurrent users the deployment can support. Each test included populated PostgreSQL v14 databases and a post table history of 100 million posts, ~200000 users, 20 teams, and ~720000 channels to provide a test simulation of a production Mattermost deployment.
 
 Tests were defined by configuration of the actions executed by each simulated user (and the frequency of these actions) where the coordinator metrics define a health system under load. Tests were performed using the Mattermost v9.5 Extended Support Release (ESR). Job servers weren't used. All tests with more than a single app node had an NGINX proxy running in front of them.
 
+For the last test of 200K users, further infrastructure changes were made. Elasticsearch nodes were added. A Redis instance was added, and multiple NGINX proxies were used to distribute traffic evenly across all nodes in the cluster. More details can be found in the `page </scale/scale-to-200000-users>`.
+
 Full testing methodology, configuration, and setup is available, incluidng a `fixed database dump with 100 million posts <https://us-east-1.console.aws.amazon.com/backup/home?region=us-east-1#/resources/arn%3Aaws%3Ards%3Aus-east-1%3A729462591288%3Acluster%3Adb-pg-100m-posts-v9-5-5>`_. Visit the `Mattermost Community <https://community.mattermost.com/>`_ and join the `Developers: Performance channel <https://community.mattermost.com/core/channels/developers-performance>`_ for details.
 
 Mattermost load testing tools
@@ -49,4 +52,4 @@ Visit the `Mattermost Load Test Tool <https://github.com/mattermost/mattermost-l
 
     - The Mattermost Load Test Tool was designed by and is used by our performance engineers to compare and benchmark the performance of the service from month to month to prepare for new releases. It's also used extensively in developing our recommended hardware sizing. 
     - We recommend deploying :doc:`Prometheus and Grafana </scale/deploy-prometheus-grafana-for-performance-monitoring>` with our :ref:`dashboards <scale/deploy-prometheus-grafana-for-performance-monitoring:getting started>` for ongoing monitoring and scale guidance.
-    - If you encounter performance concerns, we recommend :doc:`collecting performance metrics </scale/collect-performance-metrics>` and sharing them with us as a first troubleshooting step.
+    - If you encounter performance concerns, we recommend :doc:`collecting performance metrics </scale/collect-performance-metrics>` and sharing them with us as a first troubleshooting step.