Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs to achieve 200K scale. #7812

Merged
merged 3 commits into from
Mar 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions source/scale/scale-to-200000-users.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
Scale Mattermost up to 200000 users
====================================

.. include:: ../_static/badges/ent-selfhosted.rst
:start-after: :nosearch:

This page describes the Mattermost reference architecture designed for the load of up to 200000 concurrent users. Unsure which reference architecture to use? See the :doc:`scaling for enterprise </scale/scaling-for-enterprise>` documentation for details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add here that 200k concurrent users translates to approximately 500k supported users, depending on expected usage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should retire that heuristic. The ratio of concurrent:total varies highly from customer to customer. It is clearer to simply focus on concurrent users.


- **High Availability**: Required
- **Database Configuration**: writer, multiple readers

.. note::
- Usage of CPU, RAM, and storage space can vary significantly based on user behavior. These hardware recommendations are based on traditional deployments and may grow or shrink depending on how active your users are.
- From Mattermost v10.4, Mattermost Enterprise customers can configure `Redis <https://redis.io/>`_ (Remote Dictionary Server) as an alternative cache backend. Using Redis can help ensure that Mattermost remains performant and efficient, even under heavy usage. See the :ref:`Redis cache backend <configure/environment-configuration-settings:redis cache backend>` configuration settings documentation for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the Remote Dictionary Server parenthesis?

Suggested change
- From Mattermost v10.4, Mattermost Enterprise customers can configure `Redis <https://redis.io/>`_ (Remote Dictionary Server) as an alternative cache backend. Using Redis can help ensure that Mattermost remains performant and efficient, even under heavy usage. See the :ref:`Redis cache backend <configure/environment-configuration-settings:redis cache backend>` configuration settings documentation for details.
- From Mattermost v10.4, Mattermost Enterprise customers can configure `Redis <https://redis.io/>`_ as an alternative cache backend. Using Redis can help ensure that Mattermost remains performant and efficient, even under heavy usage. See the :ref:`Redis cache backend <configure/environment-configuration-settings:redis cache backend>` configuration settings documentation for details.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, this was already there. Will defer to Carrie.

- While the following Elasticsearch specifications may be more than sufficient for some use cases, we have not extensively tested configurations with lower resource allocations for this user scale. If cost optimization is a priority, admins may choose to experiment with smaller configurations, but we recommend starting with the tested specifications to ensure system stability and performance. Keep in mind that under-provisioning can lead to degraded user experience and additional troubleshooting effort.

Requirements
------------

+------------------------+-----------+----------------+-----------------------+
| **Resource Type** | **Nodes** | **vCPU/ | **AWS Instance** |
| | | Memory (GiB)** | |
+========================+===========+================+=======================+
| Mattermost Application | 14 | 16/32 | c7i.4xlarge |
+------------------------+-----------+----------------+-----------------------+
| RDS Writer | 1 | 16/128 | db.r7g.4xlarge |
+------------------------+-----------+----------------+-----------------------+
| RDS Reader | 6 | 16/128 | db.r7g.4xlarge |
+------------------------+-----------+----------------+-----------------------+
| Elasticsearch cluster | 4 | 8/64 | r6g.2xlarge.search |
+------------------------+-----------+----------------+-----------------------+
| Proxy | 4 | 32/128 | m6in.8xlarge |
+------------------------+-----------+----------------+-----------------------+
| Redis | 1 | 8/32 | cache.m7g.2xlarge |
+------------------------+-----------+----------------+-----------------------+

Lifetime storage
----------------

.. include:: ../scale/lifetime-storage.rst
:start-after: :nosearch:

Estimated storage per user, per month
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. include:: ../scale/estimated-storage-per-user-per-month.rst
:start-after: :nosearch:

Example
~~~~~~~

A 200000-person team with medium usage (with a safety factor of 2x) would require between 21.12TB :sup:`1` and 105.6TB :sup:`2` of free space per annum.

:sup:`1` 200000 users * 5 MB * 12 months * 2x safety factor

:sup:`2` 200000 users * 25 MB * 12 months * 2x safety factor

We strongly recommend that you review storage utilization at least quarterly to ensure adequate free space is available.

Additional considerations
-------------------------

.. include:: ../scale/additional-ha-considerations.rst
:start-after: :nosearch:
7 changes: 5 additions & 2 deletions source/scale/scaling-for-enterprise.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ The following reference architectures are available as recommended starting poin
* :doc:`Scale up to 80000 users </scale/scale-to-80000-users>` - Learn how to scale Mattermost to up to 80000 users.
* :doc:`Scale up to 90000 users </scale/scale-to-90000-users>` - Learn how to scale Mattermost to up to 90000 users.
* :doc:`Scale up to 100000 users </scale/scale-to-100000-users>` - Learn how to scale Mattermost to up to 100000 users.
* :doc:`Scale up to 200000 users </scale/scale-to-200000-users>` - Learn how to scale Mattermost to up to 200000 users.

.. important::

Expand All @@ -32,10 +33,12 @@ Testing methodology and updates

All tests were executed with the custom load test tool built by the Mattermost development teams to determine supported users for each deployment size. Over time, this guide will be updated with new deployment sizes, deployment architectures, and newer versions of the Mattermost Server will be tested using an ESR.

At a high level, each deployment size was fixed (Mattermost server node count/sizing, database reader/writer count/sizing), and unbounded tests were used to report the maximum numbers of concurrent users the deployment can support. Each test included populated PostgreSQL v14 databases and a post table history of 100 million posts, ~3000 users, 20 teams, and ~720000 channels to provide a test simulation of a production Mattermost deployment.
At a high level, each deployment size was fixed (Mattermost server node count/sizing, database reader/writer count/sizing), and unbounded tests were used to report the maximum numbers of concurrent users the deployment can support. Each test included populated PostgreSQL v14 databases and a post table history of 100 million posts, ~200000 users, 20 teams, and ~720000 channels to provide a test simulation of a production Mattermost deployment.

Tests were defined by configuration of the actions executed by each simulated user (and the frequency of these actions) where the coordinator metrics define a health system under load. Tests were performed using the Mattermost v9.5 Extended Support Release (ESR). Job servers weren't used. All tests with more than a single app node had an NGINX proxy running in front of them.

For the last test of 200K users, further infrastructure changes were made. Elasticsearch nodes were added. A Redis instance was added, and multiple NGINX proxies were used to distribute traffic evenly across all nodes in the cluster. More details can be found in the `page </scale/scale-to-200000-users>`.

Full testing methodology, configuration, and setup is available, incluidng a `fixed database dump with 100 million posts <https://us-east-1.console.aws.amazon.com/backup/home?region=us-east-1#/resources/arn%3Aaws%3Ards%3Aus-east-1%3A729462591288%3Acluster%3Adb-pg-100m-posts-v9-5-5>`_. Visit the `Mattermost Community <https://community.mattermost.com/>`_ and join the `Developers: Performance channel <https://community.mattermost.com/core/channels/developers-performance>`_ for details.

Mattermost load testing tools
Expand All @@ -49,4 +52,4 @@ Visit the `Mattermost Load Test Tool <https://github.com/mattermost/mattermost-l

- The Mattermost Load Test Tool was designed by and is used by our performance engineers to compare and benchmark the performance of the service from month to month to prepare for new releases. It's also used extensively in developing our recommended hardware sizing.
- We recommend deploying :doc:`Prometheus and Grafana </scale/deploy-prometheus-grafana-for-performance-monitoring>` with our :ref:`dashboards <scale/deploy-prometheus-grafana-for-performance-monitoring:getting started>` for ongoing monitoring and scale guidance.
- If you encounter performance concerns, we recommend :doc:`collecting performance metrics </scale/collect-performance-metrics>` and sharing them with us as a first troubleshooting step.
- If you encounter performance concerns, we recommend :doc:`collecting performance metrics </scale/collect-performance-metrics>` and sharing them with us as a first troubleshooting step.