Skip to content

Add saving of logs to disk & combining into CSV #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

michaeltremeer
Copy link
Contributor

The code adds the following:

  • Adds a json-save-dir arg to the load parser, so that JSON logs can be saved to disk

    • A filehandler is added to the base logger, and the output folders & file created automatically.
    • Pipes the logs out, including most of the run args (so that each log file contains all relevant configuration when being compared later).
  • Adds a combine_logs subcommand to the parser

    • Extracts run information and final metrics (prior to termination/max requests/duration) from raw log files, and combines them into a single output CSV file, ready for analysis.

Copy link
Contributor

@technicianted technicianted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!

else:
print(f"{timestamp} rpm: {rpm:<5} requests: {self.requests_count:<5} failures: {self.failed_count:<4} throttled: {self.throttled_count:<4} tpm: {tokens_per_minute:<6} ttft_avg: {ttft_avg:<6} ttft_95th: {ttft_95th:<6} tbt_avg: {tbt_avg:<6} tbt_95th: {tbt_95th:<6} e2e_avg: {e2e_latency_avg:<6} e2e_95th: {e2e_latency_95th:<6} util_avg: {util_avg:<6} util_95th: {util_95th:<6}", flush=True)
logger.info(f"rpm: {rpm:<5} requests: {self.requests_count:<5} failures: {self.failed_count:<4} throttled: {self.throttled_count:<4} tpm: {tokens_per_minute:<6} ttft_avg: {ttft_avg:<6} ttft_95th: {ttft_95th:<6} tbt_avg: {tbt_avg:<6} tbt_95th: {tbt_95th:<6} e2e_avg: {e2e_latency_avg:<6} e2e_95th: {e2e_latency_95th:<6} util_avg: {util_avg:<6} util_95th: {util_95th:<6}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind using print and not logger is to redirect to stdout vs stderr such that you can use shell redirection to only get the stats or jsonl. You will need to add conditionals here to make sure that this behvior doesn't break.

@@ -0,0 +1,97 @@
import argparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this feature is too specific for this tool. In addition, but outputting stats in jsonl, you can use simple command line tools such as jq to aggregate.

Perhaps you should take this feature out in a separate PR for better discussion.

args = parser.parse_args()

if args.func is load and args.log_save_dir is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how is this different from using shell redirection of stderr? I could do something like this:

$ python -m benchmark.bench load ... 2> my/output/dir/output.log

because this is only stderr, I will only get the logger in output.log

@@ -33,7 +33,7 @@ $ docker run azure-openai-benchmarking load --help
Consider the following guidelines when creating your benchmark tests

1. **Ensure call characteristics match your production expectations**. The number of calls per minute and total tokens you are able to process varies depending on the prompt size, generation size and call rate.
1. **Run your test long enough to reach a stable state**. Throttling is based on the total compute you have deployed and are utilizing. The utilization includes active calls. As a result you will see a higher call rate when ramping up on an unloaded deployment because there are no existing active calls being processed. Once your deplyoment is fully loaded with a utilzation near 100%, throttling will increase as calls can only be processed as earlier ones are completed. To ensure an accurate measure, set the duration long enough for the throughput to stabilize, especialy when running at or close to 100% utilization.
1. **Run your test long enough to reach a stable state**. Throttling is based on the total compute you have deployed and are utilizing. The utilization includes active calls. As a result you will see a higher call rate when ramping up on an unloaded deployment because there are no existing active calls being processed. Once your deplyoment is fully loaded with a utilzation near 100%, throttling will increase as calls can only be processed as earlier ones are completed. To ensure an accurate measure, set the duration long enough for the throughput to stabilize, especialy when running at or close to 100% utilization. Also note that once the test ends (either by termination, or reaching the maximum duration or number of requests), any pending requests will continue to drain, which can result in lower throughput values as the load on the endpoint gradually decreases to 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great clarification!

@michaeltremeer
Copy link
Contributor Author

michaeltremeer commented Dec 21, 2023

Good points in the comments, and the combine_logs functionality is a stretch for the core functionality of the repo - agreed. The only note I would say though (inspired by my own use): With shell redirection, you do miss any warnings or errors that occurred during the run, as well as any run parameters which make it easier to tell what the config was for each run (useful when you're analysing a number of permutations of shape, model and concurrency). You also miss the time the run finished or was terminated (which means the post-test period where requests are still draining can impact the numbers).

If you think it's valid and worth a change, I can resubmit just the first set of changes for saving a single run's logs, otherwise I can just submit the updated README for clarity.

@technicianted
Copy link
Contributor

technicianted commented Dec 21, 2023

you do miss any warnings or errors that occurred during the run
You can redirect both stdout and stderr to the same file if needed:

$ python -m benchmark.bench load ... > my/output/dir/full_output.log 2>&1

or separate files:

$ python -m benchmark.bench load ... > my/output/dir/stats_output.log 2> my/output/dir/logs.log

Does that cover the use-case?

as well as any run parameters which make it easier to tell what the config was for each run

Yes I agree this is super useful. Let's add that in a separate PR.

I can resubmit just the first set of changes for saving a single run's logs, otherwise I can just submit the updated README for clarity.

Yes let's break them down. Generally many smaller PRs is always better :)

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants