[Benchmark] Prepare for execuTorch failure handling (#6391)

yangw-dev · web-flow · commit 136177d82dee · 2025-03-12T17:37:55.000-07:00
# Description Issue: #6294 Prepare mobile_job yml to generate benchmark record when job fails. ## Background: When a git benchmark job failed (or some of the mobile job failed), we need to generate a benchmark record to indicate that model has failures. For instace, a benchmark job with name:`benchmark-on-device (ic3, coreml_fp16, apple_iphone_15, arn:aws:devicefarm:us-west-2:308535385114... / mobile-job (ios) ` when the whole job failed, we want to indicate that the model ic3 with backend coreml_fp16 and IOS for all metrics is failed when one of the devices in job is failed, (IPHONE 15 with os 17.1), we want to indicate that the model ic3 with backend coreml_fp16 for IPHONE 15 with os 17.1 is failed, but others are success key: always generate the artifact json with git job name. ## Change Details - [yaml]add logic to generate artifact.json if any previous step fails and there is no expected artifact.json, this makes sure we always has the artifact json with git job name - [script] add a flag `--new-json-output-format` to toggle the mobile job to generate artifact.json with new format. - see example of new json result ([s3 link](https://gha-artifacts.s3.us-east-1.amazonaws.com/device_farm/13821036006/1/artifacts/ios-artifacts-38666170088.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=ASIAUPVRELQNEU5O2WYP%2F20250312%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250312T212644Z&X-Amz-Expires=300&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEH4aCXVzLWVhc3QtMSJHMEUCIQC7%2BkVAOsGTimttLszL6u3N4HeFdSzwmPzlOYQBh%2BU%2BzwIgNjk%2FM73TZ9YfN6W92yjuRBUevYQ1BWWf0M7rmky4IT0q0AMIx%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAFGgwzMDg1MzUzODUxMTQiDCWs46GorlC4PkgCmCqkA7TQ41pTu7Pw2vUyPArSC95%2FUUHvRy5DCUEGOUwKmscwv%2B0D9jRdGfQ05E4dtVKliXhNnBRu2oH2u9WIPGKgR3fFjrVRvy2bzQhMYVjAqfUnG%2BhVO2hOKC6U33bMMNJ4SziagDSsAwHBRXl2YLsd9x4ToLubWcHFd4RtE5ZTFQFBHoB05KmzRJ5O00P6m%2BmzBvNh0T%2F2nj2l5c66VmBOe5xeyqEEHXsw3jD98NGrff7nQrONMDpRLjS74Hz%2Fz%2BGJL9RNwNQ2yJYSUdmkrTk4wi7ToNGrzpJm4Lh7wOprHQVwqpVnYaZjw7bJrTk4of4%2FE0%2FBsI1L3GqCxCt6kig02JKYBOy2nFNeRMR09xCSVQCvZE39zKZxrbilH%2FwBzHCS8KvqP14hhGbo%2F%2F08DWVBTZIgrQii0lNaPkB6c%2F0%2BCghTCQv1hUqhIY3avR3TquZzdZNeavNVU6is%2ByJtFpVZzCCH1AzeCRMcnJAlHdGyv9guD5q5wMpRICAihdmFnFy1LQZNAjSisMr0Z4zFfRKJzGdKSpdyL9D5O063WU0VVtmfI0U4fzCz38e%2BBjqEApAZr2cVZ87wIvVZOhcPBDmz%2F9mBgH5LSIK0bfkuZz6vhkUpJbmHbID6YjraMitF1ht1%2FgQtCQkHaejdA9y99K0KEwcT5JVEFaiJNhm5o7KvZJ1jlDqNAklD8brH63PQ705eszJeILnBAmKdOxTrqb83EEmg5Z2eSIjf7Cl04Si21S%2FZomsjHG1zlcHT4jZ9%2FzXPHNHFVmuMwqOVSTzMXx2BKHrOrtwW%2BbpQ8x8rOC5E9P85c86MSDefTk%2BC9Hoee16B45ywR%2BbH7I9fK%2FZ27v%2BCE0gHQglXCHTFVSp7mk18KQw67BJqq5nJDAQ%2BtEdezGj2O5iiG2Amto3XgUbeSRvTi7iF&X-Amz-Signature=49b1065e9246c807c434b8fd2dc510c014fb12a3ceb2605034da70ee2a64ca68&X-Amz-SignedHeaders=host&response-content-disposition=inline)) - [script] add git_job_name, run_report and job_reports to artifacts.json - git_job_name: used to build benchmark record if a git job failed [ a trick way to grab model info] - job_reports & run_report: we currently don't have extra info about mobile job concolusions, this can be used to upload to time_series or notification system for failure details. ## prs that simulate failure cases for generating logics Mimic step failed before the benchmark test (no json generated):#6397 Mimic step benchmark test failed but with artifact: #6398 ExecuTorch Sync Test: pytorch/executorch#9204 ## Details when the flag is on, artifact.json is converted from ``` [ .... ] ``` to ``` { "git_job_name": str "artifacts":[ ], "run_report":{} "job_reports":[....] } ``` This flag is temporary to in case the logics are in sync between repos.
diff --git a/.github/workflows/mobile_job.yml b/.github/workflows/mobile_job.yml
@@ -34,6 +34,11 @@ on:
         description: The device pool associated with the project
         default: 'arn:aws:devicefarm:us-west-2::devicepool:082d10e5-d7d7-48a5-ba5c-b33d66efa1f5'
         type: string
+      new-output-format-flag:
+        description: experiment flag to enable the new artifact json format
+        required: false
+        default: false
+        type: boolean
 
       # Pulling test-infra itself for device farm runner script
       test-infra-repository:
@@ -310,7 +315,9 @@ jobs:
           RUN_ID: ${{ github.run_id }}
           RUN_ATTEMPT: ${{ github.run_attempt }}
           JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
+          GIT_JOB_NAME:  ${{ steps.get-job-id.outputs.job-name }}
           WORKING_DIRECTORY: test-infra/tools/device-farm-runner
+          NEW_OUTPUT_FORMAT_FLAG: ${{ inputs.new-output-format-flag }}
         uses: nick-fields/retry@v3.0.0
         with:
           shell: bash
@@ -331,20 +338,11 @@ jobs:
               --name-prefix "${JOB_NAME}-${DEVICE_TYPE}" \
               --workflow-id "${RUN_ID}" \
               --workflow-attempt "${RUN_ATTEMPT}" \
-              --output "ios-artifacts-${JOB_ID}.json"
+              --output "ios-artifacts-${JOB_ID}.json" \
+              --git-job-name "${GIT_JOB_NAME}" \
+              --new-json-output-format "${NEW_OUTPUT_FORMAT_FLAG}"
             popd
 
-      - name: Upload iOS artifacts to S3
-        uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          retention-days: 14
-          s3-bucket: gha-artifacts
-          s3-prefix: |
-            device_farm/${{ github.run_id }}/${{ github.run_attempt }}/artifacts
-          path: |
-            test-infra/tools/device-farm-runner/ios-artifacts-${{ steps.get-job-id.outputs.job-id }}.json
-
       - name: Run Android tests on devices
         id: android-test
         if: ${{ inputs.device-type == 'android' }}
@@ -361,7 +359,9 @@ jobs:
           RUN_ID: ${{ github.run_id }}
           RUN_ATTEMPT: ${{ github.run_attempt }}
           JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
+          GIT_JOB_NAME:  ${{ steps.get-job-id.outputs.job-name }}
           WORKING_DIRECTORY: test-infra/tools/device-farm-runner
+          NEW_OUTPUT_FORMAT_FLAG: ${{ inputs.new-output-format-flag }}
         uses: nick-fields/retry@v3.0.0
         with:
           shell: bash
@@ -382,10 +382,26 @@ jobs:
               --name-prefix "${JOB_NAME}-${DEVICE_TYPE}" \
               --workflow-id "${RUN_ID}" \
               --workflow-attempt "${RUN_ATTEMPT}" \
-              --output "android-artifacts-${JOB_ID}.json"
+              --output "android-artifacts-${JOB_ID}.json" \
+              --git-job-name "${GIT_JOB_NAME}" \
+              --new-json-output-format "${NEW_OUTPUT_FORMAT_FLAG}"
             popd
 
-      - name: Upload Android artifacts to S3
+      - name: Check artifacts if any job fails
+        if: failure()
+        working-directory: test-infra/tools/device-farm-runner
+        shell: bash
+        env:
+          DEVICE_TYPE: ${{ inputs.device-type }}
+          BENCHMARK_OUTPUT: ${{ inputs.device-type }}-artifacts-${{ steps.get-job-id.outputs.job-id }}.json
+          GIT_JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
+        run: |
+          if [[ ! -f "$BENCHMARK_OUTPUT" ]]; then
+            echo "missing artifact json file for ${DEVICE_TYPE} with name ${BENCHMARK_OUTPUT}, generating ... "
+            echo "{\"git_job_name\": \"$GIT_JOB_NAME\"}" >> "$BENCHMARK_OUTPUT"
+          fi
+
+      - name: Upload artifacts to S3
         uses: seemethere/upload-artifact-s3@v5
         if: always()
         with:
@@ -394,4 +410,4 @@ jobs:
           s3-prefix: |
             device_farm/${{ github.run_id }}/${{ github.run_attempt }}/artifacts
           path: |
-            test-infra/tools/device-farm-runner/android-artifacts-${{ steps.get-job-id.outputs.job-id }}.json
+            test-infra/tools/device-farm-runner/${{ inputs.device-type }}-artifacts-${{ steps.get-job-id.outputs.job-id }}.json
diff --git a/.github/workflows/test_mobile_job.yml b/.github/workflows/test_mobile_job.yml
@@ -18,7 +18,7 @@ jobs:
       device-type: ios
       # For iOS testing, the runner just needs to call AWS Device Farm, so there is no need to run this on macOS
       runner: ubuntu-latest
-      # There values are prepared beforehand for the test
+      # These values are prepared beforehand for the test
       project-arn: arn:aws:devicefarm:us-west-2:308535385114:project:b531574a-fb82-40ae-b687-8f0b81341ae0
       device-pool-arn: arn:aws:devicefarm:us-west-2:308535385114:devicepool:b531574a-fb82-40ae-b687-8f0b81341ae0/da5d902d-45db-477b-ae0a-766e06ef3845
       ios-ipa-archive: https://ossci-assets.s3.amazonaws.com/DeviceFarm.ipa
@@ -34,10 +34,45 @@ jobs:
       device-type: android
       runner: ubuntu-latest
       timeout: 120
-      # There values are prepared beforehand for the test
+      # These values are prepared beforehand for the test
       project-arn: arn:aws:devicefarm:us-west-2:308535385114:project:b531574a-fb82-40ae-b687-8f0b81341ae0
       device-pool-arn: arn:aws:devicefarm:us-west-2:308535385114:devicepool:b531574a-fb82-40ae-b687-8f0b81341ae0/bd86eb80-74a6-4511-8183-09aa66e3ccc4
       android-app-archive: https://ossci-assets.s3.amazonaws.com/app-debug.apk
       android-test-archive: https://ossci-assets.s3.amazonaws.com/app-debug-androidTest.apk
       test-spec: https://ossci-assets.s3.amazonaws.com/android-llm-device-farm-test-spec.yml
       extra-data: https://ossci-assets.s3.amazonaws.com/executorch-android-llama2-7b-0717.zip
+
+  test-ios-job-with-new-output-flag:
+    permissions:
+      id-token: write
+      contents: read
+    uses: ./.github/workflows/mobile_job.yml
+    with:
+      device-type: ios
+      # For iOS testing, the runner just needs to call AWS Device Farm, so there is no need to run this on macOS
+      runner: ubuntu-latest
+      # These values are prepared beforehand for the test
+      project-arn: arn:aws:devicefarm:us-west-2:308535385114:project:b531574a-fb82-40ae-b687-8f0b81341ae0
+      device-pool-arn: arn:aws:devicefarm:us-west-2:308535385114:devicepool:b531574a-fb82-40ae-b687-8f0b81341ae0/da5d902d-45db-477b-ae0a-766e06ef3845
+      ios-ipa-archive: https://ossci-assets.s3.amazonaws.com/DeviceFarm.ipa
+      ios-xctestrun-zip: https://ossci-assets.s3.amazonaws.com/MobileNetClassifierTest_MobileNetClassifierTest_iphoneos17.4-arm64.xctestrun.zip
+      test-spec: https://ossci-assets.s3.amazonaws.com/default-ios-device-farm-appium-test-spec.yml
+      new-output-format-flag: true
+
+  test-android-llama2-job-with-new-output-flag:
+    permissions:
+      id-token: write
+      contents: read
+    uses: ./.github/workflows/mobile_job.yml
+    with:
+      device-type: android
+      runner: ubuntu-latest
+      timeout: 120
+      # These values are prepared beforehand for the test
+      project-arn: arn:aws:devicefarm:us-west-2:308535385114:project:b531574a-fb82-40ae-b687-8f0b81341ae0
+      device-pool-arn: arn:aws:devicefarm:us-west-2:308535385114:devicepool:b531574a-fb82-40ae-b687-8f0b81341ae0/bd86eb80-74a6-4511-8183-09aa66e3ccc4
+      android-app-archive: https://ossci-assets.s3.amazonaws.com/app-debug.apk
+      android-test-archive: https://ossci-assets.s3.amazonaws.com/app-debug-androidTest.apk
+      test-spec: https://ossci-assets.s3.amazonaws.com/android-llm-device-farm-test-spec.yml
+      extra-data: https://ossci-assets.s3.amazonaws.com/executorch-android-llama2-7b-0717.zip
+      new-output-format-flag: true
diff --git a/tools/device-farm-runner/run_on_aws_devicefarm.py b/tools/device-farm-runner/run_on_aws_devicefarm.py
@@ -195,6 +195,11 @@ def parse_args() -> Any:
         default=0,
         help="the workflow run attempt",
     )
+
+    parser.add_argument(
+        "--git-job-name", type=str, required=True, help="the name of the git job name."
+    )
+
     parser.add_argument(
         "--output",
         type=str,
@@ -208,12 +213,19 @@ def parse_args() -> Any:
     )
 
     parser.add_argument(
-        "--new-json-output",
-        action="store_true",
-        help="enable new json artifact output format with jobrun, and list of artifacts, this is temporary ",
+        "--new-json-output-format",
+        type=str,
+        choices=["true", "false"],
+        default="false",
+        required=False,
+        help="enable new json artifact output format with mobile job reports and list of artifacts",
     )
 
-    return parser.parse_args()
+    # in case when removing the flag, the mobile jobs does not failed due to unrecognized flag.
+    args, unknown = parser.parse_known_args()
+    if len(unknown) > 0:
+        info(f"detected unknown flags: {unknown}")
+    return args
 
 
 def upload_file(
@@ -409,6 +421,7 @@ class DeviceFarmReport:
     status: str
     result: str
     counters: Dict[str, str]
+    app_type: str
     infos: Dict[str, str]
     parent_arn: str
 
@@ -545,6 +558,7 @@ def _to_job_report(
         return JobReport(
             arn=arn,
             name=name,
+            app_type=self.app_type,
             report_type=ReportType.JOB.value,
             status=status,
             result=result,
@@ -564,6 +578,7 @@ def _to_run_report(self, report: Dict[str, Any], infos: Dict[str, str] = dict())
         return DeviceFarmReport(
             name=name,
             arn=arn,
+            app_type=self.app_type,
             report_type=ReportType.RUN.value,
             status=status,
             result=result,
@@ -661,7 +676,8 @@ def get_run_report(self):
             return DeviceFarmReport(
                 name="",
                 arn="",
-                report_type="",
+                app_type=self.app_type,
+                report_type=ReportType.RUN.value,
                 status="",
                 result="",
                 counters={},
@@ -699,9 +715,30 @@ def _upload_file_to_s3(self, file_name: str, bucket: str, key: str) -> None:
         )
 
 
+def generate_artifacts_output(
+    artifacts: List[Dict[str, str]],
+    run_report: DeviceFarmReport,
+    job_reports: List[JobReport],
+    git_job_name: str,
+):
+    output = {
+        "artifacts": artifacts,
+        "run_report": asdict(run_report),
+        "job_reports": [asdict(job_report) for job_report in job_reports],
+        "git_job_name": git_job_name,
+    }
+    return output
+
+
 def main() -> None:
     args = parse_args()
 
+    # (TODO): remove this once remove the flag.
+    if args.new_json_output_format == "true":
+        info(f"use new json output format for {args.output}")
+    else:
+        info("use legacy json output format for {args.output}")
+
     project_arn = args.project_arn
     name_prefix = args.name_prefix
     workflow_id = args.workflow_id
@@ -788,6 +825,11 @@ def main() -> None:
             time.sleep(30)
     except Exception as error:
         warn(f"Failed to run {unique_prefix}: {error}")
+        # just use the new json output format
+        json_file = {
+            "git_job_name": args.git_job_name,
+        }
+        set_output(json.dumps(json_file), "artifacts", args.output)
         sys.exit(1)
     finally:
         info(f"Run {unique_prefix} finished with state {state} and result {result}")
@@ -797,10 +839,12 @@ def main() -> None:
         )
         artifacts = processor.start(r.get("run"))
 
-        if args.new_json_output:
-            info("Generating new json output")
+        if args.new_json_output_format == "true":
             output = generate_artifacts_output(
-                artifacts, processor.get_run_report(), processor.get_job_reports()
+                artifacts,
+                processor.get_run_report(),
+                processor.get_job_reports(),
+                git_job_name=args.git_job_name,
             )
             set_output(json.dumps(output), "artifacts", args.output)
         else:
@@ -811,18 +855,5 @@ def main() -> None:
         sys.exit(1)
 
 
-def generate_artifacts_output(
-    artifacts: List[Dict[str, str]],
-    run_report: DeviceFarmReport,
-    job_reports: List[JobReport],
-):
-    output = {
-        "artifacts": artifacts,
-        "run_report": asdict(run_report),
-        "job_reports": [asdict(job_report) for job_report in job_reports],
-    }
-    return output
-
-
 if __name__ == "__main__":
     main()