Skip to content

Commit 3209462

Browse files
c-p-i-opytorchmergebot
authored andcommitted
[fr] fix OSS broken flight recorder (pytorch#140973)
Summary: OSS flight recorder does not work because we renamed `trace_dir` to `folder` in the internal version to reuse the loader. Fixes item #2 in reported issue: pytorch#140879 Test Plan: BEFORE: ``` ❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node1_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 44, in main details, version = read_dir(args) File "/home/cpio/local/pytorch/tools/flight_recorder/components/loader.py", line 89, in read_dir assert len(details) > 0, f"no files loaded from {args.folder} with prefix {prefix}" AttributeError: 'Namespace' object has no attribute 'folder' ``` AFTER: ``` python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main db = build_db(details, args, version) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db check_no_missing_dump_files(entries, memberships) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files dumps_ranks == all_ranks AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119} ❯ git status fatal: not a git repository (or any parent up to mount point /data/users/cpio) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). ❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main db = build_db(details, args, version) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db check_no_missing_dump_files(entries, memberships) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files dumps_ranks == all_ranks AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119} ``` Differential Revision: D66117013 Pull Request resolved: pytorch#140973 Approved by: https://github.com/Skylion007, https://github.com/fduwjj
1 parent 241d225 commit 3209462

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed

tools/flight_recorder/components/loader.py

+5-3
Original file line numberDiff line numberDiff line change
@@ -74,8 +74,8 @@ def read_dir(args: argparse.Namespace) -> Tuple[Dict[str, Dict[str, Any]], str]:
7474
t0 = time.time()
7575
version = ""
7676
filecount = 0
77-
assert os.path.isdir(args.folder), f"folder {args.folder} does not exist"
78-
for root, _, files in os.walk(args.folder):
77+
assert os.path.isdir(args.trace_dir), f"folder {args.trace_dir} does not exist"
78+
for root, _, files in os.walk(args.trace_dir):
7979
if prefix is None:
8080
prefix = _determine_prefix(files)
8181
for f in files:
@@ -86,6 +86,8 @@ def read_dir(args: argparse.Namespace) -> Tuple[Dict[str, Dict[str, Any]], str]:
8686
if not version:
8787
version = str(details[f]["version"])
8888
tb = time.time()
89-
assert len(details) > 0, f"no files loaded from {args.folder} with prefix {prefix}"
89+
assert (
90+
len(details) > 0
91+
), f"no files loaded from {args.trace_dir} with prefix {prefix}"
9092
logger.debug("loaded %s files in %ss", filecount, tb - t0)
9193
return details, version

0 commit comments

Comments
 (0)