Skip to content

Commit 0cc88e2

Browse files
committed
Simplify running commands(single node and multi nodes)
1. Update training logs(glint360k) 2. Update install docs 3. Fix distributed training
1 parent 873e94c commit 0cc88e2

File tree

6 files changed

+80
-52
lines changed

6 files changed

+80
-52
lines changed

recognition/arcface_torch/README.md

+23-39
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,34 @@
11
# Arcface Pytorch (Distributed Version of ArcFace)
22

3-
43
## Contents
54

65
## Set Up
76
```shell
87
torch >= 1.6.0
9-
```
10-
11-
## Train on a single node
12-
If you want to use 8 GPU to train, you should set `--nproc_per_node=8` and set `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 `
13-
If you want to use 4 GPU to train, you should set `--nproc_per_node=4` and set `CUDA_VISIBLE_DEVICES=0,1,2,3`
14-
If you want to use 1 GPU to train, you should set `--nproc_per_node=1` ...
8+
```
9+
More details see [eval.md](docs/install.md) in docs.
1510

11+
## Training
12+
### 1. Single node, 1 GPUs:
1613
```shell
17-
export OMP_NUM_THREADS=4
18-
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
19-
python -m torch.distributed.launch \
20-
--nproc_per_node=8 --nnodes=1 \
21-
--node_rank=0 --master_addr="127.0.0.1" \
22-
--master_port=1234 train.py
23-
ps -ef | grep "train" | grep -v grep | awk '{print "kill -9 "$2}' | sh
14+
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=1234 train.py
2415
```
25-
26-
## Train on multi-node
16+
### 2. Single node, 8 GPUs:
2717
```shell
28-
pass
18+
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=1234 train.py
2919
```
30-
31-
## Evaluation
20+
### 3. Multiple nodes, each node 8 GPUs:
21+
Node 0:
22+
```shell
23+
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr="ip1" --master_port=1234 train.py
24+
```
25+
Node 1:
3226
```shell
33-
# model-prefix your model path
34-
# image-path your IJBC path
35-
# result-dir your result path
36-
# network your backbone
37-
CUDA_VISIBLE_DEVICES=0,1 python eval_ijbc.py \
38-
--model-prefix ms1mv3_arcface_r50/backbone.pth \
39-
--image-path IJB_release/IJBC \
40-
--result-dir ms1mv3_arcface_r50 \
41-
--batch-size 128 \
42-
--job ms1mv3_arcface_r50 \
43-
--target IJBC \
44-
--network iresnet50
27+
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr="ip1" --master_port=1234 train.py
4528
```
29+
30+
31+
## Evaluation IJBC
4632
More details see [eval.md](docs/eval.md) in docs.
4733

4834
## Speed Benchmark
@@ -89,14 +75,12 @@ All Model Can be found in here.
8975
### Glint360k
9076
| Datasets | log |backbone | IJBC(1e-05) | IJBC(1e-04) |agedb30|cfp_fp|lfw |
9177
| :---: | :--- |:--- | :--- | :--- |:--- |:--- |:--- |
92-
| Glint360k-Cosface |[log](https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/glint360k_cosface_r100/training.log) |r100 | 96.19 | 97.39 | 98.52 | 99.26 | 99.83 |
93-
| Glint360k-Cosface |[log](https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/glint360k_cosface_r100_fp16_0.1/training.log)|r100-fp16-sample-0.1 | 95.95 | 97.35 | 98.57 | 99.30 | 99.85 |
94-
| Glint360k-Cosface | - | - | - | - | - | - | - |
95-
| Glint360k-Cosface | - | - | - | - | - | - | - |
96-
| Glint360k-Cosface | - | - | - | - | - | - | - |
97-
98-
99-
78+
| Glint360k-Cosface |[log](https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/glint360k_cosface_r18_fp16_0.1/training.log) |r18-fp16-0.1 | 93.16 | 95.33 | 97.72 | 97.73 | 99.77 |
79+
| Glint360k-Cosface |[log](https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/glint360k_cosface_r34_fp16_0.1/training.log) |r34-fp16-0.1 | 95.16 | 96.56 | 98.33 | 98.78 | 99.82 |
80+
| Glint360k-Cosface |[log](https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/glint360k_cosface_r50_fp16_0.1/training.log) |r50-fp16-0.1 | 95.61 | 96.97 | 98.38 | 99.20 | 99.83 |
81+
| Glint360k-Cosface |[log](https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/glint360k_cosface_r100_fp16_0.1/training.log)|r100-fp16-0.1 | 95.88 | 97.32 | 98.48 | 99.29 | 99.82 |
82+
83+
0.1 means sample rate is 0.1.
10084

10185
More details see [eval.md](docs/modelzoo.md) in docs.
10286

recognition/arcface_torch/config.py

+3
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,9 @@ def lr_step_func(epoch):
3737
config.lr_func = lr_step_func
3838

3939
elif config.dataset == "glint360k":
40+
# make training faster
41+
# our RAM is 256G
42+
# mount -t tmpfs -o size=140G tmpfs /train_tmp
4043
config.rec = "/train_tmp/glint360k"
4144
config.num_classes = 360232
4245
config.num_image = 17091657
+9-4
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,20 @@
11
## Eval IJBC
22

33
```shell
4+
# model-prefix your model path
5+
# image-path your IJBC path
6+
# result-dir your result path
7+
# network your backbone
48
CUDA_VISIBLE_DEVICES=0,1 python eval_ijbc.py \
5-
--model-prefix tmp_models/backbone.pth \
6-
--image-path /data/anxiang/IJB_release/IJBC \
7-
--result-dir result \
9+
--model-prefix ms1mv3_arcface_r50/backbone.pth \
10+
--image-path IJB_release/IJBC \
11+
--result-dir ms1mv3_arcface_r50 \
812
--batch-size 128 \
9-
--job cosface \
13+
--job ms1mv3_arcface_r50 \
1014
--target IJBC \
1115
--network iresnet50
1216
```
1317

1418
## Eval MegaFace
19+
pass
1520

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
## v1.7.1
2+
### Linux and Windows
3+
```shell
4+
# CUDA 11.0
5+
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
6+
7+
# CUDA 10.2
8+
pip install torch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2
9+
10+
# CUDA 10.1
11+
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
12+
13+
# CUDA 9.2
14+
pip install torch==1.7.1+cu92 torchvision==0.8.2+cu92 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
15+
16+
# CPU only
17+
pip install torch==1.7.1+cpu torchvision==0.8.2+cpu torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
18+
```
19+
20+
21+
## v1.6.0
22+
23+
### Linux and Windows
24+
```shell
25+
# CUDA 10.2
26+
pip install torch==1.6.0 torchvision==0.7.0
27+
28+
# CUDA 10.1
29+
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
30+
31+
# CUDA 9.2
32+
pip install torch==1.6.0+cu92 torchvision==0.7.0+cu92 -f https://download.pytorch.org/whl/torch_stable.html
33+
34+
# CPU only
35+
pip install torch==1.6.0+cpu torchvision==0.7.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
36+
```

recognition/arcface_torch/run.sh

+1-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,2 @@
1-
export OMP_NUM_THREADS=4
2-
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 \
3-
--node_rank=0 --master_addr="127.0.0.1" --master_port=1234 train.py
1+
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=1234 train.py
42
ps -ef | grep "train" | grep -v grep | awk '{print "kill -9 "$2}' | sh

recognition/arcface_torch/train.py

+8-6
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,13 @@
2222

2323

2424
def main(args):
25-
dist.init_process_group(backend='nccl', init_method='env://')
25+
26+
world_size = int(os.environ['WORLD_SIZE'])
27+
rank = int(os.environ['RANK'])
28+
dist_url = "tcp://{}:{}".format(os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"])
29+
dist.init_process_group(backend='nccl', init_method=dist_url, rank=rank, world_size=world_size)
2630
local_rank = args.local_rank
2731
torch.cuda.set_device(local_rank)
28-
rank = dist.get_rank()
29-
world_size = dist.get_world_size()
3032

3133
if not os.path.exists(cfg.output) and rank is 0:
3234
os.makedirs(cfg.output)
@@ -124,8 +126,8 @@ def main(args):
124126
if __name__ == "__main__":
125127
parser = argparse.ArgumentParser(description='PyTorch ArcFace Training')
126128
parser.add_argument('--local_rank', type=int, default=0, help='local_rank')
127-
parser.add_argument('--network', type=str, default="iresnet50", help="backbone network")
128-
parser.add_argument('--loss', type=str, default="ArcFace", help="loss function")
129-
parser.add_argument('--resume', type=int, default=0, help="model resuming")
129+
parser.add_argument('--network', type=str, default='iresnet50', help='backbone network')
130+
parser.add_argument('--loss', type=str, default='ArcFace', help='loss function')
131+
parser.add_argument('--resume', type=int, default=0, help='model resuming')
130132
args_ = parser.parse_args()
131133
main(args_)

0 commit comments

Comments
 (0)