본문 바로가기
NVIDIA

[NVIDIA] MIG(Multi-Instance-GPU) Docker 컨테이너에 할당

by Yoon_estar 2024. 8. 3.
728x90
반응형

1. MIG 활성화

  • 확인(비활성화 되어 있음)
# nvidia-smi
Wed Jul 31 10:57:26 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:19:00.0 Off |                    0 |
| N/A   37C    P0            115W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:2D:00.0 Off |                    0 |
| N/A   39C    P0            115W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off |   00000000:3F:00.0 Off |                    0 |
| N/A   39C    P0            116W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off |   00000000:66:00.0 Off |                    0 |
| N/A   37C    P0            118W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off |   00000000:9B:00.0 Off |                    0 |
| N/A   36C    P0            116W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off |   00000000:AE:00.0 Off |                    0 |
| N/A   41C    P0            128W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off |   00000000:BF:00.0 Off |                    0 |
| N/A   40C    P0            123W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:E4:00.0 Off |                    0 |
| N/A   37C    P0            116W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
  • 2번GPU MIG 활성화 후 활성화 확인
# nvidia-smi -i 2 -mig 1 

# nvidia-smi
Wed Jul 31 10:58:40 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:19:00.0 Off |                    0 |
| N/A   37C    P0            115W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:2D:00.0 Off |                   On |
| N/A   38C    P0            115W /  700W |       1MiB /  81559MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off |   00000000:3F:00.0 Off |                   On |
| N/A   39C    P0            116W /  700W |       1MiB /  81559MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off |   00000000:66:00.0 Off |                   On |
| N/A   37C    P0            118W /  700W |       1MiB /  81559MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off |   00000000:9B:00.0 Off |                   On |
| N/A   36C    P0            115W /  700W |       1MiB /  81559MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off |   00000000:AE:00.0 Off |                    0 |
| N/A   40C    P0            127W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off |   00000000:BF:00.0 Off |                    0 |
| N/A   40C    P0            123W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:E4:00.0 Off |                    0 |
| N/A   37C    P0            115W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  No MIG devices found                                                                   |
+-----------------------------------------------------------------------------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

2. MIG GI, CI 생성

  • 2.1 GI 생성
# nvidia-smi mig -i 2 -cgi 5,9
Successfully created GPU instance ID  1 on GPU  2 using profile MIG 4g.40gb (ID  5)
Successfully created GPU instance ID  2 on GPU  2 using profile MIG 3g.40gb (ID  9)

# nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   2  MIG 3g.40gb          9        2          4:4     |
+-------------------------------------------------------+
|   2  MIG 4g.40gb          5        1          0:4     |
+-------------------------------------------------------+
  • 2.2 CI 생성
# nvidia-smi mig -i 2 -cci -gi 2,1
Successfully created compute instance ID  0 on GPU  2 GPU instance ID  2 using profile MIG 3g.40gb (ID  2)
Successfully created compute instance ID  0 on GPU  2 GPU instance ID  1 using profile MIG 4g.40gb (ID  3)

# nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances:                                                 |
| GPU     GPU       Name             Profile   Instance   Placement  |
|       Instance                       ID        ID       Start:Size |
|         ID                                                         |
|====================================================================|
|   2      2       MIG 3g.40gb          2         0          0:4     |
+--------------------------------------------------------------------+
|   2      1       MIG 4g.40gb          3         0          0:4     |
+--------------------------------------------------------------------+
  • 2.3 MIG 생성 확인
# nvidia-smi 

...
+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  2    1   0   0  |              51MiB / 40320MiB    | 64      0 |  4   0    4    0    4 |
|                  |                 0MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    2   0   1  |              38MiB / 40320MiB    | 60      0 |  3   0    3    0    3 |
|                  |                 0MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
...

3. DOCKER

  • Docker, nvidia-docker 리파지토리 추가 후 docker 와 nvidia docker 설치
# dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
# dnf -y install containerd.io
# dnf -y install docker-ce

# curl https://nvidia.github.io/nvidia-docker/rhel8.0/nvidia-docker.repo > /etc/yum.repos.d/nvidia-docker.repo

# dnf -y install nvidia-docker2
# systemctl restart docker
# systemctl status docker

4. 작업제출

  • 4.1 pytorch 작업 제출
# docker run --gpus '"device=2:0"' nvcr.io/nvidia/pytorch:21.12-py3 /bin/bash -c "cd /opt/pytorch/examples/upstream/mnist && python main.py"

# docker ps
CONTAINER ID   IMAGE                              COMMAND                  CREATED          STATUS          PORTS             
   NAMES
463d966bb1f1   nvcr.io/nvidia/pytorch:21.12-py3   "/opt/nvidia/nvidia_…"   12 minutes ago   Up 12 minutes   6006/tcp, 8888/tcp
   cranky_benz
   
# nvidia-smi
.....
+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  2    1   0   0  |             724MiB / 40320MiB    | 64      0 |  4   0    4    0    4 |
|                  |                 3MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    2   0   1  |              38MiB / 40320MiB    | 60      0 |  3   0    3    0    3 |
|                  |                 0MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    2    1    0     743978      C   python                                        664MiB |
+-----------------------------------------------------------------------------------------+
.....
  • 3.2 작업 추가 후 확인
# docker run --gpus '"device=2:1"' nvcr.io/nvidia/pytorch:21.12-py3 /bin/bash -c "cd /opt/pytorch/examples/upstream/mnist && python main.py"

# docker ps
CONTAINER ID   IMAGE                              COMMAND                  CREATED          STATUS          PORTS             
   NAMES
ae6d21ebdd37   nvcr.io/nvidia/pytorch:21.12-py3   "/opt/nvidia/nvidia_…"   6 minutes ago    Up 6 minutes    6006/tcp, 8888/tcp
   upbeat_ganguly
463d966bb1f1   nvcr.io/nvidia/pytorch:21.12-py3   "/opt/nvidia/nvidia_…"   24 minutes ago   Up 24 minutes   6006/tcp, 8888/tcp
   cranky_benz

# nvidia-smi
...

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  2    1   0   0  |             890MiB / 40320MiB    | 64      0 |  4   0    4    0    4 |
|                  |                 3MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    2   0   1  |             295MiB / 40320MiB    | 60      0 |  3   0    3    0    3 |
|                  |                 3MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    2    1    0     743978      C   python                                        830MiB |
|    2    2    0     755998      C   python                                        248MiB |
+-----------------------------------------------------------------------------------------+
반응형