What is the DGX-1 ?

The DGX-1 is an appliance designed and fitted for deep learning and AI use cases.

 

Software details

The operating system is based on Ubuntu 16.04. Additional tools are provided for you to get the best possible experience with the associated hardware. Especially, some common deep learning frameworks can be quickly deployed using docker prebuilt containers.

software_stack

 

Available containers

The list of available containers is here: http://nvcr.ovh/list.txt

  caffe:17.10
  caffe2:17.10
  cntk:17.10
  cuda:9.0-cudnn7-devel-ubuntu16.04
  digits:17.10
  mxnet:17.10
  pytorch:17.10
  tensorflow:17.10
  theano:17.10
  torch:17.10

 

How do I fetch them ?


docker pull nvcr.ovh/nvidia/$container:$release
 

Usage example

 

docker pull nvcr.ovh/nvidia/tensorflow:17.09

mkdir -p /raid/datasets/tensorflow

nvidia-docker run -ti --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v /raid/datasets/tensorflow:/datasets nvcr.ovh/nvidia/tensorflow:17.09

 

Then, from within the container itself, launch your training:

A lot of examples are provided from within the containers themselves, in the /workspace directory

 

Bind your containers to specific graphic cards

NV_GPU=0,1 nvidia-docker run [...]

 

More details about how to use nvidia docker:

http://docs.nvidia.com/deeplearning/dgx/pdf/User-Guide.pdf

 

Hardware topology

8 V100 graphic cards are included and connected to eachother through a mesh, using the Nvlink technology. More precisely, every GPU is directly connected to 4 other ones (non uniform memory access). The direct links support up to 40Gb/s bidirectionnal bandwidth. The processors and graphic cards communications are ensured through pci-e gen 3.0. Regarding the drives there are 5 of them: the first 480G SSD is meant to contain the operating system, while the 4 1920G ones are gathered in a hardware raid 0, intented to be used for storing datasets. This way, training won't be penalized by disk io accesses.

Hardware_topo

 

This topology can be retrieved on the dgx itself using installed tools:

nvidia_smi_topo

 

You can find more details here:

https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/dgx-1-ai-supercomputer-datasheet-v4.pdf

https://devblogs.nvidia.com/parallelforall/dgx-1-fastest-deep-learning-system/

 

Raid and partitioning layout details

You may wonder why any customized hardware raid or partitioning layout provided through the OVH installation wizzard is ignored for the DGX-1. We do so on purpose, as we want to stick to the DGX appliance native settings regarding the layout:

The DGX has got 5 SSD disks:

1. The first 480GB one hosts the operating system

2. The 4 2T other ones are gathered in a hardware raid 0, to benefit from stripes, so that IO accesses bottleneck is minimized during dataset accesses, while training or simulating

The raid 0 is formatted in ext4 and mounted on /raid: this is where datasets are intended to be cached.

dgx-lsblk

 

Monitor GPU usage

gpu-resource-usage