What is the DGX-1 ?
The DGX-1 is an appliance designed and fitted for deep learning and AI use cases.
The operating system is based on Ubuntu 16.04. Additional tools are provided for you to get the best possible experience with the associated hardware. Especially, some common deep learning frameworks can be quickly deployed using docker prebuilt containers.
The list of available containers is here: http://nvcr.ovh/list.txt
How do I fetch them ?
docker pull nvcr.ovh/nvidia/$container:$release
docker pull nvcr.ovh/nvidia/tensorflow:17.09
mkdir -p /raid/datasets/tensorflow
nvidia-docker run -ti --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v /raid/datasets/tensorflow:/datasets nvcr.ovh/nvidia/tensorflow:17.09
Then, from within the container itself, launch your training:
A lot of examples are provided from within the containers themselves, in the /workspace directory
Bind your containers to specific graphic cards
NV_GPU=0,1 nvidia-docker run [...]
More details about how to use nvidia docker:
8 V100 graphic cards are included and connected to eachother through a mesh, using the Nvlink technology. More precisely, every GPU is directly connected to 4 other ones (non uniform memory access). The direct links support up to 40Gb/s bidirectionnal bandwidth. The processors and graphic cards communications are ensured through pci-e gen 3.0. Regarding the drives there are 5 of them: the first 480G SSD is meant to contain the operating system, while the 4 1920G ones are gathered in a hardware raid 0, intented to be used for storing datasets. This way, training won't be penalized by disk io accesses.
This topology can be retrieved on the dgx itself using installed tools:
You can find more details here:
Raid and partitioning layout details
You may wonder why any customized hardware raid or partitioning layout provided through the OVH installation wizzard is ignored for the DGX-1. We do so on purpose, as we want to stick to the DGX appliance native settings regarding the layout:
The DGX has got 5 SSD disks:
1. The first 480GB one hosts the operating system
2. The 4 2T other ones are gathered in a hardware raid 0, to benefit from stripes, so that IO accesses bottleneck is minimized during dataset accesses, while training or simulating
The raid 0 is formatted in ext4 and mounted on /raid: this is where datasets are intended to be cached.
Monitor GPU usage