Training of deep learning models: from local CPU to cloud GPU (Part I)

This guide will drive you throught the steps used by Nextbit to configure a cloud environment for the training of deep learning models.

In Part I we will see how to configure an AWS EC2 instance optimized for deep learning.

Luca Grementieri - lgrementieri@nextbit.it

EC2 instance configuration

Assuming you already have an AWS account, you can access the EC2 Dashboard from here. The button Launch Instance allows you to configure a new instance. AWS resources are distributed around the world in regions. The link points to one of the cheapest region, us-east-1 (N. Virginia).

At first, you have to choose an AMI, i.e. the operating system of your virtual machine. In order to get the best performances we use the AMI distributed by Nvidia since it provides access to Nvidia GPU Cloud (NGC) optimized containers. In the search bar, type NVIDIA Volta Deep Learning AMI, then select AWS Marketplace and choose the only result with the Select button. The Continue button in the popup window will confirm the choice of the AMI.

If the first result of the research has not the correct name, it means that the AMI is not available in the chosen region. You should be careful if you decide to move to another region because not all the services are available in every region, for example the GPU instance we are going to use is available only in certain regions and the same applies to compatible AMIs. See the P3 instaces page for details on region availability.

Choice of the AMI.

The choice of the instance type is very easy since the AMI only supports p3.xlarge instances. The default choice p3.2xlarge should be sufficient for most of the needs of a common deep learning practitioner. Without going into the details of the configuration, it is sufficient to select Review and Launch to confirm the default settings.

As you can see from the review page, looking at the Storage tab, an EBS volume associated to the virtual machine has been created. An EBS volume is a SSD drive with a default capacity of 32 GB used to store the AMI. It is not necessary to increase the default drive capacity because all data will be put on a separate drive in order to minimize data transfer costs. This drive is used to store the operating system of the virtual machine (AMI) and then you will put here the model code.

After reviewing the instance details you are ready to start it with the Launch button. A popup will appear asking you to Choose an existing key pair among the available ones or to Create a new pair of private and public keys used to access securely the instance by SSH. If you do not have access to any private key of the selected region, you have to select Create a new key pair, pick a name like north_virginia and use the Download Key Pair button to download the private key file (.pem file). Put the .pem file in your home directory to access it easily.

Finally the button Launch Instances creates the chosen virtual machine. You will be redirected to a summary page where at the bottom you can find the button View Instances that links to the Instances Dashboard. When the state switches from pending to running, your instance is up and running (and you are paying for it). Clicking on the name field of your instance you can edit the name, choosing Volta GPU for example. For now you can stop the instance since you are going to use it later. The Actions dropdown button has a sub-menu Instance State where you can choose among Start, Stop, Reboot and Terminate commands. While the Stop command switches the virtual machine off, the Terminate command deletes permanently the instance and the associated EBS, so it should be used carefully.

Nvidia GPU Cloud (NGC)

In order to access the AMI distributed by Nvidia installed on the virtual machine, it is necessary to have a Nvidia API Key. If you are have generated and saved an API Key, you can skip this step, otherwise it can be obtained upon registration on the NGC website.

Indeed after the registration you are redirected to a web application where you can request your API Key by simply using the buttons Get API Key and then Generate API Key. As stated on the website, the API Key is shown just once so it has to be copied and saved locally to be used after. If you lost your API Key you can generate a new one following the same procedure and the previous one will be invalidated.

Generation of Nvidia API Key.

Connection to EC2 istance

Provided of your Nvidia API Key, you can finally connect to your EC2 instance. Going to the Instances Dashboard and selecting Volta GPU you can start your instance. When the instance is running, using the Connect button, you can see a popup where you can find all information to connect to the instance.

Instance connection parameters.

To connect to the instance, open a terminal. As explained in the popup, the first time you use your private key file you need to launch the command chmod 400 north_virginia.pem.

Then, depending on your SSH client, you have to set up a connection to the user ubuntu on the machine pointed by the public DNS address shown. In the case shown in the figure above, on a Linux/MacOS system, the command below does the job. The terminal working directory has to contain your .pem file. For this reason it is preferable to put the private key in your home directory, the default working directory of a new terminal window. If this is not the case, you have to specify the full path to the .pem file.

ssh -i "north_virginia.pem" ubuntu@ec2-34-239-172-92.compute-1.amazonaws.com

If the SSH connection has been established successfully, the virtual machine will ask you for a confirmation on the autenticity of the host. Typing yes the connection procedure continue. Finally you are prompted to insert your NGC API Key and after that you will have access to your GPU virtual machine on AWS.

Summary

The first part of this guide has shown you how to set up and connect to a EC2 AWS instance provided of a Nvidia Volta GPU, the best GPU to train deep learning models.

The next part will concern data transfer from your local system to the cloud environment. It will be focused on the best (and cheaper) way to upload big amount of data, necessary to train state-of-the-art deep learning systems.

Open positions in Nextbit

At Nextbit we are always looking for open minds. If you feel this is the right place for you, please contact us at contact@nextbit.it.

  • Cloud Solutions Architect
    • GCP
    • AWS
    • Java
    • Python
    • Docker
  • Full Stack Engineer
    • JS
    • Vue
    • React
    • Node
    • Java
    • Python
    • GraphQL

Get in touch

Follow us

At Nextbit we provide consulting services for customers’ proprietary projects as well as cutting edge industry specific solutions, visit our website to know more.