Deep Learning with Google Cloud Platform (GCP)

This is a simple resource to get started with Deep Learning using Docker and GCP. The goal is to set up a GCP linux instance with a GPU and install the necessary dependencies to run TensorFlow models. Luckily, Google is giving away 300$ worth of computational resources. That will typically translate to 300 hrs of training on a Tesla K80 (12G).

If you do not want to attach your credit card and just want to follow along with a CPU version, then do not worry about any instructions that correspond to adding a GPU. You will not have to upgrade your account or attach a credit card.

First, Why GCP, GPU, TensorFlow and Docker?

  • GCP: Google Cloud Compute allows us to create and run an instance. An instance is a virtual machine that we can access to perform computation. You can add a number of GPUs, request any CPU power you would like, and attach an independent volume (point your instance to a storage device).
  • GPU: A graphical processing unit allows us to decrease computational time by executing multiple processes simultaneously. In the near future, Google will provide access to TPU, which is essentially the same thing, however, they are built to work together and will greatly increase computational speed.
  • Tensorflow: TensorFlow (TF) is google’s open-source software library for Machine Learning. There are a lot of other options to build deep learning model, but TF is made by Google! It will be optimized for the future TPU, is made to take to production with tensorflow serving, there a packages that simplify model construction, and there is a nice python API!
  • Docker: Docker can be thought of as a lightweight virtual machine. It allows us to create containers. Containers are independent environments instantiated through docker images defined in a Dockerfile. You can run many containers on a single machine without the overhead a virtual machine creates. It is easy to orchestrate containers such that more launch when needed. For now, we use docker to simplify the dependencies of tensorflow. It will be helpful when you need to launch a new instance or if you want to take your model to production with tf serving.

Setup your Google Cloud Account:

  • In order to use Google Cloud Compute, you will have to create a google account.
  • It is required to upgrade your Google Cloud Platform account in order to increase your GPU quota. This means you will have to add credit card information. You will still be able to use your 300 dollar credit towards your GPU allowance, but will get charged if you exceed that 300 dollar limit.
  • In your console, go to billing and create a budget if you want to create warning when you have approached a certain threshold.
  • You will be charged for the time your instance is running: look for explanation of how to stop instance in the section Setting up a GPU Instance.

Setting up a GPU Instance:

We will simply set up a linux instance with a GPU. First, navigate to your console. On the left under Compute, click on “compute engine” and then “VM instances”. Click “create instance.” Notice the estimate charge in the top right corner. You get charged when the instance is running, so make sure it is off when you aren’t using it. Next, we will follow the steps below to specify the instance details:

  1. Give your instance a cool name
  2. zone: choose us-west1-b (this zone has GPUs)
  3. Machine Type: click “customize”. I tend to increase the memory to 6.5 for 1 core (Maybe 12 gb on 4 cores if no GPU). Then click “GPUs” and choose 1 NVIDIA Tesla K80 (you will have to increase your GPU explained here).
  4. Boot Disk: Ubuntu 14.04 LTS (this is the base image your instance will run on). Choose standard persistent disk and choose the size of memory you need. Specify 100GB.
  5. Firewall: allow HTTP/HTTPS traffic.
  6. Click “create”
  7. Once the image is created click the name, click “edit” and scroll to network interfaces. Click “default” under Network. Scroll to Firewall rules and click “add firewall rule.” Give it a name like “tensorboard.” Under “Targets” specify “All instances in the network” and add “0.0.0.0/0” to “Source IP ranges” so this rule is applied to any IP address. Then, under specified protocols and ports add “tcp:81”. Click “create.” (This step allows us to port tensorboard to our browser).
  8. In order to connect, you are going to need to set up ssh or Putty. Follow the instructions here. Make sure you generate an ssh key pair which is a step in the link.

Turning Off Instance: By default the instance is running after creating it. Make sure to turn off your instance when you are not using it or you will get charged for the time it is running. Start and stop options are available at the VM instances page in the GCP console. The picture below shows that ‘instance-1’ is running and ‘tensorflow’ is not. Options to start and stop the instance are on the top.

start

Now, assuming all the steps above are complete, you have started the instance and you have ssh on your computer you can:

sudo ssh -i ~/.ssh/my-ssh-key lstrnad@35.197.20.53
  • -i ~/.ssh/my-ssh-key specifies that you have a ssh key and where its located
  • lstrnad would be your GCP username
  • 35.197.20.53 is the external IP address of your instance located on the VM instance page of the GCP console.
  • If you have windows computer you have to download and use PuTTY found here.

You should now be inside your linux instance with a GPU!

Setup: Installing Drivers and Dependencies

This section will walk through installing the NVIDIA drivers necessary for using the GPU on the machine, setting up Docker and NVIDIA-docker. First, lets set up the NVIDIA driver.

1. Install NVIDIA Drivers

Go to this article which explains how to attach a GPU to your instance. First, scroll down to Installing GPU drivers using scripts, select Ubuntu and copy the script in the 14.04 section. We need to make a file drivers.sh to paste this into and to execute. To do this we

  • vim drivers.sh in the command line
  • enter insert mode by pressing i and paste the script that looks like
#!/bin/bash
echo "Checking for CUDA and installing."
# Check for CUDA and try to install.
if ! dpkg-query -W cuda; then
  curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_8.0.61-1_amd64.deb
  dpkg -i ./cuda-repo-ubuntu1404_8.0.61-1_amd64.deb
  apt-get update
  apt-get install cuda -y
  apt-get install linux-headers-$(uname -r) -y
fi
  • to escape and save press <esc> and then :, which places you into a command-like prompt. Type wq and enter to save and quit.
  • Now, we should be back at the command line. We need to make that script executable by typing chmod +x drivers.sh and need to execute it by typing sudo ./drivers.sh (let it install).
  • Finally, we can test to see if our drivers have successfully installed by nvidia-smi. You should get something like

pic

Awesome!

2. Install Docker

In order to install Docker, navigate here or follow the steps below. Install recommended extra packages, install Docker CE (the recommended approach is through the repo) using the repository (use amd64), and finally install with apt-get. In short, (may be worth doing line by line)

sudo apt-get remove docker docker-engine docker.io
sudo apt-get update
sudo apt-get install \
    linux-image-extra-$(uname -r) \
    linux-image-extra-virtual
sudo apt-get update
sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"
sudo apt-get update
sudo apt-get install docker-ce

To verify that docker is installed and running run

sudo docker run hello-world

and you should get a “hello from Docker” message.

Success!

3. Install nvidia-docker

nvidia-docker allows docker to work with nvidia. It will allow containers to access the nvidia drivers. Follow the steps below to install.

  • First, navigate here and copy the link behind nvidia-docker_1.0.1-1_amd64.deb and download as follows:
    wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
    
  • Next, we need to install the binaries:
    sudo dpkg -i nvidia-docker_1.0.1-1_amd64.deb
    
  • Then, we can similarly check if it has installed properly by running sudo nvidia-docker run hello-world. Note: any time that we want to run docker we are going to have to use sudo.

Excellent! All of the dependencies have been satisfied. Let’s pull down my tensorflow image from my docker hub. Navigate to the TensorFlow and Docker tab.

Docker and TensorFlow

Brief Explanation of Docker:

This page is intended to demonstrate how to launch a container for tensorflow using docker. In order to get started we should get familiar with Docker. The best place to get started would be on the Get Started page Docker offers. This is what they say:

An image is a lightweight, stand-alone, executable package that includes everythin needed to run a piece of software, including the code, a runtime, libraries, environment variables, and config files.

A container is a runtime instance of an image–what the image becomes in memory when actually executed. It runs completely isolated from the host environment by default, only accessing host files and ports if configured to do so.

Containers run apps natively on the host machine’s kernel. They have better performance characteristics than virtual machines that only get virtual access to host resources through a hypervisor. Containers can get native access, each one running in a discrete process, taking no more memory than any other executable.

In short, Docker allows us to easily run a lightweight vm that contains all of the necessary dependencies for a coding environment or application.

Development Environment

Why Docker? Here is a use case: you are developing an app for a company and your app created in python/R requires a lot of dependencies. When distributing the app to other computers at this company you are going to have to install python/R and all of those dependencies, which we all know can be a challenge. Different computer architectures tend to require different software… Docker says:

With Docker, you can just grab a portable Python runtime as an image, no installation necessary. Then, your build can include the base Python image right alongside your app code, ensuring that your app, its dependencies, and the runtime, all travel together.

These portable images are defined by something called a Dockerfile.

Dockerfile and DockerHub:

I have made a docker image for a nice tensorflow environment. Go to my github and pull DietNet. In the future, we will look under the hood of this model. For now, we are going to navigate to the Dockerfile.gpu file and build our first image.

  • First, launch into your instance: something like
    sudo ssh -i ~/.ssh/my-ssh-key lstrnad@35.199.164.29
    
  • The commands below will pull either the gpu or cpu version of my tensorflow docker image:

    GPU:

    sudo docker pull ljstrnadiii/tensorflow_env:0.1
    

    CPU:

    sudo docker pull ljstrnadiii/tf_env_cpu:0.1
    

Now, you should have a working docker image that is my tensorflow working enviroment. Let’s see how to use it!

TensorFlow and Docker

Now that we have the tensorflow image pulled, we can run the image and work in our container.

  • First, lets launch the container:

    GPU:

    sudo nvidia-docker run -it -p 81:6006 ljstrnadiii/tensorflow_env:0.1
    

    CPU:

    sudo docker run -it -p 81:6006 ljstrnadiii/tf_env_cpu:0.1
    

    We have to specify nvidia-docker for future gpu tensorflow training. The flag -it is used to launch an image in a shell interactively. We also map port 6006 to 81 for future tensorboard usage. We should see

    root@705a3ac25b0b:/usr/local/diet_code/dietnetwork#
    

    Success!

  • Let’s make sure that tensorflow is working. Launch python by typing python in the command line and let’s copy this script into the interpretter (it should return a scalar of the final loss):

    import tensorflow as tf
    
    # Model parameters
    W = tf.Variable([.3], dtype=tf.float32)
    b = tf.Variable([-.3], dtype=tf.float32)
    # Model input and output
    x = tf.placeholder(tf.float32)
    linear_model = W * x + b
    y = tf.placeholder(tf.float32)
    
    # loss
    loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
    # optimizer
    optimizer = tf.train.GradientDescentOptimizer(0.01)
    train = optimizer.minimize(loss)
    
    # training data
    x_train = [1, 2, 3, 4]
    y_train = [0, -1, -2, -3]
    # training loop
    init = tf.global_variables_initializer()
    sess = tf.Session()
    sess.run(init) # reset values to wrong
    for i in range(1000):
      sess.run(train, {x: x_train, y: y_train})
    
    # evaluate training accuracy
    curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
    print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))
    

    Voila, our first linear model! Exit python with quit() in the interpretter and exit and kill the running container simply with exit in the terminal.

Useful Docker Features:

Docker has some useful features that allow us to enter and exit a container, attach a volume (data source), etc. We are going to briefly cover some of these useful utilities:

  • You may want to launch a python script in a container and let it continue to run while you sign out of your instance. There are many ways to do this, but for now we will assume the workflow is
    1. Lauch image
    2. Launch python script
    3. Log out and go on run while model is training.

    In order to do this you can simply hold “control + p + q”.

  • You may also want to check if any containers are currently running. sudo docker ps will give you alist of currenly running containers: (if you have a running container you will get something like this)
    CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS                            NAMES
    705a3ac25b0b        tensorflow_env      "/bin/bash"         23 minutes ago      Up 23 minutes       8888/tcp, 0.0.0.0:81->6006/tcp   compassionate_clarke
    
  • After exiting a container you may want to enter back into the container. In order to do this, you must locate the container ID and run:
    sudo docker exec -it <containerID> /bin/bash
    

    This will put you back into the running container.

  • Another important feature is being able to attach a volume(dir/files). It will essentially link your specified file or directory into the container environment. Find a file or directory you want to attach. To attach to your container use te -v flag:
    sudo nvidia-docker run -it -p 81:6006 -v /path/to/your/dir:/name/and/place/dir/in/container ljstrnadiii/tensorflow_env:0.1
    
  • Lastly, if you want to kill a docker process
    sudo docker kill <containerID>
    
  • bonus: you can also launch your container in detached mode and specify the python script you want to run. Check out the DietNet launch.sh file and the commented-out line in the dockerfile with entrypoint.

Exercise: Try to run a container, detach to continue to run the container, check to see it is running, and then enter back into the running container, exit again and then kill the container.