Diet Networks: Neural Networks and the p >> n problem
Diet Networks: Thin Parameters for Fat Genomics^{1}
Diet Networks is a deep learning approach to predicting ancestry using genomic data. The number of free parameters in a neural network depends on the input dimension. The dimension of genomic data tends to be greater than the number of observations by three orders of magnitude. The model proposes an alternative approach to a fully connected network that reduces the number of free parameters significantly.
Summary:
 Discuss Neural Networks and the Deep Learning
 Discuss genomic data and motivate the approach of Diet Networks
 Discuss the Diet Network architecture
 Discuss the TensorFlow implementation and results
Neural Network and Deep Learning
 Neural Networks are represented as graphical structures
 The weights, , are the free parameters and are learned through maximum likelihood estimation and back propagation.
 This structure can be used to represent: Linear Regression, Multivariate Regression, Binomial Regression, Softmax Regression
 Nodes following the input layer are computed with an activation function
What about the notion of Deep Learning?
 Adding hidden layers allows the model to learn a ‘deeper’ representation.
 The Universal Approximation Theorem: a network with two hidden layers and nonlinear activation functions can approximate any continuous function over a compact subset of .
 The parameters of the model can be represented as matrices.
Representation Learning
 We want to learn a new representation of the data such that the new representations are linear in this new space.
Example:
(Image above borrowed from here)
 Nonlinear activation functions allow the model to learn this discriminating function as a linear function in a new feature space.
(Image above borrowed from here)
 Nodes in the hidden layers with nonlinear activation functions are represented as where is the nonlinear activation function.
 The new representation of is then represented as .
 The algorithm essentially explores weight matrices, , that are in the path of gradient descent.
 These weight matrices contruct the hypothesis space of functions considered in the function approximation task.
Convolutional Layers
The beginning of “Deep” learning started with convolutional neural networks. The main idea is to convolve a single neural network about an image or audio. Navigate here for arithmetic or here for visualization.
(Image borrowed from here)
 Demonstrates the convolving of a kernel or neural network about the larger blue image to generate the “downsampled” output in green.
(Image borrowed from here
 Expresses how a convolutional layer can be represented by a matrix. Notice the reduction in learnable parameters.
Unfortunately, genomic data does not have an obvious relationship with neighboring entries in its sequence like image or audio data.
Genomic Data
 The 1000 genomes project released the largest genomic data set among 26 different populations.
 The data are roughly 150,000 single nucleotide polymorphisms (SNPs) for roughly 2500 people.
 SNPs are essentially genetic variations of nucleotides that occur at a significant frequency between populations.
 The goal is to classify the ancestry of an individual based on this SNP data.
Diet Networks Structure
 Diet Networks proposes a fully connected network with two auxiliary networks.
 The main use of the auxiliary network is to predict the weights of the first layer in the discriminative network.
(Image taken from Diet Networks^{1}*)
 A fully connected network with dimensional data will have a weight matrix in the first layer of the discriminative network.
 If , then we have 15,000,000 free parameters!
 The method proposed to predict the weight matrix will reduce this number significantly.
Auxiliary Network for Encoding
 The Auxiliary network for encoding predicts the weight matrix in the first layer of the discriminative network.
 note:
 is of size
 is of size
 Let hidden layers have number of units
 The first layer of the discrminative network is represented by the weight matrix, , which is .
 The first layer in the auxiliary network has a weight matrix, , with size .
 Then the output of the auxiliary network .
 has size .
 Thus, is the appropriate size for the first layer in the discriminate network.
 The final number of learnable parameters to construct is
Auxiliary Network for Decoding
 The same thing is happening for the decoding auxiliary network.
 note:
 which implies the transpose of gives a shape .
 The output of the first MLP layer, , in the discriminative is .
 Thus, gives .
 The reconstruction is used because it gives better results and helps with gradient flow.
The Embedding Layer
 This implementation focuses on the histogram embedding.
 The histogram embedding is generated by calculating the frequency of each possible value {0,1,2} for each class {1,…,26} accross each SNP {1,…,}.
 This information is contained in a matrix since 3 input types 26 classes gives 78.
 This embedding is the input to a hidden layer which has nodes.
 Therefore, we will have a weight matrix to learn, but the corresponding output will be .
TensorFlow Implementation and Results
 My TensorFlow implementation can be found here.

The goal is to replicate the results of the paper.
 They provide information on the model such as
 the number of hidden units and hidden layers
 norm constraints on the gradients
 using an adaptive learning rate stochastic gradient descent optimizer
 The paper does not specify
 exactly how they regularize the parameters
 if they used batch norm
 if they used drop out
 which activation functions were used
 how they initialized the weights of the hidden layers
 or which specific optimizers were used
 The goal of this implementation is to be specific about the regularization, weight initialization, and optimizers used.
Regularization
Regularization is a way of preventing our model from overfitting. It helps decrease the generalization error.

The paper specifies that they limit the norm of the gradients (gradient clipping).

This implementation uses the following regularization techniques:
 L2 norm on each matrix matrix (like ridge regression)
 gradient clipping (only back propagate when gradient is less than threshold)
 weight initialization (use distribution with mean of zero and small variance)
Batch Norm
 A batch is a subset of data used for back propagation.
 Batch norm normalizes each batch when performing forward pass to calculate error.
 Prevents model parameters from drifting as a cause of scale issues.
 This problem is known as covariate shift
Drop out
 Drop out is the process of randomly turning off neurons in the model.
 It allows each neuron the opportunity to “vote” and prevents a subset of neurons from taking over.
 It is mathematically equivalent to ensemble learning and is computationally cheap.
Activation Functions
 Each activation function has its own pros and cons.
 This implementation considers the tanh and relu nonlinear activation functions.
Optimizers
 Diet Networks simply specified they used an adaptive learning rate stochastic gradient descent back propagation learning algorithm.
 This implementation considers the ADAM and RMSprop optimizers in the model selection process.
TensorFlow Implementation
 The following diagram illustrates the structure of this TensorFlow implementation
The left structure represent the auxiliary network. The right structure represents the discriminative network.
 Everywhere there is a
act_fun
orw_init
is left open for model selection.
Model Selection
TensorFlow has a feature called tensorboard which helps visualize learning. Tensorboard is a webapp that displays specified summary statistics. In order to perform model selection, many models are constructed.
Models Considered:
 Weight initialization using the Normal and Uniform distribution with standard deviation of .1 and .01
 tanh and relu activation functions
 Adam and RMSprop optimizers
 learning rates of .001 and .0001
Test set accuracy of the 32 models
The optimal model achieves about 93% accuracy which matches with the results o Diet Networks.