What's the progression of learning Deep Learning in this course?
What is the softmax function?
What happens when you multiply the scores of the softmax function?
The probabilities get closer to 0 or 1. Conversely, if you divide by 10, they'll get closer to the uniform distribution. In other words, as you increase the size of the outputs, the classifier becomes very confident in it's predictions.
What is a way to measure the distance between the softmax probability vector and one hot encoding vector?
D - denoted for distance.

Give a 4 step process for Multinomial Logistic Classification

">

Describe the Loss of Multinomial Logistic Classification and it's notation.
What is the rule of 30?
Only trust an improvement of around 30 examples changing for it to be statistically significant. Ex: 3000 data elements, would require 30 changes in labels before trusting the improvement as something other than noise.
Describe Stochastic Gradient Descent and it's use for NN.
Very important for NN because gradient descent takes a lot of computational effort which doesn't scale well. SGD takes random small sample and performs Gradient Descent.

Very important to have zero mean and equal variance with random weights.

What are two techniques to be used with SGD?
Momentum and Learning Rate Decay.
What is Momentum?
Describe why SGD has a reputation for black magic?
There are many hyper-parameters to tune, such as: - initial learning rate - learning rate decay - momentum - batch size - weight initializations Good rule of thumb. Lower your learning rate if you run into problems.

What is AdaGrad?
It is a modification of SGD which implicitly does momentum and learning rate decay for you. Often makes learning less sensitive to hyper-parameters. HOWEVER, it often tends to be a little worse than precisely tuned SGD with momentum.
How many parameters does a linear model with n inputs, and k outputs have?
(n + 1) * k
Give an example of how a linear model will not capture relationships between inputs.
y = x1 + x2 will be captured within the model. HOWEVER, if the output depends on the x1*x2 multiplication, the model will not be able to accurately capture this relationship.

What is Early Termination?
What is dropout?
Another important technique for regularization. Forces your model to learn a redundant representation for everything to make sure that at least some of the information remains. Random selections of the activations get set to zero.

If dropout doesn't work for you, you should probably be using a bigger network.

What do you do to evaluate the network trained with dropout?
Want to take the consensus of the redundant models by taking the averages of the activations. Here's the trick... during training, you zero out the activations but ALSO scale the remaining activations by a factor or 2.

This way, when it comes time to average them during evaluation, you just remove the dropouts and scaling operations from your neural net. And the result is the average of the activations that is properly scaled.

What is a shorter way to refer to convolutional networks?
convets
What are convolutional networks?
Neural networks that share their parameters across space.

com/wp-content/uploads/2017/11/describe-convolutional-lingo-to-help-understand-convets.png" title="Describe convolutional lingo to help understand convets." alt="Describe convolutional lingo to help understand convets.">

Describe convolutional lingo to help understand convets.
Also included in the picture should be stride.

The number of pixels that you're shifting each time you move your filter. A stride of 1, makes the output roughly the same size as the input. Stride of 2, roughly half the size.

What is the general idea of having stacks of convolutions?
At the bottom, you have the original big image (r,g,b) but shallow. Going to apply convolutions that progressively squeed the dimensions while increasing the depth, which corresponds to the semantic complexity of the representation. i.e., shapes, features.

It forms a representation where all the spatial information has been squeezed out and only parameters that map to content of the image remain.

What's the difference between valid and same padding?
Valid padding you don't go past the edge of the original image. Same padding you go off the edge and pad with zeros in such a way that the output map size is the exact same as the input map.
What is pooling and max-pooling?
When we take all the convolutions in a neighborhood and combined them somehow. Max pooling is the most common way to go about this.

At every point in the feature map, look at a small neighborhood around that point and compute the maximum of all the responses around it.

Describe a 1x1 convolution.
If we look at the classic convolution setting, it's basically a small classifier for a patch of the image, but it's only a linear classifier. But if we add a 1x1 convolution (1px by 1px path) in the middle, we have a mini neural network running over the patch instead of a linear classifier.
What is the inception model?
The idea is that at each layer of your convet, you can make a choice. Have a pooling operation, have a convolution, then need to decide is it a 1x1, 3x3, 5x5? So why choose! Instead of having a single convolution, you use a composition of average pooling, then 1x1 convolution, then 1x1 followed by 3x3, then 1x1 followed by 5x5.

And at the top, you simply concatenate the output of each of them.

What is t-sne?
t-distributed stochastic neighbor embedding (t-SNE) is a machine learning algorithm for dimensionality reduction. It is a nonlinear dimensionality reduction technique that is particularly well suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points.

What is the idea behind Recurrent Neural Network?
We used convolutional neural networks to share parameters across space to extract patterns over an image. RNN does the same thing except over time instead of space. Image you have a sequence of events, and at each point in time you want to make a decision about what's happened so far in this sequence. Since it is a sequence, you want to take into account the past. Use the state of the previous classifier as a summary of what happened before recursively. End up with a network with a relatively simple repeating pattern with part of your classifier connecting to the input at each time step, and another part, called the recurrent connection connecting you to the past at each step.

What is LSTM?
Long Short Term Memory. Conceptually a RNN consists of repeating simple units which take as an input the past, new inputs, and produce a new prediction through a neural net and connects to the future. With LSTM, we replace the basic neural net with LSTM 'cell'
What is a beam search?
Beam search is an optimization of best-first search that reduces its memory requirements. Best-first search is a graph search which orders all partial solutions (states) according to some heuristic which attempts to predict how close a partial solution is to a complete solution (goal state). But in beam search, only a predetermined number of best partial solutions are kept as candidates.