On Machine Intelligence

Syndikovat obsah On Machine Intelligence
Why Artificial Intelligence and Machine Learning are changing the world
Aktualizace: 38 min 3 sek zpět

The Softmax Function Derivative (Part 3)

1 Červenec, 2020 - 23:00

Previously I’ve shown how to work out the derivative of the Softmax Function combined with the summation function, typical in artificial neural networks.

In this final part, we’ll look at how the weights in a Softmax layer change in respect to a Loss Function. The Loss Function is a measure of how “bad” the estimate from the network is. We’ll then be modifying the weights in the network in order to improve the “Loss”, i.e. make it less bad.

The Python code is based on the excellent article by Eli Bendersky which can be found here.

Cross Entropy Loss Function

There are different kinds Cross Entropy functions depending on what kind of classification that you want your network to estimate. In this example, we’re going to use the Categorical Cross Entropy. This function is typically used when the network is required to estimate which class something belongs to, when there are many classes. The output of the Softmax Function is a vector of probabilities, each element represents the network’s estimate that the input is in that class. For example:

[0.19091352 0.20353145 0.21698333 0.23132428 0.15724743]

The first element, 0.19091352, represents the network’s estimate that the input is in the first class, and so on.

Usually, the input is in one class, and we can represent the correct class for an input as a one-hot vector. In other words, the class vector is all zeros, except for a 1 in the index corresponding to the class.

[0 0 1 0 0]

In this example, the input is in class 3, represented by a 1 in the third element.

The multi-class Cross Entropy Function is defined as follows:

where M is the number of classes, y is the one-hot vector representing the correct classification c for the observation o (i.e. the input). S is the Softmax output for the class c for the observation o. Here is some code to calculate that (which continues from my previous posts on this topic):

def x_entropy(y, S): return np.sum(-1 * y * np.log(S)) y = np.zeros(5) y[2] = 1 # picking the third class for example purposes xe = x_entropy(y, S) print(xe) 1.5279347484961026 Cross Entropy Derivative

Just like the other derivatives we’ve looked at before, the Cross-Entropy derivative is a vector of partial derivatives with respect to it’s input:

We can make this a little simpler by observing that since Y (i.e. the ground truth classification vector) is zeros, except for the target class, c, then the Cross Entropy derivative vector is also going to be zeros, except for the class c.

To see why this is the case, let’s examine the Cross Entropy function itself. We calculate it by summing up a product. Each product is the value from Y multiplied by the log of the corresponding value from S. Since all the elements in Y are actually 0 (except for the target class, c), then the corresponding derivative will also be 0. No matter how much we change the values in S, the result will still be 0.


We can rewrite this a little, expanding out the XE function:

We already know that is 1, so we are left with:

So we are just looking for the derivative of the log of :

The rest of the elements in the vector will be 0. Here is the code that works that out:

def xe_dir(y, S): return (-1 / S) * y DXE = xe_dir(y, S) print(DXE) [-0. -0. -4.60864 -0. -0. ] Bringing it all together

When we have a neural network layer, we want to change the weights in order to make the loss as small as possible. So we are trying to calculate:

for each of the input instances X. Since XE is a function that depends on the Softmax function, which itself depends on the summation function in the neurons, we can use the calculus chain rule as follows:

In this post, we’ve calculated and in the previous posts, we calculated and . To calculate the overall changes to the weights, we simply carry out a dot product of all those matrices:

print(np.dot(DXE, DL_shortcut).reshape(W.shape)) [[ 0.01909135 0.09545676 0.07636541 0.02035314 0.10176572] [ 0.08141258 -0.07830167 -0.39150833 -0.31320667 0.02313243] [ 0.11566214 0.09252971 0.01572474 0.07862371 0.06289897]] Shortcut

Now that we’ve seen how to calculate the individual parts of the derivative, we can now look to see if there is a shortcut that avoids all that matrix multiplication, especially since there are lots of zeros in the elements.

Previously, we had established that the elements in the matrix can be calculated using:

where the input and output indices are the same, and

where they are different.

Using this result, we can see that an element in the derivative of the Cross Entropy function XE, with respect to the weights W is (swapping c for t):

We’ve shown above that the derivative of XE with respect to S is just . So each element in the derivative where i = c becomes:

This simplifies to:

Similarly, where i <> c:

Here is the corresponding Python code for that:

def xe_dir_shortcut(W, S, x, y): dir_matrix = np.zeros((W.shape[0] * W.shape[1])) for i in range(0, W.shape[1]): for j in range(0, W.shape[0]): dir_matrix[(i*W.shape[0]) + j] = (S[i] - y[i]) * x[j] return dir_matrix delta_w = xe_dir_shortcut(W, h, x, y)

Let’s verify that this gives us the same results as the longer matrix multiplication above:

print(delta_w.reshape(W.shape)) [[ 0.01909135 0.09545676 0.07636541 0.02035314 0.10176572] [ 0.08141258 -0.07830167 -0.39150833 -0.31320667 0.02313243] [ 0.11566214 0.09252971 0.01572474 0.07862371 0.06289897]]

Now we have a simple function that will calculate the changes to the weights for a seemingly complicated single-layer of a neural network.

Kategorie: Transhumanismus

The Softmax Function Derivative (Part 2)

14 Červen, 2020 - 18:58

In a previous post, I showed how to calculate the derivative of the Softmax function. This function is widely used in Artificial Neural Networks, typically in final layer in order to estimate the probability that the network’s input is in one of a number of classes.

In this post, I’ll show how to calculate the derivative of the whole Softmax Layer rather than just the function itself.

The Python code is based on the excellent article by Eli Bendersky which can be found here.

The Softmax Layer

A Softmax Layer in an Artificial Neural Network is typically composed of two functions. The first is the usual sum of all the weighted inputs to the layer. The output of this is then fed into the Softmax function which will output the probability distribution across the classes we are trying to predict. Here’s an example with three inputs and five classes:

For a given output zi, the calculation is very straightforward:

We simply multiply each input to the node by it’s corresponding weight. Expressing this in vector notation gives us the familiar:

The vector w is two dimensional so it’s actually a matrix and we can visualise the formula for our example as follows:

I’ve already covered the Softmax Function itself in the previous post, so I’ll just repeat it here for completeness:

Here’s the python code for that:

import numpy as np # input vector x = np.array([0.1,0.5,0.4]) # using some hard coded values for the weights # rather than random numbers to illustrate how # it works W = np.array([[0.1, 0.2, 0.3, 0.4, 0.5], [0.6, 0.7, 0.8, 0.9, 0.1], [0.11, 0.12, 0.13, 0.14, 0.15]]) # Softmax function def softmax(Z): eZ = np.exp(Z) sm = eZ / np.sum(eZ) return sm Z = np.dot(np.transpose(W), x) h = softmax(Z) print(h)

Which should give us the output h (the hypothesis):

[0.19091352 0.20353145 0.21698333 0.23132428 0.15724743] Calculating the Derivative

The Softmax layer is a combination of two functions, the summation followed by the Softmax function itself. Mathematically, this is usually written as:

The next thing to note is that we will be trying to calculate the change in the hypothesis h with respect to changes in the weights, not the inputs. The overall derivative of the layer that we are looking for is:

We can use the differential chain rule to calculate the derivative of the layer as follows:

In the previous post, I showed how to work out dS/dZ and just for completeness, here is a short Python function to carry out the calculation:

def sm_dir(S): S_vector = S.reshape(S.shape[0],1) S_matrix = np.tile(S_vector,S.shape[0]) S_dir = np.diag(S) - (S_matrix * np.transpose(S_matrix)) return S_dir DS = sm_dir(h) print(DS)

The output of that function is a matrix as follows:

[[ 0.154465 -0.038856 -0.041425 -0.044162 -0.030020] [-0.038856 0.162106 -0.044162 -0.047081 -0.032004] [-0.041425 -0.044162 0.1699015 -0.050193 -0.034120] [-0.044162 -0.047081 -0.050193 0.177813 -0.036375] [-0.030020 -0.032004 -0.034120 -0.036375 0.132520]] Derivative of Z

Let’s next look at the derivative of the function Z() with respect to W, dZ/dW. We are trying to find the change in each of the elements of Z(), zk when each of the weights wij are changed.

So right away, we are going to need a matrix to hold all of those values. Let’s assume that the output vector of Z() has K elements. There are (i j) individual weights in W. Therefore, our matrix of derivatives is going to be of dimensions (K, (i j)). Each of the elements of the matrix will be a partial derivative of the output zk with respect to the particular weight wij:

Taking one of those elements, using our example above, we can see how to work out the derivative:

None of the other weights are used in z1. The partial derivative of z1 with respect to w11 is x1. Likewise, the partial derivative of z1 with respect to w12 is x2, and with respect to w13 is x3. The derivative of z1 with respect to the rest of the weights is 0.

This makes the whole matrix rather simple to derive, since it is mostly zeros. Where the elements are not zero (i.e. where i = k), then the value is xj. Here is the corresponding Python code to calculate that matrix.

# derivative of the Summation Function Z w.r.t weight matrix W given inputs x def z_dir(Z, W, x): dir_matrix = np.zeros((W.shape[0] * W.shape[1], Z.shape[0])) for k in range(0, Z.shape[0]): for i in range(0, W.shape[1]): for j in range(0, W.shape[0]): if i == k: dir_matrix[(i*W.shape[0]) + j][k] = x[j] return dir_matrix

If we use the example above, then the derivative matrix will look like this:

DZ = z_dir(Z, W, x) print(DZ) [[0.1 0. 0. 0. 0. ] [0.5 0. 0. 0. 0. ] [0.4 0. 0. 0. 0. ] [0. 0.1 0. 0. 0. ] [0. 0.5 0. 0. 0. ] [0. 0.4 0. 0. 0. ] [0. 0. 0.1 0. 0. ] [0. 0. 0.5 0. 0. ] [0. 0. 0.4 0. 0. ] [0. 0. 0. 0.1 0. ] [0. 0. 0. 0.5 0. ] [0. 0. 0. 0.4 0. ] [0. 0. 0. 0. 0.1] [0. 0. 0. 0. 0.5] [0. 0. 0. 0. 0.4]]

Going back to the formula for the derivative of the Softmax Layer:

We now just take the dot product of both of the derivative matrices to get the derivative for the whole layer:

DL = np.dot(DS, np.transpose(DZ)) print(DL) [[ 0.01544 0.07723 0.06178 -0.00388 -0.01942 -0.01554 -0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766 -0.00300 -0.01501 -0.01200] [-0.00388 -0.01942 -0.01554 0.01621 0.0810 0.06484 -0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883 -0.00320 -0.01600 -0.01280] [-0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766 0.01699 0.08495 0.06796 -0.00501 -0.02509 -0.02007 -0.00341 -0.01706 -0.01364] [-0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883 -0.00501 -0.02509 -0.02007 0.01778 0.08890 0.07112 -0.00363 -0.01818 -0.01455] [-0.00300 -0.01501 -0.01200 -0.00320 -0.01600 -0.01280 -0.00341 -0.01706 -0.01364 -0.00363 -0.01818 -0.01455 0.01325 0.06626 0.05300]] Shortcut!

While it is instructive to see the matrices being derived explicitly, it is possible to manipulate the formulas to make it easier. Starting with one of the entries in the matrix DL, it looks like this:

Since the matrix dZ/dW is mostly zeros, then we can try to simplify it. dZ/dW is non-zero when i = k, and then it is equal to xj as we worked out above. So we can simplify the non-zero entries to:

In the previous post, we established that when the indices are the same, then:


When the indices are not the same, we use:

What these two formulas show is that it is possible to calculate each of the entries in the derivative matrix by using only the input values X and the Softmax output S, skipping the matrix dot product altogether.

Here is the Python code corresponding to that:

def l_dir_shortcut(W, S, x): dir_matrix = np.zeros((W.shape[0] * W.shape[1], W.shape[1])) for t in range(0, W.shape[1]): for i in range(0, W.shape[1]): for j in range(0, W.shape[0]): dir_matrix[(i*W.shape[0]) + j][t] = S[t] * ((i==t) - S[i]) * x[j] return dir_matrix DL_shortcut = np.transpose(l_dir_shortcut(W, h, x))

To verify that, we can cross check it with the matrix we derived from first principle:

print(DL_shortcut) [[ 0.01544 0.07723 0.06178 -0.00388 -0.01942 -0.01554 -0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766 -0.00300 -0.01501 -0.01200] [-0.00388 -0.01942 -0.01554 0.01621 0.08105 0.06484 -0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883 -0.00320 -0.01600 -0.01280] [-0.00414 -0.02071 -0.01657 -0.00441 -0.02208 -0.01766 0.01699 0.08495 0.06796 -0.00501 -0.02509 -0.02007 -0.00341 -0.01706 -0.01364] [-0.00441 -0.02208 -0.01766 -0.00470 -0.02354 -0.01883 -0.00501 -0.02509 -0.02007 0.01778 0.08890 0.07112 -0.00363 -0.01818 -0.01455] [-0.00300 -0.01501 -0.01200 -0.00320 -0.01600 -0.01280 -0.00341 -0.01706 -0.01364 -0.00363 -0.01818 -0.01455 0.01325 0.06626 0.05300]]

Lastly, it’s worth noting that in order to actually modify each of the weights, we need to sum up the individual adjustments in each of the corresponding columns.

Kategorie: Transhumanismus