把神经元看作是一个计算单元,左边的x1,x2,x3 (和 截距+1 )作为计算单元的输入,输出为:,其中,函数被称为激活函数,在这里我们的激活函数是sigmoid函数:


用 nl
表示网络的层数,因此例子中 nl
= 3,用 Ll 代表 l 层,所以 L1 就是输入层,
就是输出层。网络参数(W,b) = (W(1),b(1),W(2),b(2)), 表示 l 层的节点 j 与 l+1 层的节点 i 之间的连接权重,
表示与 l+1 层的节点 i 连接的偏置,因此,在这个例子中 ,. 注意偏置节点是没有输入的,因为偏置节点输出总为+1. 用 sl
表示 l 层的节点数(不计偏置节点)。
表示 l 层的节点 i 的激活值(即输出值)。当 l = 1 时,用
表示第 i 个输入。给定参数 W,b, 神经网络的假设模型 hW,b(x) 输出一个实数。计算过程如下:

表示 l 层的节点 i 的输入加权和(包括偏置节点),如:
,所以 .

训练这个网络需要训练样本 (x(i),y(i)) ,. 当你需要预测多个值得时候,这种网络会很有用。例如,在医疗诊断应用中,向量 x 给出一个病人的特征,不同的输出 yi 可能分别表示不同种类的疾病是否存在。

Exercise: Sparse Autoencoder
自编码神经网络是一种无监督学习算法,它使用了反向传播算法,并让目标值等于输入值x_i = y_i 。
隐藏神经元 j的平均活跃度(在训练集上取平均)
限制其中p是稀疏性参数,通常是一个接近于0的较小的值(比如 p=0.05 )
为了实现这一限制,我们将会在我们的优化目标函数中加入一个额外的惩罚因子,而这一惩罚因子将惩罚那些 和 有显著不同的情况从而使得隐藏神经元的平均活跃度保持在较小范围内(稀疏性)。
KL divergence 性质:相等为0,随着之间的差异增大而单调递增。
To incorporate the KL-divergence term into your derivative calculation, there is a simple-to-implement trick involving only a small change to your code.
Exercise: Sparse Autoencoder实验
首先是从如下图这样的10 image(512&512),中sample 10000 image patches(8&8)。
sample 10000 image patches(8&8) and concatenate them into a 64&10000 matrix
display a random sample of 204 patches from the dataset
visibleSize = 8*8; % number of input units
hiddenSize = 25; % number of hidden units
sparsityParam = 0.01; % desired average activation of the hidden units.
% (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
% in the lecture notes).
lambda = 0.0001; % weight decay parameter
beta = 3; % weight of sparsity penalty term
Iteration FunEvals Step Length Function Val Opt Cond
1 3
2 4
399 414
400 415
Exceeded Maximum Number of Iterations
时间已过 20.942873 秒。
教程中说Our implementation took around 5 minutes to run on a fast computer.
sparse autoencoder algorithm learning a set of edge detectors.
%% CS294A/CS294W Programming Assignment Starter Code
This file contains code that helps you get started on the
programming assignment. You will need to complete the code in sampleIMAGES.m,
sparseAutoencoderCost.m and computeNumericalGradient.m.
For the purpose of completing the assignment, you do not need to
change the code in this file.
%% STEP 0: Here we provide the relevant parameters values that will
allow your sparse autoencoder to get you do not need to
change the parameters below.
visibleSize = 8*8;
% number of input units
hiddenSize = 25;
% number of hidden units
sparsityParam = 0.01;
% desired average activation of the hidden units.
% (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
in the lecture notes).
lambda = 0.0001;
% weight decay parameter
% weight of sparsity penalty term
%% STEP 1: Implement sampleIMAGES
After implementing sampleIMAGES, the display_network command should
display a random sample of 200 patches from the dataset
patches = sampleIMAGES;
Obtain random parameters theta
theta = initializeParameters(hiddenSize, visibleSize);
%% STEP 2: Implement sparseAutoencoderCost
You can implement all of the components (squared error cost, weight decay term,
sparsity penalty) in the cost function at once, but it may be easier to do
it step-by-step and run gradient checking (see STEP 3) after each step.
suggest implementing the sparseAutoencoderCost function using the following steps:
(a) Implement forward propagation in your neural network, and implement the
squared error term of the cost function.
Implement backpropagation to
compute the derivatives.
Then (using lambda=beta=0), run Gradient Checking
to verify that the calculations corresponding to the squared error cost
term are correct.
(b) Add in the weight decay term (in both the cost function and the derivative
calculations), then re-run Gradient Checking to verify correctness.
(c) Add in the sparsity penalty term, then re-run Gradient Checking to
verify correctness.
Feel free to change the training settings when debugging your
(For example, reducing the training set size or
number of hidden units may make
and setting beta
and/or lambda to zero may be helpful for debugging.)
However, in your
final submission of the visualized weights, please use parameters we
gave in Step 0 above.
[cost, grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, lambda, ...
sparsityParam, beta, patches);
%% STEP 3: Gradient Checking
% Hint: If you are debugging your code, performing gradient checking on smaller models
% and smaller training sets (e.g., using only 10 training examples and 1-2 hidden
% units) may speed things up.
% First, lets make sure your numerical gradient computation is correct for a
% simple function.
After you have implemented computeNumericalGradient.m,
% run the following:
% checkNumericalGradient();
% % Now we can use it to check your cost function and derivative calculations
% % for the sparse autoencoder.
% numgrad = computeNumericalGradient( @(x) sparseAutoencoderCost(x, visibleSize, ...
hiddenSize, lambda, ...
sparsityParam, beta, ...
patches), theta);
% % Use this to visually compare the gradients side by side
% disp([numgrad grad]);
% % Compare numerically computed gradients with the ones obtained from backpropagation
% diff = norm(numgrad-grad)/norm(numgrad+grad);
% disp(diff); % Should be small. In our implementation, these values are
% usually less than 1e-9.
% When you got this working, Congratulations!!!
%% STEP 4: After verifying that your implementation of
sparseAutoencoderCost is correct, You can start training your sparse
autoencoder with minFunc (L-BFGS).
Randomly initialize the parameters
theta = initializeParameters(hiddenSize, visibleSize);
Use minFunc to minimize the function
addpath minFunc/
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
% function. Generally, for minFunc to work, you
% need a function pointer with two outputs: the
% function value and the gradient. In our problem,
% sparseAutoencoderCost.m satisfies this.
options.maxIter = 400;
% Maximum number of iterations of L-BFGS to run
options.display = 'on';
[opttheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ...
visibleSize, hiddenSize, ...
lambda, sparsityParam, ...
beta, patches), ...
theta, options);
%% STEP 5: Visualization
W1 = reshape(opttheta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
display_network(W1', 12);
print -djpeg weights.jpg
% save the visualization to a file
function patches = sampleIMAGES()
% sampleIMAGES
% Returns 10000 patches for training
load IMAGES;
% load images from disk
patchsize = 8;
% we'll use 8x8 patches
numpatches = 10000;
% Initialize patches with zeros.
Your code will fill in this matrix--one
% column per patch, 10000 columns.
patches = zeros(patchsize*patchsize, numpatches);
%% ---------- YOUR CODE HERE --------------------------------------
Instructions: Fill in the variable called "patches" using data
from IMAGES.
IMAGES is a 3D array containing 10 images
For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image,
and you can type "imagesc(IMAGES(:,:,6))," to visualize
it. (The contrast on these images look a bit off because they have
been preprocessed using using "whitening."
See the lecture notes for
more details.) As a second example, IMAGES(21:30,21:30,1) is an image
patch corresponding to the pixels in the block (21,21) to (30,30) of
for imageNum = 1 : 10
image = IMAGES(:, :, imageNum);
[rowNum, colNum] = size(image);
for patchNum = 1 : 1000
xPos = randi(rowNum - patchsize + 1);
yPos = randi(colNum - patchsize + 1);
patches(:, 1000 * (imageNum - 1) + patchNum) = ...
reshape(image(xPos : xPos + 7, yPos : yPos + 7), 64, 1);
%% ---------------------------------------------------------------
% For the autoencoder to work well we need to normalize the data
% Specifically, since the output of the network is bounded between [0,1]
% (due to the sigmoid activation function), we have to make sure
% the range of pixel values is also bounded between [0,1]
patches = normalizeData(patches);
%% ---------------------------------------------------------------
function patches = normalizeData(patches)
% Squash data to [0.1, 0.9] since we use sigmoid as the activation
% function in the output layer
% Remove DC (mean of images).
patches = bsxfun(@minus, patches, mean(patches));
% Truncate to +/-3 standard deviations and scale to -1 to 1
pstd = 3 * std(patches(:));
patches = max(min(patches, pstd), -pstd) /
% Rescale from [-1,1] to [0.1,0.9]
patches = (patches + 1) * 0.4 + 0.1;
function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
lambda, sparsityParam, beta, data)
% visibleSize: the number of input units (probably 64)
% hiddenSize: the number of hidden units (probably 25)
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data.
So, data(:,i) is the i-th training example.
% The input theta is a vector (because minFunc expects the parameters to be a vector).
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this
% follows the notation convention of the lecture notes.
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);
% Cost and gradient variables (your code needs to compute these values).
% Here, we initialize them to zeros.
W1grad = zeros(size(W1));
W2grad = zeros(size(W2));
b1grad = zeros(size(b1));
b2grad = zeros(size(b2));
%% ---------- YOUR CODE HERE --------------------------------------
Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,
and the corresponding gradients W1grad, W2grad, b1grad, b2grad.
% W1grad, W2grad, b1grad and b2grad should be computed using backpropagation.
% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions
% as b1, etc.
Your code should set W1grad to be the partial derivative of J_sparse(W,b) with
% respect to W1.
I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b)
% with respect to the input parameter W1(i,j).
Thus, W1grad should be equal to the term
% [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2
% of the lecture notes (and similarly for W2grad, b1grad, b2grad).
% Stated differently, if we were using batch gradient descent to optimize the parameters,
% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2.
Jcost = 0;
Jweight = 0;
Jsparse = 0;
[n m] = size(data);
z2 = W1*data+repmat(b1,1,m);
a2 = sigmoid(z2);
z3 = W2*a2+repmat(b2,1,m);
a3 = sigmoid(z3);
Jcost = (0.5/m)*sum(sum((a3-data).^2));
Jweight = (1/2)*(sum(sum(W1.^2))+sum(sum(W2.^2)));
rho = (1/m).*sum(a2,2);
Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+ ...
cost = Jcost+lambda*Jweight+beta*J
d3 = -(data-a3).*dsigmoid(a3);
sterm = beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));
d2 = (W2'*d3+repmat(sterm,1,m)).*dsigmoid(a2);
W1grad = W1grad+d2*data';
W1grad = (1/m)*W1grad+lambda*W1;
W2grad = W2grad+d3*a2';
W2grad = (1/m).*W2grad+lambda*W2;
b1grad = b1grad+sum(d2,2);
b1grad = (1/m)*b1
b2grad = b2grad+sum(d3,2);
b2grad = (1/m)*b2
% After computing the cost and gradient, we will convert the gradients back
% to a vector format (suitable for minFunc).
Specifically, we will unroll
% your gradient matrices into a vector.
grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];
% Here's an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients.
This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)).
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
function dsigm = dsigmoid(a)
dsigm = a .* (1.0 - a);
function numgrad = computeNumericalGradient(J, theta)
% numgrad = computeNumericalGradient(J, theta)
% theta: a vector of parameters
% J: a function that outputs a real-number. Calling y = J(theta) will return the
% function value at theta.
% Initialize numgrad with zeros
numgrad = zeros(size(theta));
%% ---------- YOUR CODE HERE --------------------------------------
% Instructions:
% Implement numerical gradient checking, and return the result in numgrad.
% (See Section 2.3 of the lecture notes.)
% You should write code so that numgrad(i) is (the numerical approximation to) the
% partial derivative of J with respect to the i-th input argument, evaluated at theta.
% I.e., numgrad(i) should be the (approximately) the partial derivative of J with
% respect to theta(i).
% Hint: You will probably want to compute the elements of numgrad one at a time.
epsilon = 1e-4;
n = size(theta,1);
E = eye(n);
for i = 1:n
delta = E(:,i)*
numgrad(i) = (J(theta+delta)-J(theta-delta))/(epsilon*2.0);
%% ---------------------------------------------------------------
Hugo Larochelle nn course
