softmax regression和tanh哪个好

点击联系发帖人 时间：2016-12-16 09:23

softmax分类器

君，已阅读到文档的结尾了呢~~
介绍了R包softmaxreg技术文档
扫扫二维码，随身浏览文档
手机或平板扫扫即可继续访问
R包softmaxreg的技术文档
举报该文档为侵权文档。
举报该文档含有违规或不良信息。
反馈该文档无法正常浏览。
举报该文档为重复文档。
推荐理由：
将文档分享至：
分享完整地址
文档地址：
粘贴到BBS或博客
flash地址：
支持嵌入FLASH地址的网站使用
html代码：
&embed src='/DocinViewer--144.swf' width='100%' height='600' type=application/x-shockwave-flash ALLOWFULLSCREEN='true' ALLOWSCRIPTACCESS='always'&&/embed&
450px*300px480px*400px650px*490px
支持嵌入HTML代码的网站使用
您的内容已经提交成功
您所提交的内容需要审核后才能发布，请您等待！
3秒自动关闭窗口　　本材料参考Andrew Ng大神的机器学习课程&http://cs229.stanford.edu，以及斯坦福无监督学习UFLDL tutorial&http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
　　机器学习中的回归问题属于有监督学习的范畴。回归问题的目标是给定D维输入变量x，并且每一个输入矢量x都有对应的值y，要求对于新来的数据预测它对应的连续的目标值t。比如下面这个例子：假设我们有一个包含47个房子的面积和价格的数据集如下：
我们可以在Matlab中画出来这组数据集，如下：
　　看到画出来的点，是不是有点像一条直线？我们可以用一条曲线去尽量拟合这些数据点，那么对于新来的输入，我么就可以将拟合的曲线上返回对应的点从而达到预测的目的。如果要预测的值是连续的比如上述的房价，那么就属于回归问题；如果要预测的值是离散的即一个个标签，那么就属于分类问题。这个学习处理过程如下图所示：
　　上述学习过程中的常用术语：包含房子面积和价格的数据集称为训练集training set；输入变量x（本例中为面积）为特征features；输出的预测值y（本例中为房价）为目标值target；拟合的曲线，一般表示为y = h(x)，称为假设模型hypothesis；训练集的条目数称为特征的维数，本例为47。
二、线性回归模型
　　线性回归模型假设输入特征和对应的结果满足线性关系。在上述的数据集中加上一维--房间数量，于是数据集变为：
　　于是，输入特征x是二维的矢量，比如x1(i)表示数据集中第i个房子的面积，x2(i)表示数据集中第i个房子的房间数量。于是可以假设输入特征x与房价y满足线性函数，比如：
这里&i称为假设模型即映射输入特征x与结果y的线性函数h的参数parameters，为了简化表示，我们在输入特征中加入x0 = 1，于是得到：
参数&和输入特征x都为矢量，n是输入的特征x的个数（不包含x0）。
　　现在，给定一个训练集，我们应该怎么学习参数&，从而达到比较好的拟合效果呢？一个直观的想法是使得预测值h(x)尽可能接近y，为了达到这个目的，我们对于每一个参数&，定义一个代价函数cost function用来描述h(x(i))'与对应的y(i)'的接近程度：
前面乘上的1/2是为了求导的时候，使常数系数消失。于是我们的目标就变为了调整&使得代价函数J(&)取得最小值，方法有梯度下降法，最小二乘法等。
　　2.1 梯度下降法
　　现在我们要调整&使得J(&)取得最小值，为了达到这个目的，我们可以对&取一个随机初始值（随机初始化的目的是使对称失效），然后不断地迭代改变&的值来使J(&)减小，知道最终收敛取得一个&值使得J(&)最小。梯度下降法就采用这样的思想：对&设定一个随机初值&0，然后迭代进行以下更新
直到收敛。这里的&称为学习率learning rate。
　　梯度方向由J(&)对& 的偏导数决定，由于要求的是最小值，因此对偏导数取负值得到梯度方向。将J(&)代入得到总的更新公式
这样的更新规则称为LMS update rule（least mean squares），也称为Widrow-Hoff learning rule。
　　对于如下更新参数的算法：
由于在每一次迭代都考察训练集的所有样本，而称为批量梯度下降batch gradient descent。对于引言中的房价数据集，运行这种算法，可以得到&0 = 71.27, &1 = 1.1345，拟合曲线如下图：
　　如果参数更新计算算法如下：
这里我们按照单个训练样本更新&的值，称为随机梯度下降stochastic gradient descent。比较这两种梯度下降算法，由于batch gradient descent在每一步都考虑全部数据集，因而复杂度比较高，随机梯度下降会比较快地收敛，而且在实际情况中两种梯度下降得到的最优解J(&)一般会接近真实的最小值。所以对于较大的数据集，一般采用效率较高的随机梯度下降法。
　　2.2 最小二乘法（LMS）
　　梯度下降算法给出了一种计算&的方法，但是需要迭代的过程，比较费时而且不太直观。下面介绍的最小二乘法是一种直观的直接利用矩阵运算可以得到&值的算法。为了理解最小二乘法，首先回顾一下矩阵的有关运算：
　　假设函数f是将m*n维矩阵映射为一个实数的运算，即，并且定义对于矩阵A，映射f(A)对A的梯度为：
因此该梯度为m*n的矩阵。例如对于矩阵A=，而且映射函数f(A)定义为：F(A) = 1.5A11 + 5A122 + A21A22，于是梯度为：
　　另外，对于矩阵的迹的梯度运算，有如下规则：
　　下面，我们将测试集中的输入特征x和对应的结果y表示成矩阵或者向量的形式，有：
对于预测模型有，即，于是可以很容易得到：
所以可以得到。
　　于是，我们就将代价函数J(&)表示为了矩阵的形式，就可以用上述提到的矩阵运算来得到梯度：
令上述梯度为0，得到等式：，于是得到&的值：
。这就是最小二乘法得到的假设模型中参数的值。
　　2.3 加权线性回归
　　首先考虑下图中的几种曲线拟合情况：
最左边的图使用线性拟合，但是可以看到数据点并不完全在一条直线上，因而拟合的效果并不好。如果我们加入x2项，得到，如中间图所示，该二次曲线可以更好的拟合数据点。我们继续加入更高次项，可以得到最右边图所示的拟合曲线，可以完美地拟合数据点，最右边的图中曲线为5阶多项式，可是我们都很清醒地知道这个曲线过于完美了，对于新来的数据可能预测效果并不会那么好。对于最左边的曲线，我们称之为欠拟合--过小的特征集合使得模型过于简单不能很好地表达数据的结构，最右边的曲线我们称之为过拟合--过大的特征集合使得模型过于复杂。
　　正如上述例子表明，在学习过程中，特征的选择对于最终学习到的模型的性能有很大影响，于是选择用哪个特征，每个特征的重要性如何就产生了加权的线性回归。在传统的线性回归中，学习过程如下：
而加权线性回归学习过程如下：
　　二者的区别就在于对不同的输入特征赋予了不同的非负值权重，权重越大，对于代价函数的影响越大。一般选取的权重计算公式为：
其中，x是要预测的特征，表示离x越近的样本权重越大，越远的影响越小。
三、logistic回归与Softmax回归
　　3.1&logistic回归
　　下面介绍一下logistic回归，虽然名曰回归，但实际上logistic回归用于分类问题。logistic回归实质上还是线性回归模型，只是在回归的连续值结果上加了一层函数映射，将特征线性求和，然后使用g(z)作映射，将连续值映射到离散值0/1上（对于sigmoid函数为0/1两类，而对于双曲正弦tanh函数为1/-1两类）。采用假设模型为：
而sigmoid函数g(z)为：
当z趋近于-&，g(z)趋近于0，而z趋近于&，g(z)趋近于1，从而达到分类的目的。这里的
　　那么对于这样的logistic模型，怎么调整参数&呢？我们假设
，由于是两类问题，即，于是得到似然估计为：
对似然估计取对数可以更容易地求解：。
接下来是&的似然估计最大化，可以考虑上述的梯度下降法，于是得到：
得到类似的更新公式：。虽然这个更新规则类似于LMS得到的公式，但是这两种是不同算法，因为这里的h&(x(i))是一个关于&Tx(i)的非线性函数。
　　3.2&Softmax回归
　　logistic回归是两类回归问题的算法，如果目标结果是多个离散值怎么办？Softmax回归模型就是解决这个问题的，Softmax回归模型是logistic模型在多分类问题上的推广。在Softmax回归中，类标签y可以去k个不同的值（k&2）。因此对于y(i)从属于{1,2,3&&&k}。
　　对于给定的测试输入x，我们要利用假设模型针对每一个类别j估算概率值p(y = j|x)。于是假设函数h&(x(i))形式为：
其中&1，&2，&3，&&&，&k属于模型的参数，等式右边的系数是对概率分布进行归一化，使得总概率之和为1。于是类似于logistic回归，推广得到新的代价函数为：
可以看到Softmax代价函数与logistic代价函数形式上非常相似，只是Softmax函数将k个可能的类别进行了累加，在Softmax中将x分为类别j的概率为：
于是对于Softmax的代价函数，利用梯度下降法使的J(&)最小，梯度公式如下：
表示J(&)对第j个元素&j的偏导数，每一次迭代进行更新：。
　　3.3 Softmax回归 vs logistic回归
　　特别地，当Softmax回归中k = 2时，Softmax就退化为logistic回归。当k = 2时，Softmax回归的假设模型为：
我们令& = &1，并且两个参数都剪去&1，得到：
于是Softmax回归预测得到两个类别的概率形式与logistic回归一致。
　　现在，如果有一个k类分类的任务，我们可以选择Softmax回归，也可以选择k个独立的logistic回归分类器，应该如何选择呢？
　　这一选择取决于这k个类别是否互斥，例如，如果有四个类别的电影，分别为：好莱坞电影、港台电影、日韩电影、大陆电影，需要对每一个训练的电影样本打上一个标签，那么此时应选择k = 4的Softmax回归。然而，如果四个电影类别如下：动作、喜剧、爱情、欧美，这些类别并不是互斥的，于是这种情况下使用4个logistic回归分类器比较合理。
四、一般线性回归模型
　　首先定义一个通用的指数概率分布：
考虑伯努利分布，有：
再考虑高斯分布：
一般线性模型满足：1. y|x;& 满足指数分布族E(&)　　2. 给定特征x，预测结果为T(y) = E[y|x]　　3. 参数& = &Tx 。
　　对于第二部分的线性模型，我们假设结果y满足高斯分布Ν(&,&2)，于是期望& =&&，所以：
很显然，从一般线性模型的角度得到了第二部分的假设模型。
　　对于logistic模型，由于假设结果分为两类，很自然地想到伯努利分布，并且可以得到，于是 y|x;& 满足B(&P)，E[y|x;&] =&&P，所以
于是得到了与logistic假设模型的公式，这也解释了logistic回归为何使用这个函数。
阅读(...) 评论()Classification:
Learning Deep(7)
Learning Machine(15)
algorithm(3)
[original]
Reproduced please indicate the source [CSDN]
Note: using the new version of the keras may report max pool (2D) got an unexpected keyword argument 'mode the mistake, this is a bug in the old version of the theano, only need to theano's official website to download the latest source code to compile can be installed
Another: this code is the old version of the Keras, the new version of the keras has a larger change, the code has not been able to run directly, you can write code based on the model can be.
Database introduction
Cifar-10 is a collection of Krizhevsky Alex, Ilya, Sutskever Hinton, a data set for universal object recognition.
Cifar-10 consists of 60000 RGB 32*32 color images, a total of 10 categories. 50000 training, 10000 test (cross validation). The greatest feature of this data set is that it will be identified and migrated to the general object, and it is applied to a multi class classification (sister data set Cifar-100 to achieve the 100 class, ILSVRC game is the 1000 category).
Data set available toDownload.
Compared with the mature face recognition, universal object recognition is a huge challenge, which contains a large number of features, noise, and the proportion of different objects.
development tool
Now popular DeepLearning library very much, and now the most concerned about the fire should be caffe GitHub. But I personally feel that the Caffe package is too dead, many things are encapsulated into a library, to learn the principle of words, or have to look at the Theano version of the.
My personal use of the library is recommended by friends, is based on Theano, the advantage is easy to use, you can quickly develop.
Network framework
Network framework reference cifar-10 Caffe framework, but they have made a modification.
The framework is as follows:
Conv1: layer1
:32 kernel, size kernel: 5, activation: relu, dropout: 0.25
Conv2: layer2
Kernel: 32, size: kernel 5, activation: relu, dropout: 0.25
MaxPooling1: poolsize:2 layer3
Conv3: layer4
Kernel: 64, size: kernel 3, activation: relu, dropout: 0.25
MaxPooling2: poolsize:2 layer5
Full connect layer6: 512,:tanh activation
Softmax layer7:
Training results
Contrast to the training error of the Alex paper chart, the paper said, using tanh as an activation function to replace the relu can make the algorithm fast convergence, but in fact I do not feel how much faster.
Or, to see the direction of loss is not so good in the paper, of course, my framework is not exactly the same with the author.
35 iterations, the final training accuracy rate is 0.86, the cross validation accuracy rate is: 0.78
Feeling has been faster than the fit, so there is no increase in the number of iterations.
Training Essentials
Must be done on the cifar-10 database to be pre processed, that is, to mean, normalization and do whitening treatment.
Replace relu with tanh as the activation function, it must be said that learningrate is reduced by an order of magnitude, otherwise there will be over fitting.
What relu should be used in what layer? The answer is that in addition to the last layer of softmax to use tanh, the other layers can be.
On the proportion of Dropout, the paper said it was 0.5, but I have found 0.25 better results, which may need to be adjusted according to the data
Activation function
Deep learning in the activation function is: sigmoid, tanh, ReLUs, Softplus
The best is the linear units rectified (ReLUs) modified linear unit
Multi layer neural network if the sigmoid or tanh activation function does not do the words will be vanishing problem gradient and will not be able to converge. Using ReLU, this is not a problem.
The use of pre training: rule,
compress data, strengthen the characteristics, speed up the convergence rate.
Standard sigmoid output does not have the sparsity, the need to use a number of penalty factor to train a lot of close to 0 of the redundant data, resulting in sparse data, such as L1, L1/L2 or Student-t as a penalty factor. Therefore it is necessary to carry out unsupervised pre training.
While the ReLU is a linear correction, the formula is: G (x) = max (0, x), the function diagram is as follows. Its function is if the calculated value is less than 0, it is equal to 0, otherwise the original value remains unchanged. This is a simple and crude way to force some data to 0, however, the practice has proved that the trained network has a moderate degree of sparsity. And the training of the visual effect and the traditional way of pre training out of the effect is very similar, which also shows that ReLU has the ability to guide a moderate sparse
The role of Dropout
Dropout: prevent overfitting, practices: in training, FP randomly hidden layer nodes output value in accordance with the proportion of random set to 0, and the BP process corresponding to is set to 0 of the hidden layer node error is zero, sparse. This will make the neuron have to learn some more robust and more abstract features.
Training code
On Thu Aug Created 27 11:27:34 2015
Lab-liu.longpo @author:
From__future__ImportAbsolute_import
From__future__ImportPrint_function
FromKeras.modelsImportSequential
FromKeras.layers.coreImportDense, Dropout, Activation, Flatten
FromKeras.layers.convolutionalImportConvolution2D, MaxPooling2D
FromKeras.optimizersImportSGD, Adadelta, Adagrad
FromKeras.utilsImportNp_utils, generic_utils
ImportMatplotlib.pyplotAsPLT
ImportNumpyAsNP
ImportScipy.ioAsSiO
D = sio.loadmat ('data.mat')
Data = d['d']
Label = d['l']
Data = np.reshape (data, (Fifty thousand,Three,Thirty-two,Thirty-two))
Label = np_utils.to_categorical (label,Ten)
Print(Loading data''finish)
Model = Sequential ()
Model.add (Convolution2D (Thirty-two,Three,Five,Five, border_mode='valid'))
Model.add (Activation ('relu'))
Model.add (Dropout (Zero point two five))
Model.add (Convolution2D (Thirty-two,Thirty-two,Five,Five, border_mode='valid'))
Model.add (Activation ('relu'))
Model.add (MaxPooling2D (poolsize= (Two,Two)))
Model.add (Dropout (Zero point two five))
Model.add (Convolution2D (Sixty-four,Thirty-two,Three,Three, border_mode='valid'))
Model.add (Activation ('relu'))
Model.add (MaxPooling2D (poolsize= (Two,Two)))
Model.add (Dropout (Zero point two five))
Model.add (Flatten ())
Model.add (Dense (Sixty-four*Five*Five,Five hundred and twelve, init='normal'))
Model.add (Activation ('tanh'))
Model.add (Dense (Five hundred and twelve,Ten, init='normal'))
Model.add (Activation ('softmax'))
SGD = SGD (l2=Zero point zero zero oneLr=.Zero point zero zero six five, decay=1e-6, momentum=Zero point nine, nesterov=True)
pile (loss='categorical_crossentropy', optimizer=sgd, class_mode=&Categorical&)
Result = model.fit (data, label, batch_size=Fifty, nb_epoch=Thirty-five, shuffle=True, verbose=One, show_accuracy=True, validation_split=Zero point two)
Plt.figure
Plt.plot (result.epoch, result.history['acc', label=&Acc&)
Plt.plot (result.epoch, result.history['val_acc', label=&Val_acc&)
Plt.scatter (result.epoch, result.history['acc', marker=*)
Plt.scatter (result.epoch, result.history['val_acc'])
Plt.legend (loc=Right''under)
Plt.show ()
Plt.figure
Plt.plot (result.epoch, result.history['loss', label=&Loss&)
Plt.plot (result.epoch, result.history['val_loss', label=&Val_loss&)
Plt.scatter (result.epoch, result.history['loss', marker=*)
Plt.scatter (result.epoch, result.history['val_loss', marker=*)
Plt.legend (loc=Right''upper)
Plt.show ()
Today to achieve their previous machine learning and deep learning algorithmIn this paper, the data set and data preprocessing, training code, etc. will also be on the GitHub, if you feel useful, please give a star.
Next article
Guess you're looking for
* the above user comments only represent their personal views, does not represent the views or position of the CSDN website
Visit234588 times
Integral:Three thousand three hundred and ninety-one
Points: 3391
Rank:5602nd name
Original70
Reproduced:32
Translation:3
Comments:177
South China University of Technology Research Institute, a new type of man-machine
Interactive lab, focus on machine learning, depth
Learning, computer vision, video image processing
Sina weibo:
Personal GitHub:
Article: 10
Read: 24517
(1)(2)(2)(2)(1)(4)(7)(11)(4)(13)(7)(34)(13)(4)}

久游无息网