cart gini系数计算指数怎么计算

点击联系发帖人 时间：2016-08-27 06:07

gini index如何计算

数据挖掘系列（6）决策树分类算法 – MyDetail | 查问题
汇聚最新编程技术，编程问题一网打尽
& 数据挖掘系列（6）决策树分类算法 – MyDetail
数据挖掘系列（6）决策树分类算法 – MyDetail
[ 分类： ]
　　从这篇开始，我将介绍分类问题，主要介绍决策树算法、朴素贝叶斯、支持向量机、BP神经网络、懒惰学习算法、随机森林与自适应增强算法、分类模型选择和结果评价。总共7篇，欢迎关注和交流。
　　这篇先介绍分类问题的一些基本知识，然后主要讲述决策树算法的原理、实现，最后利用决策树算法做一个泰坦尼克号船员生存预测应用。
一、分类基本介绍
　　物以类聚，人以群分，分类问题只古以来就出现我们的生活中。分类是数据挖掘中一个重要的分支，在各方面都有着广泛的应用，如医学疾病判别、垃圾邮件过滤、垃圾短信拦截、客户分析等等。分类问题可以分为两类：
　　归类：归类是指对离散数据的分类，比如对根据一个人的笔迹判别这个是男还是女，这里的类别只有两个，类别是离散的集合空间{男，女}的。
　　预测：预测是指对连续数据的分类，比如预测明天8点天气的湿度情况，天气的湿度在随时变化，8点时的天气是一个具体值，它不属于某个有限集合空间。预测也叫回归分析，在金融领域有着广泛应用。
　　虽然对离散数据和连续数据的处理方式有所不同，但其实他们之间相互转化，比如我们可以根据比较的某个特征值判断，如果值大于0.5就认定为男性，小于等于0.5就认为是女性，这样就转化为连续处理方式；将天气湿度值分段处理也就转化为离散数据。
　　数据分类分两个步骤：
构造模型，利用训练数据集训练分类器；
利用建好的分类器模型对测试数据进行分类。
　　好的分类器具有很好的泛化能力，即它不仅在训练数据集上能达到很高的正确率，而且能在未见过得测试数据集也能达到较高的正确率。如果一个分类器只是在训练数据上表现优秀，但在测试数据上表现稀烂，这个分类器就已经过拟合了，它只是把训练数据记下来了，并没有抓到整个数据空间的特征。
二、决策树分类
　　决策树算法借助于树的分支结构实现分类。下图是一个决策树的示例，树的内部结点表示对某个属性的判断，该结点的分支是对应的判断结果；叶子结点代表一个类标。
　　上表是一个预测一个人是否会购买购买电脑的决策树，利用这棵树，我们可以对新记录进行分类，从根节点（年龄）开始，如果某个人的年龄为中年，我们就直接判断这个人会买电脑，如果是青少年，则需要进一步判断是否是学生；如果是老年则需要进一步判断其信用等级，直到叶子结点可以判定记录的类别。
　　决策树算法有一个好处，那就是它可以产生人能直接理解的规则，这是贝叶斯、神经网络等算法没有的特性；决策树的准确率也比较高，而且不需要了解背景知识就可以进行分类，是一个非常有效的算法。决策树算法有很多变种，包括ID3、C4.5、C5.0、CART等，但其基础都是类似的。下面来看看决策树算法的基本思想：
算法：GenerateDecisionTree(D,attributeList)根据训练数据记录D生成一棵决策树.
数据记录D，包含类标的训练数据集;
属性列表attributeList，候选属性集，用于在内部结点中作判断的属性.
属性选择方法AttributeSelectionMethod()，选择最佳分类属性的方法.
输出：一棵决策树.
构造一个节点N；
如果数据记录D中的所有记录的类标都相同（记为C类）：
则将节点N作为叶子节点标记为C，并返回结点N；
如果属性列表为空：
则将节点N作为叶子结点标记为D中类标最多的类，并返回结点N；
调用AttributeSelectionMethod(D,attributeList)选择最佳的分裂准则splitC
将节点N标记为最佳分裂准则splitC
如果分裂属性取值是离散的，并且允许决策树进行多叉分裂：
从属性列表中减去分裂属性，attributeLsit -= splitA
对分裂属性的每一个取值j:
记D中满足j的记录集合为Dj;
如果Dj为空：
则新建一个叶子结点F，标记为D中类标最多的类，并且把结点F挂在N下;
递归调用GenerateDecisionTree(Dj,attributeList)得到子树结点Nj，将Nj挂在N下;
返回结点N;
&　　算法的1、2、3步骤都很显然，第4步的最佳属性选择函数会在后面专门介绍，现在只有知道它能找到一个准则，使得根据判断结点得到的子树的类别尽可能的纯，这里的纯就是只含有一个类标；第5步根据分裂准则设置结点N的测试表达式。在第6步中，对应构建多叉决策树时，离散的属性在结点N及其子树中只用一次，用过之后就从可用属性列表中删掉。比如前面的图中，利用属性选择函数，确定的最佳分裂属性是年龄，年龄有三个取值，每一个取值对应一个分支，后面不会再用到年龄这个属性。算法的时间复杂度是O(k*|D|*log(|D|))，k为属性个数，|D|为记录集D的记录数。
三、属性选择方法
　　属性选择方法总是选择最好的属性最为分裂属性，即让每个分支的记录的类别尽可能纯。它将所有属性列表的属性进行按某个标准排序，从而选出最好的属性。属性选择方法很多，这里我介绍三个常用的方法：信息增益（Information gain）、增益比率（gain ratio）、基尼指数（Gini index）。
信息增益（Information gain）
　　信息增益基于香浓的信息论，它找出的属性R具有这样的特点：以属性R分裂前后的信息增益比其他属性最大。这里信息的定义如下：
　　其中的m表示数据集D中类别C的个数，Pi表示D中任意一个记录属于Ci的概率，计算时Pi=(D中属于Ci类的集合的记录个数/|D|)。Info(D)表示将数据集D不同的类分开需要的信息量。
　　如果了解信息论，就会知道上面的信息Info实际上就是信息论中的熵Entropy，熵表示的是不确定度的度量，如果某个数据集的类别的不确定程度越高，则其熵就越大。比如我们将一个立方体A抛向空中，记落地时着地的面为f1，f1的取值为{1,2,3,4,5,6}，f1的熵entropy(f1)=-(1/6*log(1/6)+&#*log(1/6))=-1*log(1/6)=2.58；现在我们把立方体A换为正四面体B，记落地时着地的面为f2，f2的取值为{1,2,3,4}，f2的熵entropy(1)=-（1/4*log(1/4)+1/4*log(1/4)+1/4*log(1/4)+1/4*log(1/4)) =-log(1/4)=2；如果我们再换成一个球C，记落地时着地的面为f3，显然不管怎么扔着地都是同一个面，即f3的取值为{1}，故其熵entropy(f3)=-1*log(1)=0。可以看到面数越多，熵值也越大，而当只有一个面的球时，熵值为0，此时表示不确定程度为0，也就是着地时向下的面是确定的。
　　有了上面关于熵的简单理解，我们接着讲信息增益。假设我们选择属性R作为分裂属性，数据集D中，R有k个不同的取值{V1,V2,…,Vk}，于是可将D根据R的值分成k组{D1,D2,…,Dk}，按R进行分裂后，将数据集D不同的类分开还需要的信息量为：&
　　信息增益的定义为分裂前后，两个信息量只差：
　　信息增益Gain(R)表示属性R给分类带来的信息量，我们寻找Gain最大的属性，就能使分类尽可能的纯，即最可能的把不同的类分开。不过我们发现对所以的属性Info(D)都是一样的，所以求最大的Gain可以转化为求最新的InfoR(D)。这里引入Info(D)只是为了说明背后的原理，方便理解，实现时我们不需要计算Info(D)。举一个例子，数据集D如下：
是否购买电脑
　　这个数据集是根据一个人的年龄、收入、是否学生以及信用等级来确定他是否会购买电脑，即最后一列&是否购买电脑&是类标。现在我们用信息增益选出最最佳的分类属性，计算按年龄分裂后的信息量：
　　整个式子由三项累加而成，第一项为青少年，14条记录中有5条为青少年，其中2（占2/5）条购买电脑，3（占3/5）条不购买电脑。第二项为中年，第三项为老年。类似的，有：
　　可以得出Info年龄(D)最小，即以年龄分裂后，分得的结果中类标最纯，此时已年龄作为根结点的测试属性，根据青少年、中年、老年分为三个分支：
　　注意，年龄这个属性用过后，之后的操作就不需要年龄了，即把年龄从attributeList中删掉。往后就按照同样的方法，构建D1,D2,D3对应的决策子树。ID3算法使用的就是基于信息增益的选择属性方法。
增益比率（gain ratio）
　　信息增益选择方法有一个很大的缺陷，它总是会倾向于选择属性值多的属性，如果我们在上面的数据记录中加一个姓名属性，假设14条记录中的每个人姓名不同，那么信息增益就会选择姓名作为最佳属性，因为按姓名分裂后，每个组只包含一条记录，而每个记录只属于一类（要么购买电脑要么不购买），因此纯度最高，以姓名作为测试分裂的结点下面有14个分支。但是这样的分类没有意义，它每一任何泛化能力。增益比率对此进行了改进，它引入一个分裂信息：
　　增益比率定义为信息增益与分裂信息的比率：
　　我们找GainRatio最大的属性作为最佳分裂属性。如果一个属性的取值很多，那么SplitInfoR(D)会大，从而使GainRatio(R)变小。不过增益比率也有缺点，SplitInfo(D)可能取0，此时没有计算意义；且当SplitInfo(D)趋向于0时，GainRatio(R)的值变得不可信，改进的措施就是在分母加一个平滑，这里加一个所有分裂信息的平均值：
基尼指数（Gini index）
　　基尼指数是另外一种数据的不纯度的度量方法，其定义如下：
　　其中的m仍然表示数据集D中类别C的个数，Pi表示D中任意一个记录属于Ci的概率，计算时Pi=(D中属于Ci类的集合的记录个数/|D|)。如果所有的记录都属于同一个类中，则P1=1，Gini(D)=0，此时不纯度最低。在CART(Classification and Regression Tree)算法中利用基尼指数构造二叉决策树，对每个属性都会枚举其属性的非空真子集，以属性R分裂后的基尼系数为：
　　D1为D的一个非空真子集，D2为D1在D的补集，即D1+D2=D，对于属性R来说，有多个真子集，即GiniR(D)有多个值，但我们选取最小的那么值作为R的基尼指数。最后：
　　我们转Gini(R)增量最大的属性作为最佳分裂属性。
　　本来打算还讲一下实现的，今天要回家，帮家里收下谷子，本来买的前天的票的，台风影响京广线没走成，先写到这里，收拾东西去。9月30回来，再总结一下实现和实践，并持续后面的话题，欢迎持续关注。
参考文献：
　[1]：HanJiaWei. Data Mining: concept and &techniques.
　感谢关注，欢迎回帖交流。
转载请注明出处：
本文链接：，转载请注明。决策树系列（五）——CART - 学会分享~ - 博客园
随笔 - 16, 文章 - 0, 评论 - 0, 引用 - 0
CART，又名分类回归树，是在ID3的基础上进行优化的决策树，学习CART记住以下几个关键点：
（1）CART既能是分类树，又能是分类树；
（2）当CART是分类树时，采用GINI值作为节点分裂的依据；当CART是回归树时，采用样本的最小方差作为节点分裂的依据；
（3）CART是一棵二叉树。
接下来将以一个实际的例子对CART进行介绍：
&&&&&&&& &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 　　　　　　　　　　　　表1 原始数据表
看电视时间
从以下的思路理解CART：
分类树？回归树？
& & & 分类树的作用是通过一个对象的特征来预测该对象所属的类别，而回归树的目的是根据一个对象的信息预测该对象的属性，并以数值表示。
& & & CART既能是分类树，又能是决策树，如上表所示，如果我们想预测一个人是否已婚，那么构建的CART将是分类树；如果想预测一个人的年龄，那么构建的将是回归树。
分类树和回归树是怎么做决策的？假设我们构建了两棵决策树分别预测用户是否已婚和实际的年龄，如图1和图2所示：
&&&&&&&&&&&&&&&&&&&&&&& 　　　　　　　　　　　　　　图1 预测婚姻情况决策树 & & & & & & & & & & & & & & & & & & & & & & & 图2 预测年龄的决策树
& & & &图1表示一棵分类树，其叶子节点的输出结果为一个实际的类别，在这个例子里是婚姻的情况（已婚或者未婚），选择叶子节点中数量占比最大的类别作为输出的类别；
& & & &图2是一棵回归树，预测用户的实际年龄，是一个具体的输出值。怎样得到这个输出值？一般情况下选择使用中值、平均值或者众数进行表示，图2使用节点年龄数据的平均值作为输出值。
CART如何选择分裂的属性？
& & & 分裂的目的是为了能够让数据变纯，使决策树输出的结果更接近真实值。那么CART是如何评价节点的纯度呢？如果是分类树，CART采用GINI值衡量节点纯度；如果是回归树，采用样本方差衡量节点纯度。节点越不纯，节点分类或者预测的效果就越差。
GINI值的计算公式：
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　&
& & & 节点越不纯，GINI值越大。以二分类为例，如果节点的所有数据只有一个类别，则，如果两类数量相同，则。
回归方差计算公式：
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&　　　　　　　　　　　　　　　&&&&&&&&&&&&&&&&&&& &&&&&&&&
& & & 方差越大，表示该节点的数据越分散，预测的效果就越差。如果一个节点的所有数据都相同，那么方差就为0，此时可以很肯定得认为该节点的输出值；如果节点的数据相差很大，那么输出的值有很大的可能与实际值相差较大。
& & & 因此，无论是分类树还是回归树，CART都要选择使子节点的GINI值或者回归方差最小的属性作为分裂的方案。即最小化（分类树）：
&　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　
或者（回归树）：
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&　　　　　　　　　　　　　　　　　　&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& &&&&&&&&
CART如何分裂成一棵二叉树？
& & &节点的分裂分为两种情况，连续型的数据和离散型的数据。
& & &CART对连续型属性的处理与C4.5差不多，通过最小化分裂后的GINI值或者样本方差寻找最优分割点，将节点一分为二，在这里不再叙述，详细请看C4.5。
& & &对于离散型属性，理论上有多少个离散值就应该分裂成多少个节点。但CART是一棵二叉树，每一次分裂只会产生两个节点，怎么办呢？很简单，只要将其中一个离散值独立作为一个节点，其他的离散值生成另外一个节点即可。这种分裂方案有多少个离散值就有多少种划分的方法，举一个简单的例子：如果某离散属性一个有三个离散值X，Y，Z，则该属性的分裂方法有{X}、{Y，Z}，{Y}、{X，Z}，{Z}、{X，Y}，分别计算每种划分方法的基尼值或者样本方差确定最优的方法。
& & &以属性&职业&为例，一共有三个离散值，&学生&、&老师&、&上班族&。该属性有三种划分的方案，分别为{&学生&}、{&老师&、&上班族&}，{&老师&}、{&学生&、&上班族&}，{&上班族&}、{&学生&、&老师&}，分别计算三种划分方案的子节点GINI值或者样本方差，选择最优的划分方法，如下图所示：
第一种划分方法：{&学生&}、{&老师&、&上班族&}
预测是否已婚（分类）：
&&&&&&&&&&&&&&&&&& &
预测年龄（回归）：
&&&&&&&　　　　&
第二种划分方法：{&老师&}、{&学生&、&上班族&}
预测是否已婚（分类）：
&&&&&&&&&&&&&&&&&& &
预测年龄（回归）：
&&&&&&　　　　&&
第三种划分方法：{&上班族&}、{&学生&、&老师&}
&预测是否已婚（分类）：
&&&&&&&&&&&&&&&&&& &
预测年龄（回归）：
&&&&&&&&　　　　
综上，如果想预测是否已婚，则选择{&上班族&}、{&学生&、&老师&}的划分方法，如果想预测年龄，则选择{&老师&}、{&学生&、&上班族&}的划分方法。
如何剪枝？
& & & CART采用CCP（代价复杂度）剪枝方法。代价复杂度选择节点表面误差率增益值最小的非叶子节点，删除该非叶子节点的左右子节点，若有多个非叶子节点的表面误差率增益值相同小，则选择非叶子节点中子节点数最多的非叶子节点进行剪枝。
可描述如下：
令决策树的非叶子节点为。
a）计算所有非叶子节点的表面误差率增益值&
b）选择表面误差率增益值最小的非叶子节点（若多个非叶子节点具有相同小的表面误差率增益值，选择节点数最多的非叶子节点）。
c）对进行剪枝
表面误差率增益值的计算公式：
&　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　
表示叶子节点的误差代价，，为节点的错误率，为节点数据量的占比；
表示子树的误差代价，，为子节点i的错误率，表示节点i的数据节点占比；
表示子树节点个数。
下图是其中一颗子树，设决策树的总数据量为40。
该子树的表面误差率增益值可以计算如下：
求出该子树的表面错误覆盖率为，只要求出其他子树的表面误差率增益值就可以对决策树进行剪枝。
程序实际以及源代码
（1）数据处理
&&&&&&&& 对原始的数据进行数字化处理，并以二维数据的形式存储，每一行表示一条记录，前n-1列表示属性，最后一列表示分类的标签。
&&&&&&&& 如表1的数据可以转化为表2：
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 表2 初始化后的数据
看电视时间
& & & 其中，对于&婚姻情况&属性，数字{1，2}分别表示{未婚，已婚 }；对于&职业&属性{1，2，3， }分别表示{学生、老师、上班族}；
代码如下所示：
&&&&&&&& static double[][] allD&&&&&&&&&&&&&&&&&&&&&&&&&&&&& //存储进行训练的数据
　　　&static List&String&[] featureV&&&&&&&&&&&&&&&&&&& //离散属性对应的离散值
featureValues是链表数组，数组的长度为属性的个数，数组的每个元素为该属性的离散值链表。
（2）两个类：节点类和分裂信息
a）节点类Node
& & & 该类表示一个节点，属性包括节点选择的分裂属性、节点的输出类、孩子节点、深度等。注意，与ID3中相比，新增了两个属性：leafWrong和leafNode_Count分别表示叶子节点的总分类误差和叶子节点的个数，主要是为了方便剪枝。
1 class Node
/// &summary&
/// 每一个节点的分裂值
/// &/summary&
public List&String& features { get; set; }
/// &summary&
/// 分裂属性的类型{离散、连续}
/// &/summary&
public String feature_Type { get; set; }
/// &summary&
/// 分裂属性的下标
/// &/summary&
public String SplitFeature { get; set; }
//List&int& nums = new List&int&();
/// &summary&
/// 每一个类对应的数目
/// &/summary&
public double[] ClassCount { get; set; }
//int[] isUsed = new int[0];
//属性的使用情况 1：已用 2:未用
/// &summary&
/// 孩子节点
/// &/summary&
public List&Node& childNodes { get; set; }
Node Parent = null;
/// &summary&
/// 该节点占比最大的类别
/// &/summary&
public String finalResult { get; set; }
/// &summary&
/// 树的深度
/// &/summary&
public int deep { get; set; }
/// &summary&
/// 最大的类下标
/// &/summary&
public int result { get; set; }
/// &summary&
/// 子节点误差
/// &/summary&
public int leafWrong { get; set; }
/// &summary&
/// 子节点数目
/// &/summary&
public int leafNode_Count { get; set; }
/// &summary&
/// 数据量
/// &/summary&
public int rowCount { get; set; }
public void setClassCount(double[] count)
this.ClassCount =
double max = ClassCount[0];
int result = 0;
for (int i = 1; i & ClassCount.L i++)
if (max & ClassCount[i])
max = ClassCount[i];
this.result =
public double getErrorCount()
return rowCount - ClassCount[result];
b）分裂信息类，该类存储节点进行分裂的信息，包括各个子节点的行坐标、子节点各个类的数目、该节点分裂的属性、属性的类型等。
class SplitInfo
/// &summary&
/// 分裂的属性下标
/// &/summary&
public int splitIndex { get; set; }
/// &summary&
/// 数据类型
/// &/summary&
public int type { get; set; }
/// &summary&
/// 分裂属性的取值
/// &/summary&
public List&String& features { get; set; }
/// &summary&
/// 各个节点的行坐标链表
/// &/summary&
public List&int&[] temp { get; set; }
/// &summary&
/// 每个节点各类的数目
/// &/summary&
public double[][] class_Count { get; set; }
主方法findBestSplit(Node node,List&int& nums,int[] isUsed)，该方法对节点进行分裂
node表示即将进行分裂的节点；
nums表示节点数据对一个的行坐标列表；
isUsed表示到该节点位置所有属性的使用情况；
findBestSplit的这个方法主要有以下几个组成部分：
1）节点分裂停止的判定
节点分裂条件如上文所述，源代码如下：
public static bool ifEnd(Node node, double shang,int[] isUsed)
double[] count = node.ClassC
int rowCount = node.rowC
int maxResult = 0;
double maxRate = 0;
#region 数达到某一深度
int deep = node.
if (deep &= 10)
maxResult = node.result + 1;
node.feature_Type="result";
node.features=new List&String&() { maxResult + ""
node.leafWrong=rowCount - Convert.ToInt32(count[maxResult-1]);
node.leafNode_Count=1;
return true;
#endregion
#region 纯度(其实跟后面的有点重了，记得要修改)
//maxResult = 1;
//for (int i = 1; i & count.L i++)
if (count[i] / rowCount &= 0.95)
node.feature_Type="result";
node.features=new List&String& { "" + (i +
node.leafNode_Count=1;
node.leafWrong=rowCount - Convert.ToInt32
36 (count[i]);
#endregion
#region 熵为0
if (shang == 0)
maxRate = count[0] / rowC
maxResult = 1;
for (int i = 1; i & count.L i++)
if (count[i] / rowCount &= maxRate)
maxRate = count[i] / rowC
maxResult = i + 1;
node.feature_Type="result";
node.features=new List&String& { maxResult + ""
node.leafWrong=rowCount - Convert.ToInt32(count
60 [maxResult - 1]);
node.leafNode_Count=1;
return true;
#endregion
#region 属性已经分完
//int[] isUsed = node.getUsed();
bool flag = true;
for (int i = 0; i & isUsed.Length - 1; i++)
if (isUsed[i] == 0)
flag = false;
maxRate = count[0] / rowC
maxResult = 1;
for (int i = 1; i & count.L i++)
if (count[i] / rowCount &= maxRate)
maxRate = count[i] / rowC
maxResult = i + 1;
node.feature_Type=("result");
node.features=(new List&String& { "" +
91 (maxResult) });
node.leafWrong=(rowCount - Convert.ToInt32(count
94 [maxResult - 1]));
node.leafNode_Count=(1);
return true;
#endregion
#region 几点数少于100
if (rowCount & Limit_Node)
maxRate = count[0] / rowC
maxResult = 1;
for (int i = 1; i & count.L i++)
if (count[i] / rowCount &= maxRate)
maxRate = count[i] / rowC
maxResult = i + 1;
node.feature_Type="result";
node.features=new List&String& { "" + (maxResult)
node.leafWrong=rowCount - Convert.ToInt32(count
118 [maxResult - 1]);
node.leafNode_Count=1;
return true;
#endregion
return false;
catch (Exception e)
return false;
停止分裂的条件
2）寻找最优的分裂属性
寻找最优的分裂属性需要计算每一个分裂属性分裂后的GINI值或者样本方差，计算公式上文已给出，其中GINI值的计算代码如下：
public static double getGini(double[] counts, int countAll)
double Gini = 1;
for (int i = 0; i & counts.L i++)
Gini = Gini - Math.Pow(counts[i] / countAll, 2);
GINI值计算
3）进行分裂，同时对子节点进行迭代处理
其实就是递归的过程，对每一个子节点执行findBestSplit方法进行分裂。
findBestSplit源代码：
public static Node findBestSplit(Node node,List&int& nums,int[] isUsed)
//判断是否继续分裂
double totalShang = getGini(node.ClassCount, node.rowCount);
if (ifEnd(node, totalShang, isUsed))
#region 变量声明
SplitInfo info = new SplitInfo();
info.initial();
int RowCount = nums.C
//样本总数
double jubuMax = 1;
//局部最大熵
int splitPoint = 0;
//分裂的点
double splitValue = 0;
//分裂的值
#endregion
for (int i = 0; i & isUsed.Length - 1; i++)
if (isUsed[i] == 1)
#region 离散变量
if (type[i] == 0)
double[][] allCount = new double[allNum[i]][];
for (int j = 0; j & allCount.L j++)
allCount[j] = new double[classCount];
int[] countAllFeature = new int[allNum[i]];
List&int&[] temp = new List&int&[allNum[i]];
double[] allClassCount = node.ClassC
//所有类别的数量
for (int j = 0; j & temp.L j++)
temp[j] = new List&int&();
for (int j = 0; j & nums.C j++)
int index = Convert.ToInt32(allData[nums[j]][i]);
temp[index - 1].Add(nums[j]);
countAllFeature[index - 1]++;
allCount[index - 1][Convert.ToInt32(allData[nums[j]][lieshu - 1]) - 1]++;
double allShang = 1;
int choose = 0;
double[][] jubuCount = new double[2][];
for (int k = 0; k & allCount.L k++)
if (temp[k].Count == 0)
double JubuShang = 0;
double[][] tempCount = new double[2][];
tempCount[0] = allCount[k];
tempCount[1] = new double[allCount[0].Length];
for (int j = 0; j & tempCount[1].L j++)
tempCount[1][j] = allClassCount[j] - allCount[k][j];
JubuShang = JubuShang + getGini(tempCount[0], countAllFeature[k]) * countAllFeature[k] / RowC
int nodecount = RowCount - countAllFeature[k];
JubuShang = JubuShang + getGini(tempCount[1], nodecount) * nodecount / RowC
if (JubuShang & allShang)
allShang = JubuS
jubuCount = tempC
if (allShang & jubuMax)
info.type = 0;
jubuMax = allS
info.class_Count = jubuC
info.temp[0] = temp[choose];
info.temp[1] = new List&int&();
info.features = new List&string&();
info.features.Add((choose + 1) + "");
info.features.Add("");
for (int j = 0; j & temp.L j++)
if (j == choose)
for (int k = 0; k & temp[j].C k++)
info.temp[1].Add(temp[j][k]);
if (temp[j].Count != 0)
info.features[1] = info.features[1] + (j + 1) + ",";
info.splitIndex =
#endregion
#region 连续变量
double[] leftCunt = new double[classCount];
//做节点各个类别的数量
double[] rightCount = new double[classCount];
//右节点各个类别的数量
double[] count1 = new double[classCount];
//子集1的统计量
double[] count2 = new double
114 [node.ClassCount.Length];
//子集2的统计量
for (int j = 0; j & node.ClassCount.L
count2[j] = node.ClassCount[j];
int all1 = 0;
//子集1的样本量
int all2 = nums.C
//子集2的样本量
double lastValue = 0;
//上一个记录的类别
double currentValue = 0;
//当前类别
double lastPoint = 0;
//上一个点的值
double currentPoint = 0;
//当前点的值
double[] values = new double[nums.Count];
for (int j = 0; j & values.L j++)
values[j] = allData[nums[j]][i];
QSort(values, nums, 0, nums.Count - 1);
double lianxuMax = 1;
//连续型属性的最大熵
#region 寻找最佳的分割点
for (int j = 0; j & nums.Count - 1; j++)
currentValue = allData[nums[j]][lieshu -
currentPoint = (allData[nums[j]][i]);
if (j == 0)
lastValue = currentV
lastPoint = currentP
if (currentValue != lastValue &&
162 currentPoint != lastPoint)
double shang1 = getGini(count1,
166 all1);
double shang2 = getGini(count2,
169 all2);
double allShang = shang1 * all1 /
172 (all1 + all2) + shang2 * all2 / (all1 + all2);
//allShang = (totalShang - allShang);
if (lianxuMax & allShang)
lianxuMax = allS
for (int k = 0; k &
179 count1.L k++)
leftCunt[k] = count1[k];
rightCount[k] = count2[k];
splitPoint =
splitValue = (currentPoint +
187 lastPoint) / 2;
count1[Convert.ToInt32(currentValue) -
count2[Convert.ToInt32(currentValue) -
lastValue = currentV
lastPoint = currentP
#endregion
#region 如果超过了局部值，重设
if (lianxuMax & jubuMax)
info.type = 1;
info.splitIndex =
info.features=new List&string&()
209 {splitValue+""};
//finalPoint = splitP
jubuMax = lianxuM
info.temp[0] = new List&int&();
info.temp[1] = new List&int&();
for (int k = 0; k & splitP k++)
info.temp[0].Add(nums[k]);
for (int k = splitP k & nums.C
info.temp[1].Add(nums[k]);
info.class_Count[0] = new double
226 [leftCunt.Length];
info.class_Count[1] = new double
229 [leftCunt.Length];
for (int k = 0; k & leftCunt.L k++)
info.class_Count[0][k] = leftCunt[k];
info.class_Count[1][k] = rightCount
#endregion
#endregion
#region 没有寻找到最佳的分裂点，则设置为叶节点
if (info.splitIndex == -1)
double[] finalCount = node.ClassC
double max = finalCount[0];
int result = 1;
for (int i = 1; i & finalCount.L i++)
if (finalCount[i] & max)
max = finalCount[i];
result = (i + 1);
node.feature_Type="result";
node.features=new List&String& { "" + result };
#endregion
#region 分裂
int deep = node.
node.SplitFeature = ("" + info.splitIndex);
List&Node& childNode = new List&Node&();
int[][] used = new int[2][];
used[0] = new int[isUsed.Length];
used[1] = new int[isUsed.Length];
for (int i = 0; i & isUsed.L i++)
used[0][i] = isUsed[i];
used[1][i] = isUsed[i];
if (info.type == 0)
used[0][info.splitIndex] = 1;
node.feature_Type = ("离散");
//used[info.splitIndex] = 0;
node.feature_Type = ("连续");
List&int&[] rowIndex = info.
List&String& features = info.
Node node1 = new Node();
Node node2 = new Node();
node1.setClassCount(info.class_Count[0]);
node2.setClassCount(info.class_Count[1]);
node1.rowCount = info.temp[0].C
node2.rowCount = info.temp[1].C
node1.deep = deep + 1;
node2.deep = deep + 1;
node1 = findBestSplit(node1, info.temp[0],used[0]);
node2 = findBestSplit(node2, info.temp[1], used[1]);
node.leafNode_Count = (node1.leafNode_Count
297 +node2.leafNode_Count);
node.leafWrong = (node1.leafWrong+node2.leafWrong);
node.features = (features);
childNode.Add(node1);
childNode.Add(node2);
node.childNodes = childN
#endregion
catch (Exception e)
Console.WriteLine(e.StackTrace);
节点选择属性和分裂
代价复杂度剪枝方法（CCP）：
public static void getSeries(Node node)
Stack&Node& nodeStack = new Stack&Node&();
if (node != null)
nodeStack.Push(node);
if (node.feature_Type == "result")
List&Node& childs = node.childN
for (int i = 0; i & childs.C i++)
getSeries(node);
CCP代价复杂度剪枝
CART全部核心代码：
/// &summary&
/// 判断是否还需要分裂
/// &/summary&
/// &param name="node"&&/param&
/// &returns&&/returns&
public static bool ifEnd(Node node, double shang,int[] isUsed)
double[] count = node.ClassC
int rowCount = node.rowC
int maxResult = 0;
double maxRate = 0;
#region 数达到某一深度
int deep = node.
if (deep &= 10)
maxResult = node.result + 1;
node.feature_Type="result";
node.features=new List&String&() { maxResult + ""
node.leafWrong=rowCount - Convert.ToInt32(count[maxResult-1]);
node.leafNode_Count=1;
return true;
#endregion
#region 纯度(其实跟后面的有点重了，记得要修改)
//maxResult = 1;
//for (int i = 1; i & count.L i++)
if (count[i] / rowCount &= 0.95)
node.feature_Type="result";
node.features=new List&String& { "" + (i +
node.leafNode_Count=1;
node.leafWrong=rowCount - Convert.ToInt32
41 (count[i]);
#endregion
#region 熵为0
if (shang == 0)
maxRate = count[0] / rowC
maxResult = 1;
for (int i = 1; i & count.L i++)
if (count[i] / rowCount &= maxRate)
maxRate = count[i] / rowC
maxResult = i + 1;
node.feature_Type="result";
node.features=new List&String& { maxResult + ""
node.leafWrong=rowCount - Convert.ToInt32(count
65 [maxResult - 1]);
node.leafNode_Count=1;
return true;
#endregion
#region 属性已经分完
//int[] isUsed = node.getUsed();
bool flag = true;
for (int i = 0; i & isUsed.Length - 1; i++)
if (isUsed[i] == 0)
flag = false;
maxRate = count[0] / rowC
maxResult = 1;
for (int i = 1; i & count.L i++)
if (count[i] / rowCount &= maxRate)
maxRate = count[i] / rowC
maxResult = i + 1;
node.feature_Type=("result");
node.features=(new List&String& { "" +
96 (maxResult) });
node.leafWrong=(rowCount - Convert.ToInt32(count
99 [maxResult - 1]));
node.leafNode_Count=(1);
return true;
#endregion
#region 几点数少于100
if (rowCount & Limit_Node)
maxRate = count[0] / rowC
maxResult = 1;
for (int i = 1; i & count.L i++)
if (count[i] / rowCount &= maxRate)
maxRate = count[i] / rowC
maxResult = i + 1;
node.feature_Type="result";
node.features=new List&String& { "" + (maxResult)
node.leafWrong=rowCount - Convert.ToInt32(count
123 [maxResult - 1]);
node.leafNode_Count=1;
return true;
#endregion
return false;
catch (Exception e)
return false;
#region 排序算法
public static void InsertSort(double[] values, List&int& arr,
138 int StartIndex, int endIndex)
for (int i = StartIndex + 1; i &= endI i++)
int key = arr[i];
double init = values[i];
int j = i - 1;
while (j &= StartIndex && values[j] & init)
arr[j + 1] = arr[j];
values[j + 1] = values[j];
arr[j + 1] =
values[j + 1] =
static int SelectPivotMedianOfThree(double[] values, List&int& arr, int low, int high)
int mid = low + ((high - low) && 1);//计算数组中间的元素的下标
//使用三数取中法选择枢轴
if (values[mid] & values[high])//目标: arr[mid] &= arr[high]
swap(values, arr, mid, high);
if (values[low] & values[high])//目标: arr[low] &= arr[high]
swap(values, arr, low, high);
if (values[mid] & values[low]) //目标: arr[low] &= arr[mid]
swap(values, arr, mid, low);
//此时，arr[mid] &= arr[low] &= arr[high]
//low的位置上保存这三个位置中间的值
//分割时可以直接使用low位置的元素作为枢轴，而不用改变分割函数了
static void swap(double[] values, List&int& arr, int t1, int t2)
double temp = values[t1];
values[t1] = values[t2];
values[t2] =
int key = arr[t1];
arr[t1] = arr[t2];
static void QSort(double[] values, List&int& arr, int low, int high)
int first =
int last =
int left =
int right =
int leftLen = 0;
int rightLen = 0;
if (high - low + 1 & 10)
InsertSort(values, arr, low, high);
//一次分割
int key = SelectPivotMedianOfThree(values, arr, low,
206 high);//使用三数取中法选择枢轴
double inti = values[key];
int currentKey = arr[key];
while (low & high)
while (high & low && values[high] &= inti)
if (values[high] == inti)//处理相等元素
swap(values, arr, right, high);
rightLen++;
arr[low] = arr[high];
values[low] = values[high];
while (high & low && values[low] &= inti)
if (values[low] == inti)
swap(values, arr, left, low);
leftLen++;
arr[high] = arr[low];
values[high] = values[low];
arr[low] = currentK
values[low] = values[key];
//一次快排结束
//把与枢轴key相同的元素移到枢轴最终位置周围
int i = low - 1;
while (j & left && values[i] != inti)
swap(values, arr, i, j);
i = low + 1;
while (j & right && values[i] != inti)
swap(values, arr, i, j);
QSort(values, arr, first, low - 1 - leftLen);
QSort(values, arr, low + 1 + rightLen, last);
#endregion
/// &summary&
/// 寻找最佳的分裂点
/// &/summary&
/// &param name="num"&&/param&
/// &param name="node"&&/param&
public static Node findBestSplit(Node node,List&int& nums,int[] isUsed)
//判断是否继续分裂
double totalShang = getGini(node.ClassCount, node.rowCount);
if (ifEnd(node, totalShang, isUsed))
#region 变量声明
SplitInfo info = new SplitInfo();
info.initial();
int RowCount = nums.C
//样本总数
double jubuMax = 1;
//局部最大熵
int splitPoint = 0;
//分裂的点
double splitValue = 0;
//分裂的值
#endregion
for (int i = 0; i & isUsed.Length - 1; i++)
if (isUsed[i] == 1)
#region 离散变量
if (type[i] == 0)
double[][] allCount = new double[allNum[i]][];
for (int j = 0; j & allCount.L j++)
allCount[j] = new double[classCount];
int[] countAllFeature = new int[allNum[i]];
List&int&[] temp = new List&int&[allNum[i]];
double[] allClassCount = node.ClassC
//所有类别的数量
for (int j = 0; j & temp.L j++)
temp[j] = new List&int&();
for (int j = 0; j & nums.C j++)
int index = Convert.ToInt32(allData[nums[j]][i]);
temp[index - 1].Add(nums[j]);
countAllFeature[index - 1]++;
allCount[index - 1][Convert.ToInt32(allData[nums[j]][lieshu - 1]) - 1]++;
double allShang = 1;
int choose = 0;
double[][] jubuCount = new double[2][];
for (int k = 0; k & allCount.L k++)
if (temp[k].Count == 0)
double JubuShang = 0;
double[][] tempCount = new double[2][];
tempCount[0] = allCount[k];
tempCount[1] = new double[allCount[0].Length];
for (int j = 0; j & tempCount[1].L j++)
tempCount[1][j] = allClassCount[j] - allCount[k][j];
JubuShang = JubuShang + getGini(tempCount[0], countAllFeature[k]) * countAllFeature[k] / RowC
int nodecount = RowCount - countAllFeature[k];
JubuShang = JubuShang + getGini(tempCount[1], nodecount) * nodecount / RowC
if (JubuShang & allShang)
allShang = JubuS
jubuCount = tempC
if (allShang & jubuMax)
info.type = 0;
jubuMax = allS
info.class_Count = jubuC
info.temp[0] = temp[choose];
info.temp[1] = new List&int&();
info.features = new List&string&();
info.features.Add((choose + 1) + "");
info.features.Add("");
for (int j = 0; j & temp.L j++)
if (j == choose)
for (int k = 0; k & temp[j].C k++)
info.temp[1].Add(temp[j][k]);
if (temp[j].Count != 0)
info.features[1] = info.features[1] + (j + 1) + ",";
info.splitIndex =
#endregion
#region 连续变量
double[] leftCunt = new double[classCount];
//做节点各个类别的数量
double[] rightCount = new double[classCount];
//右节点各个类别的数量
double[] count1 = new double[classCount];
//子集1的统计量
double[] count2 = new double
379 [node.ClassCount.Length];
//子集2的统计量
for (int j = 0; j & node.ClassCount.L
count2[j] = node.ClassCount[j];
int all1 = 0;
//子集1的样本量
int all2 = nums.C
//子集2的样本量
double lastValue = 0;
//上一个记录的类别
double currentValue = 0;
//当前类别
double lastPoint = 0;
//上一个点的值
double currentPoint = 0;
//当前点的值
double[] values = new double[nums.Count];
for (int j = 0; j & values.L j++)
values[j] = allData[nums[j]][i];
QSort(values, nums, 0, nums.Count - 1);
double lianxuMax = 1;
//连续型属性的最大熵
#region 寻找最佳的分割点
for (int j = 0; j & nums.Count - 1; j++)
currentValue = allData[nums[j]][lieshu -
currentPoint = (allData[nums[j]][i]);
if (j == 0)
lastValue = currentV
lastPoint = currentP
if (currentValue != lastValue &&
427 currentPoint != lastPoint)
double shang1 = getGini(count1,
431 all1);
double shang2 = getGini(count2,
434 all2);
double allShang = shang1 * all1 /
437 (all1 + all2) + shang2 * all2 / (all1 + all2);
//allShang = (totalShang - allShang);
if (lianxuMax & allShang)
lianxuMax = allS
for (int k = 0; k &
444 count1.L k++)
leftCunt[k] = count1[k];
rightCount[k] = count2[k];
splitPoint =
splitValue = (currentPoint +
452 lastPoint) / 2;
count1[Convert.ToInt32(currentValue) -
count2[Convert.ToInt32(currentValue) -
lastValue = currentV
lastPoint = currentP
#endregion
#region 如果超过了局部值，重设
if (lianxuMax & jubuMax)
info.type = 1;
info.splitIndex =
info.features=new List&string&()
474 {splitValue+""};
//finalPoint = splitP
jubuMax = lianxuM
info.temp[0] = new List&int&();
info.temp[1] = new List&int&();
for (int k = 0; k & splitP k++)
info.temp[0].Add(nums[k]);
for (int k = splitP k & nums.C
info.temp[1].Add(nums[k]);
info.class_Count[0] = new double
491 [leftCunt.Length];
info.class_Count[1] = new double
494 [leftCunt.Length];
for (int k = 0; k & leftCunt.L k++)
info.class_Count[0][k] = leftCunt[k];
info.class_Count[1][k] = rightCount
#endregion
#endregion
#region 没有寻找到最佳的分裂点，则设置为叶节点
if (info.splitIndex == -1)
double[] finalCount = node.ClassC
double max = finalCount[0];
int result = 1;
for (int i = 1; i & finalCount.L i++)
if (finalCount[i] & max)
max = finalCount[i];
result = (i + 1);
node.feature_Type="result";
node.features=new List&String& { "" + result };
#endregion
#region 分裂
int deep = node.
node.SplitFeature = ("" + info.splitIndex);
List&Node& childNode = new List&Node&();
int[][] used = new int[2][];
used[0] = new int[isUsed.Length];
used[1] = new int[isUsed.Length];
for (int i = 0; i & isUsed.L i++)
used[0][i] = isUsed[i];
used[1][i] = isUsed[i];
if (info.type == 0)
used[0][info.splitIndex] = 1;
node.feature_Type = ("离散");
//used[info.splitIndex] = 0;
node.feature_Type = ("连续");
List&int&[] rowIndex = info.
List&String& features = info.
Node node1 = new Node();
Node node2 = new Node();
node1.setClassCount(info.class_Count[0]);
node2.setClassCount(info.class_Count[1]);
node1.rowCount = info.temp[0].C
node2.rowCount = info.temp[1].C
node1.deep = deep + 1;
node2.deep = deep + 1;
node1 = findBestSplit(node1, info.temp[0],used[0]);
node2 = findBestSplit(node2, info.temp[1], used[1]);
node.leafNode_Count = (node1.leafNode_Count
562 +node2.leafNode_Count);
node.leafWrong = (node1.leafWrong+node2.leafWrong);
node.features = (features);
childNode.Add(node1);
childNode.Add(node2);
node.childNodes = childN
#endregion
catch (Exception e)
Console.WriteLine(e.StackTrace);
/// &summary&
/// GINI值
/// &/summary&
/// &param name="counts"&&/param&
/// &param name="countAll"&&/param&
/// &returns&&/returns&
public static double getGini(double[] counts, int countAll)
double Gini = 1;
for (int i = 0; i & counts.L i++)
Gini = Gini - Math.Pow(counts[i] / countAll, 2);
#region CCP剪枝
public static void getSeries(Node node)
Stack&Node& nodeStack = new Stack&Node&();
if (node != null)
nodeStack.Push(node);
if (node.feature_Type == "result")
List&Node& childs = node.childN
for (int i = 0; i & childs.C i++)
getSeries(node);
/// &summary&
/// 遍历剪枝
/// &/summary&
/// &param name="node"&&/param&
public static Node getNode1(Node node, Node nodeCut)
//List&Node& childNodes = node.getChild();
//double min = 100000;
////Node nodeCut = new Node();
//double temp = 0;
//for (int i = 0; i & childNodes.C i++)
if (childNodes[i].getType() != "result")
//if (!cutTree(childNodes[i]))
min = cutTree(childNodes[i], min);
if (min & temp)
nodeCut = childNodes[i];
getNode1(childNodes[i], nodeCut);
//node.setChildNode(childNodes);
return null;
/// &summary&
/// 对每一个节点剪枝
/// &/summary&
public static double cutTree(Node node, double minA)
int rowCount = node.rowC
double leaf = node.getErrorCount();
double[] values = getError1(node, 0, 0);
double treeWrong = values[0];
double son = values[1];
double rate = (leaf - treeWrong) / (son - 1);
if (minA & rate)
//double var = Math.Sqrt(treeWrong * (1 - treeWrong /
649 rowCount));
//double panbie = treeWrong + var -
//if (panbie & 0)
node.setFeatureType("result");
node.setChildNode(null);
int result = (node.getResult() + 1);
node.setFeatures(new List&String&() { "" + result
return minA;
/// &summary&
/// 获得子树的错误个数
/// &/summary&
/// &param name="node"&&/param&
/// &returns&&/returns&
public static double[] getError1(Node node, double treeError,
670 double son)
if (node.feature_Type == "result")
double error = node.getErrorCount();
return new double[] { treeError + error, son };
List&Node& childNode = node.childN
for (int i = 0; i & childNode.C i++)
double[] values = getError1(childNode[i], treeError,
treeError = values[0];
son = values[1];
return new double[] { treeError, son };
#endregion
CART核心代码
（1）CART是一棵二叉树，每一次分裂会产生两个子节点，对于连续性的数据，直接采用与C4.5相似的处理方法，对于离散型数据，选择最优的两种离散值组合方法。
（2）CART既能是分类数，又能是二叉树。如果是分类树，将选择能够最小化分裂后节点GINI值的分裂属性；如果是回归树，选择能够最小化两个节点样本方差的分裂属性。
（3）CART跟C4.5一样，需要进行剪枝，采用CCP（代价复杂度的剪枝方法）。}

久游无息网