Decision Trees - RDD-Based API

Posted on 2016-11-14 | In PySpark |

Basic algorithm

决策树和它们的集成方法

Node impurity and information gain

节点不纯度是结点标签一致性的度量。目前的实现针对分类提供了两种不纯度测量(Gini不纯度和熵)，针对回归提供了一种不纯度测量(方差)。

Split candidates

Continuous features

Categorical features

Stopping rule

当下列条件之一被满足，迭代的树构造就被停止：

结点深度等于maxDepth训练参数；
没有分割候选会得到大于minInfoGain的信息增益；
没有分割候选会产生至少有minInstancesPerNode个训练样本的孩子结点。

Usage tips

通过讨论各种参数，我们给出一些使用决策树的指南。下列参数大致以重要性递减被列出。新用户应该主要考虑“Problem specification parameters” 部分和maxDepth参数。

Problem specification parameters

这些参数描述了你想要解决的问题和你的数据集，它们应该被设置，不需要调优。

algo: 决策树的类型，是Classification 或者 Regression。
numClasses: 类别的数量(只针对Classification)。
categoricalFeaturesInfo: Specifies which features are categorical and how many categorical values each of those features can take.

设置
This is given as a map from feature indices to feature arity (number of categories). Any features not in this map are treated as continuous.

Stopping criteria

Tunable parameters

Caching and checkpointing

Scaling

Examples

Classification

Regression