当前位置: 首页 > 图灵资讯 > 技术篇> 随机森林算法demo python spark

随机森林算法demo python spark

来源:图灵教育
时间:2023-06-01 09:51:37

关键参数

最重要的是,有两个参数通常需要调试来提高算法效果:numTrees,maxDepth。

  • numTrees(决策树的数量):增加决策树的数量会降低预测结果的方差,从而在测试过程中有更高的accuracy。训练时间与numtrees大致呈线性增长。
  • maxDepth:指森林中每棵决策树最有可能depth,并在决策树中提到这个参数。一棵更深的树意味着模型预测更强大,但训练时间更长,更倾向于过拟合。但值得注意的是,随机森林算法和单一决策树算法对该参数有不同的要求。由于随机森林是多个决策树预测结果的投票或平均投票,预测结果的方差降低,因此与单个决策树相比,不容易拟合。因此,随机森林可以选择比决策树模型中更大的maxDepth。甚至有文献说,随机森林中的每一棵决策树都最有可能在不剪枝的情况下生长。但无论如何,建议对maxDepth参数进行一定的实验,看能否提高预测效果。还有两个参数,subsamplingRate,一般来说,featuresubsetstrategy不需要调试,但这两个参数也可以重新设置以加快训练,但值得注意的是,模型的预测效果可能会受到影响(如果需要调试,请仔细阅读以下英语)。

We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.The first two parameters we mention are the most important, and tuning them can often improve performance:(1)numTrees: Number of trees in the forest.Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy.Training time increases roughly linearly in the number of trees.(2)maxDepth: Maximum depth of each tree in the forest.Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).The next two parameters generally do not require tuning. However, they can be tuned to speed up training.(3)subsamplingRate: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.(4)featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.

"""Random Forest Classification Example."""from __future__ import print_functionfrom pyspark import SparkContext# $example on$from pyspark.mllib.tree import RandomForest, RandomForestModelfrom pyspark.mllib.util import MLUtils# $example off$if __name__ == "__main__":    sc = SparkContext(appName="PythonRandomForestClassificationExample")    # $example on$    # Load and parse the data file into an RDD of LabeledPoint.    data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')    # Split the data into training and test sets (30% held out for testing)    (trainingData, testData) = data.randomSplit([0.7, 0.3])    # Train a RandomForest model.    #  Empty categoricalFeaturesInfo indicates all features are continuous.    #  Note: Use larger numTrees in practice.    #  Setting featureSubsetStrategy="auto" lets the algorithm choose.    model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},                                         numTrees=3, featureSubsetStrategy="auto",                                         impurity='gini', maxDepth=4, maxBins=32)    # Evaluate model on test instances and compute test error    predictions = model.predict(testData.map(lambda x: x.features))    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)    testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())    print('Test Error = ' + str(testErr))    print('Learned classification forest model:')    print(model.toDebugString())    # Save and load model    model.save(sc, "target/tmp/myRandomForestClassificationModel")    sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")    # $example off$

模型样子:

TreeEnsembleModel classifier with 3 trees  Tree 0:    If (feature 511 <= 0.0)     If (feature 434 <= 0.0)      Predict: 0.0     Else (feature 434 > 0.0)      Predict: 1.0    Else (feature 511 > 0.0)     Predict: 0.0  Tree 1:    If (feature 490 <= 31.0)     Predict: 0.0    Else (feature 490 > 31.0)     Predict: 1.0  Tree 2:    If (feature 302 <= 0.0)     If (feature 461 <= 0.0)      If (feature 208 <= 107.0)       Predict: 1.0      Else (feature 208 > 107.0)       Predict: 0.0     Else (feature 461 > 0.0)      Predict: 1.0    Else (feature 302 > 0.0)     Predict: 0.0