Difference between min_samples_split and min_samples_leaf in sklearn DecisionTreeClassifier

Question

From the documentation:

The main difference between the two is that min_samples_leaf guarantees a minimum number of samples in a leaf, while min_samples_split can create arbitrary small leaves, though min_samples_split is more common in the literature.

To get a grasp of this piece of documentation I think you should make the distinction between a leaf (also called external node) and an internal node. An internal node will have further splits (also called children), while a leaf is by definition a node without any children (without any further splits).

min_samples_split specifies the minimum number of samples required to split an internal node, while min_samples_leaf specifies the minimum number of samples required to be at a leaf node.

For instance, if min_samples_split = 5, and there are 7 samples at an internal node, then the split is allowed. But let’s say the split results in two leaves, one with 1 sample, and another with 6 samples. If min_samples_leaf = 2, then the split won’t be allowed (even if the internal node has 7 samples) because one of the leaves resulted will have less then the minimum number of samples required to be at a leaf node.

As the documentation referenced above mentions, min_samples_leaf guarantees a minimum number of samples in every leaf, no matter the value of min_samples_split.

Leave a Comment Cancel reply