Datafilos: Missing values in a decision tree

čtvrtek 31. prosince 2015

There multiple ways how a decision tree can deal missing values in the data.

When a decision has to be made on an attribute that is missing, the scoring of the instance can terminate and class probabilities of the current node can be returned as the prediction. Note that the implementation has to keep class probabilities not only of the leaves, but of all the nodes in the tree.
Or we may keep a statistics about how many samples goes into each node. And if a decision has to be made based on a missing value, then the instance goes into the most frequent descendant.
We may also train the tree how to score based on a missing value. For example, if a split is learned on a continuous attribute and the split says that 90% of the training samples goes into the right descendant, the model can also learn that if an instance has a missing value, then based on the class label it more similar to the instances in the left descendant. Hence it sends the instances with the missing attribute value left.

Datafilos