Predictive Equivalence in Decision Trees
This blog post is based on joint work with Hayden McTavish and Jon Donnelly and is presented in our ICML 2025 paper, "Leveraging Predictive Equivalence in Decision Trees", advised by Dr.'s Margo Seltzer and Cynthia Rudin.
A Top Down View of Decision Trees
Figure 1: Two decision trees which encode precisely the same logical expression. Splits to the right correspond to the variable being True, left to False.
Decision trees are generally accepted as a very interpretable model class for understanding tabular machine learning problems. Look through a classical machine learning textbook, and you'll find that algorithms for 'optimizing' decision trees are a dime a dozen -- Scikit-Learn implements CART, C4.5, C5.0, and ID3. Modern research (see GOSDT and SPLIT, STreeD, and DL8.5, among others) has taken these algorithms far beyond the old days of greedy optimization into the world of true optimality, subject to depth constraints and limits on the number of leaves in the tree. All of these methods embed the same (very natural) intuition about trees -- that you evaluate them from the top down. In the example above, to make a prediction on a new sample, you keep testing variables until you reach a leaf, which finally outputs either 0 or 1. This is entirely unsurprising -- all of machine learning pedagogy describes trees as flowcharts, where you follow the top down to make predictions.
Let's dive deeper into this example, beyond what your machine learning textbook will tell you. If we conjoin the logic for each of the leaves, the left tree represents the logical expression,
(B and C) or
(B and (not C) and (not A)) or
((not B) and C and A)
The right tree, similarly, encodes the expression
(A and C) or
((not A) and B)
So what? Although the two trees seem to have different paths from top to bottom, these two expressions are equivalent! We leave proving this as an exercise for the reader.
That means that the two trees will make the same prediction on any data you give to them; we refer to such trees as “predictively equivalent”. Predictive equivalence is more than just a neat observation. It has huge implications for your understanding of your model, both in terms of which features are important to it and how those features are combined to make a final prediction.
Variable Variable Importance
Figure 2: Gini impurity calculations when sending one sample down each leaf.
Let's say we have fit these trees with the Scikit-Learn package, and we're using the reduction in Gini impurity attributable to each feature as a measure of feature importance. This is the default feature importance metric in Scikit-Learn, so you wouldn't be remiss for using it! However, if we do exactly that, in the left tree we find that feature A has a Gini importance of 0.5, and B and C have 0.25. In the right tree, feature A has a Gini importance of 0, and B and C have 0.5. If you and I, as practitioners, found these two trees independently of one another, and reported feature importances, we would have disagreed entirely about the relative importance of every single feature! Even though these trees always make the same predictions!
"Great," you say, "but I have one over on you. Let's use a feature importance metric, like the Rashomon Importance Distribution, which only depends on the predictions, so predictive equivalence can't hurt us." This is certainly a step in the right direction, but we sometimes encounter a sneakier problem. I will refer you to our paper for the details, but as a teaser -- predictive equivalence can bias variable importance metrics which operate over the Rashomon set, or the entire set of near-optimal decision trees, by duplicating certain models and overemphasizing the importance of variables in those duplicated models!
Who Needs Imputation?
Figure 3: Percent of samples we can still use a tree to make predictions with, with X% of feature values artificially missing at random. The used-features method gives up if any feature used by the tree is missing; path-based follows the tree from root to leaf, and our method uses the logical form of the tree.
I've just told you a bit about what predictive equivalence does to us -- it forces us to shed Gini importance as a reliable metric for learning about data. Now let's talk about something predictive equivalence can do for us. Let's assume that, up above, variable B is missing. Somebody forgot to record it in the database, they didn't order the right test, or its place in memory on disk was struck by a cosmic ray. The left tree is catastrophically affected by this -- it cannot make any predictions (tracing from the top down) if variable B is missing. The right tree, however, only cares about the outcome of B if A is False!
Let's make this example more concrete and assume that we're missing variable A. If variables B and C are both True, or if they're both False, then the value of A does not matter (you can see this by inspecting either tree, or by plugging these values into the logical expressions and simplifying down to True or False). The tree on the right would mistakenly throw its hands up on all of these examples! This matters in the real world - on the four datasets shown above, taking predictive equivalence into account allows us to make predictions on a much higher proportion of examples compared to the usual top-down evaluation method when we artificially inject missingness into the data.
Cost Optimization
Conclusion
- Decision trees have several, and sometimes many, predictively equivalent forms.
- These forms have different properties, including different measures of variable importance, abilities to handle missing data, and evaluation costs.
- Predictive equivalence almost certainly affects many other decision tree applications.
Comments
Post a Comment