Predictive Equivalence in Decision Trees

This blog post is based on joint work with Hayden McTavish and Jon Donnelly and is presented in our ICML 2025 paper, "Leveraging Predictive Equivalence in Decision Trees", advised by Dr.'s Margo Seltzer and Cynthia Rudin.

A Top Down View of Decision Trees

Figure 1: Two decision trees which encode precisely the same logical expression. Splits to the right correspond to the variable being True, left to False.


Decision trees are generally accepted as a very interpretable model class for understanding tabular machine learning problems. Look through a classical machine learning textbook, and you'll find that algorithms for 'optimizing' decision trees are a dime a dozen -- Scikit-Learn implements CART, C4.5, C5.0, and ID3. Modern research (see GOSDT and SPLIT, STreeD, and DL8.5, among others) has taken these algorithms far beyond the old days of greedy optimization into the world of true optimality, subject to depth constraints and limits on the number of leaves in the tree. All of these methods embed the same (very natural) intuition about trees -- that you evaluate them from the top down. In the example above, to make a prediction on a new sample, you keep testing variables until you reach a leaf, which finally outputs either 0 or 1. This is entirely unsurprising -- all of machine learning pedagogy describes trees as flowcharts, where you follow the top down to make predictions.


Let's dive deeper into this example, beyond what your machine learning textbook will tell you. If we conjoin the logic for each of the leaves, the left tree represents the logical expression,


(B and C) or 

(B and (not C) and (not A)) or

((not B) and C and A)


The right tree, similarly, encodes the expression


(A and C) or

((not A) and B)


So what? Although the two trees seem to have different paths from top to bottom, these two expressions are equivalent! We leave proving this as an exercise for the reader.


That means that the two trees will make the same prediction on any data you give to them; we refer to such trees as “predictively equivalent”. Predictive equivalence is more than just a neat observation. It has huge implications for your understanding of your model, both in terms of which features are important to it and how those features are combined to make a final prediction. 

Variable Variable Importance

Figure 2: Gini impurity calculations when sending one sample down each leaf.


Let's say we have fit these trees with the Scikit-Learn package, and we're using the reduction in Gini impurity attributable to each feature as a measure of feature importance. This is the default feature importance metric in Scikit-Learn, so you wouldn't be remiss for using it! However, if we do exactly that, in the left tree we find that feature A has a Gini importance of 0.5, and B and C have 0.25. In the right tree, feature A has a Gini importance of 0, and B and C have 0.5. If you and I, as practitioners, found these two trees independently of one another, and reported feature importances, we would have disagreed entirely about the relative importance of every single feature! Even though these trees always make the same predictions!


"Great," you say, "but I have one over on you. Let's use a feature importance metric, like the Rashomon Importance Distribution, which only depends on the predictions, so predictive equivalence can't hurt us." This is certainly a step in the right direction, but we sometimes encounter a sneakier problem. I will refer you to our paper for the details, but as a teaser -- predictive equivalence can bias variable importance metrics which operate over the Rashomon set, or the entire set of near-optimal decision trees, by duplicating certain models and overemphasizing the importance of variables in those duplicated models!

Who Needs Imputation?

Figure 3: Percent of samples we can still use a tree to make predictions with, with X% of feature values artificially missing at random. The used-features method gives up if any feature used by the tree is missing; path-based follows the tree from root to leaf, and our method uses the logical form of the tree.


I've just told you a bit about what predictive equivalence does to us -- it forces us to shed Gini importance as a reliable metric for learning about data. Now let's talk about something predictive equivalence can do for us. Let's assume that, up above, variable B is missing. Somebody forgot to record it in the database, they didn't order the right test, or its place in memory on disk was struck by a cosmic ray. The left tree is catastrophically affected by this -- it cannot make any predictions (tracing from the top down) if variable B is missing. The right tree, however, only cares about the outcome of B if A is False!


Let's make this example more concrete and assume that we're missing variable A. If variables B and C are both True, or if they're both False, then the value of A does not matter (you can see this by inspecting either tree, or by plugging these values into the logical expressions and simplifying down to True or False). The tree on the right would mistakenly throw its hands up on all of these examples! This matters in the real world - on the four datasets shown above, taking predictive equivalence into account allows us to make predictions on a much higher proportion of examples compared to the usual top-down evaluation method when we artificially inject missingness into the data.

Cost Optimization

Figure 4: Average cost of evaluating trees over the test set for several common datasets, where the cost of purchasing each feature is randomly assigned. The naive method purchases every feature, the path based method follows the tree from the top down, and our method collects features in a way that minimizes average cost over the training data.

Let us now consider the scenario where we start without knowing any feature values, and we decide in real time which features to collect to be able to make a prediction. You can imagine variables A, B, and C in this scenario to be three medical tests, each offering different evidence, that a doctor may sequentially order to diagnose a patient. Each of these tests has a different cost, and we are interested in minimizing the average cost of making predictions. Let's assume that it costs $10 to measure variable A, and $1 to measure variables B and C. We immediately notice that, in our two example decision trees, the left tree often avoids measuring variable A, but the right tree always needs to measure it. 

The above experiment presents a related idea. Given a decision tree, how do we decide which variables to purchase, and in which order? In Figure 4, we find that following trees from the top down is not the optimal way to measure features to reach conclusions! How do we find this optimal order? We give an algorithm to find a predictively equivalent form of a decision tree which approximates the optimal average cost. This predictively equivalent form of the original decision tree makes the exact same predictions as the original tree, but much cheaper! The left tree has additional top-level splits which avoid querying variable A quite so often -- it's predictively equivalent, and less sparse than the right tree, but has lower cost. We refer the interested reader to our paper for our formulation and solution to this problem.

Conclusion

We hope you've enjoyed our tour through the impacts of predictive equivalence, and convinced you that it's worth keeping this phenomenon in mind when thinking about your own decision tree related problems! We hope that you take several key things away from this work:
  1. Decision trees have several, and sometimes many, predictively equivalent forms.
  2. These forms have different properties, including different measures of variable importance, abilities to handle missing data, and evaluation costs.
  3. Predictive equivalence almost certainly affects many other decision tree applications.
Beyond our paper, we're particularly interested in understanding the structure of predictively equivalent trees. We don't currently have a general method to find different forms of a tree from a given tree -- we can collapse a collection of trees into their logical forms and see if they are indeed equivalent, but we can't go from the logical forms to the collection of trees. We're also interested in exploring the impact of predictive equivalence on tree ensembles.

Thank you for reading! Let us know what you found interesting, and please do not hesitate to reach out with questions and comments about our work.

Comments

Popular posts from this blog

What is Machine Unlearning?