Using a white box model to approximate a black box model in the neighborhood of a given input point is a well-known technique for explaining what the black box model is doing. However, is that really an explanation? In this post, I’ll try to explain why I think this is not a good approach for explaining AI.
If the black box model takes an input A and arrives at prediction B, is that the same as a white box model also arriving at B, given A? Both models could have taken very different routes to get to B. A white box model, e.g, a linear model, would likely take a straight path to B whereas the black box model would likely take a much more complicated path. The end result is the same but you’re really only explaining the white box model, not the black box model. When you change the input A a little bit, the white box model’s output could remain B or, in case of regression, remain close to B. On the other hand, the black box model could be much more unstable. A small change in A, could result in a change from B to C or, in case of regression, a value far removed from B. The black box model could also run into problems while taking the complicated route to B. Or something might happen causing it to never arrive at B in the first place. It’s just these kinds of problems that we’d like to explain to the user. It’s really no use letting the white box model take a straight and predictable shortcut to B, pretending that all is well.