Probabilistically Sampled Forests

18 April 2012

New Image

Random subspaces are the key idea in random forests. While the optimal size s of random subspaces can significantly improve classification accuracy, a suboptimal choice can be detrimental, especially in highdimensional feature spaces, which typically contain only a few discriminative features among a majority of uninformative ones. As to overcome this problem, we propose an alternative to the subspace method: learning a forest by probabilistically sampling trees from the Boltzmann distribution at temperature T . This approach includes sampling from the posterior distribution as a special case for T = 1. An increased temperature T 1 serves as a conceptually wellestablished measure for the additional randomness introduced into the learning process. Moreover, this approach also suggests a novel relationship between the bootstrap and the random subspace method. Apart from this, in our experiments on UCI data sets, we also found improvements on the practical side: classification accuracy does not tend to suffer as much from a suboptimal T compared to a suboptimal s in high-dimensional feature spaces.