StatiX: Making XML Count
01 January 2002
The availability of summary data for XML documents has many applications, from providing users with quick feedback about their queries, to cost-based storage design and query optimization. In this paper, we propose a new statistics framework for XML that is based on two principles: XML schema transformations, which decompose schema types into finer grained structures for the purpose of gathering statistics; and histograms, which provide scalable summaries of both the structure of the decomposed types as well as the values in the data. This approach leverages standard XML technology (i.e., XML Schema and validating parsers) and can be easily integrated with existing database systems. We propose algorithms for schema decomposition and statistics gathering, and present an experimental evaluation which demonstrates the accuracy and scalability of our approach. We also show how these statistics are used in cost-based storage design for XML.