Engineering Motif Search for Large Graphs
01 January 2018
Given a vertex-colored graph H and a multiset M of colors as input, the graph motif problem asks us to decide whether H has a connected induced subgraph whose multiset of colors agrees with M. The graph motif problem is NP-complete but known to admit randomized algorithms based on constrained multilinear sieving over GF(2^b) that run in time O(2^k k^2 m M(2^b)) and with a false-negative probability of at most k/2^(b-1) for a connected m-edge input and a motif of size k. On modern CPU microarchitectures such algorithms have practical edge-linear scalability to inputs with billions of edges for small motif sizes, as demonstrated by Björklund, Kaski, Kowalik, and Lauri [ALENEX'15]. This scalability to large graphs prompts the dual question whether it is possible to scale to large motif sizes. We present a vertex-localized variant of the constrained multilinear sieve that enables us to obtain, in time O(2^k k^2 m M(2^b)) and for every vertex simultaneously, whether the vertex participates in at least one match with the motif, with a per-vertex probability of at most k/2^(b-1) for a false negative. Furthermore, the algorithm is easily vector-parallelizable for up to 2^k threads, and parallelizable for up to 2^kn threads, where n is the number of vertices in H. Here M(2^b) is the time complexity to multiply in GF(2^b). We demonstrate with an open-source implementation that our variant of constrained multilinear sieving can be engineered for vector-parallel microarchitectures to yield hardware utilization that is bound by the available memory bandwidth. Our main engineering contributions are (a) a version of the recurrence for tightly labeled arborescences that can be executed as a sequence of memory-and-arithmetic coalescent parallel workloads on multiple GPUs, and (b) a bit-sliced low-level implementation for arithmetic in characteristic 2 to support (a). On a single NVIDIA Tesla P100 Accelerator (NVIDIA GP100 GPU with 3584 cores at 1189 MHz), our implementation achieves, at peak, an empirical global memory bandwidth of up to half a terabyte per second, essentially matching the peak empirical transfer bandwidth for ondevice memory. Furthermore, we obtain simultaneously an arithmetic bandwidth of over 300 billion GF(2^8)-multiplications per second. For example, with b = 8 we can find all vertices that participate in at least one match with a motif of size k = 10 on a one-million-vertex 20-regular graph (m = 10000000) in less than four seconds. With k = 20 and m = 1000000, we run in less than 25 minutes. When scaled up to a four-P100-card configuration (14336 cores), we obtain a three-tofour-fold speed-up, with up to two terabytes per second of empirical memory bandwidth and over 1.2 trillion GF(2^8)-multiplications per second when executing the sieve.