Authors: Jiecao Chen,He Sun,David P. Woodruff,Qin Zhang
Where published:
NeurIPS 2016 12
ArXiv: 1702.00196
Document:
PDF
DOI
Abstract URL: http://arxiv.org/abs/1702.00196v1
Clustering large datasets is a fundamental problem with a number of
applications in machine learning. Data is often collected on different sites
and clustering needs to be performed in a distributed manner with low
communication. We would like the quality of the clustering in the distributed
setting to match that in the centralized setting for which all the data resides
on a single site. In this work, we study both graph and geometric clustering
problems in two distributed models: (1) a point-to-point model, and (2) a model
with a broadcast channel. We give protocols in both models which we show are
nearly optimal by proving almost matching communication lower bounds. Our work
highlights the surprising power of a broadcast channel for clustering problems;
roughly speaking, to spectrally cluster $n$ points or $n$ vertices in a graph
distributed across $s$ servers, for a worst-case partitioning the communication
complexity in a point-to-point model is $n \cdot s$, while in the broadcast
model it is $n + s$. A similar phenomenon holds for the geometric setting as
well. We implement our algorithms and demonstrate this phenomenon on real life
datasets, showing that our algorithms are also very efficient in practice.