BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:2.0 BEGIN:VEVENT DTSTART:20141120T193000Z DTEND:20141120T200000Z LOCATION:393-94-95 DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: Fault-tolerance poses a major challenge for future large-scale systems. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node's compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated checkpointing has focused on optimizing message log volumes, local checkpointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. Lastly, we demonstrate how to tune hierarchical uncoordinated checkpointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. SUMMARY:Understanding the Effects of Communication and Coordination on Checkpointing at Scale PRIORITY:3 END:VEVENT END:VCALENDAR