Large-Scale Execution Engines and Cluster Schedulers

Overview

Resource scheduling remains a key building block of modern data-intensive clusters. As data volumes increase and the need for analysis through multiple lenses explode, diverse coexisting jobs from many users and applications contend for the same pool of resources in shared clusters. Consequently, today’s cluster schedulers have to deal with multiple resources, consider jobs with complex directed acyclic graph (DAG) structures, and allow job-specific constraints. Schedulers must provide performance isolation between different users and groups through fair resource sharing, while ensuring performance (low job completion times) and efficiency (high cluster resource utilization). We tackle these challenges specific to cluster scheduling by developing cluster schedulers that optimize on one or more of the metrics -- fairness, average job completion time, and makespan -- by exploiting various DAG-level properties. In ongoing work (QOOP), we are looking at further optimizing these metrics by changing the DAG associated with a job mid-execution in response to resource dynamics. Following is a brief overview of our existing systems:

Tetris : Multi-resource Packing for Cluster Schedulers

Paper

Multi-Resource Packing for Cluster Schedulers, SIGCOMM 2014

Paper Abstract

Tasks in modern data-parallel clusters have highly diverse resource requirements along CPU, memory, disk and network. We present Tetris, a multi-resource cluster scheduler that packs tasks to machines based on their requirements of all resource types. Doing so avoids resource fragmentation as well as over-allocation of the resources that are not explicitly allocated, both of which are drawbacks of current schedulers. Tetris adapts heuristics for the multidimensional bin packing problem to the context of cluster schedulers wherein task arrivals and machine availability change in an online manner and wherein task’s resource needs change with time and with the machine that the task is placed at. In addition, Tetris improves average job completion time by preferentially serving jobs that have less remaining work. We observe that fair allocations do not offer the best performance and the above heuristics are compatible with a large class of fairness policies; hence, we show how to simultaneously achieve good performance and fairness. Tracedriven simulations and deployment of our Apache YARN prototype on a 250 node cluster show gains of over 30% in makespan and job completion time while achieving nearly perfect fairness.

Talks

Slides from talk at SIGCOMM

Code

Please e-mail the first author for access to code.

Graphene : Scheduling Dependent Computations in Data-Analytics Clusters

Paper

Packing and Dependency-aware Scheduling for Data-Parallel Clusters, OSDI 2016

Paper Abstract

We present a scheduler that improves cluster utilization and job completion times by packing tasks having multi-resource requirements and inter-dependencies. While the problem is algorithmically very hard, we achieve near-optimality on the job DAGs that appear in production clusters at a large enterprise and in benchmarks such as TPC-DS. A key insight is that carefully handling the long-running tasks and those with tough-to-pack resource needs will produce good-enough schedules. However, which subset of tasks to treat carefully is not clear (and intractable to discover). Hence, we offer a search procedure that evaluates various possibilities and outputs a preferred schedule order over tasks. An online component enforces the schedule orders desired by the various jobs running on the cluster. In addition, it packs tasks, overbooks the fungible resources and guarantees bounded unfairness for a variety of desirable fairness schemes. Relative to the state-of-the art schedulers, we speed up 50% of the jobs by over 30% each.

Talks

Slides from talk at OSDI 2018

Code

Please e-mail the first author for access to code.

Carbyne : Altruistic Scheduling in Multi-Resource Clusters

Paper

Altruistic Scheduling in Multi-Resource Clusters, OSDI 2016

Paper Abstract

Given the well-known tradeoffs between fairness, performance, and efficiency, modern cluster schedulers often prefer instantaneous fairness as their primary objective to ensure performance isolation between users and groups. However, instantaneous, short-term convergence to fairness often does not result in noticeable long-term benefits. Instead, we propose an altruistic, long-term approach, CARBYNE, where jobs yield fractions of their allocated resources without impacting their own completion times. We show that leftover resources collected via altruisms of many jobs can then be rescheduled to further secondary goals such as application-level performance and cluster efficiency without impacting performance isolation. Deployments and large-scale simulations show that CARBYNE closely approximates the state-of- the-art solutions (e.g., DRF) in terms of performance isolation, while providing 1:26x better efficiency and 1:59x lower average job completion time.

Talks

Slides from talk at OSDI 2018

Code

Please e-mail the first author for access to code.