Skip to content

FOSSClusterManagers

robnagler edited this page Jun 22, 2016 · 2 revisions

Open Source Cluster Managers

The following is a non-comprehensive list of open source cluster managers.

See Wikipedia's List of cluster management software and Comparison of cluster software for a more comprehensive list.

We've written this list to give our analysis of cluster managers for our particular problem described in ClusterManager.

Kubernetes

http://kubernetes.io/

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.

It groups containers that make up an application into logical units for easy management and discovery. Kubernetes builds upon 15 years of experience of running production workloads at Google, combined with best-of-breed ideas and practices from the community.

There are very few hits on MPI and Kubernetes. Kubernetes model is emphemeral, that is, it assumes it can kill nodes.

https://github.com/ContinUSE/kubernetes-coreos-cluster

Demonstrates the possibility of running MPI on top of Kubernetes, but the configuration is complex, and it is not clear it scales. Very detailed configuration of mpi in the docker container.

https://kismatic.com/company/qa-with-malte-schwarzkopf-on-distributed-systems-orchestration-in-the-modern-data-center/

The second problem is hoarding, which especially affects “gang-scheduled” jobs. These are jobs that must acquire all of the requested resources before any of their tasks can start running. An MPI job is a traditional example of such a computation, but there others, too: stateful stream processing systems also require their entire pipelines to be intact to get started. In a two-level model such as Mesos, this poses a problem: the application-level scheduler can either wait until offered sufficient resources in one offer to get everything started (which may never happen), or accept a series of smaller offers to accumulate sufficient resources. If it does the latter, the resources are uselessly hoarded until sufficiently many offers have been accepted, even though they could have been used productively for other workloads (such as low-priority MapReduce jobs) in the meantime.

OpenStack

http://www.openstack.org/

OpenStack software controls large pools of compute, storage, and networking resources throughout a datacenter, managed through a dashboard or via the OpenStack API. OpenStack works with popular enterprise and open source technologies making it ideal for heterogeneous infrastructure.

OpenStack is a behemoth. It's powerful for setting up hardware and managing private clouds. Does not support queueing or anything else that might be useful to this specific problem.

Docker Swarm

https://docs.docker.com/swarm/

Docker Swarm is native clustering for Docker. It turns a pool of Docker hosts into a single, virtual Docker host. Because Docker Swarm serves the standard Docker API, any tool that already communicates with a Docker daemon can use Swarm to transparently scale to multiple hosts. [...] Docker Swarm is still in its infancy and under active development.

Not clear what the advantage is over starting the container over ssh. You need a list of hosts to configure MPI. The sshd has to be running with the appropriate keys for that particular instance of the cluster.

SLURM

http://slurm.schedmd.com/

The Simple Linux Utility for Resource Management (Slurm) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

http://schedmd.com/slurmdocs/elastic_computing.html

Slurm has the ability to support a cluster that grows and shrinks on demand, typically relying upon a service such as Amazon Elastic Computing Cloud (Amazon EC2) for resources. These resources can be combined with an existing cluster to process excess workload (cloud bursting) or it can operate as an independent self-contained cluster. Good responsiveness and throughput can be achieved while you only pay for the resources needed.

but

Slurm's Elastic Computing logic relies heavily upon the existing power save logic. Review of Slurm's Power Saving Guide is strongly recommended. This

Lots of remaining work. SLURM itself is focused on being the manager for the whole cluster, not invoking multiple independent clusters.

Not designed for user-containers. Unix user is access control.

StarCluster

http://star.mit.edu/cluster/

StarCluster has been designed to automate and simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud. StarCluster allows anyone to easily create a cluster computing environment in the cloud suited for distributed and parallel computing applications and systems.

Bound to Unix user IDs. Not designed to create a cluster on demand for each user, rather to build the cluster than then manage with Grid Engine or other queuing system. Not design to run user-constainers. Unix users are access control.

Does not seem to be matinained.

Last update of git://github.com/jtriley/StarCluster.git was 11/12/15

Torque

http://www.adaptivecomputing.com/products/open-source/torque/ https://github.com/adaptivecomputing/torque

TORQUE Resource Manager provides control over batch jobs and distributed computing resources. It is an advanced open-source product based on the original PBS project* and incorporates the best of both community and professional development. It incorporates significant advances in the areas of scalability, reliability, and functionality and is currently in use at tens of thousands of leading government, academic, and commercial sites throughout the world. TORQUE may be freely used, modified, and distributed under the constraints of the included license.

Maintained by Adaptive Computing Inc.

It's a basic queuing system built on the idea of static cluster. Does not support user-containers. Unix users are used for permissioning.

xCAT

https://xcat.org/ https://github.com/xcat2/xcat-core

xCAT is Extreme Cluster/Cloud Administration Toolkit, xCAT offers complete management for HPC clusters, RenderFarms, Grids, WebFarms, Online Gaming Infrastructure, Clouds, Datacenters, and whatever tomorrow's buzzwords may be. It is agile, extensible, and based on years of system administration best practices and experience. It enables you to:

  • Provision Operating Systems on physical or virtual machines: RHEL, CentOS, Fedora, SLES, Ubuntu, AIX, Windows, VMWare, KVM, PowerVM, PowerKVM, zVM.
  • Provision using scripted install, stateless, statelite, iSCSI, or cloning
  • Remotely manage systems: lights-out management, remote console, and distributed shell support
  • Quickly configure and control management node services: DNS, HTTP, DHCP, TFTP, NFS

xCAT is old and hardware-oriented. It ensures your cluster stays running. It can manage nodes coming and going dynamically. It is not a queuing system.

Does not support user-containers.

Celery

http://www.celeryproject.org/

Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well. The execution units, called tasks, are executed concurrently on a single or more worker servers using multiprocessing, Eventlet, or gevent. Tasks can execute asynchronously (in the background) or synchronously (wait until ready).

We are running Celery inside Docker with user-containers. A Celery worker runs on a single node, and you can start MPI within that worker. However, there is no concept larger than the worker process so there's no way to define a cluster within Celery. A cluster, once defined, can start Celery workers on each of its nodes, but those nodes cannot communicate with each other, because there is not global registry.

Cobbler

http://cobbler.github.io/

Cobbler is a Linux installation server that allows for rapid setup of network installation environments. It glues together and automates many associated Linux tasks so you do not have to hop between many various commands and applications when deploying new systems, and, in some cases, changing existing ones. Cobbler can help with provisioning, managing DNS and DHCP, package updates, power management, configuration management orchestration, and much more.

Is not a queueing system. Has no concept of "cluster", just nodes.

HTCondor

https://research.cs.wisc.edu/htcondor/

https://github.com/htcondor/htcondor

Our goal is to develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources. Guided by both the technological and sociological challenges of such a computing environment, the Center for High Throughput Computing at UW-Madison has been building the open source HTCondor distributed computing software (pronounced "aitch-tee-condor") and related technologies to enable scientists and engineers to increase their computing throughput.

http://research.cs.wisc.edu/htcondor/manual/latest/2_12Docker_Universe.html

A docker universe job instantiates a Docker container from a Docker image, and HTCondor manages the running of that container as an HTCondor job, on an execute machine. This running container can then be managed as any HTCondor job. For example, it can be scheduled, removed, put on hold, or be part of a workflow managed by DAGMan.

https://research.cs.wisc.edu/htcondor/HTCondorWeek2015/presentations/ThainG_Docker.pdf

Surprises with Docker Universe

  • Condor_ssh_to_job doesn’t work
  • Condor_chirp doesn’t work
  • Suspend doesn’t work
  • Can’t access NFS/shared filesystems
  • Job not a child of the condor_starter:
  • Request_disk doesn’t work
  • resource usage is funky
  • Networking is only NAT

Importantly, it doesn't support inter-node communication.

MPI Distributed Virtual Machine (DVM)

https://www.open-mpi.org/doc/current/man1/orte-dvm.1.php

orte-dvm will establish a DVM that can be used to execute subsequent applications. Use of orte-dvm can be advantageous, for example, when you want to execute a number of short-lived tasks. In such cases, the time required to start the ORTE DVM can be a significant fraction of the time to execute the overall application. Thus, creating a persistent DVM can speed the overall execution. In addition, a persistent DVM will support executing multiple parallel applications while maintaining separation between their respective cores.

Designed to support faster instantiation of MPI jobs. Does not support Docker. Unix user access control. Is not a queueing system.

Globus Toolkit

http://toolkit.globus.org/toolkit/

The open source Globus® Toolkit is a fundamental enabling technology for the "Grid," letting people share computing power, databases, and other tools securely online across corporate, institutional, and geographic boundaries without sacrificing local autonomy. The toolkit includes software services and libraries for resource monitoring, discovery, and management, plus security and file management. In addition to being a central part of science and engineering projects that total nearly a half-billion dollars internationally, the Globus Toolkit is a substrate on which leading IT companies are building significant commercial Grid products.

Status is unclear. In the CVS repo, last update was from 2011.

Doesn't support Docker. Don't know much else.

Grid Engine

http://gridscheduler.sourceforge.net/

Open Grid Scheduler/Grid Engine is a commercially supported open-source batch-queuing system for distributed resource management. OGS/GE is based on Sun Grid Engine, and maintained by the same group of external (i.e. non-Sun) developers who started contributing code since 2001.

http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html

Recently, we have provisioned a 10,000-node Grid Engine cluster in Amazon EC2 to test the scalability of Grid Engine. As the official maintainer of open-source Grid Engine, we have the obligation to make sure that Grid Engine continues to scale in the modern datacenters. [...] A few things more findings:

  • We kept sending spot requests to the us-east-1 region until "capacity-not-able" was returned to us. We were expecting the bid price to go sky-high when an instance type ran out but that did not happen.
  • When a certain instance-type gets mostly used up, further spot requests for the instance type get slower and slower.
  • At peak rate, we were able to provision over 2,000 nodes in less than 30 minutes. In total, we spent less than 6 hours constructing, debugging the issue caused by the EBS volume limit, running a small number of Grid Engine tests, and taking down the cluster.
  • Instance boot time was independent of the instance type: EBS-backed c1.xlarge and t1.micro took roughly the same amount of time to boot.

Does not support Docker out of the box. Univa community edition does support Docker, but the licensing is a bit strange.

Another interesting article

PVM

http://www.csm.ornl.gov/pvm/

PVM (Parallel Virtual Machine) is a software package that permits a heterogeneous collection of Unix and/or Windows computers hooked together by a network to be used as a single large parallel computer. Thus large computational problems can be solved more cost effectively by using the aggregate power and memory of many computers. The software is very portable. The source, which is available free thru netlib, has been compiled on everything from laptops to CRAYs.

PVM enables users to exploit their existing computer hardware to solve much larger problems at minimal additional cost. Hundreds of sites around the world are using PVM to solve important scientific, industrial, and medical problems in addition to PVM's use as an educational tool to teach parallel programming. With tens of thousands of users, PVM has become the de facto standard for distributed computing world-wide.

Does not support Docker. Is a competitor to MPI so really not applicable.

OpenLava - Teraproc

http://www.openlava.org/ https://github.com/openlava/openlava

http://www.teraproc.com/teraproc-blog/scalability/

Does not dynamically support running in the cloud:

As is typical in very large environments, three of the hosts provisioned by AWS failed to start. Rather than waste time troubleshooting failed hosts for what were likely AWS transient errors, the test was conducted with a total of 99,620 cores:

99,600 simulated cores across (999 – 3 = 996) compute hosts each with 100 slots 20 available cores on the master host

Unix user permission control:

OS level login accounts were created for each of the eight users referenced above (u0 through u7) and accounts were configured to source the OpenLava environment on login.

OpenLava is an early fork of LSF. The people who did this started a company called Teraproc to provide support and cluster-as-a-service to enterprises. The community edition seems poorly supported. There's no mention of AWS at all. The enterprise edition costs money.

They allow you to run Docker images as jobs.

There is no support for non-Unix users, that is, they assume a job user is also a Unix user.

There doesn't seem to be any AWS support in the community edition. There is primitive docker support. I suspect the real work is in their web app/backend service that spins up instances. You have to give them your AWS credentials so they can start servers for you. They don't operate like Rescale; they are much more transparent about what's happening, and they don't skim off the billing. This seems like it would hurt their business model. I didn't see pricing right off.

OSCAR

http://svn.oscar.openclustergroup.org/trac/oscar

Not maintained last release 6.1.1 on 5/31/2011

Clone this wiki locally