Skip to content

Latest commit

 

History

History
69 lines (40 loc) · 3.64 KB

sync_controller_en.md

File metadata and controls

69 lines (40 loc) · 3.64 KB

Sync Controller


Overview

As is well-known, in a distributed system, nodes usually vary in progress. So when aggregation for result is needed, Slow nodes will block the entire computing task. And this will greatly waste computation resource.

Considering machine learning's specific characteristic, ML system can actually loosen the synchronization restrictions. It is not a must to wait for all nodes to finish in every iteration. Faster workers can push their computed model delta ahead and proceed with next iteration。In this way, waiting time for each other can be reduced, which make the entire process faster.

Thus in distributed computing system Sync Controller is one of the most important function. Angel provides three levels of sync control: BSP (Bulk Synchronous Parallel)SSP (Stalness Synchronous Parallel) and ASP (Asynchronous Parallel). Among these three protocols, BSP is the most restricted synchronization protocol, whereas ASP is the least restricted. In general, looser synchronization protocol results in better speed.

Introduction to the Sync Protocols

1. BSP

BSP is the default sync protocol in Angel and widely used in other distributed computing systems. It requires waiting of all tasks to complete in every iteration.

  • Advantages: highly applicable; better convergence for each iteration
  • Disadvantages: waiting for the slowest task in each iteration, resulting in long time to complete the application
  • Use BSP in Angel: default setting

2. SSP

SSP allows the tasks to drift apart up to an upper limit, known as staleness, which is the number of iterations that the fastest task is allowed to be ahead of the slowest task.

  • Advantages: reducing waiting time among tasks to some extent, resulting in better speed
  • Disadvantages: compared to BSP, SSP needs more iterations to reach the same level of convergence, also lacks applicability for some algorithms
  • Use SSP in Angel: configure angel.staleness=N, where N must be a positive integer

3. ASP

Tasks can drift apart with no restrictions; once a task finishes, it just continues to the next iteration without waiting.

  • Advantages: eliminating the waiting time among tasks; fast
  • Disadvantages: poor applicability; also, convergence is not guaranteed
  • Use ASP in Angel: configure angel.staleness=-1

Configuration of the sync controller is as simple as setting the staleness value, though one needs to keep in mind the tradeoff between convergence quality and speed, and it is always a good practice to adjust the sync control level, together with other related parameters, to the specific machine-learning algorithm and the metrics.

Implement Principle --- Vector Clock

In Angel, sync controller is implemented using the Vector Clock algorithm.

Steps as below:

  1. Maintain a vector clock for each partition on the server side, which records each worker's clock for its operations to this partition
  2. Maintain a background sync thread on the worker side to synchronize all partitions' clocks
  3. Task decides whether to wait to get (or other retrieving operations) from PSModel based on its local clock and staleness
  4. After each iteration, call the clock method of PSModel to update the vector clock

The default invoke interface for user is simple:

	psModel.increment(update)
	……
	psModel.clock().get()
	ctx.incIteration()

Angel's flexible sync controller gives user convenient control over the synchronization protocols, making it possible to avoid serious performance issues in large-scale machine learning caused by few machines failure.