Updated: Nov 11, 2019
Static vs. Dynamic Data
Many iterative applications we analyzed show a common characteristic of operating on two types of data products. Static data is used in each iteration and remain fixed throughout the computation whereas the variable data is the computed results in each iteration and typically consumed in the next iteration in many expectation maximization (EM) type algorithms.
Although some of the typical MapReduce computations such as distributed sorting and information retrieval consume very large data sets, many iterative applications we encounter operate on moderately sized data sets which can fit into the distributed memory of the computation clusters. This observation led us to explore the idea of using long running map/reduce tasks similar to the long running parallel processes in many MPI applications which last throughout the life of the computation. The long running (cacheable) map/reduce tasks allow map/reduce tasks to be configured with static data and use them without loading again and again in each iteration. Current MapReduce implementations such as Hadoop and DryadLINQ do not support this behavior and hence they initiate new map/reduce tasks and load static data in each iteration incurring considerable performance overheads.
Supports "side-effect-free" Programming
By supporting long running map/reduce tasks Twister does not encourage users to store state information in the map/reduce tasks violating the "side-effect-free" nature of the map/reduce computations rather achieving considerable performance gains by caching the static data across map/reduce tasks. The framework does not guarantee the use of same set of map/reduce tasks (objects) throughout the life of the iterative computation.
Combine step as a Further Reduction
Twister also introduce an optional reduction phase named "combine", which is another reduction phase that can be used to combine the results of the reduce phase into a single value. The user program and the combine operation run on a single process space allowing its output directly accessible to the user program. This enables the user to check conditions based on the output of the MapReduce computations.
Uses Pub/Sub Messaging
Twister uses pub/sub messaging for all the communication/data transfer requirements which eliminates the overhead in transferring data via file systems as in Hadoop or DryadLINQ. The output <Key,Value> pairs produced during the map stage get transferred directly to the reduce stage and the output of the reduce stage get transferred directly to the combined stage via the pub-sub broker network. Currently Twister uses publish-subscribe messaging capabilities of NaradaBrokering messaging infrastructure, but the framework is extensible to support any other publish-subscribe messaging infrastructure such as Active MQ .
Data Access via Local Disks
We provide two mechanisms to access data in Twister; (i) from the local disk of the computer nodes, (ii) directly from the pub-sub infrastructure. For the simplicity of the implementation, we provide a file based data access mechanism for the map/reduce tasks. Unlike Hadoop, twister does not come with the built in file system. Instead it provides a tool to manage the data across these distributed disks. Apart from the above the use of streaming enables Twister to support features such as directly sending input <Key,Value> pairs for the map stage from the user program and configuring map/reduce stages using the data sent from the user program. The above figure shows the programming model of Twister and how iterative MapReduce computations are executed using it.
Fault Tolerance [Working progress]
Providing fault tolerance support for iterative computations with Twister is currently under development.