Optimizing the Throughput of Storm-Based Stream Processing in Clouds
Authors: Huiyan Cao (New Jersey Institute of Technology)
Abstract: There is a rapidly growing need for processing large volumes of streaming data in real time in various big data applications. As one of the most commonly used systems for streaming data processing, Apache Storm provides a workflow-based mechanism to execute directed acyclic graph (DAG)-structured topologies. With the expansion of cloud infrastructures around the globe and the economic benefits of cloud-based computing and storage services, many such Storm workflows have been shifted or are in active transition to clouds. However, modeling the behavior of streaming data processing and improving its performance in clouds still remain largely unexplored. We construct rigorous cost models to analyze the throughput dynamics of Storm workflows and formulate a budget-constrained topology mapping problem to maximize Storm workflow throughput in clouds. We show this problem to be NP-complete and design a heuristic solution that takes into consideration not only the selection of virtual machine type but also the degree of parallelism for each task (spout/bolt) in the topology. The performance superiority of the proposed mapping solution is illustrated through extensive simulations and further verified by real-life workflow experiments deployed in public clouds in comparison with the default Storm and other existing methods.
Back to WORKS 2018: 13th Workshop on Workflows in Support of Large-Scale Archive Listing
Back to Full Workshop Archive Listing