Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

This thesis presents a comprehensive study on the application of reinforcement learn- ing (RL) to optimize batch job scheduling in high-performance computing (HPC). The research focuses on the development and evaluation of RL-based scheduling strategies, with the aim of improving the efficiency and adaptability of HPC systems.The first part of the thesis introduces a novel RL-based scheduler, RLScheduler, which dynamically adapts to changing job loads, optimization goals, and system settings. The scheduler employs a kernel-based neural network and trajectory filtering to learn high-quality scheduling policies, demonstrating the potential of RL in HPC scheduling. The second part of the thesis presents SchedInspector, an RL-based inspector that integrates runtime factors into batch job scheduling. The inspector reviews and po- tentially rejects the decisions of base schedulers based on current runtime conditions, leading to significant improvements in job execution performance and system utiliza- tion. The third part of the thesis focuses on the application of RL to the backfilling process in HPC scheduling. The proposed RLBackfilling algorithm challenges the common belief that better estimations of job runtime lead to more effective scheduling, demonstrating the potential of RL in optimizing the backfilling process. The final part of the thesis diverges from the RL-based approaches to focus on the analysis of job traces in HPC environments, particularly under the influence of emerging deep learning tasks. The cross-system analysis provides valuable insights into job geometries, failure patterns, and user behaviors, guiding the design of more efficient job schedulers for future HPC systems. The findings from this research open a promising way to easily integrate RL-based intelligent decision-making into existing HPC job scheduling, advancing computa- tional performance in diverse application domains. The research underscores the transition from conventional to more intelligent and adaptive scheduling methods, emphasizing the role of RL in revolutionizing HPC scheduling strategies.

Details

PDF

Statistics

from
to
Export
Download Full History