Intelligent Batch Job Scheduling In High Performance Computing Environment

Zhang, Di

Zhang, Di

2024

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

This thesis presents a comprehensive study on the application of reinforcement learn- ing (RL) to optimize batch job scheduling in high-performance computing (HPC). The research focuses on the development and evaluation of RL-based scheduling strategies, with the aim of improving the efficiency and adaptability of HPC systems.The first part of the thesis introduces a novel RL-based scheduler, RLScheduler, which dynamically adapts to changing job loads, optimization goals, and system settings. The scheduler employs a kernel-based neural network and trajectory filtering to learn high-quality scheduling policies, demonstrating the potential of RL in HPC scheduling. The second part of the thesis presents SchedInspector, an RL-based inspector that integrates runtime factors into batch job scheduling. The inspector reviews and po- tentially rejects the decisions of base schedulers based on current runtime conditions, leading to significant improvements in job execution performance and system utiliza- tion. The third part of the thesis focuses on the application of RL to the backfilling process in HPC scheduling. The proposed RLBackfilling algorithm challenges the common belief that better estimations of job runtime lead to more effective scheduling, demonstrating the potential of RL in optimizing the backfilling process. The final part of the thesis diverges from the RL-based approaches to focus on the analysis of job traces in HPC environments, particularly under the influence of emerging deep learning tasks. The cross-system analysis provides valuable insights into job geometries, failure patterns, and user behaviors, guiding the design of more efficient job schedulers for future HPC systems. The findings from this research open a promising way to easily integrate RL-based intelligent decision-making into existing HPC job scheduling, advancing computa- tional performance in diverse application domains. The research underscores the transition from conventional to more intelligent and adaptive scheduling methods, emphasizing the role of RL in revolutionizing HPC scheduling strategies.

Details

Title

Intelligent Batch Job Scheduling In High Performance Computing Environment

Author

Zhang, Di (Computer Science)

Contributor

Proquest (firm) Contributor
University Of North Carolina At Charlotte Degree Granting Institution
Dai, Dong Thesis Advisor

Date

2024

Publisher

University of North Carolina at Charlotte

Subjects

Computer science

Link to This Page

Handle: http://hdl.handle.net/20.500.13093/etd:3773

Publication Type

doctoral dissertations

Pagination

1 online resource (175 pages) : PDF

File Format

application/pdf

Degree Type

Ph.D.

Usage Statement

This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). For additional information, see http://rightsstatements.org/page/InC/1.0/., (http://rightsstatements.org/page/InC/1.0/)
Copyright is held by the author unless otherwise indicated.

Record Appears in

Departments and Institutes > Computer Science
Types > Doctoral Dissertations
Graduate Theses and Dissertations
Graduate Thesis and Dissertations

PDF

Statistics

Download Full History

Files

Abstract

Details

Related Items

PDF

Statistics