(1、并行与分布式处理国防科技重点实验室,国防科学技术大学,长沙,410073
2、计算机学院,国防科学技术大学,长沙,410073)
摘 要:随着系统规模的不断增加,大型计算机系统的可靠性问题日益突出,系统的管理和维护工作变的越来越复杂。本文提出了一种新的故障管理系统设计方案,使计算机本身通过实施自我管理,对故障进行检测、诊断、隔离和修复,降低系统故障开销,为用户提供稳定的计算环境,提高大型计算机系统的可用性。
关键词:大型计算机系统 ;故障管理
The Design of A New Fault Management System
Long Cheng1,2 Kai Lu 1,2XiaoPing Wang1,2
1Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha, 410073
2 College of Computer, National University of Defense Technology, Changsha, 410073
Abstract: As with the escalating of system scale, the reliability of large-scale computer systems has become increasingly prominent, making system management and maintenance more and more complex. To address this problem, this paper proposes a design offault management system, where the computer system is able to conduct fault detection, diagnosis, isolation and repair through the implementation of self-management, thus providing a stable computing environment for users and meanwhile improving the usability of large-scale computer systems.
Key words: large-scale computer systems;Fault management
参考文献:
[1] Jack Dongarra,PeteBeckman,Terry Moore. he International Exascale Software Project RoadMap[J]. International Journal of High Performance Computing
Applications,Feb.2011,vol.25(1):3-60.
[2]张琨,许满武,刘玉凤.面向自主计算的主体服务匹配:研究综述[J].计算机科学,2008, 35(12):1-4.
[3]马会彬,赵晓南,李战怀. 具有自律特征的网络故障管理框架[J].微电子学与计算机, 2006,23(8):49-52.
[4]Ada Diaconescu,AdrianMos,JohnMurphy.Automatic performance management in component based software systems.Proceedings of the International Conference on Autonomic Computing,2004;6-18.
[5]林成.高可用服务器故障管理板设计与实现[D].哈尔滨工业大学,2012.
作者简介:
程龙,(1985- ),男,汉族,国防科学技术大学计算机学院计算机科学与技术专业工程硕士。主要研究方向为计算机软件理论。