'Predicting GPU Failures With High Precision Under Deep Learning Workloads'

‘Predicting GPU Failures With High Precision Under Deep Learning Workloads’

October 24, 2023

“Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. In large-scale GPU clusters, GPU failures are inevitable and may cause severe consequences. For example, GPU failures disrupt distributed training, crash inference services, and result in service level agreement violations. In this paper, we study the problem of predicting GPU failures using machine learning (ML) models to mitigate their damages.”

Find the paper and full list of authors in the Proceedings of the 16th ACM International Conference on Systems and Storage.

View on Site

Cheng Tan

Computer Science, Machine Learning

‘Predicting GPU Failures With High Precision Under Deep Learning Workloads’

Related

NSF grant awarded for adaptive clothing

Patent for ‘lightweight pose estimation network’ goes to Fu

DARPA grant to enhance mixed reality security

Patents for experimental virtual reality methods

Patent for efficient computation

‘Human Mobility Is Well Described by Closed-Form Gravity-Like Models Learned Automatically from Data’

‘Foundations of Scalable Systems’

‘Network Coding for Engineers’

‘Practical Business Analytics Using R and Python’