4555

Fault Tolerance with High Performance for Matrix Multiplication

Schwartz Oded, HUJI, School of Computer Science and Engineering, Computer Science

Category

Computer Science and Engineering

Keywords

High Performance Computing, Distributed Matrix Multiplication, Communication-Minimizing algorithms, Fault Tolerance

Current development stage

TRL4 Technology validated in lab            

Application

  • The increase in machine size and the decrease in operating voltage causes more hard errors (component failure) and soft errors (bit flip) in high performance computing.
  • Hard-error resiliency solutions such as checkpoint-restart are costly and severely degrade performance. These solutions are based on distributed "2D" algorithms, hence guarantee optimal performance only for minimal memory size.
  • When more memory is available significant increase in the processors number is required and the inter-processor communication costs are asymptotically larger than the lower bounds dictate.

Our Innovation

A novel computation model for fault tolerant matrix multiplication algorithms that reduce resources overhead: minimizing both the number of additional processors required and the communication costs.

  • Enable redundant memory
  • Obtain resiliency for Strassen and Strassen-like algorithms, with small costs overheads.
  • Lower bounds on additional resources

Opportunity

  • Lower communication costs
  • Better computation and high performance

Patent Status

Published US-2018-0365099-A1

Contact for more information:

Aviv Shoher
SVP BUSINESS DEVELOPMENT
+972-2-6586635
Contact ME:
Image CAPTCHA