4555

Fault Tolerance with High Performance for Matrix Multiplication

Schwartz Oded, HUJI, School of Computer Science and Engineering, Computer Science

Application

  • The increase in machine size and the decrease in operating voltage causes more hard errors (component failure) and soft errors (bit flip) in high performance computing.
  • Hard-error resiliency solutions such as checkpoint-restart are costly and severely degrade performance. These solutions are based on distributed "2D" algorithms, hence guarantee optimal performance only for minimal memory size.
  • When more memory is available significant increase in the processors number is required and the inter-processor communication costs are asymptotically larger than the lower bounds dictate.

Our Innovation

A novel computation model for fault tolerant matrix multiplication algorithms that reduce resources overhead: minimizing both the number of additional processors required and the communication costs.

  • Enable redundant memory
  • Obtain resiliency for Strassen and Strassen-like algorithms, with small costs overheads.
  • Lower bounds on additional resources

Opportunity

  • Lower communication costs
  • Better computation and high performance

Patent Status

Published US-2018-0365099-A1

Contact for more information:

Anna Pellivert
Manager BD
+972-2-6586697
Contact ME: