Arithmetic Throughput
The amount of data that can be crunched is impressive in the IPU at 31.1 TFlops/s in single precision and 124.5 TFlops/s in mixed precision.
A key to this speed is the Accumulating Matrix Product (AMP). The paper didn’t go into much detail on it; aside from stating that it accelerates Matmul and Convs.
Actual arithmetic performance depends heavily on the actual workload used, so microbenchmarks should be taken with a grain of salt.
Memory Architecture
the nominal aggregate bandwidth for the entire IPU memory is 45 TB/s, with a latency of 6 clock cycles. I think that reveals why there are 6 threads to each tile. It gives you the least complexity while hiding the maximum amount of latency.
The cumulated memory on IPU is 304 MiB and typically sufficient for contemporary ML applications. If you need more memory, the interconnect architecture future proofs the IPU.
Interconnect Architecture
Per IPU board there are two IPUs giving 608MiB, but with native inter chip communication, the programmer can effectively expand their entire on chip memory as they expand compute. This provides two main benefits: performance and programmability.
Performance is attributed to the linear scaling of the power as you scale IPUs, while programmability comes from the nice abstraction layer from the programmer that does not need to add extra effort to increase the amount of resources that can be used by the hardware.