Abstract
At a high level, this paper will discuss a deeper look into the Graphcore IPU via a process of microbenchmarking which includes a variety of operations like gather, scatter, etc. They will address:
- memory performance
- latency and bandwidth
- compute power
- actual performance
Memory Architectures of CPU, GPU, and IPU compared
CPU uses a hierarchy of memory caches, with sophisticated branching techniques to accurately predict the next instruction, and prefetch it, so over the average, there is a hidden latency.
GPU use typically smaller cores compared to CPUs and don not use as sophisticated branch predictions, but have workloads which allow them to run multiple threads on a batch of memory with memory accesses interleaved throughout to hide the latency.
IPU only provide the onchip memory of 256KiB in scratchpad form, so that each processor has full control to work on that amount of memory. The memory is designed in SRAM which is much faster than DRAM, while the IPU allows 6 independent threads to hide its own latency there too.