Wednesday, April 20, 2022

In a system with 100 processors (multiprocessor), each processor reaches a maximum operating speed of 2Gflops. If 2% of the code we want to run in such a system can be serial and 98% parallel, what will be the performance of the system in terms of Gflops?

 In a system with 100 processors (multiprocessor), each processor reaches a maximum operating speed of 2Gflops. If 2% of the code we want to run in such a system can be serial and 98% parallel, what will be the performance of the system in terms of Gflops?







If run-time's 2% belongs to serial execution, then even if you have infinite Gflops you can't surpass 50x speedup. If parallelized codes perfectly scale to 100 processors, then 98% of run-time reduces to 0.98% which makes 2.98% of unoptimized run-time. This means nearly (2/3) of the time, it is serial work, nearly (1/3) parallel work. So, running at 200 Gflops for ~1 unit time and 2 Gflops for ~2 unit time means the average is ~67 Gflops. If just the serial part was run on an overclock of 100% (2% to 1% time reduction), then it would be ~100 Gflops on average (1% time with 1 core, ~1% time with 100 cores, roughly 50 cores on average). So, even a single-core turbo on a 128-core CPU can boost overall performance by 25%-50% for some optimized algorithms. For example, a game scene has 40,000 colliding balls and you build an octree to find collisions faster, using an old 4 core CPU. Building the octree in single thread takes 12 milliseconds and computing collisions in single thread parallizable code takes 24 milliseconds. That's only 3x speedup even with infinite number of cores and perfect scaling. With 4 cores, it gets 6 milliseconds for collision. When added to 12ms of octree building, its 18 milliseconds, or 100% more speed while having only 1/3 of the time on parallelized codes. If the octree building had turbo core with 100% overclock, total time would be 6ms+6ms so it would have half the time on 4 core and half the time on single core, this means 2.5 cores on average or 150% more performance. But, nothing stops you implementing a pipeline and overlap two frames in time line such that a serial and a parallel region works at the same time but for different frames. Then you can double the average performance (perfect scaling, i.e. equal timings for octree building amd collision) or reach max cores performance.

No comments:

Post a Comment