Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Posted Oct 1, 2022 3:57 UTC (Sat) by jhoblitt (subscriber, #77733)In reply to: Hybrid scheduling gets more complicated by roc
Parent article: Hybrid scheduling gets more complicated
Where the later could retire twice as many instructions at the as thermal envelop (assuming memory bandwidth isn't a bottleneck). Intel hasn't offered such a beast, so I'm guessing the scaling isn't linear.
      Posted Oct 1, 2022 5:44 UTC (Sat)
                               by roc (subscriber, #30627)
                              [Link] (8 responses)
       
In theory there could be a non-uniform workload that performs really well with 64 p-cores and 256 e-cores *all sharing memory*, but that sounds relatively unusual. 
     
    
      Posted Oct 1, 2022 7:40 UTC (Sat)
                               by Sesse (subscriber, #53779)
                              [Link] (5 responses)
       
     
    
      Posted Oct 1, 2022 12:41 UTC (Sat)
                               by jhoblitt (subscriber, #77733)
                              [Link] (1 responses)
       
     
    
      Posted Oct 1, 2022 20:36 UTC (Sat)
                               by flussence (guest, #85566)
                              [Link] 
       
     
      Posted Oct 1, 2022 21:43 UTC (Sat)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (2 responses)
       
It went down like a brick. It was barely better than Xeons and not really any cheaper, while requiring a completely new architecture. 
     
    
      Posted Oct 3, 2022 9:24 UTC (Mon)
                               by paulj (subscriber, #341)
                              [Link] (1 responses)
       
UltraSPARC T1 is probably more like it. A whole bunch of (relatively) classic, simple, in-order, 5-stage RISC SPARC CPUs, behind a thread-selector stage to rapidly switch threads and a huge memory controller. Low instructions-per-thread, but - if you loaded it up with an embarrassingly parallel IO workload (i.e. uncomplicated HTTP serving), very high aggregate IPC. 
[In the initial versions, they neglected to add FPUs (just 1 FPU in the first T1, shared across all cores) and then - once away from Suns' in-house benchmarks and into the real-world - they realised that a lot of real-world applications depend on scripting languages that use FP by default, even for integer math. ;) ] 
     
    
      Posted Oct 3, 2022 11:15 UTC (Mon)
                               by paulj (subscriber, #341)
                              [Link] 
       
     
      Posted Oct 1, 2022 11:31 UTC (Sat)
                               by tlamp (subscriber, #108540)
                              [Link] (1 responses)
       
For some server workloads it could be also interesting to have still both, like for example 128 E-cores and 32 P-cores. 
For example, if I got an HTTP API with various endpoints, some of them may be quite compute-intensive while others not, a mixed setup may have a higher throughput at a lower total power consumption. 
For very specialized workloads it may be better to handle such things explicitly in the application, or user space itself via pinning, but I saw quite some services where such different request work loads can happen, and it'd be nice if the OS scheduler would handle that better in a general way, even if not 100% optimal. 
     
    
      Posted Oct 2, 2022 8:40 UTC (Sun)
                               by roc (subscriber, #30627)
                              [Link] 
       
     
      Posted Oct 3, 2022 12:51 UTC (Mon)
                               by rbanffy (guest, #103898)
                              [Link] (3 responses)
       
They did - it was called Xeon Phi and making it run at its theoretical peak performance was difficult - it was better suited for HPC workloads (it borrowed heavily from their Larrabee chip) than general-purpose high-throughput tasks such as web serving. I think it's a shame it ended up abandoned. 
     
    
      Posted Oct 3, 2022 19:13 UTC (Mon)
                               by Paf (subscriber, #91811)
                              [Link] (2 responses)
       
     
    
      Posted Oct 3, 2022 20:45 UTC (Mon)
                               by jhoblitt (subscriber, #77733)
                              [Link] (1 responses)
       
Phi as an add on accelerator had all of the same problems as GPU development (E.g. network I/O), without mature (CUDA was a monopoly at the time) tooling, and without a cost / performance advantage over GPUs (or so I've heard).  However, for workloads like batch HPC/HTC, the improved flops/watt could have been a attractive and it would have saved the labor investment required to deal with GPU development. 
     
    
      Posted Oct 3, 2022 23:46 UTC (Mon)
                               by rbanffy (guest, #103898)
                              [Link] 
       
Such a shame. I really wanted one. 
The first generation (it was 4 threads per core, I think, each running at a quarter speed) was very limited and extracting reasonable performance from it was not trivial. Subsequent generations were much better, but, by then, a lot of the heavy lifting was available on GPUs and the "friendly" programming of the Phi was no longer a huge advantage (if you can hide the GPU behind a library, it's not significantly uglier than using intrinsics to directly play with AVX-512). 
Again, a huge shame. I hope Intel makes something with lots of E cores, even if for no other reason than to teach programmers to make more parallel code, because clock x IPCs won't get higher as quickly as core counts. 
     
    Hybrid scheduling gets more complicated
      
Hybrid scheduling gets more complicated
      
Hybrid scheduling gets more complicated
      
Hybrid scheduling gets more complicated
      
Hybrid scheduling gets more complicated
      
Hybrid scheduling gets more complicated
      
Hybrid scheduling gets more complicated
      
s/ SPARC/UltraSPARC/g
Hybrid scheduling gets more complicated
      
Tasks that, e.g., do lots of number crunching (or profit from advanced CPU core capabilities) can be moved to the P-cores while the relatively big amount of E-cores can serve parallel to that a massive amount of the "cheaper" responses using a low(er) amount of compute resources and potentially avoiding latency spikes compared to the homogeneous setups.
Hybrid scheduling gets more complicated
      
Hybrid scheduling gets more complicated
      
Hybrid scheduling gets more complicated
      
Hybrid scheduling gets more complicated
      
Hybrid scheduling gets more complicated
      
 
           