Yes I think the last paragraph is central. Think of higher, task-based level and let the load balancer do the scheduling (and assignment to physical threads and cores) for you.
The overhead of the real time profiling is likely minimal. Of course for some particular cases this strategy may reduce performance because the load balancer likely needs a short amount of time to sample performance and adapt (although it probably starts with default settings that are not that bad). But these problems were likely not very good candidate for parallelism in the first place.
Doing optimal manual scheduling for all present and future hardware is not that easy. For example should the granularity be the same when vector registers are 512bits (eg Xeon Phi/Skylake/Cannon Lake) versus 256bits? Not sure (and it probably depends whether the code is highly vectorized or not). Can we predict if our code will play nice with hyperthreading or not (ie depending on memory accesses / external ressources), ie if we should put two tasks per core? Not sure either.
I like the idea of TBB to introduce the higher level concept of tasks. If you know that your problem does not scale well (or will have negative scalability), just do one big task. If it is expected to scale, decompose it in a lot of small tasks, and then batches of them will be automatically mapped (without any magic) onto physical threads.
I understand that ITK has a long history of thinking in term of threads. I think it wouldnāt be a bad idea to add this vision of tasks, too. But this is maybe for another thread