Visual Studio optimization

bucklera · October 11, 2017, 2:24pm

Hello,
We have an application which we deploy on Mac OS, Linux, and Windows. We have recently started to build Release configurations so as to obtain the speed benefits of compiler optimization. Doing so resulted on a 5x or more improvement in the ITK-based algorithm run times using the XCode compiler, but only a 2x effect on Windows using the Visual Studio compiler. My hope is that someone has experience regarding this that may help us to make further improvement on Windows.

Settings I have tried: /Ox vs. /O2, /Ob2, /Oi, /Ot, /Oy, /GT, and /GL all result in same run times (as stated above about 2x vs. none of these options set).

In order to udnerstand whether in fact the Windows configuration was already as fast as the Mac, thta is, if the unoptimized Windows was better than the Mac unoptimized, I made a comparison between as closely matched hardware as I had availble. The hardware was the same except for speed and cores:
Reference Mac configuration that resulted in a 36 second run time (with optimization) for an example ITK-based algorithm: 2.2GHz i7, 4 cores, 16GB memory
Reference Win10 configuration that resulted in a 38 second run time (with optimization) for an example ITK-based algorithm: 2.6GHz i7, 8 cores, 16GB memory.

No disk activity occurred with my test runs.

I observed that all 8 cores were in fact working on the Win10 computer, and since it is at 2.6MHz vs. the Mac’s 2.2, I would expect the Win10 to have been about twice as fast - in accordance with what would have occurred were the Windows optimization to be as effective on a relative basis as the Mac’s. I also checked that no swapping was occurring, that is, the memory was sufficient for all of the 8 cores.

Bottom line: is this the best I can do, as something intrinsic to Visual Studio and/or Windows, or is there a way to get another 2x out of it?

Thank you,
Andy

matt.mccormick · October 11, 2017, 3:26pm

Hello Andy,

Following a recent talk with a member of the Visual Studio compiler backend team, the best way to generate optimized code with Visual Studio, and many modern compilers, is to enable link time optimization, which is called LTCG in Visual Studio.

HTH,
Matt

dzenanz · October 11, 2017, 6:19pm

If your code is single-threaded, it does not matter how many cores does the computer have. What happens to run duration if you disable half the cores on your Windows computer (so it has only 4 cores active)?

Also, processors have had a frequency of about 3GHz for about 15 years now, and speed improvements come from better architectures. Which generation of Core i7 is each of the CPUs?

So it is possible you will not squeeze much more speed out of your Windows machine by tweaking compiler optimization flags. Tweaking those flags might give you some ~20% speedup. Profile guided optimization and link time optimization are two big ones.

bucklera · October 11, 2017, 7:20pm

Thank you Matt. I found the /LTCG option and enabled it. There was no measurable change.
In looking carefully at the Optimization options under Linker, I see References, Enable COMDAT Folding, Function Order, Profile Guided Database, and Link Time Code Generation. The first three are blank, and of course I know have “Use Link Time Code Generation. Do you have any recommendations for the others? Regarding the Profile Guided Database, do you think this would be profitable? I see there is some kind of a plug in, but it seems to just be for 2013, and I have 2015 (the installation fails).

Andy

matt.mccormick · October 11, 2017, 7:54pm

Interesting that there was no measurable change. Was the build coupled with optimization, i.e. /O3?

COMDAT folding could reduce the build time itself, but it will not change the speed of the generated executable.

Profile guided optimization may help, but it takes work to apply it. The target application has to be executed under a profiler, and the build needs to be directed to the information the profiler collects.

bucklera · October 11, 2017, 8:11pm

The compiler flags are presently set to /O2 /Ob2 /Oi /Ot /Oy /GT /GL

dzenanz · October 12, 2017, 1:36pm

Here is an article about how to use PGO and why it is helpful.

bucklera · October 12, 2017, 9:15pm

Thank you, this was helpful. I did try the PGO and successfully ran the steps. There was less than 5% improvement. But it was instructive to go through this; depending on the nature of the application I suspect it will make a lesser or greater difference. In this case, not worth the time to do it but will keep in mind for future.

Also, I did lookmore carefully at the actual i7’s of the two systems I was using for my comparison.
One was a i7-4770, and the other a i7-6770. Aside from the lithography difference, they appear to be essentially the same. Bottom line is that it appears that the optimized result is within 10% on the Mac and on Windows. The optimization speedup of 2x on Windows vs. 5x on Mac appears to be more that the unoptimized Windows was faster than the unoptimized Mac, rather than the optimized Mac being faster than the optimized Windows. I think in fact that I have what I will have, and have an equivalent result, so this issue can be considered closed.

Many thanks to you,
Andy

blowekamp · October 12, 2017, 9:25pm

Have you tried setting the number of threads to just 1, to evaluate the performance difference of the two optimized versions?