So what, the optimized code goes only once through the loop, and then
bails out?  If so, what is the value of 'size' when the loop ends?

OK, so there was one more detail that I forgot to mention. It looks like I also had "-funroll-loops". After removing it, "-O3" works fine too (without "volatile"). I think this is still a toolchain bug because "-O3 -funroll-loops" combination works fine in the x64 build and what's more important --- it *should* work fine. Anyway, maybe it's a good idea to somehow prevent people from passing "-funroll-loops" by filtering it out from "CFLAGS" in "Makefile" for example?