* OpenBLAS and performance @ 2017-12-19 10:49 Pjotr Prins 2017-12-19 17:12 ` Ludovic Courtès 2017-12-20 11:50 ` Dave Love 0 siblings, 2 replies; 23+ messages in thread From: Pjotr Prins @ 2017-12-19 10:49 UTC (permalink / raw) To: Federico Beffa; +Cc: Guix-devel The last weeks I have been toying with OpenBlas and tweaking it boosts performance magnificently over the standard install we do now. A configuration for Haswell looks like: https://gitlab.com/genenetwork/guix-bioinformatics/blob/master/gn/packages/gemma.scm#L64 It will benefit python-numpy and R users greatly to use multi-threading (particularly). How do we make a flavour that supports this. Or are channels going to solve this for us? Btw the latest stable release worked fine too: https://gitlab.com/genenetwork/guix-bioinformatics/commit/474524a5a0d57744c1727442b33d8f2889eb0397 Pj. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-19 10:49 OpenBLAS and performance Pjotr Prins @ 2017-12-19 17:12 ` Ludovic Courtès 2017-12-20 11:50 ` Dave Love 1 sibling, 0 replies; 23+ messages in thread From: Ludovic Courtès @ 2017-12-19 17:12 UTC (permalink / raw) To: Pjotr Prins; +Cc: Guix-devel, Federico Beffa Pjotr Prins <pjotr.public12@thebird.nl> skribis: > The last weeks I have been toying with OpenBlas and tweaking it boosts > performance magnificently over the standard install we do now. A > configuration for Haswell looks like: > > https://gitlab.com/genenetwork/guix-bioinformatics/blob/master/gn/packages/gemma.scm#L64 Nice! The ‘openblas’ definition has this comment: ;; Build the library for all supported CPUs. This allows ;; switching CPU targets at runtime with the environment variable ;; OPENBLAS_CORETYPE=<type>, where "type" is a supported CPU type. Do you achieve similar performance by setting OPENBLAS_CORETYPE=haswell? It would be nice if OpenBLAS would do this automatically. > Or are channels going to solve this for us? No, I don’t think channels have much to do with this kind of issue. But see the discussion we had on this topic: https://lists.gnu.org/archive/html/guix-devel/2017-08/msg00155.html https://lists.gnu.org/archive/html/guix-devel/2017-09/msg00002.html > Btw the latest stable release worked fine too: > > https://gitlab.com/genenetwork/guix-bioinformatics/commit/474524a5a0d57744c1727442b33d8f2889eb0397 It was updated in ‘core-updates’, so it’ll hopefully land soon! Ludo’. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-19 10:49 OpenBLAS and performance Pjotr Prins 2017-12-19 17:12 ` Ludovic Courtès @ 2017-12-20 11:50 ` Dave Love 2017-12-20 14:48 ` Dave Love 2017-12-21 14:55 ` Ludovic Courtès 1 sibling, 2 replies; 23+ messages in thread From: Dave Love @ 2017-12-20 11:50 UTC (permalink / raw) To: Pjotr Prins; +Cc: Guix-devel, Federico Beffa Pjotr Prins <pjotr.public12@thebird.nl> writes: > The last weeks I have been toying with OpenBlas and tweaking it boosts > performance magnificently over the standard install we do now. How so? I haven't measured it from Guix, but I have with Fedora packages, and OB is basically equivalent to MKL in the normal configuration for AVX < 512. > A configuration for Haswell looks like: > > https://gitlab.com/genenetwork/guix-bioinformatics/blob/master/gn/packages/gemma.scm#L64 Why make it Haswell-specific? The cpuid dispatch is the main reason to use OB over at least BLIS currently. > It will benefit python-numpy and R users greatly to use > multi-threading (particularly). How do we make a flavour that supports > this. [I assume/hope it's not intended to default to multithreading.] Fedora sensibly builds separately-named libraries for different flavours <https://apps.fedoraproject.org/packages/openblas/sources/>, but I'd argue also for threaded versions being available with the generic soname in librray sub-directories. There's some discussion and measurements (apologies if I've referenced it before) at <https://loveshack.fedorapeople.org/blas-subversion.html> -- not that measurements sway people who insist on Microsoft R ☹. Fedora should sort out support for optimal BLAS/LAPACK, but those sort of dynamic loading tricks are important in HPC systems for various reasons, and seem rather at odds with the Guix approach. I should write something about that sometime. If you do provide some sort of threaded version for Python, then as far as I remember it must use pthreads, not OpenMP, though you want the OpenMP version for other purposes, and I hadn't realized there wasn't one currently. That's covered in some combination of the OB and Debian issue trackers. I don't know if the same applies to R in general. > Or are channels going to solve this for us? > > Btw the latest stable release worked fine too: > > https://gitlab.com/genenetwork/guix-bioinformatics/commit/474524a5a0d57744c1727442b33d8f2889eb0397 > > Pj. Beware that 0.2.20 has one or two significant problems that I don't remember, but could check. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 11:50 ` Dave Love @ 2017-12-20 14:48 ` Dave Love 2017-12-20 15:06 ` Ricardo Wurmus 2017-12-20 17:22 ` Pjotr Prins 2017-12-21 14:55 ` Ludovic Courtès 1 sibling, 2 replies; 23+ messages in thread From: Dave Love @ 2017-12-20 14:48 UTC (permalink / raw) To: guix-devel I wrote: > If you do provide some sort of threaded version for Python, then as far > as I remember it must use pthreads, not OpenMP, though you want the > OpenMP version for other purposes, and I hadn't realized there wasn't > one currently. I was confused. I see the only version of the library shipped is built with pthreads. I think there should be serial, pthreads, and OpenMP versions, as for Fedora. It may also be useful to provide the 64-bit integer versions, like Fedora; that's useful for at least a flagship chemistry program. I also remembered the problems with 0.2.20. I'll send a patch for the wrong cache size used on some x86_64 when I get a chance. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 14:48 ` Dave Love @ 2017-12-20 15:06 ` Ricardo Wurmus 2017-12-22 12:24 ` Dave Love 2017-12-20 17:22 ` Pjotr Prins 1 sibling, 1 reply; 23+ messages in thread From: Ricardo Wurmus @ 2017-12-20 15:06 UTC (permalink / raw) To: Dave Love; +Cc: guix-devel Hi Dave, > I wrote: > >> If you do provide some sort of threaded version for Python, then as far >> as I remember it must use pthreads, not OpenMP, though you want the >> OpenMP version for other purposes, and I hadn't realized there wasn't >> one currently. > > I was confused. I see the only version of the library shipped is built > with pthreads. I think there should be serial, pthreads, and OpenMP > versions, as for Fedora. Do these library variants have the same binary interface, so that a user could simply preload one of them to override the default variant we use in the input graph of a given package? -- Ricardo GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC https://elephly.net ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 15:06 ` Ricardo Wurmus @ 2017-12-22 12:24 ` Dave Love 0 siblings, 0 replies; 23+ messages in thread From: Dave Love @ 2017-12-22 12:24 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: guix-devel Ricardo Wurmus <rekado@elephly.net> writes: >> I was confused. I see the only version of the library shipped is built >> with pthreads. I think there should be serial, pthreads, and OpenMP >> versions, as for Fedora. > > Do these library variants have the same binary interface, so that a user > could simply preload one of them to override the default variant we use > in the input graph of a given package? Yes. You can use LD_LIBRARY_PATH as normal if you have variants with the right soname, like the trivial shims in the example I referenced. You probably want versions with the implementation-specific names too. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 14:48 ` Dave Love 2017-12-20 15:06 ` Ricardo Wurmus @ 2017-12-20 17:22 ` Pjotr Prins 2017-12-20 18:15 ` Ricardo Wurmus 1 sibling, 1 reply; 23+ messages in thread From: Pjotr Prins @ 2017-12-20 17:22 UTC (permalink / raw) To: Dave Love; +Cc: guix-devel On Wed, Dec 20, 2017 at 02:48:42PM +0000, Dave Love wrote: > I wrote: > > > If you do provide some sort of threaded version for Python, then as far > > as I remember it must use pthreads, not OpenMP, though you want the > > OpenMP version for other purposes, and I hadn't realized there wasn't > > one currently. > > I was confused. I see the only version of the library shipped is built > with pthreads. I think there should be serial, pthreads, and OpenMP > versions, as for Fedora. It may also be useful to provide the 64-bit > integer versions, like Fedora; that's useful for at least a flagship > chemistry program. I was just stating that the default openblas package does not perform well (it is single threaded, for one). If I compile for a target it makes a large difference. I don't know how we can make that available to others apart from having special packages like the one I made. Looks to me like python-numpy could benefit, but how we deploy that as a flavour - I have no idea. > I also remembered the problems with 0.2.20. I'll send a patch for the > wrong cache size used on some x86_64 when I get a chance. It is a stable release. So far no problems on my end using the latest git checkout of openblas. Pj. -- ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 17:22 ` Pjotr Prins @ 2017-12-20 18:15 ` Ricardo Wurmus 2017-12-20 19:28 ` Pjotr Prins 2017-12-21 16:17 ` Dave Love 0 siblings, 2 replies; 23+ messages in thread From: Ricardo Wurmus @ 2017-12-20 18:15 UTC (permalink / raw) To: Pjotr Prins; +Cc: guix-devel, Dave Love Hi Pjotr, > I was just stating that the default openblas package does not perform > well (it is single threaded, for one). Is it really single-threaded? I remember having a couple of problems with OpenBLAS on our cluster when it is used with Numpy as both would spawn lots of threads. The solution was to limit OpenBLAS to at most two threads. > If I compile for a target it > makes a large difference. The FAQ document[1] says this: The environment variable which control the kernel selection is OPENBLAS_CORETYPE (see driver/others/dynamic.c) e.g. export OPENBLAS_CORETYPE=Haswell. And the function char* openblas_get_corename() returns the used target. [1]: https://github.com/xianyi/OpenBLAS/wiki/Faq Have you tried this and compared the performance? -- Ricardo GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC https://elephly.net ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 18:15 ` Ricardo Wurmus @ 2017-12-20 19:28 ` Pjotr Prins 2017-12-20 20:00 ` Ricardo Wurmus ` (2 more replies) 2017-12-21 16:17 ` Dave Love 1 sibling, 3 replies; 23+ messages in thread From: Pjotr Prins @ 2017-12-20 19:28 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: guix-devel, Dave Love On Wed, Dec 20, 2017 at 07:15:16PM +0100, Ricardo Wurmus wrote: > Is it really single-threaded? I remember having a couple of problems > with OpenBLAS on our cluster when it is used with Numpy as both would > spawn lots of threads. The solution was to limit OpenBLAS to at most > two threads. Looks like 1 on my system. > > If I compile for a target it > > makes a large difference. > > The FAQ document[1] says this: > > The environment variable which control the kernel selection is > OPENBLAS_CORETYPE (see driver/others/dynamic.c) e.g. export > OPENBLAS_CORETYPE=Haswell. And the function char* > openblas_get_corename() returns the used target. > > [1]: https://github.com/xianyi/OpenBLAS/wiki/Faq > > Have you tried this and compared the performance? About 10x difference on 24+ cores for matrix multiplication (my version vs what comes with Guix). I do think we need to default to a conservative openblas for general use. Question is how we make it fly on dedicated hardware. package python-numpy:openblas-haswellp for the parallel version? also for R and others. Problem is that we blow up the types of packages. Pj. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 19:28 ` Pjotr Prins @ 2017-12-20 20:00 ` Ricardo Wurmus 2017-12-20 20:32 ` Pjotr Prins 2017-12-21 14:43 ` Ludovic Courtès 2017-12-22 14:35 ` Dave Love 2 siblings, 1 reply; 23+ messages in thread From: Ricardo Wurmus @ 2017-12-20 20:00 UTC (permalink / raw) To: Pjotr Prins; +Cc: guix-devel, Dave Love Pjotr Prins <pjotr.public12@thebird.nl> writes: >> > If I compile for a target it >> > makes a large difference. >> >> The FAQ document[1] says this: >> >> The environment variable which control the kernel selection is >> OPENBLAS_CORETYPE (see driver/others/dynamic.c) e.g. export >> OPENBLAS_CORETYPE=Haswell. And the function char* >> openblas_get_corename() returns the used target. >> >> [1]: https://github.com/xianyi/OpenBLAS/wiki/Faq >> >> Have you tried this and compared the performance? > > About 10x difference on 24+ cores for matrix multiplication (my > version vs what comes with Guix). > > I do think we need to default to a conservative openblas for general > use. Question is how we make it fly on dedicated hardware. Have you tried preloading the special library with LD_PRELOAD? -- Ricardo GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC https://elephly.net ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 20:00 ` Ricardo Wurmus @ 2017-12-20 20:32 ` Pjotr Prins 2017-12-20 19:02 ` Eric Bavier 2017-12-20 23:02 ` Ricardo Wurmus 0 siblings, 2 replies; 23+ messages in thread From: Pjotr Prins @ 2017-12-20 20:32 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: guix-devel, Dave Love On Wed, Dec 20, 2017 at 09:00:46PM +0100, Ricardo Wurmus wrote: > > I do think we need to default to a conservative openblas for general > > use. Question is how we make it fly on dedicated hardware. > > Have you tried preloading the special library with LD_PRELOAD? It is not a question of what I can do. It is a question of how we give other people the benefit of optimized libs in an elegant way. I think channels actually should make a difference if we don't cater in default Guix for such use cases. Performance matters in some circles. Pj. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 20:32 ` Pjotr Prins @ 2017-12-20 19:02 ` Eric Bavier 2017-12-21 16:38 ` Dave Love 2017-12-20 23:02 ` Ricardo Wurmus 1 sibling, 1 reply; 23+ messages in thread From: Eric Bavier @ 2017-12-20 19:02 UTC (permalink / raw) To: Pjotr Prins; +Cc: guix-devel, Dave Love On Wed, 20 Dec 2017 21:32:15 +0100 Pjotr Prins <pjotr.public12@thebird.nl> wrote: > On Wed, Dec 20, 2017 at 09:00:46PM +0100, Ricardo Wurmus wrote: > > > I do think we need to default to a conservative openblas for general > > > use. Question is how we make it fly on dedicated hardware. > > > > Have you tried preloading the special library with LD_PRELOAD? > > It is not a question of what I can do. It is a question of how we give > other people the benefit of optimized libs in an elegant way. Related only to this specific case of BLAS libraries, and not to the general idea of optimized libraries: I recently discovered "FlexiBLAS" from the Max Planck Institute https://www.mpi-magdeburg.mpg.de/projects/flexiblas which I thought might be useful for Guix. It lets one choose the desired BLAS backend at runtime via a configuration file or environment variables. In it's current state it needs a little configuration before use, but I think with a little work we could make picking a BLAS implementation as easy as, e.g. guix package -i python-numpy openblas-haswellp or guix package -i python-numpy atlas-24 where the python-numpy package is the same in both cases, built with a "flexiblas" input. > Performance matters in some circles. This should let people choose the BLAS implementation that is best for their hardware/application. It could also let Guix packages use vendor-supplied BLAS libraries. Just a thought, `~Eric ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 19:02 ` Eric Bavier @ 2017-12-21 16:38 ` Dave Love 0 siblings, 0 replies; 23+ messages in thread From: Dave Love @ 2017-12-21 16:38 UTC (permalink / raw) To: Eric Bavier; +Cc: guix-devel Eric Bavier <ericbavier@centurylink.net> writes: > Related only to this specific case of BLAS libraries, and not to the > general idea of optimized libraries: > I recently discovered "FlexiBLAS" from the Max Planck Institute > https://www.mpi-magdeburg.mpg.de/projects/flexiblas which I thought > might be useful for Guix. That's a new one on me; I'll see how it works. (You'd hope you could do it with weak symbols or other ELFin stuff, but I couldn't see how.) > It lets one choose the desired BLAS backend > at runtime via a configuration file or environment variables. The Fedora package I referenced also does that, makes it easy to have local defaults on heterogeneous clusters, and has been used in production. The same technique allows you to use proprietary BLAS if necessary. > In it's > current state it needs a little configuration before use, but I think > with a little work we could make picking a BLAS implementation as easy > as, e.g. > > guix package -i python-numpy openblas-haswellp Really, you shouldn't need to do that. By the way, there's hope for free ~MKL-equivalent L3 BLAS on avx512 from some work that's promised in the new year. (BLIS dgemm currently has ~70% of MKL performance.) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 20:32 ` Pjotr Prins 2017-12-20 19:02 ` Eric Bavier @ 2017-12-20 23:02 ` Ricardo Wurmus 2017-12-21 10:36 ` Pjotr Prins 1 sibling, 1 reply; 23+ messages in thread From: Ricardo Wurmus @ 2017-12-20 23:02 UTC (permalink / raw) To: Pjotr Prins; +Cc: guix-devel, Dave Love Pjotr Prins <pjotr.public12@thebird.nl> writes: > On Wed, Dec 20, 2017 at 09:00:46PM +0100, Ricardo Wurmus wrote: >> > I do think we need to default to a conservative openblas for general >> > use. Question is how we make it fly on dedicated hardware. >> >> Have you tried preloading the special library with LD_PRELOAD? > > It is not a question of what I can do. It is a question of how we give > other people the benefit of optimized libs in an elegant way. I’m asking because preloading different BLAS libraries is a thing. If this works then we can ask people to just pick their favourite BLAS library variant and preload it. We don’t need to build all combinations of library variants and applications. > I think channels actually should make a difference if we don't cater > in default Guix for such use cases. I think channels (and really: alternative build farms) do make sense for tuned builds. -- Ricardo GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC https://elephly.net ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 23:02 ` Ricardo Wurmus @ 2017-12-21 10:36 ` Pjotr Prins 0 siblings, 0 replies; 23+ messages in thread From: Pjotr Prins @ 2017-12-21 10:36 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: guix-devel, Dave Love On Thu, Dec 21, 2017 at 12:02:55AM +0100, Ricardo Wurmus wrote: > > Pjotr Prins <pjotr.public12@thebird.nl> writes: > > > On Wed, Dec 20, 2017 at 09:00:46PM +0100, Ricardo Wurmus wrote: > >> > I do think we need to default to a conservative openblas for general > >> > use. Question is how we make it fly on dedicated hardware. > >> > >> Have you tried preloading the special library with LD_PRELOAD? > > > > It is not a question of what I can do. It is a question of how we give > > other people the benefit of optimized libs in an elegant way. > > I’m asking because preloading different BLAS libraries is a thing. If > this works then we can ask people to just pick their favourite BLAS > library variant and preload it. We don’t need to build all combinations > of library variants and applications. Ah, sorry, I misunderstood. Let me play with that a little. Do note that it is a bit more complex than it looks. For example, often you need a cblas API. This comes with gslclas and openblas built-in. So, to use atlas you also need libgslcblas.so. With optimized openblas you don't. > > I think channels actually should make a difference if we don't cater > > in default Guix for such use cases. > > I think channels (and really: alternative build farms) do make sense for > tuned builds. +1. The use case is simply: guix channel pjotr-optimized-haswell guix package -i python-numpy which would install my optimized edition. Pj. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 19:28 ` Pjotr Prins 2017-12-20 20:00 ` Ricardo Wurmus @ 2017-12-21 14:43 ` Ludovic Courtès 2017-12-22 14:35 ` Dave Love 2 siblings, 0 replies; 23+ messages in thread From: Ludovic Courtès @ 2017-12-21 14:43 UTC (permalink / raw) To: Pjotr Prins; +Cc: guix-devel, Dave Love Pjotr Prins <pjotr.public12@thebird.nl> skribis: > On Wed, Dec 20, 2017 at 07:15:16PM +0100, Ricardo Wurmus wrote: [...] >> The FAQ document[1] says this: >> >> The environment variable which control the kernel selection is >> OPENBLAS_CORETYPE (see driver/others/dynamic.c) e.g. export >> OPENBLAS_CORETYPE=Haswell. And the function char* >> openblas_get_corename() returns the used target. >> >> [1]: https://github.com/xianyi/OpenBLAS/wiki/Faq >> >> Have you tried this and compared the performance? > > About 10x difference on 24+ cores for matrix multiplication (my > version vs what comes with Guix). Even when you use OPENBLAS_CORETYPE=haswell (lower-case?)? That would be surprising: it’s the same code after all. The only difference should be what happens at load time. Ludo’. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 19:28 ` Pjotr Prins 2017-12-20 20:00 ` Ricardo Wurmus 2017-12-21 14:43 ` Ludovic Courtès @ 2017-12-22 14:35 ` Dave Love 2 siblings, 0 replies; 23+ messages in thread From: Dave Love @ 2017-12-22 14:35 UTC (permalink / raw) To: guix-devel For what it's worth, I get 37000 Mflops from the dgemm.goto benchmark using the current Guix openblas and OPENBLAS_NUM_THREADS=1 at a size of 7000 on a laptop with "i5-6200U CPU @ 2.30GHz" (avx2). That looks about right, and it should more-or-less plateau at that size. For comparison, I get 44000 on a cluster node "E5-2690 v3 @ 2.60GHz" with its serial build of 0.2.19. (I mis-remembered the sandybridge figures, which should be low 20s, not high 20s.) If you see something much different, perhaps the performance counters give a clue, e.g. with Guix' scorep/cube, oprofile, or perf. I've sent a patch for the correct cache size on haswell, but I don't think it makes much difference in this case. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 18:15 ` Ricardo Wurmus 2017-12-20 19:28 ` Pjotr Prins @ 2017-12-21 16:17 ` Dave Love 2017-12-21 16:46 ` Ricardo Wurmus 1 sibling, 1 reply; 23+ messages in thread From: Dave Love @ 2017-12-21 16:17 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: guix-devel Ricardo Wurmus <rekado@elephly.net> writes: > Hi Pjotr, > >> I was just stating that the default openblas package does not perform >> well (it is single threaded, for one). > > Is it really single-threaded? I remember having a couple of problems > with OpenBLAS on our cluster when it is used with Numpy as both would > spawn lots of threads. The solution was to limit OpenBLAS to at most > two threads. Yes, it's symlinked from the libopenblasp variant, which is linked against libpthread, and I'd expect such problems. Anyhow, there's something badly wrong if it doesn't perform roughly equivalently to MKL on SIMD other than AVX512. If I recall correctly, the DGEMM single-threaded performance/core for HPC-type Sandybridge is in the high 20s GFLOPs, and roughly double that for avx2 ({Has,broad}well). I don't think the bad L2 cache value that currently used for Haswell has much effect in that case, but does in other benchmarks. I'll supply a patch for that. Another point about the OB package is that it excludes LAPACK for some reason that doesn't seem to be recorded. I think that should be included, partly for convenience, and partly because it optimizes some of LAPACK. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-21 16:17 ` Dave Love @ 2017-12-21 16:46 ` Ricardo Wurmus 0 siblings, 0 replies; 23+ messages in thread From: Ricardo Wurmus @ 2017-12-21 16:46 UTC (permalink / raw) To: Dave Love; +Cc: guix-devel Dave Love <fx@gnu.org> writes: > Another point about the OB package is that it excludes LAPACK for some > reason that doesn't seem to be recorded. I think that should be > included, partly for convenience, and partly because it optimizes some > of LAPACK. That was me, I think. I did this because I assumed that if users want LAPACK they’d just install the lapack package. If this turns out to be a misguided idea because the OB LAPACK differs then I’m fine with enabling LAPACK in the OB package. (I’m not very knowlegdeable about all of this. I just happened to package OpenBLAS first.) -- Ricardo GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC https://elephly.net ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-20 11:50 ` Dave Love 2017-12-20 14:48 ` Dave Love @ 2017-12-21 14:55 ` Ludovic Courtès 2017-12-22 12:45 ` Dave Love 1 sibling, 1 reply; 23+ messages in thread From: Ludovic Courtès @ 2017-12-21 14:55 UTC (permalink / raw) To: Dave Love; +Cc: Guix-devel, Eric Bavier, Federico Beffa Hello, Dave Love <fx@gnu.org> skribis: > Fedora sensibly builds separately-named libraries for different flavours > <https://apps.fedoraproject.org/packages/openblas/sources/>, but I'd > argue also for threaded versions being available with the generic soname > in librray sub-directories. There's some discussion and measurements > (apologies if I've referenced it before) at > <https://loveshack.fedorapeople.org/blas-subversion.html> I like the idea of an ‘update-alternative’ kind of approach for interchangeable implementations. Unfortunately my understanding is that implementations aren’t entirely interchangeable, especially for LAPACK (not sure about BLAS), because BLIS, OpenBLAS, etc. implement slightly different subsets of netlib LAPACK, AIUI. Packages also often check for specific implementations in their configure/CMakeLists.txt rather than just for “BLAS” or “LAPACK”. FlexiBLAS, which Eric mentioned, looks interesting because it’s designed specifically for that purpose. Perhaps worth giving it a try. Besides, it would be good to have a BLAS/LAPACK policy in Guix. We should at least agree (1) on default BLAS/LAPACK implementations, (2) possibly on a naming scheme for variants based on a different implementation. For #1 we should probably favor implementations that support run-time implementation selection such as OpenBLAS (or the coming BLIS release). Thoughts? Ludo’. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-21 14:55 ` Ludovic Courtès @ 2017-12-22 12:45 ` Dave Love 2017-12-22 15:10 ` Ludovic Courtès 0 siblings, 1 reply; 23+ messages in thread From: Dave Love @ 2017-12-22 12:45 UTC (permalink / raw) To: Ludovic Courtès; +Cc: Guix-devel, Eric Bavier, Federico Beffa Ludovic Courtès <ludovic.courtes@inria.fr> writes: > Hello, > > Dave Love <fx@gnu.org> skribis: > >> Fedora sensibly builds separately-named libraries for different flavours >> <https://apps.fedoraproject.org/packages/openblas/sources/>, but I'd >> argue also for threaded versions being available with the generic soname >> in librray sub-directories. There's some discussion and measurements >> (apologies if I've referenced it before) at >> <https://loveshack.fedorapeople.org/blas-subversion.html> > > I like the idea of an ‘update-alternative’ kind of approach for > interchangeable implementations. /etc/ld.so.conf.d normally provides a clean way to flip the default, but that isn't available in Guix as far as I remember. > Unfortunately my understanding is that implementations aren’t entirely > interchangeable, especially for LAPACK (not sure about BLAS), because > BLIS, OpenBLAS, etc. implement slightly different subsets of netlib > LAPACK, AIUI. LAPACK may add new routines, but you can always link with the vanilla Netlib version, and openblas is currently only one release behind. The LAPACK release notes I've seen aren't very helpful for following that. The important requirement is fast GEMM from the optimized BLAS. I thought BLIS just provided the BLAS layer, which is quite stable, isn't it? > Packages also often check for specific implementations in > their configure/CMakeLists.txt rather than just for “BLAS” or “LAPACK”. It doesn't matter what they're built against when you dynamically load a compatible version. (You'd hope a build system would be able to find arbitrary BLAS but I'm too familiar with cmake pain.) The openblas compatibility hack basically just worked on an RHEL6 cluster when I maintained it. > FlexiBLAS, which Eric mentioned, looks interesting because it’s designed > specifically for that purpose. Perhaps worth giving it a try. I see it works by wrapping everything, which I wanted to avoid. Also it's GPL, which restricts its use. What's the advantage over just having implementations which are directly interchangeable at load time? > Besides, it would be good to have a BLAS/LAPACK policy in Guix. We > should at least agree (1) on default BLAS/LAPACK implementations, (2) > possibly on a naming scheme for variants based on a different > implementation. Yes, but the issue is wider than just linear algebra. It seems to reflect tension between Guix' approach (as I understand it) and the late binding I expect to use. There are potentially other libraries with similar micro-architecture-specific issues, and the related one of profiling/debugging versions. I don't know how much of a real problem there really is, and it would be good to know if someone has addressed this. It's a reason I'm currently not convinced about the trades-off with Guix, and don't go along with the "reproducibility" mantra. Obviously I'm not writing Guix off, though, and I hope the discussion is useful. > For #1 we should probably favor implementations that support run-time > implementation selection such as OpenBLAS (or the coming BLIS release). > > Thoughts? > > Ludo’. Yes, but even with dynamic dispatch you need to account for situations like we currently have on x86_64 with OB not supporting the latest micro-architecture, and it only works on x86 with OB. You may also want to avoid overhead -- see FFTW's advice for packaging. Oh for SIMD hwcaps... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-22 12:45 ` Dave Love @ 2017-12-22 15:10 ` Ludovic Courtès 2017-12-22 16:08 ` Pjotr Prins 0 siblings, 1 reply; 23+ messages in thread From: Ludovic Courtès @ 2017-12-22 15:10 UTC (permalink / raw) To: Dave Love; +Cc: Guix-devel, Eric Bavier, Federico Beffa Hi, Dave Love <fx@gnu.org> skribis: > Ludovic Courtès <ludovic.courtes@inria.fr> writes: > >> Hello, >> >> Dave Love <fx@gnu.org> skribis: >> >>> Fedora sensibly builds separately-named libraries for different flavours >>> <https://apps.fedoraproject.org/packages/openblas/sources/>, but I'd >>> argue also for threaded versions being available with the generic soname >>> in librray sub-directories. There's some discussion and measurements >>> (apologies if I've referenced it before) at >>> <https://loveshack.fedorapeople.org/blas-subversion.html> >> >> I like the idea of an ‘update-alternative’ kind of approach for >> interchangeable implementations. > > /etc/ld.so.conf.d normally provides a clean way to flip the default, > but that isn't available in Guix as far as I remember. Right. >> Unfortunately my understanding is that implementations aren’t entirely >> interchangeable, especially for LAPACK (not sure about BLAS), because >> BLIS, OpenBLAS, etc. implement slightly different subsets of netlib >> LAPACK, AIUI. > > LAPACK may add new routines, but you can always link with the vanilla > Netlib version, and openblas is currently only one release behind. The > LAPACK release notes I've seen aren't very helpful for following that. > The important requirement is fast GEMM from the optimized BLAS. I > thought BLIS just provided the BLAS layer, which is quite stable, isn't > it? I tried a while back to link PaSTiX (a sparse matrix direct solver developed by colleagues of mine), IIRC, against BLIS, and it would miss a couple of functions that Netlib LAPACK provides. >> Packages also often check for specific implementations in >> their configure/CMakeLists.txt rather than just for “BLAS” or “LAPACK”. > > It doesn't matter what they're built against when you dynamically load a > compatible version. Right but they do that precisely because all these implementations provide different subsets of the Netlib APIs, AIUI. >> FlexiBLAS, which Eric mentioned, looks interesting because it’s designed >> specifically for that purpose. Perhaps worth giving it a try. > > I see it works by wrapping everything, which I wanted to avoid. Also > it's GPL, which restricts its use. What's the advantage over just > having implementations which are directly interchangeable at load time? Dunno, I haven’t dig into it. >> Besides, it would be good to have a BLAS/LAPACK policy in Guix. We >> should at least agree (1) on default BLAS/LAPACK implementations, (2) >> possibly on a naming scheme for variants based on a different >> implementation. > > Yes, but the issue is wider than just linear algebra. It seems to > reflect tension between Guix' approach (as I understand it) and the late > binding I expect to use. There are potentially other libraries with > similar micro-architecture-specific issues, and the related one of > profiling/debugging versions. I don't know how much of a real problem > there really is, and it would be good to know if someone has addressed > this. Guix’ approach is to use static binding a lot, and late binding sometimes. For all things plugin-like we use late binding. For shared libraries (not dlopened) we use static binding. Static binding has a cost, as you write, but it gives us control over the environment, and the ability to capture and replicate the software environment. As a user, that’s something I value a lot. I’d also argue that this is something computational scientists should value: first because results they publish should not depend on the phase of the moon, second because they should be able to provide peers with a self-contained recipe to reproduce them. > Yes, but even with dynamic dispatch you need to account for situations > like we currently have on x86_64 with OB not supporting the latest > micro-architecture, and it only works on x86 with OB. You may also want > to avoid overhead -- see FFTW's advice for packaging. Oh for SIMD > hwcaps... I’m not sure what you mean. That OB does not support the latest micro-architecture is not something the package manager can solve. As for overhead, it should be limited to load time, as illustrated by IFUNC and similar designs. Thanks, Ludo’. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OpenBLAS and performance 2017-12-22 15:10 ` Ludovic Courtès @ 2017-12-22 16:08 ` Pjotr Prins 0 siblings, 0 replies; 23+ messages in thread From: Pjotr Prins @ 2017-12-22 16:08 UTC (permalink / raw) To: Ludovic Courtès; +Cc: Guix-devel, Eric Bavier, Dave Love, Federico Beffa On Fri, Dec 22, 2017 at 04:10:39PM +0100, Ludovic Courtès wrote: > Static binding has a cost, as you write, but it gives us control over > the environment, and the ability to capture and replicate the software > environment. As a user, that’s something I value a lot. > I’d also argue that this is something computational scientists should > value: first because results they publish should not depend on the phase > of the moon, second because they should be able to provide peers with a > self-contained recipe to reproduce them. As a scientist I value that *more* than a lot. There is a tension between 'just getting things done' and making things reproducible. If we can do the latter, we should. Also as a programmer I value reproducibility a lot. I want people who report bugs to use the exact same setup. Especially when they are running on machines I can not access (quite common in sequencing centers). If someone sends me a core dump, stack trace or even an asserting in a shared lib it is incredibly useful the full stack is the same. I am wary of flexible resolution of optimized libraries and kernels. Look at what atlas tried to do and what a mess it became. I strongly believe we need explicit statements about what we are running. It does imply Guix will have to provide all options, directly or through channels. I also work on HPC and if I know where I am running I know *what* to target. It is a deterministic recipe. Pj. ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2017-12-22 19:52 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-12-19 10:49 OpenBLAS and performance Pjotr Prins 2017-12-19 17:12 ` Ludovic Courtès 2017-12-20 11:50 ` Dave Love 2017-12-20 14:48 ` Dave Love 2017-12-20 15:06 ` Ricardo Wurmus 2017-12-22 12:24 ` Dave Love 2017-12-20 17:22 ` Pjotr Prins 2017-12-20 18:15 ` Ricardo Wurmus 2017-12-20 19:28 ` Pjotr Prins 2017-12-20 20:00 ` Ricardo Wurmus 2017-12-20 20:32 ` Pjotr Prins 2017-12-20 19:02 ` Eric Bavier 2017-12-21 16:38 ` Dave Love 2017-12-20 23:02 ` Ricardo Wurmus 2017-12-21 10:36 ` Pjotr Prins 2017-12-21 14:43 ` Ludovic Courtès 2017-12-22 14:35 ` Dave Love 2017-12-21 16:17 ` Dave Love 2017-12-21 16:46 ` Ricardo Wurmus 2017-12-21 14:55 ` Ludovic Courtès 2017-12-22 12:45 ` Dave Love 2017-12-22 15:10 ` Ludovic Courtès 2017-12-22 16:08 ` Pjotr Prins
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/guix.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.