all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* OpenBLAS and performance
@ 2017-12-19 10:49 Pjotr Prins
  2017-12-19 17:12 ` Ludovic Courtès
  2017-12-20 11:50 ` Dave Love
  0 siblings, 2 replies; 23+ messages in thread
From: Pjotr Prins @ 2017-12-19 10:49 UTC (permalink / raw)
  To: Federico Beffa; +Cc: Guix-devel

The last weeks I have been toying with OpenBlas and tweaking it boosts
performance magnificently over the standard install we do now. A
configuration for Haswell looks like:

  https://gitlab.com/genenetwork/guix-bioinformatics/blob/master/gn/packages/gemma.scm#L64

It will benefit python-numpy and R users greatly to use
multi-threading (particularly). How do we make a flavour that supports
this.

Or are channels going to solve this for us?

Btw the latest stable release worked fine too:

  https://gitlab.com/genenetwork/guix-bioinformatics/commit/474524a5a0d57744c1727442b33d8f2889eb0397

Pj.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-19 10:49 OpenBLAS and performance Pjotr Prins
@ 2017-12-19 17:12 ` Ludovic Courtès
  2017-12-20 11:50 ` Dave Love
  1 sibling, 0 replies; 23+ messages in thread
From: Ludovic Courtès @ 2017-12-19 17:12 UTC (permalink / raw)
  To: Pjotr Prins; +Cc: Guix-devel, Federico Beffa

Pjotr Prins <pjotr.public12@thebird.nl> skribis:

> The last weeks I have been toying with OpenBlas and tweaking it boosts
> performance magnificently over the standard install we do now. A
> configuration for Haswell looks like:
>
>   https://gitlab.com/genenetwork/guix-bioinformatics/blob/master/gn/packages/gemma.scm#L64

Nice!

The ‘openblas’ definition has this comment:

          ;; Build the library for all supported CPUs.  This allows
          ;; switching CPU targets at runtime with the environment variable
          ;; OPENBLAS_CORETYPE=<type>, where "type" is a supported CPU type.

Do you achieve similar performance by setting OPENBLAS_CORETYPE=haswell?
It would be nice if OpenBLAS would do this automatically.

> Or are channels going to solve this for us?

No, I don’t think channels have much to do with this kind of issue.

But see the discussion we had on this topic:

  https://lists.gnu.org/archive/html/guix-devel/2017-08/msg00155.html
  https://lists.gnu.org/archive/html/guix-devel/2017-09/msg00002.html

> Btw the latest stable release worked fine too:
>
>   https://gitlab.com/genenetwork/guix-bioinformatics/commit/474524a5a0d57744c1727442b33d8f2889eb0397

It was updated in ‘core-updates’, so it’ll hopefully land soon!

Ludo’.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-19 10:49 OpenBLAS and performance Pjotr Prins
  2017-12-19 17:12 ` Ludovic Courtès
@ 2017-12-20 11:50 ` Dave Love
  2017-12-20 14:48   ` Dave Love
  2017-12-21 14:55   ` Ludovic Courtès
  1 sibling, 2 replies; 23+ messages in thread
From: Dave Love @ 2017-12-20 11:50 UTC (permalink / raw)
  To: Pjotr Prins; +Cc: Guix-devel, Federico Beffa

Pjotr Prins <pjotr.public12@thebird.nl> writes:

> The last weeks I have been toying with OpenBlas and tweaking it boosts
> performance magnificently over the standard install we do now.

How so?  I haven't measured it from Guix, but I have with Fedora
packages, and OB is basically equivalent to MKL in the normal
configuration for AVX < 512.

> A configuration for Haswell looks like:
>
>   https://gitlab.com/genenetwork/guix-bioinformatics/blob/master/gn/packages/gemma.scm#L64

Why make it Haswell-specific?  The cpuid dispatch is the main reason to
use OB over at least BLIS currently.

> It will benefit python-numpy and R users greatly to use
> multi-threading (particularly). How do we make a flavour that supports
> this.

[I assume/hope it's not intended to default to multithreading.]

Fedora sensibly builds separately-named libraries for different flavours
<https://apps.fedoraproject.org/packages/openblas/sources/>, but I'd
argue also for threaded versions being available with the generic soname
in librray sub-directories.  There's some discussion and measurements
(apologies if I've referenced it before) at
<https://loveshack.fedorapeople.org/blas-subversion.html> -- not that
measurements sway people who insist on Microsoft R ☹.  Fedora should
sort out support for optimal BLAS/LAPACK, but those sort of dynamic
loading tricks are important in HPC systems for various reasons, and
seem rather at odds with the Guix approach.  I should write something
about that sometime.

If you do provide some sort of threaded version for Python, then as far
as I remember it must use pthreads, not OpenMP, though you want the
OpenMP version for other purposes, and I hadn't realized there wasn't
one currently.  That's covered in some combination of the OB and Debian
issue trackers.  I don't know if the same applies to R in general.

> Or are channels going to solve this for us?
>
> Btw the latest stable release worked fine too:
>
>   https://gitlab.com/genenetwork/guix-bioinformatics/commit/474524a5a0d57744c1727442b33d8f2889eb0397
>
> Pj.

Beware that 0.2.20 has one or two significant problems that I don't
remember, but could check.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 11:50 ` Dave Love
@ 2017-12-20 14:48   ` Dave Love
  2017-12-20 15:06     ` Ricardo Wurmus
  2017-12-20 17:22     ` Pjotr Prins
  2017-12-21 14:55   ` Ludovic Courtès
  1 sibling, 2 replies; 23+ messages in thread
From: Dave Love @ 2017-12-20 14:48 UTC (permalink / raw)
  To: guix-devel

I wrote: 

> If you do provide some sort of threaded version for Python, then as far
> as I remember it must use pthreads, not OpenMP, though you want the
> OpenMP version for other purposes, and I hadn't realized there wasn't
> one currently.

I was confused.  I see the only version of the library shipped is built
with pthreads.  I think there should be serial, pthreads, and OpenMP
versions, as for Fedora.  It may also be useful to provide the 64-bit
integer versions, like Fedora; that's useful for at least a flagship
chemistry program.

I also remembered the problems with 0.2.20.  I'll send a patch for the
wrong cache size used on some x86_64 when I get a chance.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 14:48   ` Dave Love
@ 2017-12-20 15:06     ` Ricardo Wurmus
  2017-12-22 12:24       ` Dave Love
  2017-12-20 17:22     ` Pjotr Prins
  1 sibling, 1 reply; 23+ messages in thread
From: Ricardo Wurmus @ 2017-12-20 15:06 UTC (permalink / raw)
  To: Dave Love; +Cc: guix-devel


Hi Dave,

> I wrote:
>
>> If you do provide some sort of threaded version for Python, then as far
>> as I remember it must use pthreads, not OpenMP, though you want the
>> OpenMP version for other purposes, and I hadn't realized there wasn't
>> one currently.
>
> I was confused.  I see the only version of the library shipped is built
> with pthreads.  I think there should be serial, pthreads, and OpenMP
> versions, as for Fedora.

Do these library variants have the same binary interface, so that a user
could simply preload one of them to override the default variant we use
in the input graph of a given package?

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6  2150 197A 5888 235F ACAC
https://elephly.net

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 14:48   ` Dave Love
  2017-12-20 15:06     ` Ricardo Wurmus
@ 2017-12-20 17:22     ` Pjotr Prins
  2017-12-20 18:15       ` Ricardo Wurmus
  1 sibling, 1 reply; 23+ messages in thread
From: Pjotr Prins @ 2017-12-20 17:22 UTC (permalink / raw)
  To: Dave Love; +Cc: guix-devel

On Wed, Dec 20, 2017 at 02:48:42PM +0000, Dave Love wrote:
> I wrote: 
> 
> > If you do provide some sort of threaded version for Python, then as far
> > as I remember it must use pthreads, not OpenMP, though you want the
> > OpenMP version for other purposes, and I hadn't realized there wasn't
> > one currently.
> 
> I was confused.  I see the only version of the library shipped is built
> with pthreads.  I think there should be serial, pthreads, and OpenMP
> versions, as for Fedora.  It may also be useful to provide the 64-bit
> integer versions, like Fedora; that's useful for at least a flagship
> chemistry program.

I was just stating that the default openblas package does not perform
well (it is single threaded, for one). If I compile for a target it
makes a large difference. I don't know how we can make that available
to others apart from having special packages like the one I made.

Looks to me like python-numpy could benefit, but how we deploy that as
a flavour - I have no idea.

> I also remembered the problems with 0.2.20.  I'll send a patch for the
> wrong cache size used on some x86_64 when I get a chance.

It is a stable release. So far no problems on my end using the latest
git checkout of openblas.

Pj.

-- 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 17:22     ` Pjotr Prins
@ 2017-12-20 18:15       ` Ricardo Wurmus
  2017-12-20 19:28         ` Pjotr Prins
  2017-12-21 16:17         ` Dave Love
  0 siblings, 2 replies; 23+ messages in thread
From: Ricardo Wurmus @ 2017-12-20 18:15 UTC (permalink / raw)
  To: Pjotr Prins; +Cc: guix-devel, Dave Love


Hi Pjotr,

> I was just stating that the default openblas package does not perform
> well (it is single threaded, for one).

Is it really single-threaded?  I remember having a couple of problems
with OpenBLAS on our cluster when it is used with Numpy as both would
spawn lots of threads.  The solution was to limit OpenBLAS to at most
two threads.

> If I compile for a target it
> makes a large difference.

The FAQ document[1] says this:

  The environment variable which control the kernel selection is
  OPENBLAS_CORETYPE (see driver/others/dynamic.c) e.g. export
  OPENBLAS_CORETYPE=Haswell. And the function char*
  openblas_get_corename() returns the used target.

[1]: https://github.com/xianyi/OpenBLAS/wiki/Faq

Have you tried this and compared the performance?

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6  2150 197A 5888 235F ACAC
https://elephly.net

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 20:32             ` Pjotr Prins
@ 2017-12-20 19:02               ` Eric Bavier
  2017-12-21 16:38                 ` Dave Love
  2017-12-20 23:02               ` Ricardo Wurmus
  1 sibling, 1 reply; 23+ messages in thread
From: Eric Bavier @ 2017-12-20 19:02 UTC (permalink / raw)
  To: Pjotr Prins; +Cc: guix-devel, Dave Love

On Wed, 20 Dec 2017 21:32:15 +0100
Pjotr Prins <pjotr.public12@thebird.nl> wrote:

> On Wed, Dec 20, 2017 at 09:00:46PM +0100, Ricardo Wurmus wrote:
> > > I do think we need to default to a conservative openblas for general
> > > use. Question is how we make it fly on dedicated hardware.  
> > 
> > Have you tried preloading the special library with LD_PRELOAD?  
> 
> It is not a question of what I can do. It is a question of how we give
> other people the benefit of optimized libs in an elegant way.

Related only to this specific case of BLAS libraries, and not to the
general idea of optimized libraries:

I recently discovered "FlexiBLAS" from the Max Planck Institute
https://www.mpi-magdeburg.mpg.de/projects/flexiblas which I thought
might be useful for Guix.  It lets one choose the desired BLAS backend
at runtime via a configuration file or environment variables. In it's
current state it needs a little configuration before use, but I think
with a little work we could make picking a BLAS implementation as easy
as, e.g.

  guix package -i python-numpy openblas-haswellp

or

  guix package -i python-numpy atlas-24

where the python-numpy package is the same in both cases, built with
a "flexiblas" input.

> Performance matters in some circles.

This should let people choose the BLAS implementation that is best for
their hardware/application.  It could also let Guix packages use
vendor-supplied BLAS libraries.

Just a thought,
`~Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 18:15       ` Ricardo Wurmus
@ 2017-12-20 19:28         ` Pjotr Prins
  2017-12-20 20:00           ` Ricardo Wurmus
                             ` (2 more replies)
  2017-12-21 16:17         ` Dave Love
  1 sibling, 3 replies; 23+ messages in thread
From: Pjotr Prins @ 2017-12-20 19:28 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel, Dave Love

On Wed, Dec 20, 2017 at 07:15:16PM +0100, Ricardo Wurmus wrote:
> Is it really single-threaded?  I remember having a couple of problems
> with OpenBLAS on our cluster when it is used with Numpy as both would
> spawn lots of threads.  The solution was to limit OpenBLAS to at most
> two threads.

Looks like 1 on my system.

> > If I compile for a target it
> > makes a large difference.
> 
> The FAQ document[1] says this:
> 
>   The environment variable which control the kernel selection is
>   OPENBLAS_CORETYPE (see driver/others/dynamic.c) e.g. export
>   OPENBLAS_CORETYPE=Haswell. And the function char*
>   openblas_get_corename() returns the used target.
> 
> [1]: https://github.com/xianyi/OpenBLAS/wiki/Faq
> 
> Have you tried this and compared the performance?

About 10x difference on 24+ cores for matrix multiplication (my
version vs what comes with Guix).

I do think we need to default to a conservative openblas for general
use. Question is how we make it fly on dedicated hardware.

package python-numpy:openblas-haswellp 

for the parallel version?

also for R and others. Problem is that we blow up the types of
packages.

Pj.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 19:28         ` Pjotr Prins
@ 2017-12-20 20:00           ` Ricardo Wurmus
  2017-12-20 20:32             ` Pjotr Prins
  2017-12-21 14:43           ` Ludovic Courtès
  2017-12-22 14:35           ` Dave Love
  2 siblings, 1 reply; 23+ messages in thread
From: Ricardo Wurmus @ 2017-12-20 20:00 UTC (permalink / raw)
  To: Pjotr Prins; +Cc: guix-devel, Dave Love


Pjotr Prins <pjotr.public12@thebird.nl> writes:

>> > If I compile for a target it
>> > makes a large difference.
>> 
>> The FAQ document[1] says this:
>> 
>>   The environment variable which control the kernel selection is
>>   OPENBLAS_CORETYPE (see driver/others/dynamic.c) e.g. export
>>   OPENBLAS_CORETYPE=Haswell. And the function char*
>>   openblas_get_corename() returns the used target.
>> 
>> [1]: https://github.com/xianyi/OpenBLAS/wiki/Faq
>> 
>> Have you tried this and compared the performance?
>
> About 10x difference on 24+ cores for matrix multiplication (my
> version vs what comes with Guix).
>
> I do think we need to default to a conservative openblas for general
> use. Question is how we make it fly on dedicated hardware.

Have you tried preloading the special library with LD_PRELOAD?

-- 
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6  2150 197A 5888 235F ACAC
https://elephly.net

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 20:00           ` Ricardo Wurmus
@ 2017-12-20 20:32             ` Pjotr Prins
  2017-12-20 19:02               ` Eric Bavier
  2017-12-20 23:02               ` Ricardo Wurmus
  0 siblings, 2 replies; 23+ messages in thread
From: Pjotr Prins @ 2017-12-20 20:32 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel, Dave Love

On Wed, Dec 20, 2017 at 09:00:46PM +0100, Ricardo Wurmus wrote:
> > I do think we need to default to a conservative openblas for general
> > use. Question is how we make it fly on dedicated hardware.
> 
> Have you tried preloading the special library with LD_PRELOAD?

It is not a question of what I can do. It is a question of how we give
other people the benefit of optimized libs in an elegant way.

I think channels actually should make a difference if we don't cater
in default Guix for such use cases.

Performance matters in some circles.

Pj.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 20:32             ` Pjotr Prins
  2017-12-20 19:02               ` Eric Bavier
@ 2017-12-20 23:02               ` Ricardo Wurmus
  2017-12-21 10:36                 ` Pjotr Prins
  1 sibling, 1 reply; 23+ messages in thread
From: Ricardo Wurmus @ 2017-12-20 23:02 UTC (permalink / raw)
  To: Pjotr Prins; +Cc: guix-devel, Dave Love


Pjotr Prins <pjotr.public12@thebird.nl> writes:

> On Wed, Dec 20, 2017 at 09:00:46PM +0100, Ricardo Wurmus wrote:
>> > I do think we need to default to a conservative openblas for general
>> > use. Question is how we make it fly on dedicated hardware.
>>
>> Have you tried preloading the special library with LD_PRELOAD?
>
> It is not a question of what I can do. It is a question of how we give
> other people the benefit of optimized libs in an elegant way.

I’m asking because preloading different BLAS libraries is a thing.  If
this works then we can ask people to just pick their favourite BLAS
library variant and preload it.  We don’t need to build all combinations
of library variants and applications.

> I think channels actually should make a difference if we don't cater
> in default Guix for such use cases.

I think channels (and really: alternative build farms) do make sense for
tuned builds.

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6  2150 197A 5888 235F ACAC
https://elephly.net

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 23:02               ` Ricardo Wurmus
@ 2017-12-21 10:36                 ` Pjotr Prins
  0 siblings, 0 replies; 23+ messages in thread
From: Pjotr Prins @ 2017-12-21 10:36 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel, Dave Love

On Thu, Dec 21, 2017 at 12:02:55AM +0100, Ricardo Wurmus wrote:
> 
> Pjotr Prins <pjotr.public12@thebird.nl> writes:
> 
> > On Wed, Dec 20, 2017 at 09:00:46PM +0100, Ricardo Wurmus wrote:
> >> > I do think we need to default to a conservative openblas for general
> >> > use. Question is how we make it fly on dedicated hardware.
> >>
> >> Have you tried preloading the special library with LD_PRELOAD?
> >
> > It is not a question of what I can do. It is a question of how we give
> > other people the benefit of optimized libs in an elegant way.
> 
> I’m asking because preloading different BLAS libraries is a thing.  If
> this works then we can ask people to just pick their favourite BLAS
> library variant and preload it.  We don’t need to build all combinations
> of library variants and applications.

Ah, sorry, I misunderstood. Let me play with that a little. Do note
that it is a bit more complex than it looks. For example, often you
need a cblas API. This comes with gslclas and openblas built-in. So,
to use atlas you also need libgslcblas.so. With optimized openblas you
don't.

> > I think channels actually should make a difference if we don't cater
> > in default Guix for such use cases.
> 
> I think channels (and really: alternative build farms) do make sense for
> tuned builds.

+1. The use case is simply:

  guix channel pjotr-optimized-haswell
  guix package -i python-numpy

which would install my optimized edition.

Pj.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 19:28         ` Pjotr Prins
  2017-12-20 20:00           ` Ricardo Wurmus
@ 2017-12-21 14:43           ` Ludovic Courtès
  2017-12-22 14:35           ` Dave Love
  2 siblings, 0 replies; 23+ messages in thread
From: Ludovic Courtès @ 2017-12-21 14:43 UTC (permalink / raw)
  To: Pjotr Prins; +Cc: guix-devel, Dave Love

Pjotr Prins <pjotr.public12@thebird.nl> skribis:

> On Wed, Dec 20, 2017 at 07:15:16PM +0100, Ricardo Wurmus wrote:

[...]

>> The FAQ document[1] says this:
>> 
>>   The environment variable which control the kernel selection is
>>   OPENBLAS_CORETYPE (see driver/others/dynamic.c) e.g. export
>>   OPENBLAS_CORETYPE=Haswell. And the function char*
>>   openblas_get_corename() returns the used target.
>> 
>> [1]: https://github.com/xianyi/OpenBLAS/wiki/Faq
>> 
>> Have you tried this and compared the performance?
>
> About 10x difference on 24+ cores for matrix multiplication (my
> version vs what comes with Guix).

Even when you use OPENBLAS_CORETYPE=haswell (lower-case?)?

That would be surprising: it’s the same code after all.  The only
difference should be what happens at load time.

Ludo’.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 11:50 ` Dave Love
  2017-12-20 14:48   ` Dave Love
@ 2017-12-21 14:55   ` Ludovic Courtès
  2017-12-22 12:45     ` Dave Love
  1 sibling, 1 reply; 23+ messages in thread
From: Ludovic Courtès @ 2017-12-21 14:55 UTC (permalink / raw)
  To: Dave Love; +Cc: Guix-devel, Eric Bavier, Federico Beffa

Hello,

Dave Love <fx@gnu.org> skribis:

> Fedora sensibly builds separately-named libraries for different flavours
> <https://apps.fedoraproject.org/packages/openblas/sources/>, but I'd
> argue also for threaded versions being available with the generic soname
> in librray sub-directories.  There's some discussion and measurements
> (apologies if I've referenced it before) at
> <https://loveshack.fedorapeople.org/blas-subversion.html>

I like the idea of an ‘update-alternative’ kind of approach for
interchangeable implementations.

Unfortunately my understanding is that implementations aren’t entirely
interchangeable, especially for LAPACK (not sure about BLAS), because
BLIS, OpenBLAS, etc. implement slightly different subsets of netlib
LAPACK, AIUI.  Packages also often check for specific implementations in
their configure/CMakeLists.txt rather than just for “BLAS” or “LAPACK”.

FlexiBLAS, which Eric mentioned, looks interesting because it’s designed
specifically for that purpose.  Perhaps worth giving it a try.

Besides, it would be good to have a BLAS/LAPACK policy in Guix.  We
should at least agree (1) on default BLAS/LAPACK implementations, (2)
possibly on a naming scheme for variants based on a different
implementation.

For #1 we should probably favor implementations that support run-time
implementation selection such as OpenBLAS (or the coming BLIS release).

Thoughts?

Ludo’.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 18:15       ` Ricardo Wurmus
  2017-12-20 19:28         ` Pjotr Prins
@ 2017-12-21 16:17         ` Dave Love
  2017-12-21 16:46           ` Ricardo Wurmus
  1 sibling, 1 reply; 23+ messages in thread
From: Dave Love @ 2017-12-21 16:17 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

Ricardo Wurmus <rekado@elephly.net> writes:

> Hi Pjotr,
>
>> I was just stating that the default openblas package does not perform
>> well (it is single threaded, for one).
>
> Is it really single-threaded?  I remember having a couple of problems
> with OpenBLAS on our cluster when it is used with Numpy as both would
> spawn lots of threads.  The solution was to limit OpenBLAS to at most
> two threads.

Yes, it's symlinked from the libopenblasp variant, which is linked
against libpthread, and I'd expect such problems.

Anyhow, there's something badly wrong if it doesn't perform roughly
equivalently to MKL on SIMD other than AVX512.  If I recall correctly,
the DGEMM single-threaded performance/core for HPC-type Sandybridge is
in the high 20s GFLOPs, and roughly double that for avx2
({Has,broad}well).  I don't think the bad L2 cache value that currently
used for Haswell has much effect in that case, but does in other
benchmarks.  I'll supply a patch for that.

Another point about the OB package is that it excludes LAPACK for some
reason that doesn't seem to be recorded.  I think that should be
included, partly for convenience, and partly because it optimizes some
of LAPACK.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 19:02               ` Eric Bavier
@ 2017-12-21 16:38                 ` Dave Love
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Love @ 2017-12-21 16:38 UTC (permalink / raw)
  To: Eric Bavier; +Cc: guix-devel

Eric Bavier <ericbavier@centurylink.net> writes:

> Related only to this specific case of BLAS libraries, and not to the
> general idea of optimized libraries:

> I recently discovered "FlexiBLAS" from the Max Planck Institute
> https://www.mpi-magdeburg.mpg.de/projects/flexiblas which I thought
> might be useful for Guix.

That's a new one on me; I'll see how it works.  (You'd hope you could do
it with weak symbols or other ELFin stuff, but I couldn't see how.)

> It lets one choose the desired BLAS backend
> at runtime via a configuration file or environment variables.

The Fedora package I referenced also does that, makes it easy to have
local defaults on heterogeneous clusters, and has been used in
production.  The same technique allows you to use proprietary BLAS if
necessary.

> In it's
> current state it needs a little configuration before use, but I think
> with a little work we could make picking a BLAS implementation as easy
> as, e.g.
>
>   guix package -i python-numpy openblas-haswellp

Really, you shouldn't need to do that.
  
By the way, there's hope for free ~MKL-equivalent L3 BLAS on avx512 from
some work that's promised in the new year.  (BLIS dgemm currently has
~70% of MKL performance.)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-21 16:17         ` Dave Love
@ 2017-12-21 16:46           ` Ricardo Wurmus
  0 siblings, 0 replies; 23+ messages in thread
From: Ricardo Wurmus @ 2017-12-21 16:46 UTC (permalink / raw)
  To: Dave Love; +Cc: guix-devel


Dave Love <fx@gnu.org> writes:

> Another point about the OB package is that it excludes LAPACK for some
> reason that doesn't seem to be recorded.  I think that should be
> included, partly for convenience, and partly because it optimizes some
> of LAPACK.

That was me, I think.  I did this because I assumed that if users want
LAPACK they’d just install the lapack package.  If this turns out to be
a misguided idea because the OB LAPACK differs then I’m fine with
enabling LAPACK in the OB package. 

(I’m not very knowlegdeable about all of this.  I just happened to
package OpenBLAS first.)

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6  2150 197A 5888 235F ACAC
https://elephly.net

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 15:06     ` Ricardo Wurmus
@ 2017-12-22 12:24       ` Dave Love
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Love @ 2017-12-22 12:24 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

Ricardo Wurmus <rekado@elephly.net> writes:

>> I was confused.  I see the only version of the library shipped is built
>> with pthreads.  I think there should be serial, pthreads, and OpenMP
>> versions, as for Fedora.
>
> Do these library variants have the same binary interface, so that a user
> could simply preload one of them to override the default variant we use
> in the input graph of a given package?

Yes.  You can use LD_LIBRARY_PATH as normal if you have variants with
the right soname, like the trivial shims in the example I referenced.
You probably want versions with the implementation-specific names too.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-21 14:55   ` Ludovic Courtès
@ 2017-12-22 12:45     ` Dave Love
  2017-12-22 15:10       ` Ludovic Courtès
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Love @ 2017-12-22 12:45 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix-devel, Eric Bavier, Federico Beffa

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> Hello,
>
> Dave Love <fx@gnu.org> skribis:
>
>> Fedora sensibly builds separately-named libraries for different flavours
>> <https://apps.fedoraproject.org/packages/openblas/sources/>, but I'd
>> argue also for threaded versions being available with the generic soname
>> in librray sub-directories.  There's some discussion and measurements
>> (apologies if I've referenced it before) at
>> <https://loveshack.fedorapeople.org/blas-subversion.html>
>
> I like the idea of an ‘update-alternative’ kind of approach for
> interchangeable implementations.

/etc/ld.so.conf.d normally provides a clean way to flip the default, but
that isn't available in Guix as far as I remember.

> Unfortunately my understanding is that implementations aren’t entirely
> interchangeable, especially for LAPACK (not sure about BLAS), because
> BLIS, OpenBLAS, etc. implement slightly different subsets of netlib
> LAPACK, AIUI.

LAPACK may add new routines, but you can always link with the vanilla
Netlib version, and openblas is currently only one release behind.  The
LAPACK release notes I've seen aren't very helpful for following that.
The important requirement is fast GEMM from the optimized BLAS.  I
thought BLIS just provided the BLAS layer, which is quite stable, isn't
it?

> Packages also often check for specific implementations in
> their configure/CMakeLists.txt rather than just for “BLAS” or “LAPACK”.

It doesn't matter what they're built against when you dynamically load a
compatible version.  (You'd hope a build system would be able to find
arbitrary BLAS but I'm too familiar with cmake pain.)  The openblas
compatibility hack basically just worked on an RHEL6 cluster when I
maintained it.

> FlexiBLAS, which Eric mentioned, looks interesting because it’s designed
> specifically for that purpose.  Perhaps worth giving it a try.

I see it works by wrapping everything, which I wanted to avoid.  Also
it's GPL, which restricts its use.  What's the advantage over just
having implementations which are directly interchangeable at load time?

> Besides, it would be good to have a BLAS/LAPACK policy in Guix.  We
> should at least agree (1) on default BLAS/LAPACK implementations, (2)
> possibly on a naming scheme for variants based on a different
> implementation.

Yes, but the issue is wider than just linear algebra.  It seems to
reflect tension between Guix' approach (as I understand it) and the late
binding I expect to use.  There are potentially other libraries with
similar micro-architecture-specific issues, and the related one of
profiling/debugging versions.  I don't know how much of a real problem
there really is, and it would be good to know if someone has addressed
this.

It's a reason I'm currently not convinced about the trades-off with
Guix, and don't go along with the "reproducibility" mantra.  Obviously
I'm not writing Guix off, though, and I hope the discussion is useful.

> For #1 we should probably favor implementations that support run-time
> implementation selection such as OpenBLAS (or the coming BLIS release).
>
> Thoughts?
>
> Ludo’.

Yes, but even with dynamic dispatch you need to account for situations
like we currently have on x86_64 with OB not supporting the latest
micro-architecture, and it only works on x86 with OB.  You may also want
to avoid overhead -- see FFTW's advice for packaging.  Oh for SIMD
hwcaps...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-20 19:28         ` Pjotr Prins
  2017-12-20 20:00           ` Ricardo Wurmus
  2017-12-21 14:43           ` Ludovic Courtès
@ 2017-12-22 14:35           ` Dave Love
  2 siblings, 0 replies; 23+ messages in thread
From: Dave Love @ 2017-12-22 14:35 UTC (permalink / raw)
  To: guix-devel

For what it's worth, I get 37000 Mflops from the dgemm.goto benchmark
using the current Guix openblas and OPENBLAS_NUM_THREADS=1 at a size of
7000 on a laptop with "i5-6200U CPU @ 2.30GHz" (avx2).  That looks about
right, and it should more-or-less plateau at that size.  For comparison,
I get 44000 on a cluster node "E5-2690 v3 @ 2.60GHz" with its serial
build of 0.2.19.  (I mis-remembered the sandybridge figures, which
should be low 20s, not high 20s.)

If you see something much different, perhaps the performance counters
give a clue, e.g. with Guix' scorep/cube, oprofile, or perf.

I've sent a patch for the correct cache size on haswell, but I don't
think it makes much difference in this case.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-22 12:45     ` Dave Love
@ 2017-12-22 15:10       ` Ludovic Courtès
  2017-12-22 16:08         ` Pjotr Prins
  0 siblings, 1 reply; 23+ messages in thread
From: Ludovic Courtès @ 2017-12-22 15:10 UTC (permalink / raw)
  To: Dave Love; +Cc: Guix-devel, Eric Bavier, Federico Beffa

Hi,

Dave Love <fx@gnu.org> skribis:

> Ludovic Courtès <ludovic.courtes@inria.fr> writes:
>
>> Hello,
>>
>> Dave Love <fx@gnu.org> skribis:
>>
>>> Fedora sensibly builds separately-named libraries for different flavours
>>> <https://apps.fedoraproject.org/packages/openblas/sources/>, but I'd
>>> argue also for threaded versions being available with the generic soname
>>> in librray sub-directories.  There's some discussion and measurements
>>> (apologies if I've referenced it before) at
>>> <https://loveshack.fedorapeople.org/blas-subversion.html>
>>
>> I like the idea of an ‘update-alternative’ kind of approach for
>> interchangeable implementations.
>
> /etc/ld.so.conf.d normally provides a clean way to flip the default,
> but that isn't available in Guix as far as I remember.

Right.

>> Unfortunately my understanding is that implementations aren’t entirely
>> interchangeable, especially for LAPACK (not sure about BLAS), because
>> BLIS, OpenBLAS, etc. implement slightly different subsets of netlib
>> LAPACK, AIUI.
>
> LAPACK may add new routines, but you can always link with the vanilla
> Netlib version, and openblas is currently only one release behind.  The
> LAPACK release notes I've seen aren't very helpful for following that.
> The important requirement is fast GEMM from the optimized BLAS.  I
> thought BLIS just provided the BLAS layer, which is quite stable, isn't
> it?

I tried a while back to link PaSTiX (a sparse matrix direct solver
developed by colleagues of mine), IIRC, against BLIS, and it would miss
a couple of functions that Netlib LAPACK provides.

>> Packages also often check for specific implementations in
>> their configure/CMakeLists.txt rather than just for “BLAS” or “LAPACK”.
>
> It doesn't matter what they're built against when you dynamically load a
> compatible version.

Right but they do that precisely because all these implementations
provide different subsets of the Netlib APIs, AIUI.

>> FlexiBLAS, which Eric mentioned, looks interesting because it’s designed
>> specifically for that purpose.  Perhaps worth giving it a try.
>
> I see it works by wrapping everything, which I wanted to avoid.  Also
> it's GPL, which restricts its use.  What's the advantage over just
> having implementations which are directly interchangeable at load time?

Dunno, I haven’t dig into it.

>> Besides, it would be good to have a BLAS/LAPACK policy in Guix.  We
>> should at least agree (1) on default BLAS/LAPACK implementations, (2)
>> possibly on a naming scheme for variants based on a different
>> implementation.
>
> Yes, but the issue is wider than just linear algebra.  It seems to
> reflect tension between Guix' approach (as I understand it) and the late
> binding I expect to use.  There are potentially other libraries with
> similar micro-architecture-specific issues, and the related one of
> profiling/debugging versions.  I don't know how much of a real problem
> there really is, and it would be good to know if someone has addressed
> this.

Guix’ approach is to use static binding a lot, and late binding
sometimes.  For all things plugin-like we use late binding.  For shared
libraries (not dlopened) we use static binding.

Static binding has a cost, as you write, but it gives us control over
the environment, and the ability to capture and replicate the software
environment.  As a user, that’s something I value a lot.

I’d also argue that this is something computational scientists should
value: first because results they publish should not depend on the phase
of the moon, second because they should be able to provide peers with a
self-contained recipe to reproduce them.

> Yes, but even with dynamic dispatch you need to account for situations
> like we currently have on x86_64 with OB not supporting the latest
> micro-architecture, and it only works on x86 with OB.  You may also want
> to avoid overhead -- see FFTW's advice for packaging.  Oh for SIMD
> hwcaps...

I’m not sure what you mean.  That OB does not support the latest
micro-architecture is not something the package manager can solve.

As for overhead, it should be limited to load time, as illustrated by
IFUNC and similar designs.

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OpenBLAS and performance
  2017-12-22 15:10       ` Ludovic Courtès
@ 2017-12-22 16:08         ` Pjotr Prins
  0 siblings, 0 replies; 23+ messages in thread
From: Pjotr Prins @ 2017-12-22 16:08 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix-devel, Eric Bavier, Dave Love, Federico Beffa

On Fri, Dec 22, 2017 at 04:10:39PM +0100, Ludovic Courtès wrote:
> Static binding has a cost, as you write, but it gives us control over
> the environment, and the ability to capture and replicate the software
> environment.  As a user, that’s something I value a lot.

> I’d also argue that this is something computational scientists should
> value: first because results they publish should not depend on the phase
> of the moon, second because they should be able to provide peers with a
> self-contained recipe to reproduce them.

As a scientist I value that *more* than a lot. There is a tension
between 'just getting things done' and making things reproducible. If
we can do the latter, we should.  Also as a programmer I value
reproducibility a lot. I want people who report bugs to use the exact
same setup. Especially when they are running on machines I can not
access (quite common in sequencing centers). If someone sends me a
core dump, stack trace or even an asserting in a shared lib it is
incredibly useful the full stack is the same.

I am wary of flexible resolution of optimized libraries and kernels.
Look at what atlas tried to do and what a mess it became. I strongly
believe we need explicit statements about what we are running. It does
imply Guix will have to provide all options, directly or through
channels.

I also work on HPC and if I know where I am running I know *what* to
target. It is a deterministic recipe. 

Pj.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-12-22 19:52 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-19 10:49 OpenBLAS and performance Pjotr Prins
2017-12-19 17:12 ` Ludovic Courtès
2017-12-20 11:50 ` Dave Love
2017-12-20 14:48   ` Dave Love
2017-12-20 15:06     ` Ricardo Wurmus
2017-12-22 12:24       ` Dave Love
2017-12-20 17:22     ` Pjotr Prins
2017-12-20 18:15       ` Ricardo Wurmus
2017-12-20 19:28         ` Pjotr Prins
2017-12-20 20:00           ` Ricardo Wurmus
2017-12-20 20:32             ` Pjotr Prins
2017-12-20 19:02               ` Eric Bavier
2017-12-21 16:38                 ` Dave Love
2017-12-20 23:02               ` Ricardo Wurmus
2017-12-21 10:36                 ` Pjotr Prins
2017-12-21 14:43           ` Ludovic Courtès
2017-12-22 14:35           ` Dave Love
2017-12-21 16:17         ` Dave Love
2017-12-21 16:46           ` Ricardo Wurmus
2017-12-21 14:55   ` Ludovic Courtès
2017-12-22 12:45     ` Dave Love
2017-12-22 15:10       ` Ludovic Courtès
2017-12-22 16:08         ` Pjotr Prins

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.