From mboxrd@z Thu Jan 1 00:00:00 1970 From: Amirouche Boubekki Subject: Re: [Cuirass] Missing database indexes? Date: Mon, 12 Nov 2018 20:42:52 +0100 Message-ID: References: <87va54yh0c.fsf@gnu.org> <20181110211128.6dc522da@alma-ubu> <87k1ljr1c7.fsf@gnu.org> <20181112195044.6d64f51c@alma-ubu> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="000000000000d2c037057a7ce8ab" Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:54868) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gMI7I-0004AE-9I for guix-devel@gnu.org; Mon, 12 Nov 2018 14:43:26 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gMI7E-0006VL-MB for guix-devel@gnu.org; Mon, 12 Nov 2018 14:43:23 -0500 In-Reply-To: <20181112195044.6d64f51c@alma-ubu> List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: bjoern.hoefling@bjoernhoefling.de Cc: guix-devel , =?UTF-8?Q?Cl=C3=A9ment_Lassieur?= --000000000000d2c037057a7ce8ab Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello all, Le lun. 12 nov. 2018 =C3=A0 19:51, Bj=C3=B6rn H=C3=B6fling < bjoern.hoefling@bjoernhoefling.de> a =C3=A9crit : > Hi Ludo, > > On Sun, 11 Nov 2018 18:06:00 +0100 > ludo@gnu.org (Ludovic Court=C3=A8s) wrote: > > > Indeed, that solves the problem for this simple example, thanks! > > > > Now, if I go back to the big query that /api/latestbuilds makes=C2=B9, = the > > result is still pretty bad: > > > > --8<---------------cut here---------------start------------->8--- > > sqlite> EXPLAIN QUERY PLAN SELECT * FROM ( > > ...> SELECT Builds.derivation, Builds.rowid, Builds.timestamp, > > Builds.starttime, ...> Builds.stoptime, Builds.log, Builds.status, > > Builds.job_name, Builds.system, ...> Builds.nix_name, > > Specifications.name ...> FROM Builds > > ...> INNER JOIN Evaluations ON Builds.evaluation =3D Evaluations.id > > ...> INNER JOIN Specifications ON Evaluations.specification =3D > > Specifications.name ...> WHERE (:id IS NULL OR (:id =3D Builds.rowid)) > > ...> AND (:derivation IS NULL OR (:derivation =3D Builds.derivation)= ) > > ...> AND (:jobset IS NULL OR (:jobset =3D Specifications.name)) > > ...> AND (:job IS NULL OR (:job =3D Builds.job_name)) > > ...> AND (:system IS NULL OR (:system =3D Builds.system)) > > ...> AND (:evaluation IS NULL OR (:evaluation =3D Builds.evaluation)= ) > > ...> AND (:status IS NULL OR (:status =3D 'done' AND Builds.status > > >=3D 0) ...> OR (:status =3D 'pending' AND > > >Builds.status < 0) > > ...> OR (:status =3D 'succeeded' AND > > Builds.status =3D 0) ...> OR (:status =3D 'failed' > > AND Builds.status > 0)) ...> AND (:borderlowtime IS NULL > > OR :borderlowid IS NULL ...> OR ((:borderlowtime, :borderlowid) < > > (Builds.stoptime, Builds.rowid))) ...> AND (:borderhightime IS NULL > > OR :borderhighid IS NULL ...> OR ((:borderhightime, :borderhighid) > > > (Builds.stoptime, Builds.rowid))) ...> ORDER BY > > ...> CASE WHEN :borderlowtime IS NULL > > ...> OR :borderlowid IS NULL THEN Builds.stoptime > > ...> ELSE -Builds.stoptime > > ...> END DESC, > > ...> CASE WHEN :borderlowtime IS NULL > > ...> OR :borderlowid IS NULL THEN Builds.rowid > > ...> ELSE -Builds.rowid > > ...> END DESC > > ...> LIMIT :nr) > > ...> ORDER BY stoptime, rowid ASC; > > 1|0|0|SCAN TABLE Builds > > 1|1|1|SEARCH TABLE Evaluations USING INTEGER PRIMARY KEY (rowid=3D?) > > 1|2|2|SEARCH TABLE Specifications USING COVERING INDEX > > sqlite_autoindex_Specifications_1 (name=3D?) 1|0|0|USE TEMP B-TREE FOR > > ORDER BY 0|0|0|SCAN SUBQUERY 1 > > 0|0|0|USE TEMP B-TREE FOR ORDER BY > > --8<---------------cut here---------------end--------------->8--- > > > > I don=E2=80=99t really know what additional index to create (and I=E2= =80=99d rather > > let SQLite do it for me, if it were possible). > > I don't know if there is any automated process to assist you. I have > the feeling that query optimization is more an art than science. > > Hm. This code smells ... It looks too complicated. > Agreed. > > I don't know if this brings you further concerning performance, here > are some thoughts: > > One problematic part is this construct: > > :variable IS NULL OR :variable=3Dmy_column) > > Here is a very simple example: > > sqlite> CREATE TABLE tst ( > ...> id INTEGER PRIMARY KEY AUTOINCREMENT, > ...> name TEXT NOT NULL, > ...> age INTEGER NOT NULL); > sqlite> CREATE INDEX tst_name_age_idx ON tst(name, age); > > sqlite> EXPLAIN QUERY PLAN > ...> SELECT * FROM tst WHERE (23=3D23 OR id=3D:id); > 0|0|0|SCAN TABLE tst > > sqlite> EXPLAIN QUERY PLAN > ...> SELECT * FROM tst WHERE id=3D:id; > 0|0|0|SEARCH TABLE tst USING INTEGER PRIMARY KEY (rowid=3D?) > > sqlite> EXPLAIN QUERY PLAN > ...> SELECT * FROM tst WHERE name=3D:name AND age < 42; > 0|0|0|SEARCH TABLE tst USING COVERING INDEX tst_name_age_idx (name=3D? AN= D > age > So, even when we have a constant part(23=3D23) in the OR clause, this > leads to a full table scan. I think the optimizer cannot detect the > fact that it is a constant boolean value. In the other examples, it is > using the index. > > Even this OR-clause with two variables looks better: > > SELECT * FROM tst WHERE (id=3D:id1 OR id=3D:id2); > 0|0|0|SEARCH TABLE tst USING INTEGER PRIMARY KEY (rowid=3D?) > 0|0|0|EXECUTE LIST SUBQUERY 1 > > I double-checked with Postgresql and it is also performing a full table > scan in the "boolean-constant OR :id=3Did" case. I could not find any > references on the net about it. > > When this would be Java/JPA I would suggest to dynamically create the > query. Can we do something in Scheme-DB too? I.e. pseudo-code > > (string-append sql-prefix > (unless (empty? derivation) "AND :derivation=3DBuilds.derivation") > (unless (empty? jobset) "AND :jobset=3DBuilds.jobset) > ...) > ;;; Should be more some kind of folding, because of the "AND" > > ;;; Parameter-filling needs to be considered too > > > Two more things I noticed that are not directly performance oriented: > > We are directly relying on the rowid here, there is no explicit > id-column. > > This could lead to unpredicted results and reorderings (6th Quirk in > document): > > https://www.sqlite.org/rowidtable.html > > We should add a column: > > id INTEGER PRIMARY KEY AUTOINCREMENT > > Problem is that this concept of AUTOINCREMENT does only work for > Primary Keys in Sqlite. So we need to degrade "derivation" to a > secondary key, i.e. make it non-null and unique: > > derivation TEXT NOT NULL UNIQUE, > > Is there anything speaking against that? > > > Lastly, the query has a limit and an order-by. The question is: Will > the result be first ordered and then the limit taken? The answer (I know > only for Postgresql and MySql, but I think it is the same for Sqlite, > I haven't found any reference): The order is always executed first, but > it has to be stable. In this case it is, because we order by > Builds.rowid, which is a key. Did this happen intentionally or just by > chance? Should we better add a note about that to the SQL code? > > Bj=C3=B6rn > What Bj=C3=B6rn said. I might be wrong but if :derivation :jobset :job :system and :evalutation are exculsive ie. if one is set the other are null. They could all inherit from a "Buildable" class in Object-Oriented terms. Then you'd rather use a Generic Foreign Key pattern where you have two columns 'object_type' and 'object_id' where 'object_type' is one of the "object" type that is buildable and 'object_id' is the identifier of the row in the table named by 'object_type'. Otherwise you can try narrow the search to a given 'stoptime' or 'starttime' slice and index those columns. Something like: SELECT Builds.derivation, Builds.rowid, Builds.timestamp, Builds.starttime, Builds.stoptime, Builds.log, Builds.status, Builds.job_name, Builds.system, Builds.nix_name, Specifications.name FROM Builds INNER JOIN ... INNER JOIN ... WHERE Builds.stoptime > yesterday AND Builds.stoptime < now(); HTH --000000000000d2c037057a7ce8ab Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello all,

Le=C2=A0lun. 12 nov. 2018 =C3=A0=C2=A019:51, Bj=C3=B6r= n H=C3=B6fling <bjo= ern.hoefling@bjoernhoefling.de> a =C3=A9crit=C2=A0:
Hi Ludo,

On Sun, 11 Nov 2018 18:06:00 +0100
ludo@gnu.org (Ludovic= Court=C3=A8s) wrote:

> Indeed, that solves the problem for this simple example, thanks!
>
> Now, if I go back to the big query that /api/latestbuilds makes=C2=B9,= the
> result is still pretty bad:
>
> --8<---------------cut here---------------start------------->8--= -
> sqlite> EXPLAIN QUERY PLAN SELECT * FROM (
>=C2=A0 =C2=A0 ...> SELECT Builds.derivation, Builds.rowid, Builds.ti= mestamp,
> Builds.starttime, ...> Builds.stoptime, Builds.log, Builds.status,<= br> > Builds.job_name, Builds.system, ...> Builds.nix_name,
> Specifications.name ...> FROM Builds
>=C2=A0 =C2=A0 ...> INNER JOIN Evaluations ON Builds.evaluation =3D E= valuations.id
>=C2=A0 =C2=A0 ...> INNER JOIN Specifications ON Evaluations.specific= ation =3D
> Specifications.name ...> WHERE (:id IS NULL OR (:id =3D Builds.rowi= d))
>=C2=A0 =C2=A0 ...> AND (:derivation IS NULL OR (:derivation =3D Buil= ds.derivation))
>=C2=A0 =C2=A0 ...> AND (:jobset IS NULL OR (:jobset =3D Specificatio= ns.name))
>=C2=A0 =C2=A0 ...> AND (:job IS NULL OR (:job =3D Builds.job_name))<= br> >=C2=A0 =C2=A0 ...> AND (:system IS NULL OR (:system =3D Builds.syste= m))
>=C2=A0 =C2=A0 ...> AND (:evaluation IS NULL OR (:evaluation =3D Buil= ds.evaluation))
>=C2=A0 =C2=A0 ...> AND (:status IS NULL OR (:status =3D 'done= 9; AND Builds.status
> >=3D 0) ...>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 OR (:status =3D 'pending' AND
> >Builds.status < 0)
>=C2=A0 =C2=A0 ...>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 OR (:status =3D 'succeeded' AND
> Builds.status =3D 0) ...>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 OR (:status =3D 'failed'
> AND Builds.status > 0)) ...> AND (:borderlowtime IS NULL
> OR :borderlowid IS NULL ...>=C2=A0 OR ((:borderlowtime, :borderlowi= d) <
> (Builds.stoptime, Builds.rowid))) ...> AND (:borderhightime IS NULL=
> OR :borderhighid IS NULL ...>=C2=A0 OR ((:borderhightime, :borderhi= ghid) >
> (Builds.stoptime, Builds.rowid))) ...> ORDER BY
>=C2=A0 =C2=A0 ...> CASE WHEN :borderlowtime IS NULL
>=C2=A0 =C2=A0 ...>=C2=A0 =C2=A0 =C2=A0 =C2=A0 OR :borderlowid IS NUL= L THEN Builds.stoptime
>=C2=A0 =C2=A0 ...>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ELSE -Builds= .stoptime
>=C2=A0 =C2=A0 ...> END DESC,
>=C2=A0 =C2=A0 ...> CASE WHEN :borderlowtime IS NULL
>=C2=A0 =C2=A0 ...>=C2=A0 =C2=A0 =C2=A0 =C2=A0 OR :borderlowid IS NUL= L THEN Builds.rowid
>=C2=A0 =C2=A0 ...>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ELSE -Builds= .rowid
>=C2=A0 =C2=A0 ...> END DESC
>=C2=A0 =C2=A0 ...> LIMIT :nr)
>=C2=A0 =C2=A0 ...> ORDER BY stoptime, rowid ASC;=C2=A0
> 1|0|0|SCAN TABLE Builds
> 1|1|1|SEARCH TABLE Evaluations USING INTEGER PRIMARY KEY (rowid=3D?) > 1|2|2|SEARCH TABLE Specifications USING COVERING INDEX
> sqlite_autoindex_Specifications_1 (name=3D?) 1|0|0|USE TEMP B-TREE FOR=
> ORDER BY 0|0|0|SCAN SUBQUERY 1
> 0|0|0|USE TEMP B-TREE FOR ORDER BY
> --8<---------------cut here---------------end--------------->8--= -
>
> I don=E2=80=99t really know what additional index to create (and I=E2= =80=99d rather
> let SQLite do it for me, if it were possible).

I don't know if there is any automated process to assist you. I have the feeling that query optimization is more an art than science.
=C2=A0
Hm. This code smells ... It looks too complicated.
Agreed.
=C2=A0

I don't know if this brings you further concerning performance, here are some thoughts:

One problematic part is this construct:

:variable IS NULL OR :variable=3Dmy_column)

Here is a very simple example:

sqlite> CREATE TABLE tst (
=C2=A0 =C2=A0...> id INTEGER PRIMARY KEY AUTOINCREMENT,
=C2=A0 =C2=A0...> name TEXT NOT NULL,
=C2=A0 =C2=A0...> age INTEGER NOT NULL);
sqlite> CREATE INDEX tst_name_age_idx ON tst(name, age);

sqlite> EXPLAIN QUERY PLAN
=C2=A0 =C2=A0...> SELECT * FROM tst WHERE (23=3D23 OR id=3D:id);
0|0|0|SCAN TABLE tst

sqlite> EXPLAIN QUERY PLAN
=C2=A0 =C2=A0...> SELECT * FROM tst WHERE id=3D:id;
0|0|0|SEARCH TABLE tst USING INTEGER PRIMARY KEY (rowid=3D?)

sqlite> EXPLAIN QUERY PLAN
=C2=A0 =C2=A0...> SELECT * FROM tst WHERE name=3D:name AND age < 42;<= br> 0|0|0|SEARCH TABLE tst USING COVERING INDEX tst_name_age_idx (name=3D? AND = age<?)

So, even when we have a constant part(23=3D23) in the OR clause, this
leads to a full table scan. I think the optimizer cannot detect the
fact that it is a constant boolean value. In the other examples, it is
using the index.

Even this OR-clause with two variables looks better:

SELECT * FROM tst WHERE (id=3D:id1 OR id=3D:id2);
0|0|0|SEARCH TABLE tst USING INTEGER PRIMARY KEY (rowid=3D?)
0|0|0|EXECUTE LIST SUBQUERY 1

I double-checked with Postgresql and it is also performing a full table
scan in the "boolean-constant OR :id=3Did" case. I could not find= any
references on the net about it.

When this would be Java/JPA I would suggest to dynamically create the
query. Can we do something in Scheme-DB too? I.e. pseudo-code

(string-append sql-prefix
=C2=A0 (unless (empty? derivation) "AND :derivation=3DBuilds.derivatio= n")
=C2=A0 (unless (empty? jobset) "AND :jobset=3DBuilds.jobset)
=C2=A0 ...)
;;; Should be more some kind of folding, because of the "AND"

;;; Parameter-filling needs to be considered too


Two more things I noticed that are not directly performance oriented:

We are directly relying on the rowid here, there is no explicit
id-column.

This could lead to unpredicted results and reorderings (6th Quirk in
document):

https://www.sqlite.org/rowidtable.html

We should add a column:

id INTEGER PRIMARY KEY AUTOINCREMENT

Problem is that this concept of AUTOINCREMENT does only work for
Primary Keys in Sqlite. So we need to degrade "derivation" to a secondary key, i.e. make it non-null and unique:

derivation=C2=A0 =C2=A0 TEXT NOT NULL UNIQUE,

Is there anything speaking against that?


Lastly, the query has a limit and an order-by. The question is: Will
the result be first ordered and then the limit taken? The answer (I know only for Postgresql and MySql, but I think it is the same for Sqlite,
I haven't found any reference): The order is always executed first, but=
it has to be stable. In this case it is, because we order by
Builds.rowid, which is a key. Did this happen intentionally or just by
chance? Should we better add a note about that to the SQL code?

Bj=C3=B6rn

What Bj=C3=B6rn said.

I might be wrong but if :derivation :jobset :job :s= ystem and :evalutation are exculsive
ie. if one is set the other = are null. They could all inherit from a "Buildable" class in Obje= ct-Oriented terms.
Then you'd rather use a Generic Forei= gn Key pattern where you have two columns 'object_type' and 'ob= ject_id'
where 'object_type' is one of the "obje= ct" type that is buildable and 'object_id' is the identifier o= f the row in the table
named by 'object_type'.
<= div>
Otherwise you can try narrow the search to a given 'stopt= ime' or 'starttime' slice and index those columns.
Something like:
SELECT Builds.derivation, Builds.rowid, = Builds.timestamp, Builds.starttime,
Builds.stoptime, Builds.log, Builds.= status, Builds.job_name, Builds.system,
Builds.nix_name, Specifications.= name
FROM Builds
INNER JOIN ...
INNER JOIN ...
WHERE Builds.stoptime > yesterday AND Build= s.stoptime < now();


HTH

--000000000000d2c037057a7ce8ab--