From mboxrd@z Thu Jan 1 00:00:00 1970 From: Maxim Cournoyer Subject: bug#24450: [PATCHv2] Re: pypi importer outputs strange character series in optional dependency case. Date: Mon, 10 Jun 2019 17:32:01 +0900 Message-ID: <87imtd6dtq.fsf@gmail.com> References: <87pnod7ot4.fsf@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:470:142:3::10]:58583) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1haFjl-0007uy-Ax for bug-guix@gnu.org; Mon, 10 Jun 2019 04:33:07 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1haFjj-0000Ag-Fr for bug-guix@gnu.org; Mon, 10 Jun 2019 04:33:05 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:42404) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1haFji-00009m-OV for bug-guix@gnu.org; Mon, 10 Jun 2019 04:33:03 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1haFji-0005XT-Id for bug-guix@gnu.org; Mon, 10 Jun 2019 04:33:02 -0400 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: (Ricardo Wurmus's message of "Mon, 27 May 2019 17:54:39 +0200") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Ricardo Wurmus Cc: 24450@debbugs.gnu.org Hello! Ricardo Wurmus writes: > Patch number 3! Yay! >> From 0c62b541a3e8925b5ca31fe55dbe7536cf95151f Mon Sep 17 00:00:00 2001 >> From: Maxim Cournoyer >> Date: Thu, 28 Mar 2019 00:26:01 -0400 >> Subject: [PATCH 3/9] import: pypi: Improve parsing of requirement >> specifications. >> >> The previous solution was fragile and could leave unwanted characters in= a >> requirement name, such as '[' or ']'. > > Wouldn=E2=80=99t it be sufficient to add [ and ] to the list of forbidden > characters? The tests pass with this implementation of > clean-requirements: > > (define (clean-requirements s) > (cond > ((string-index s (char-set #\space #\> #\=3D #\< #\[ #\])) =3D> (cut st= ring-take s <>)) > (else s))) Indeed this would be sufficient to make the tests pass, but the tests don't cover all the cases; as an example, consider: --8<---------------cut here---------------start------------->8--- argparse;python_version<"2.7" --8<---------------cut here---------------end--------------->8--- While we could make it work with the current logic by adding more invalid characters (such as ';' here) to the character set, it seems less error prone to use the upstream provided regex to match a package name. [0] >> +(define %requirement-name-regexp >> + ;; Regexp to match the requirement name in a requirement specificatio= n. >> + >> + ;; Some grammar, taken from PEP-0508 (see: >> + ;; https://www.python.org/dev/peps/pep-0508/). >> + >> + ;; The unified rule can be expressed as: >> + ;; specification =3D wsp* ( url_req | name_req ) wsp* >> + >> + ;; where url_req is: >> + ;; url_req =3D name wsp* extras? wsp* urlspec wsp+ quoted_marker? >> + >> + ;; and where name_req is: >> + ;; name_req =3D name wsp* extras? wsp* versionspec? wsp* quoted_marke= r? >> + >> + ;; Thus, we need only matching NAME, which is expressed as: >> + ;; identifer_end =3D letterOrDigit | (('-' | '_' | '.' )* letterOrDig= it) >> + ;; identifier =3D letterOrDigit identifier_end* >> + ;; name =3D identifier >> + (let* ((letter-or-digit "[A-Za-z0-9]") >> + (identifier-end (string-append "(" letter-or-digit "|" >> + "[-_.]*" letter-or-digit ")")) >> + (identifier (string-append "^" letter-or-digit identifier-end = "*")) >> + (name identifier)) >> + (make-regexp name))) > > This seems a little too complicated. Translating a grammar into a > regexp is probably not a good idea in general. Since we don=E2=80=99t ca= re > about anything other than the name it seems easier to just chop off > the string tail as soon as we find an invalid character. While I agree that a regexp is a bigger hammer than basic string manipulation, I see some merit to it here: 1) We can be assured of conformance with upstream, again, per PEP-0508. 2) It is easier to extend; we might want to add parsing for the version spec in order to disregard dependencies specified for Python < 3, for example. The use of the PEP-0508 grammar to define the regexp is useful to detail in a more human-friendly language the components of the regexp. We could have otherwise used the more cryptic regexp for Python distribution names: --8<---------------cut here---------------start------------->8--- ^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$ --8<---------------cut here---------------end--------------->8--- So I guess that what I'm saying is that I prefer this approach to using string-index with invalid characters, for the reasons above. [0] https://www.python.org/dev/peps/pep-0508/ Thanks! Maxim