From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Simen =?UTF-8?Q?Heggest=C3=B8yl?= Newsgroups: gmane.emacs.bugs Subject: bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing Date: Sun, 08 May 2022 21:31:12 +0200 Message-ID: <16182.174386789$1652038338@news.gmane.org> References: <07E204D4-5FE4-4122-BB82-EBB2107C09E8@acm.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="22625"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) Cc: 55315@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Sun May 08 21:32:11 2022 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nnmdW-0005iW-MU for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 08 May 2022 21:32:11 +0200 Original-Received: from localhost ([::1]:44154 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nnmdV-00007E-Hd for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 08 May 2022 15:32:09 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:41838) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nnmdO-000071-RF for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 15:32:02 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:33117) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nnmdO-0003Qk-In for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 15:32:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1nnmdO-0001OI-F3 for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 15:32:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Simen =?UTF-8?Q?Heggest=C3=B8yl?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 08 May 2022 19:32:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 55315 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 55315-submit@debbugs.gnu.org id=B55315.16520382905307 (code B ref 55315); Sun, 08 May 2022 19:32:02 +0000 Original-Received: (at 55315) by debbugs.gnu.org; 8 May 2022 19:31:30 +0000 Original-Received: from localhost ([127.0.0.1]:55247 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnmcr-0001NX-Jd for submit@debbugs.gnu.org; Sun, 08 May 2022 15:31:30 -0400 Original-Received: from mailtransmit04.runbox.com ([185.226.149.37]:34530) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnmcp-0001NH-4e for 55315@debbugs.gnu.org; Sun, 08 May 2022 15:31:28 -0400 Original-Received: from mailtransmit03.runbox ([10.9.9.163] helo=aibo.runbox.com) by mailtransmit04.runbox.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1nnmci-007azT-9h; Sun, 08 May 2022 21:31:20 +0200 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=runbox.com; s=selector2; h=Content-Type:MIME-Version:Message-ID:In-Reply-To:Date: References:Subject:Cc:To:From; bh=y9hO1VMN+NX9YbVfxkolrt2dcv33bEuvXvNTuWNv8GI=; b=dfUiR/NwbaF7/OUhPtJ09eJAoq 2rWfVevR0voYGNcduJtcv+TdmP5c4dR5kzimqSk67/H3MrqOI9YNKFPLfT6KVv6Rc+4pvoAMqsRZw Hz98qmi7u3gz76dY36cf6zswVucsKDizIj+1CSJeaumoc2hU8sRIPzDRrhat3Lf0T8cyqhl8M+JTZ Cwbeq/AJ8gXD0Pcvn3gGyOmbacHtOrhA5UdH19kDe074t0DgWDU6pfXHNMb3uyfTrT2xwXWkoqy7a 2fUMd71KhPRbYW33gKsnB7jI0ZvDut4ey2wvA4fEGZJ7+ErE44E3V7G1A+EIRzPUlKgSzLy+ZO6QT nsbl/J+w==; Original-Received: from [10.9.9.72] (helo=submission01.runbox) by mailtransmit03.runbox with esmtp (Exim 4.86_2) (envelope-from ) id 1nnmch-00064F-Sa; Sun, 08 May 2022 21:31:20 +0200 Original-Received: by submission01.runbox with esmtpsa [Authenticated ID (963757)] (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) id 1nnmca-0004O1-SM; Sun, 08 May 2022 21:31:12 +0200 In-Reply-To: <07E204D4-5FE4-4122-BB82-EBB2107C09E8@acm.org> ("Mattias =?UTF-8?Q?Engdeg=C3=A5rd?="'s message of "Sun, 8 May 2022 19:56:00 +0200") X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:231684 Archived-At: --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mattias Engdeg=C3=A5rd writes: >> + (setq csv-separator-chars (mapcar #'string-to-char value)) >> + (let ((quoted-value (mapcar #'regexp-quote value))) >> + (setq csv--skip-chars (apply #'concat "^\n" quoted-value)) >> + (setq csv-separator-regexp >> + (apply #'concat `("[" ,@quoted-value "]")))) > > `regexp-quote` produces a regexp from a string literal, but what goes > inside the square brackets is not a regexp -- the syntax rules are > different. More specifically, other characters are special, and > backslash does not quote anything. > > To produce a regexp that matches one in a set of characters, try rx-to-st= ring or regexp-opt. For example, > > (setq csv-separator-regexp (rx-to-string `(or ,@csv-separator-chars) t)) > > The same applies to csv--skip-chars: this isn't a regexp either, but > uses yet another syntax so regexp-quote is inappropriate here > too. Easiest is to precede each char with a backslash since that > always yields a correctly quoted character: "ABC" -> "\\A\\B\\C". > > This is not a judgement on the rest of the patch which may be fine for al= l I know. Thanks Mattias. Does it look better in the updated patch attached? Note that `csv--skip-chars' and `csv-separator-regexp' are set in two different places in the patch, the first time from a list of strings in `csv-separators', and the second time from a single character in `csv-set-separator'. Am I right in thinking that the use of `regexp-quote' in the `csv-set-separator' case gives the right result? -- Simen --=-=-= Content-Type: text/x-diff; charset=utf-8 Content-Disposition: attachment; filename=0001-Add-CSV-separator-guessing-functionality.patch Content-Transfer-Encoding: quoted-printable >From e498ab88ffe8468d791a10c50b692a926a2341ea Mon Sep 17 00:00:00 2001 From: =3D?UTF-8?q?Simen=3D20Heggest=3DC3=3DB8yl?=3D Date: Sun, 8 May 2022 16:01:35 +0200 Subject: [PATCH] Add CSV separator guessing functionality Add two new commands: `csv-guess-set-separator' that automatically guesses and sets the CSV separator of the current buffer, and `csv-set-separator' for setting it manually. `csv-guess-set-separator' can be useful to add to the mode hook to have CSV mode guess and set the separator automatically when visiting a buffer: (add-hook 'csv-mode-hook 'csv-guess-set-separator) * csv-mode.el (csv-separators): Properly quote regexp values. (csv--set-separator-history, csv--preferred-separators): New variables. (csv-set-separator, csv-guess-set-separator) (csv-guess-separator, csv--separator-candidates) (csv--separator-score): New functions. * csv-mode-tests.el (csv-tests--data): New test data. (csv-tests-guess-separator, csv-tests-separator-candidates) (csv-tests-separator-score): New tests. --- csv-mode-tests.el | 80 ++++++++++++++++++++------- csv-mode.el | 137 +++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 187 insertions(+), 30 deletions(-) diff --git a/csv-mode-tests.el b/csv-mode-tests.el index 316dc4bb93..0caeab7d80 100644 --- a/csv-mode-tests.el +++ b/csv-mode-tests.el @@ -1,8 +1,8 @@ ;;; csv-mode-tests.el --- Tests for CSV mode -*- lexical-binding: = t; -*- =20 -;; Copyright (C) 2020 Free Software Foundation, Inc +;; Copyright (C) 2020-2022 Free Software Foundation, Inc =20 -;; Author: Simen Heggest=C3=B8yl +;; Author: Simen Heggest=C3=B8yl ;; Keywords: =20 ;; This program is free software; you can redistribute it and/or modify @@ -28,83 +28,121 @@ (require 'csv-mode) (eval-when-compile (require 'subr-x)) =20 -(ert-deftest csv-mode-tests-end-of-field () +(ert-deftest csv-tests-end-of-field () (with-temp-buffer (csv-mode) (insert "aaa,bbb") (goto-char (point-min)) (csv-end-of-field) - (should (equal (buffer-substring (point-min) (point)) - "aaa")) + (should (equal (buffer-substring (point-min) (point)) "aaa")) (forward-char) (csv-end-of-field) (should (equal (buffer-substring (point-min) (point)) "aaa,bbb")))) =20 -(ert-deftest csv-mode-tests-end-of-field-with-quotes () +(ert-deftest csv-tests-end-of-field-with-quotes () (with-temp-buffer (csv-mode) (insert "aaa,\"b,b\"") (goto-char (point-min)) (csv-end-of-field) - (should (equal (buffer-substring (point-min) (point)) - "aaa")) + (should (equal (buffer-substring (point-min) (point)) "aaa")) (forward-char) (csv-end-of-field) (should (equal (buffer-substring (point-min) (point)) "aaa,\"b,b\"")))) =20 -(ert-deftest csv-mode-tests-beginning-of-field () +(ert-deftest csv-tests-beginning-of-field () (with-temp-buffer (csv-mode) (insert "aaa,bbb") (csv-beginning-of-field) - (should (equal (buffer-substring (point) (point-max)) - "bbb")) + (should (equal (buffer-substring (point) (point-max)) "bbb")) (backward-char) (csv-beginning-of-field) (should (equal (buffer-substring (point) (point-max)) "aaa,bbb")))) =20 -(ert-deftest csv-mode-tests-beginning-of-field-with-quotes () +(ert-deftest csv-tests-beginning-of-field-with-quotes () (with-temp-buffer (csv-mode) (insert "aaa,\"b,b\"") (csv-beginning-of-field) - (should (equal (buffer-substring (point) (point-max)) - "\"b,b\"")) + (should (equal (buffer-substring (point) (point-max)) "\"b,b\"")) (backward-char) (csv-beginning-of-field) (should (equal (buffer-substring (point) (point-max)) "aaa,\"b,b\"")))) =20 -(defun csv-mode-tests--align-fields (before after) +(defun csv-tests--align-fields (before after) (with-temp-buffer (insert (string-join before "\n")) (csv-align-fields t (point-min) (point-max)) (should (equal (buffer-string) (string-join after "\n"))))) =20 -(ert-deftest csv-mode-tests-align-fields () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields () + (csv-tests--align-fields '("aaa,bbb,ccc" "1,2,3") '("aaa, bbb, ccc" "1 , 2 , 3"))) =20 -(ert-deftest csv-mode-tests-align-fields-with-quotes () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields-with-quotes () + (csv-tests--align-fields '("aaa,\"b,b\",ccc" "1,2,3") '("aaa, \"b,b\", ccc" "1 , 2 , 3"))) =20 ;; Bug#14053 -(ert-deftest csv-mode-tests-align-fields-double-quote-comma () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields-double-quote-comma () + (csv-tests--align-fields '("1,2,3" "a,\"b\"\"c,\",d") '("1, 2 , 3" "a, \"b\"\"c,\", d"))) =20 +(defvar csv-tests--data + "1,4;Sun, 2022-04-10;4,12 +8;Mon, 2022-04-11;3,19 +3,2;Tue, 2022-04-12;1,00 +2;Wed, 2022-04-13;0,37 +9;Wed, 2022-04-13;0,37") + +(ert-deftest csv-tests-guess-separator () + (should-not (csv-guess-separator "")) + (should (=3D (csv-guess-separator csv-tests--data 3) ?,)) + (should (=3D (csv-guess-separator csv-tests--data) ?\;)) + (should (=3D (csv-guess-separator csv-tests--data) + (csv-guess-separator csv-tests--data + (length csv-tests--data))))) + +(ert-deftest csv-tests-separator-candidates () + (should-not (csv--separator-candidates "")) + (should-not (csv--separator-candidates csv-tests--data 0)) + (should + (equal (sort (csv--separator-candidates csv-tests--data 4) #'<) + '(?, ?\;))) + (should + (equal (sort (csv--separator-candidates csv-tests--data) #'<) + '(?\s ?, ?- ?\;))) + (should + (equal + (sort (csv--separator-candidates csv-tests--data) #'<) + (sort (csv--separator-candidates csv-tests--data + (length csv-tests--data)) + #'<)))) + +(ert-deftest csv-tests-separator-score () + (should (< (csv--separator-score ?, csv-tests--data) + (csv--separator-score ?\s csv-tests--data) + (csv--separator-score ?- csv-tests--data))) + (should (=3D (csv--separator-score ?- csv-tests--data) + (csv--separator-score ?\; csv-tests--data))) + (should (=3D 0 (csv--separator-score ?\; csv-tests--data 0))) + (should (=3D (csv--separator-score ?\; csv-tests--data) + (csv--separator-score ?\; csv-tests--data + (length csv-tests--data))))) + (provide 'csv-mode-tests) ;;; csv-mode-tests.el ends here diff --git a/csv-mode.el b/csv-mode.el index 10ce166052..9fd5fc8f10 100644 --- a/csv-mode.el +++ b/csv-mode.el @@ -1,11 +1,11 @@ ;;; csv-mode.el --- Major mode for editing comma/char separated values -*= - lexical-binding: t -*- =20 -;; Copyright (C) 2003, 2004, 2012-2020 Free Software Foundation, Inc +;; Copyright (C) 2003, 2004, 2012-2022 Free Software Foundation, Inc =20 ;; Author: "Francis J. Wright" ;; Maintainer: emacs-devel@gnu.org ;; Version: 1.19 -;; Package-Requires: ((emacs "24.1") (cl-lib "0.5")) +;; Package-Requires: ((emacs "27.1") (cl-lib "0.5")) ;; Keywords: convenience =20 ;; This package is free software; you can redistribute it and/or modify @@ -119,7 +119,9 @@ =20 ;;; Code: =20 -(eval-when-compile (require 'cl-lib)) +(eval-when-compile + (require 'cl-lib) + (require 'subr-x)) =20 (defgroup CSV nil "Major mode for editing files of comma-separated value type." @@ -163,12 +165,14 @@ session. Use `customize-set-variable' instead if tha= t is required." (error "%S is already a quote" x))) value) (custom-set-default variable value) - (setq csv-separator-chars (mapcar #'string-to-char value) - csv--skip-chars (apply #'concat "^\n" csv-separators) - csv-separator-regexp (apply #'concat `("[" ,@value "]")) - csv-font-lock-keywords - ;; NB: csv-separator-face variable evaluates to itself. - `((,csv-separator-regexp (0 'csv-separator-face)))))) + (setq csv-separator-chars (mapcar #'string-to-char value)) + (setq csv--skip-chars + (apply #'concat "^\n" + (mapcar (lambda (s) (concat "\\" s)) value))) + (setq csv-separator-regexp (regexp-opt value)) + (setq csv-font-lock-keywords + ;; NB: csv-separator-face variable evaluates to itself. + `((,csv-separator-regexp (0 'csv-separator-face)))))) =20 (defcustom csv-field-quotes '("\"") "Field quotes: a list of *single-character* strings. @@ -368,6 +372,23 @@ It must be either a string or nil." (modify-syntax-entry ?\n ">" csv-mode-syntax-table)) (setq csv-comment-start string)) =20 +(defvar csv--set-separator-history nil) + +(defun csv-set-separator (sep) + "Set the CSV separator in the current buffer to SEP." + (interactive (list (read-char-from-minibuffer + "Separator: " nil 'csv--set-separator-history))) + (when (and (boundp 'csv-field-quotes) + (member (string sep) csv-field-quotes)) + (error "%c is already a quote" sep)) + (setq-local csv-separators (list (string sep))) + (setq-local csv-separator-chars (list sep)) + (setq-local csv--skip-chars (format "^\n%c" sep)) + (setq-local csv-separator-regexp (regexp-quote (string sep))) + (setq-local csv-font-lock-keywords + `((,csv-separator-regexp (0 'csv-separator-face)))) + (font-lock-refresh-defaults)) + ;;;###autoload (add-to-list 'auto-mode-alist '("\\.[Cc][Ss][Vv]\\'" . csv-mode)) =20 @@ -1728,6 +1749,104 @@ setting works better)." (jit-lock-unregister #'csv--jit-align) (csv--jit-unalign (point-min) (point-max)))) (csv--header-flush)) + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +;;; Separator guessing +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +(defvar csv--preferred-separators + '(?\t ?\s ?, ?: ?\;) + "Preferred separator characters in case of a tied score.") + +(defun csv-guess-set-separator () + "Guess and set the CSV separator of the current buffer. + +Add it to the mode hook to have CSV mode guess and set the +separator automatically when visiting a buffer: + + (add-hook \\=3D'csv-mode-hook \\=3D'csv-guess-set-separator)" + (interactive) + (let ((sep (csv-guess-separator + (buffer-substring-no-properties + (point-min) + ;; We're probably only going to look at the first 2048 + ;; or so chars, but take more than we probably need to + ;; minimize the chance of breaking the input in the + ;; middle of a (long) row. + (min 8192 (point-max))) + 2048))) + (when sep + (csv-set-separator sep)))) + +(defun csv-guess-separator (text &optional cutoff) + "Return a guess of which character is the CSV separator in TEXT." + (let ((best-separator nil) + (best-score 0)) + (dolist (candidate (csv--separator-candidates text cutoff)) + (let ((candidate-score + (csv--separator-score candidate text cutoff))) + (when (or (> candidate-score best-score) + (and (=3D candidate-score best-score) + (member candidate csv--preferred-separators))) + (setq best-separator candidate) + (setq best-score candidate-score)))) + best-separator)) + +(defun csv--separator-candidates (text &optional cutoff) + "Return a list of candidate CSV separators in TEXT. +When CUTOFF is passed, look only at the first CUTOFF number of characters." + (let ((chars (make-hash-table))) + (dolist (c (string-to-list + (if cutoff + (substring text 0 (min cutoff (length text))) + text))) + (when (and (not (gethash c chars)) + (or (=3D c ?\t) + (and (not (member c '(?. ?/ ?\" ?'))) + (not (member (get-char-code-property c 'general-= category) + '(Lu Ll Lt Lm Lo Nd Nl No Ps Pe Cc = Co)))))) + (puthash c t chars))) + (hash-table-keys chars))) + +(defun csv--separator-score (separator text &optional cutoff) + "Return a score on how likely SEPARATOR is a separator in TEXT. + +When CUTOFF is passed, stop the calculation at the next whole +line after having read CUTOFF number of characters. + +The scoring is based on the idea that most CSV data is tabular, +i.e. separators should appear equally often on each line. +Furthermore, more commonly appearing characters are scored higher +than those who appear less often. + +Adapted from the paper \"Wrangling Messy CSV Files by Detecting +Row and Type Patterns\" by Gerrit J.J. van den Burg , Alfredo +Naz=C3=A1bal, and Charles Sutton: https://arxiv.org/abs/1811.11242." + (let ((groups + (with-temp-buffer + (csv-set-separator separator) + (save-excursion + (insert text)) + (let ((groups (make-hash-table)) + (chars-read 0)) + (while (and (/=3D (point) (point-max)) + (or (not cutoff) + (< chars-read cutoff))) + (let* ((lep (line-end-position)) + (nfields (length (csv--collect-fields lep)))) + (cl-incf (gethash nfields groups 0)) + (cl-incf chars-read (- lep (point))) + (goto-char (+ lep 1)))) + groups))) + (sum 0)) + (maphash + (lambda (length num) + (cl-incf sum (* num (/ (- length 1) (float length))))) + groups) + (let ((unique-groups (hash-table-count groups))) + (if (=3D 0 unique-groups) + 0 + (/ sum unique-groups))))) =20 ;;; TSV support =20 --=20 2.35.1 --=-=-=--