* bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing
[not found] <87h760jeq7.fsf@simenheg@gmail.com>
@ 2022-05-08 17:56 ` Mattias Engdegård
2022-05-08 19:31 ` Simen Heggestøyl
[not found] ` <87mtfryg73.fsf@simenheg@gmail.com>
0 siblings, 2 replies; 7+ messages in thread
From: Mattias Engdegård @ 2022-05-08 17:56 UTC (permalink / raw)
To: Simen Heggestøyl; +Cc: 55315
> + (setq csv-separator-chars (mapcar #'string-to-char value))
> + (let ((quoted-value (mapcar #'regexp-quote value)))
> + (setq csv--skip-chars (apply #'concat "^\n" quoted-value))
> + (setq csv-separator-regexp
> + (apply #'concat `("[" ,@quoted-value "]"))))
`regexp-quote` produces a regexp from a string literal, but what goes inside the square brackets is not a regexp -- the syntax rules are different. More specifically, other characters are special, and backslash does not quote anything.
To produce a regexp that matches one in a set of characters, try rx-to-string or regexp-opt. For example,
(setq csv-separator-regexp (rx-to-string `(or ,@csv-separator-chars) t))
The same applies to csv--skip-chars: this isn't a regexp either, but uses yet another syntax so regexp-quote is inappropriate here too. Easiest is to precede each char with a backslash since that always yields a correctly quoted character: "ABC" -> "\\A\\B\\C".
This is not a judgement on the rest of the patch which may be fine for all I know.
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing
2022-05-08 17:56 ` bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing Mattias Engdegård
@ 2022-05-08 19:31 ` Simen Heggestøyl
[not found] ` <87mtfryg73.fsf@simenheg@gmail.com>
1 sibling, 0 replies; 7+ messages in thread
From: Simen Heggestøyl @ 2022-05-08 19:31 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 55315
[-- Attachment #1: Type: text/plain, Size: 1613 bytes --]
Mattias Engdegård <mattiase@acm.org> writes:
>> + (setq csv-separator-chars (mapcar #'string-to-char value))
>> + (let ((quoted-value (mapcar #'regexp-quote value)))
>> + (setq csv--skip-chars (apply #'concat "^\n" quoted-value))
>> + (setq csv-separator-regexp
>> + (apply #'concat `("[" ,@quoted-value "]"))))
>
> `regexp-quote` produces a regexp from a string literal, but what goes
> inside the square brackets is not a regexp -- the syntax rules are
> different. More specifically, other characters are special, and
> backslash does not quote anything.
>
> To produce a regexp that matches one in a set of characters, try rx-to-string or regexp-opt. For example,
>
> (setq csv-separator-regexp (rx-to-string `(or ,@csv-separator-chars) t))
>
> The same applies to csv--skip-chars: this isn't a regexp either, but
> uses yet another syntax so regexp-quote is inappropriate here
> too. Easiest is to precede each char with a backslash since that
> always yields a correctly quoted character: "ABC" -> "\\A\\B\\C".
>
> This is not a judgement on the rest of the patch which may be fine for all I know.
Thanks Mattias.
Does it look better in the updated patch attached?
Note that `csv--skip-chars' and `csv-separator-regexp' are set in two
different places in the patch, the first time from a list of strings in
`csv-separators', and the second time from a single character in
`csv-set-separator'. Am I right in thinking that the use of
`regexp-quote' in the `csv-set-separator' case gives the right result?
-- Simen
[-- Attachment #2: 0001-Add-CSV-separator-guessing-functionality.patch --]
[-- Type: text/x-diff, Size: 14268 bytes --]
From e498ab88ffe8468d791a10c50b692a926a2341ea Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Simen=20Heggest=C3=B8yl?= <simenheg@runbox.com>
Date: Sun, 8 May 2022 16:01:35 +0200
Subject: [PATCH] Add CSV separator guessing functionality
Add two new commands: `csv-guess-set-separator' that automatically
guesses and sets the CSV separator of the current buffer, and
`csv-set-separator' for setting it manually.
`csv-guess-set-separator' can be useful to add to the mode hook to
have CSV mode guess and set the separator automatically when visiting
a buffer:
(add-hook 'csv-mode-hook 'csv-guess-set-separator)
* csv-mode.el (csv-separators): Properly quote regexp values.
(csv--set-separator-history, csv--preferred-separators): New
variables.
(csv-set-separator, csv-guess-set-separator)
(csv-guess-separator, csv--separator-candidates)
(csv--separator-score): New functions.
* csv-mode-tests.el (csv-tests--data): New test data.
(csv-tests-guess-separator, csv-tests-separator-candidates)
(csv-tests-separator-score): New tests.
---
csv-mode-tests.el | 80 ++++++++++++++++++++-------
csv-mode.el | 137 +++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 187 insertions(+), 30 deletions(-)
diff --git a/csv-mode-tests.el b/csv-mode-tests.el
index 316dc4bb93..0caeab7d80 100644
--- a/csv-mode-tests.el
+++ b/csv-mode-tests.el
@@ -1,8 +1,8 @@
;;; csv-mode-tests.el --- Tests for CSV mode -*- lexical-binding: t; -*-
-;; Copyright (C) 2020 Free Software Foundation, Inc
+;; Copyright (C) 2020-2022 Free Software Foundation, Inc
-;; Author: Simen Heggestøyl <simenheg@gmail.com>
+;; Author: Simen Heggestøyl <simenheg@runbox.com>
;; Keywords:
;; This program is free software; you can redistribute it and/or modify
@@ -28,83 +28,121 @@
(require 'csv-mode)
(eval-when-compile (require 'subr-x))
-(ert-deftest csv-mode-tests-end-of-field ()
+(ert-deftest csv-tests-end-of-field ()
(with-temp-buffer
(csv-mode)
(insert "aaa,bbb")
(goto-char (point-min))
(csv-end-of-field)
- (should (equal (buffer-substring (point-min) (point))
- "aaa"))
+ (should (equal (buffer-substring (point-min) (point)) "aaa"))
(forward-char)
(csv-end-of-field)
(should (equal (buffer-substring (point-min) (point))
"aaa,bbb"))))
-(ert-deftest csv-mode-tests-end-of-field-with-quotes ()
+(ert-deftest csv-tests-end-of-field-with-quotes ()
(with-temp-buffer
(csv-mode)
(insert "aaa,\"b,b\"")
(goto-char (point-min))
(csv-end-of-field)
- (should (equal (buffer-substring (point-min) (point))
- "aaa"))
+ (should (equal (buffer-substring (point-min) (point)) "aaa"))
(forward-char)
(csv-end-of-field)
(should (equal (buffer-substring (point-min) (point))
"aaa,\"b,b\""))))
-(ert-deftest csv-mode-tests-beginning-of-field ()
+(ert-deftest csv-tests-beginning-of-field ()
(with-temp-buffer
(csv-mode)
(insert "aaa,bbb")
(csv-beginning-of-field)
- (should (equal (buffer-substring (point) (point-max))
- "bbb"))
+ (should (equal (buffer-substring (point) (point-max)) "bbb"))
(backward-char)
(csv-beginning-of-field)
(should (equal (buffer-substring (point) (point-max))
"aaa,bbb"))))
-(ert-deftest csv-mode-tests-beginning-of-field-with-quotes ()
+(ert-deftest csv-tests-beginning-of-field-with-quotes ()
(with-temp-buffer
(csv-mode)
(insert "aaa,\"b,b\"")
(csv-beginning-of-field)
- (should (equal (buffer-substring (point) (point-max))
- "\"b,b\""))
+ (should (equal (buffer-substring (point) (point-max)) "\"b,b\""))
(backward-char)
(csv-beginning-of-field)
(should (equal (buffer-substring (point) (point-max))
"aaa,\"b,b\""))))
-(defun csv-mode-tests--align-fields (before after)
+(defun csv-tests--align-fields (before after)
(with-temp-buffer
(insert (string-join before "\n"))
(csv-align-fields t (point-min) (point-max))
(should (equal (buffer-string) (string-join after "\n")))))
-(ert-deftest csv-mode-tests-align-fields ()
- (csv-mode-tests--align-fields
+(ert-deftest csv-tests-align-fields ()
+ (csv-tests--align-fields
'("aaa,bbb,ccc"
"1,2,3")
'("aaa, bbb, ccc"
"1 , 2 , 3")))
-(ert-deftest csv-mode-tests-align-fields-with-quotes ()
- (csv-mode-tests--align-fields
+(ert-deftest csv-tests-align-fields-with-quotes ()
+ (csv-tests--align-fields
'("aaa,\"b,b\",ccc"
"1,2,3")
'("aaa, \"b,b\", ccc"
"1 , 2 , 3")))
;; Bug#14053
-(ert-deftest csv-mode-tests-align-fields-double-quote-comma ()
- (csv-mode-tests--align-fields
+(ert-deftest csv-tests-align-fields-double-quote-comma ()
+ (csv-tests--align-fields
'("1,2,3"
"a,\"b\"\"c,\",d")
'("1, 2 , 3"
"a, \"b\"\"c,\", d")))
+(defvar csv-tests--data
+ "1,4;Sun, 2022-04-10;4,12
+8;Mon, 2022-04-11;3,19
+3,2;Tue, 2022-04-12;1,00
+2;Wed, 2022-04-13;0,37
+9;Wed, 2022-04-13;0,37")
+
+(ert-deftest csv-tests-guess-separator ()
+ (should-not (csv-guess-separator ""))
+ (should (= (csv-guess-separator csv-tests--data 3) ?,))
+ (should (= (csv-guess-separator csv-tests--data) ?\;))
+ (should (= (csv-guess-separator csv-tests--data)
+ (csv-guess-separator csv-tests--data
+ (length csv-tests--data)))))
+
+(ert-deftest csv-tests-separator-candidates ()
+ (should-not (csv--separator-candidates ""))
+ (should-not (csv--separator-candidates csv-tests--data 0))
+ (should
+ (equal (sort (csv--separator-candidates csv-tests--data 4) #'<)
+ '(?, ?\;)))
+ (should
+ (equal (sort (csv--separator-candidates csv-tests--data) #'<)
+ '(?\s ?, ?- ?\;)))
+ (should
+ (equal
+ (sort (csv--separator-candidates csv-tests--data) #'<)
+ (sort (csv--separator-candidates csv-tests--data
+ (length csv-tests--data))
+ #'<))))
+
+(ert-deftest csv-tests-separator-score ()
+ (should (< (csv--separator-score ?, csv-tests--data)
+ (csv--separator-score ?\s csv-tests--data)
+ (csv--separator-score ?- csv-tests--data)))
+ (should (= (csv--separator-score ?- csv-tests--data)
+ (csv--separator-score ?\; csv-tests--data)))
+ (should (= 0 (csv--separator-score ?\; csv-tests--data 0)))
+ (should (= (csv--separator-score ?\; csv-tests--data)
+ (csv--separator-score ?\; csv-tests--data
+ (length csv-tests--data)))))
+
(provide 'csv-mode-tests)
;;; csv-mode-tests.el ends here
diff --git a/csv-mode.el b/csv-mode.el
index 10ce166052..9fd5fc8f10 100644
--- a/csv-mode.el
+++ b/csv-mode.el
@@ -1,11 +1,11 @@
;;; csv-mode.el --- Major mode for editing comma/char separated values -*- lexical-binding: t -*-
-;; Copyright (C) 2003, 2004, 2012-2020 Free Software Foundation, Inc
+;; Copyright (C) 2003, 2004, 2012-2022 Free Software Foundation, Inc
;; Author: "Francis J. Wright" <F.J.Wright@qmul.ac.uk>
;; Maintainer: emacs-devel@gnu.org
;; Version: 1.19
-;; Package-Requires: ((emacs "24.1") (cl-lib "0.5"))
+;; Package-Requires: ((emacs "27.1") (cl-lib "0.5"))
;; Keywords: convenience
;; This package is free software; you can redistribute it and/or modify
@@ -119,7 +119,9 @@
;;; Code:
-(eval-when-compile (require 'cl-lib))
+(eval-when-compile
+ (require 'cl-lib)
+ (require 'subr-x))
(defgroup CSV nil
"Major mode for editing files of comma-separated value type."
@@ -163,12 +165,14 @@ session. Use `customize-set-variable' instead if that is required."
(error "%S is already a quote" x)))
value)
(custom-set-default variable value)
- (setq csv-separator-chars (mapcar #'string-to-char value)
- csv--skip-chars (apply #'concat "^\n" csv-separators)
- csv-separator-regexp (apply #'concat `("[" ,@value "]"))
- csv-font-lock-keywords
- ;; NB: csv-separator-face variable evaluates to itself.
- `((,csv-separator-regexp (0 'csv-separator-face))))))
+ (setq csv-separator-chars (mapcar #'string-to-char value))
+ (setq csv--skip-chars
+ (apply #'concat "^\n"
+ (mapcar (lambda (s) (concat "\\" s)) value)))
+ (setq csv-separator-regexp (regexp-opt value))
+ (setq csv-font-lock-keywords
+ ;; NB: csv-separator-face variable evaluates to itself.
+ `((,csv-separator-regexp (0 'csv-separator-face))))))
(defcustom csv-field-quotes '("\"")
"Field quotes: a list of *single-character* strings.
@@ -368,6 +372,23 @@ It must be either a string or nil."
(modify-syntax-entry ?\n ">" csv-mode-syntax-table))
(setq csv-comment-start string))
+(defvar csv--set-separator-history nil)
+
+(defun csv-set-separator (sep)
+ "Set the CSV separator in the current buffer to SEP."
+ (interactive (list (read-char-from-minibuffer
+ "Separator: " nil 'csv--set-separator-history)))
+ (when (and (boundp 'csv-field-quotes)
+ (member (string sep) csv-field-quotes))
+ (error "%c is already a quote" sep))
+ (setq-local csv-separators (list (string sep)))
+ (setq-local csv-separator-chars (list sep))
+ (setq-local csv--skip-chars (format "^\n%c" sep))
+ (setq-local csv-separator-regexp (regexp-quote (string sep)))
+ (setq-local csv-font-lock-keywords
+ `((,csv-separator-regexp (0 'csv-separator-face))))
+ (font-lock-refresh-defaults))
+
;;;###autoload
(add-to-list 'auto-mode-alist '("\\.[Cc][Ss][Vv]\\'" . csv-mode))
@@ -1728,6 +1749,104 @@ setting works better)."
(jit-lock-unregister #'csv--jit-align)
(csv--jit-unalign (point-min) (point-max))))
(csv--header-flush))
+\f
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;; Separator guessing
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+(defvar csv--preferred-separators
+ '(?\t ?\s ?, ?: ?\;)
+ "Preferred separator characters in case of a tied score.")
+
+(defun csv-guess-set-separator ()
+ "Guess and set the CSV separator of the current buffer.
+
+Add it to the mode hook to have CSV mode guess and set the
+separator automatically when visiting a buffer:
+
+ (add-hook \\='csv-mode-hook \\='csv-guess-set-separator)"
+ (interactive)
+ (let ((sep (csv-guess-separator
+ (buffer-substring-no-properties
+ (point-min)
+ ;; We're probably only going to look at the first 2048
+ ;; or so chars, but take more than we probably need to
+ ;; minimize the chance of breaking the input in the
+ ;; middle of a (long) row.
+ (min 8192 (point-max)))
+ 2048)))
+ (when sep
+ (csv-set-separator sep))))
+
+(defun csv-guess-separator (text &optional cutoff)
+ "Return a guess of which character is the CSV separator in TEXT."
+ (let ((best-separator nil)
+ (best-score 0))
+ (dolist (candidate (csv--separator-candidates text cutoff))
+ (let ((candidate-score
+ (csv--separator-score candidate text cutoff)))
+ (when (or (> candidate-score best-score)
+ (and (= candidate-score best-score)
+ (member candidate csv--preferred-separators)))
+ (setq best-separator candidate)
+ (setq best-score candidate-score))))
+ best-separator))
+
+(defun csv--separator-candidates (text &optional cutoff)
+ "Return a list of candidate CSV separators in TEXT.
+When CUTOFF is passed, look only at the first CUTOFF number of characters."
+ (let ((chars (make-hash-table)))
+ (dolist (c (string-to-list
+ (if cutoff
+ (substring text 0 (min cutoff (length text)))
+ text)))
+ (when (and (not (gethash c chars))
+ (or (= c ?\t)
+ (and (not (member c '(?. ?/ ?\" ?')))
+ (not (member (get-char-code-property c 'general-category)
+ '(Lu Ll Lt Lm Lo Nd Nl No Ps Pe Cc Co))))))
+ (puthash c t chars)))
+ (hash-table-keys chars)))
+
+(defun csv--separator-score (separator text &optional cutoff)
+ "Return a score on how likely SEPARATOR is a separator in TEXT.
+
+When CUTOFF is passed, stop the calculation at the next whole
+line after having read CUTOFF number of characters.
+
+The scoring is based on the idea that most CSV data is tabular,
+i.e. separators should appear equally often on each line.
+Furthermore, more commonly appearing characters are scored higher
+than those who appear less often.
+
+Adapted from the paper \"Wrangling Messy CSV Files by Detecting
+Row and Type Patterns\" by Gerrit J.J. van den Burg , Alfredo
+Nazábal, and Charles Sutton: https://arxiv.org/abs/1811.11242."
+ (let ((groups
+ (with-temp-buffer
+ (csv-set-separator separator)
+ (save-excursion
+ (insert text))
+ (let ((groups (make-hash-table))
+ (chars-read 0))
+ (while (and (/= (point) (point-max))
+ (or (not cutoff)
+ (< chars-read cutoff)))
+ (let* ((lep (line-end-position))
+ (nfields (length (csv--collect-fields lep))))
+ (cl-incf (gethash nfields groups 0))
+ (cl-incf chars-read (- lep (point)))
+ (goto-char (+ lep 1))))
+ groups)))
+ (sum 0))
+ (maphash
+ (lambda (length num)
+ (cl-incf sum (* num (/ (- length 1) (float length)))))
+ groups)
+ (let ((unique-groups (hash-table-count groups)))
+ (if (= 0 unique-groups)
+ 0
+ (/ sum unique-groups)))))
;;; TSV support
--
2.35.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
[parent not found: <87mtfryg73.fsf@simenheg@gmail.com>]
* bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing
[not found] ` <87mtfryg73.fsf@simenheg@gmail.com>
@ 2022-05-09 9:37 ` Mattias Engdegård
2022-05-09 11:03 ` Simen Heggestøyl
0 siblings, 1 reply; 7+ messages in thread
From: Mattias Engdegård @ 2022-05-09 9:37 UTC (permalink / raw)
To: Simen Heggestøyl; +Cc: 55315
8 maj 2022 kl. 21.31 skrev Simen Heggestøyl <simenheg@runbox.com>:
> Am I right in thinking that the use of
> `regexp-quote' in the `csv-set-separator' case gives the right result?
Yes, I think so. `csv-set-separator` should probably escape the character in `csv--skip-chars`, however:
(setq-local csv--skip-chars (format "^\n%c" sep))
should be
(setq-local csv--skip-chars (format "^\n\\%c" sep))
I'm not sure if a separator can be chosen that needs escaping here but better be safe; who knows how the code will be used.
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing
2022-05-09 9:37 ` Mattias Engdegård
@ 2022-05-09 11:03 ` Simen Heggestøyl
2022-05-09 11:28 ` Mattias Engdegård
0 siblings, 1 reply; 7+ messages in thread
From: Simen Heggestøyl @ 2022-05-09 11:03 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 55315
[-- Attachment #1: Type: text/plain, Size: 922 bytes --]
Mattias Engdegård <mattiase@acm.org> writes:
> 8 maj 2022 kl. 21.31 skrev Simen Heggestøyl <simenheg@runbox.com>:
>
>> Am I right in thinking that the use of
>> `regexp-quote' in the `csv-set-separator' case gives the right result?
>
> Yes, I think so. `csv-set-separator` should probably escape the character in `csv--skip-chars`, however:
>
> (setq-local csv--skip-chars (format "^\n%c" sep))
>
> should be
>
> (setq-local csv--skip-chars (format "^\n\\%c" sep))
>
> I'm not sure if a separator can be chosen that needs escaping here but
> better be safe; who knows how the code will be used.
Ah, thanks, I misread the docstring of `skip-chars-forward':
(but not at the end of a range; quoting is never needed there)
I somehow misinterpreted that as quoting not being necessary at the end
of the string fed to `skip-chars-forward'.
Updated patch with your proposed fix attached.
[-- Attachment #2: 0001-Add-CSV-separator-guessing-functionality.patch --]
[-- Type: text/x-diff, Size: 14270 bytes --]
From 872d7f08c47fa382ae18171a0806afa110de8fbe Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Simen=20Heggest=C3=B8yl?= <simenheg@runbox.com>
Date: Sun, 8 May 2022 16:01:35 +0200
Subject: [PATCH] Add CSV separator guessing functionality
Add two new commands: `csv-guess-set-separator' that automatically
guesses and sets the CSV separator of the current buffer, and
`csv-set-separator' for setting it manually.
`csv-guess-set-separator' can be useful to add to the mode hook to
have CSV mode guess and set the separator automatically when visiting
a buffer:
(add-hook 'csv-mode-hook 'csv-guess-set-separator)
* csv-mode.el (csv-separators): Properly quote regexp values.
(csv--set-separator-history, csv--preferred-separators): New
variables.
(csv-set-separator, csv-guess-set-separator)
(csv-guess-separator, csv--separator-candidates)
(csv--separator-score): New functions.
* csv-mode-tests.el (csv-tests--data): New test data.
(csv-tests-guess-separator, csv-tests-separator-candidates)
(csv-tests-separator-score): New tests.
---
csv-mode-tests.el | 80 ++++++++++++++++++++-------
csv-mode.el | 137 +++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 187 insertions(+), 30 deletions(-)
diff --git a/csv-mode-tests.el b/csv-mode-tests.el
index 316dc4bb93..0caeab7d80 100644
--- a/csv-mode-tests.el
+++ b/csv-mode-tests.el
@@ -1,8 +1,8 @@
;;; csv-mode-tests.el --- Tests for CSV mode -*- lexical-binding: t; -*-
-;; Copyright (C) 2020 Free Software Foundation, Inc
+;; Copyright (C) 2020-2022 Free Software Foundation, Inc
-;; Author: Simen Heggestøyl <simenheg@gmail.com>
+;; Author: Simen Heggestøyl <simenheg@runbox.com>
;; Keywords:
;; This program is free software; you can redistribute it and/or modify
@@ -28,83 +28,121 @@
(require 'csv-mode)
(eval-when-compile (require 'subr-x))
-(ert-deftest csv-mode-tests-end-of-field ()
+(ert-deftest csv-tests-end-of-field ()
(with-temp-buffer
(csv-mode)
(insert "aaa,bbb")
(goto-char (point-min))
(csv-end-of-field)
- (should (equal (buffer-substring (point-min) (point))
- "aaa"))
+ (should (equal (buffer-substring (point-min) (point)) "aaa"))
(forward-char)
(csv-end-of-field)
(should (equal (buffer-substring (point-min) (point))
"aaa,bbb"))))
-(ert-deftest csv-mode-tests-end-of-field-with-quotes ()
+(ert-deftest csv-tests-end-of-field-with-quotes ()
(with-temp-buffer
(csv-mode)
(insert "aaa,\"b,b\"")
(goto-char (point-min))
(csv-end-of-field)
- (should (equal (buffer-substring (point-min) (point))
- "aaa"))
+ (should (equal (buffer-substring (point-min) (point)) "aaa"))
(forward-char)
(csv-end-of-field)
(should (equal (buffer-substring (point-min) (point))
"aaa,\"b,b\""))))
-(ert-deftest csv-mode-tests-beginning-of-field ()
+(ert-deftest csv-tests-beginning-of-field ()
(with-temp-buffer
(csv-mode)
(insert "aaa,bbb")
(csv-beginning-of-field)
- (should (equal (buffer-substring (point) (point-max))
- "bbb"))
+ (should (equal (buffer-substring (point) (point-max)) "bbb"))
(backward-char)
(csv-beginning-of-field)
(should (equal (buffer-substring (point) (point-max))
"aaa,bbb"))))
-(ert-deftest csv-mode-tests-beginning-of-field-with-quotes ()
+(ert-deftest csv-tests-beginning-of-field-with-quotes ()
(with-temp-buffer
(csv-mode)
(insert "aaa,\"b,b\"")
(csv-beginning-of-field)
- (should (equal (buffer-substring (point) (point-max))
- "\"b,b\""))
+ (should (equal (buffer-substring (point) (point-max)) "\"b,b\""))
(backward-char)
(csv-beginning-of-field)
(should (equal (buffer-substring (point) (point-max))
"aaa,\"b,b\""))))
-(defun csv-mode-tests--align-fields (before after)
+(defun csv-tests--align-fields (before after)
(with-temp-buffer
(insert (string-join before "\n"))
(csv-align-fields t (point-min) (point-max))
(should (equal (buffer-string) (string-join after "\n")))))
-(ert-deftest csv-mode-tests-align-fields ()
- (csv-mode-tests--align-fields
+(ert-deftest csv-tests-align-fields ()
+ (csv-tests--align-fields
'("aaa,bbb,ccc"
"1,2,3")
'("aaa, bbb, ccc"
"1 , 2 , 3")))
-(ert-deftest csv-mode-tests-align-fields-with-quotes ()
- (csv-mode-tests--align-fields
+(ert-deftest csv-tests-align-fields-with-quotes ()
+ (csv-tests--align-fields
'("aaa,\"b,b\",ccc"
"1,2,3")
'("aaa, \"b,b\", ccc"
"1 , 2 , 3")))
;; Bug#14053
-(ert-deftest csv-mode-tests-align-fields-double-quote-comma ()
- (csv-mode-tests--align-fields
+(ert-deftest csv-tests-align-fields-double-quote-comma ()
+ (csv-tests--align-fields
'("1,2,3"
"a,\"b\"\"c,\",d")
'("1, 2 , 3"
"a, \"b\"\"c,\", d")))
+(defvar csv-tests--data
+ "1,4;Sun, 2022-04-10;4,12
+8;Mon, 2022-04-11;3,19
+3,2;Tue, 2022-04-12;1,00
+2;Wed, 2022-04-13;0,37
+9;Wed, 2022-04-13;0,37")
+
+(ert-deftest csv-tests-guess-separator ()
+ (should-not (csv-guess-separator ""))
+ (should (= (csv-guess-separator csv-tests--data 3) ?,))
+ (should (= (csv-guess-separator csv-tests--data) ?\;))
+ (should (= (csv-guess-separator csv-tests--data)
+ (csv-guess-separator csv-tests--data
+ (length csv-tests--data)))))
+
+(ert-deftest csv-tests-separator-candidates ()
+ (should-not (csv--separator-candidates ""))
+ (should-not (csv--separator-candidates csv-tests--data 0))
+ (should
+ (equal (sort (csv--separator-candidates csv-tests--data 4) #'<)
+ '(?, ?\;)))
+ (should
+ (equal (sort (csv--separator-candidates csv-tests--data) #'<)
+ '(?\s ?, ?- ?\;)))
+ (should
+ (equal
+ (sort (csv--separator-candidates csv-tests--data) #'<)
+ (sort (csv--separator-candidates csv-tests--data
+ (length csv-tests--data))
+ #'<))))
+
+(ert-deftest csv-tests-separator-score ()
+ (should (< (csv--separator-score ?, csv-tests--data)
+ (csv--separator-score ?\s csv-tests--data)
+ (csv--separator-score ?- csv-tests--data)))
+ (should (= (csv--separator-score ?- csv-tests--data)
+ (csv--separator-score ?\; csv-tests--data)))
+ (should (= 0 (csv--separator-score ?\; csv-tests--data 0)))
+ (should (= (csv--separator-score ?\; csv-tests--data)
+ (csv--separator-score ?\; csv-tests--data
+ (length csv-tests--data)))))
+
(provide 'csv-mode-tests)
;;; csv-mode-tests.el ends here
diff --git a/csv-mode.el b/csv-mode.el
index 10ce166052..b2a881dde2 100644
--- a/csv-mode.el
+++ b/csv-mode.el
@@ -1,11 +1,11 @@
;;; csv-mode.el --- Major mode for editing comma/char separated values -*- lexical-binding: t -*-
-;; Copyright (C) 2003, 2004, 2012-2020 Free Software Foundation, Inc
+;; Copyright (C) 2003, 2004, 2012-2022 Free Software Foundation, Inc
;; Author: "Francis J. Wright" <F.J.Wright@qmul.ac.uk>
;; Maintainer: emacs-devel@gnu.org
;; Version: 1.19
-;; Package-Requires: ((emacs "24.1") (cl-lib "0.5"))
+;; Package-Requires: ((emacs "27.1") (cl-lib "0.5"))
;; Keywords: convenience
;; This package is free software; you can redistribute it and/or modify
@@ -119,7 +119,9 @@
;;; Code:
-(eval-when-compile (require 'cl-lib))
+(eval-when-compile
+ (require 'cl-lib)
+ (require 'subr-x))
(defgroup CSV nil
"Major mode for editing files of comma-separated value type."
@@ -163,12 +165,14 @@ session. Use `customize-set-variable' instead if that is required."
(error "%S is already a quote" x)))
value)
(custom-set-default variable value)
- (setq csv-separator-chars (mapcar #'string-to-char value)
- csv--skip-chars (apply #'concat "^\n" csv-separators)
- csv-separator-regexp (apply #'concat `("[" ,@value "]"))
- csv-font-lock-keywords
- ;; NB: csv-separator-face variable evaluates to itself.
- `((,csv-separator-regexp (0 'csv-separator-face))))))
+ (setq csv-separator-chars (mapcar #'string-to-char value))
+ (setq csv--skip-chars
+ (apply #'concat "^\n"
+ (mapcar (lambda (s) (concat "\\" s)) value)))
+ (setq csv-separator-regexp (regexp-opt value))
+ (setq csv-font-lock-keywords
+ ;; NB: csv-separator-face variable evaluates to itself.
+ `((,csv-separator-regexp (0 'csv-separator-face))))))
(defcustom csv-field-quotes '("\"")
"Field quotes: a list of *single-character* strings.
@@ -368,6 +372,23 @@ It must be either a string or nil."
(modify-syntax-entry ?\n ">" csv-mode-syntax-table))
(setq csv-comment-start string))
+(defvar csv--set-separator-history nil)
+
+(defun csv-set-separator (sep)
+ "Set the CSV separator in the current buffer to SEP."
+ (interactive (list (read-char-from-minibuffer
+ "Separator: " nil 'csv--set-separator-history)))
+ (when (and (boundp 'csv-field-quotes)
+ (member (string sep) csv-field-quotes))
+ (error "%c is already a quote" sep))
+ (setq-local csv-separators (list (string sep)))
+ (setq-local csv-separator-chars (list sep))
+ (setq-local csv--skip-chars (format "^\n\\%c" sep))
+ (setq-local csv-separator-regexp (regexp-quote (string sep)))
+ (setq-local csv-font-lock-keywords
+ `((,csv-separator-regexp (0 'csv-separator-face))))
+ (font-lock-refresh-defaults))
+
;;;###autoload
(add-to-list 'auto-mode-alist '("\\.[Cc][Ss][Vv]\\'" . csv-mode))
@@ -1728,6 +1749,104 @@ setting works better)."
(jit-lock-unregister #'csv--jit-align)
(csv--jit-unalign (point-min) (point-max))))
(csv--header-flush))
+\f
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;; Separator guessing
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+(defvar csv--preferred-separators
+ '(?\t ?\s ?, ?: ?\;)
+ "Preferred separator characters in case of a tied score.")
+
+(defun csv-guess-set-separator ()
+ "Guess and set the CSV separator of the current buffer.
+
+Add it to the mode hook to have CSV mode guess and set the
+separator automatically when visiting a buffer:
+
+ (add-hook \\='csv-mode-hook \\='csv-guess-set-separator)"
+ (interactive)
+ (let ((sep (csv-guess-separator
+ (buffer-substring-no-properties
+ (point-min)
+ ;; We're probably only going to look at the first 2048
+ ;; or so chars, but take more than we probably need to
+ ;; minimize the chance of breaking the input in the
+ ;; middle of a (long) row.
+ (min 8192 (point-max)))
+ 2048)))
+ (when sep
+ (csv-set-separator sep))))
+
+(defun csv-guess-separator (text &optional cutoff)
+ "Return a guess of which character is the CSV separator in TEXT."
+ (let ((best-separator nil)
+ (best-score 0))
+ (dolist (candidate (csv--separator-candidates text cutoff))
+ (let ((candidate-score
+ (csv--separator-score candidate text cutoff)))
+ (when (or (> candidate-score best-score)
+ (and (= candidate-score best-score)
+ (member candidate csv--preferred-separators)))
+ (setq best-separator candidate)
+ (setq best-score candidate-score))))
+ best-separator))
+
+(defun csv--separator-candidates (text &optional cutoff)
+ "Return a list of candidate CSV separators in TEXT.
+When CUTOFF is passed, look only at the first CUTOFF number of characters."
+ (let ((chars (make-hash-table)))
+ (dolist (c (string-to-list
+ (if cutoff
+ (substring text 0 (min cutoff (length text)))
+ text)))
+ (when (and (not (gethash c chars))
+ (or (= c ?\t)
+ (and (not (member c '(?. ?/ ?\" ?')))
+ (not (member (get-char-code-property c 'general-category)
+ '(Lu Ll Lt Lm Lo Nd Nl No Ps Pe Cc Co))))))
+ (puthash c t chars)))
+ (hash-table-keys chars)))
+
+(defun csv--separator-score (separator text &optional cutoff)
+ "Return a score on how likely SEPARATOR is a separator in TEXT.
+
+When CUTOFF is passed, stop the calculation at the next whole
+line after having read CUTOFF number of characters.
+
+The scoring is based on the idea that most CSV data is tabular,
+i.e. separators should appear equally often on each line.
+Furthermore, more commonly appearing characters are scored higher
+than those who appear less often.
+
+Adapted from the paper \"Wrangling Messy CSV Files by Detecting
+Row and Type Patterns\" by Gerrit J.J. van den Burg , Alfredo
+Nazábal, and Charles Sutton: https://arxiv.org/abs/1811.11242."
+ (let ((groups
+ (with-temp-buffer
+ (csv-set-separator separator)
+ (save-excursion
+ (insert text))
+ (let ((groups (make-hash-table))
+ (chars-read 0))
+ (while (and (/= (point) (point-max))
+ (or (not cutoff)
+ (< chars-read cutoff)))
+ (let* ((lep (line-end-position))
+ (nfields (length (csv--collect-fields lep))))
+ (cl-incf (gethash nfields groups 0))
+ (cl-incf chars-read (- lep (point)))
+ (goto-char (+ lep 1))))
+ groups)))
+ (sum 0))
+ (maphash
+ (lambda (length num)
+ (cl-incf sum (* num (/ (- length 1) (float length)))))
+ groups)
+ (let ((unique-groups (hash-table-count groups)))
+ (if (= 0 unique-groups)
+ 0
+ (/ sum unique-groups)))))
;;; TSV support
--
2.35.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing
2022-05-09 11:03 ` Simen Heggestøyl
@ 2022-05-09 11:28 ` Mattias Engdegård
2022-05-12 19:59 ` Simen Heggestøyl
0 siblings, 1 reply; 7+ messages in thread
From: Mattias Engdegård @ 2022-05-09 11:28 UTC (permalink / raw)
To: Simen Heggestøyl; +Cc: 55315
9 maj 2022 kl. 13.03 skrev Simen Heggestøyl <simenheg@runbox.com>:
> Updated patch with your proposed fix attached.
Thanks, looks fine with respect to the regexp and skip-set generation.
For the remainder of the patch (the vast bulk) you are probably more qualified to judge!
By the way, thanks for the reference to the CSV wrangling paper.
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing
2022-05-09 11:28 ` Mattias Engdegård
@ 2022-05-12 19:59 ` Simen Heggestøyl
0 siblings, 0 replies; 7+ messages in thread
From: Simen Heggestøyl @ 2022-05-12 19:59 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 55315-done
Mattias Engdegård <mattiase@acm.org> writes:
> 9 maj 2022 kl. 13.03 skrev Simen Heggestøyl <simenheg@runbox.com>:
>
>> Updated patch with your proposed fix attached.
>
> Thanks, looks fine with respect to the regexp and skip-set generation.
Good! Thanks for taking another look. I've merged the patch.
-- Simen
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing
@ 2022-05-08 14:12 Simen Heggestøyl
0 siblings, 0 replies; 7+ messages in thread
From: Simen Heggestøyl @ 2022-05-08 14:12 UTC (permalink / raw)
To: 55315
[-- Attachment #1: Type: text/plain, Size: 598 bytes --]
Hi.
Attached is a proposed patch to csv-mode.el in GNU ELPA which adds CSV
separator guessing functionality to CSV mode.
It adds two new commands: `csv-guess-set-separator' that automatically
guesses and sets the CSV separator of the current buffer, and
`csv-set-separator' for setting it manually.
The idea is that `csv-guess-set-separator' can be useful to add to the
mode hook to have CSV mode guess and set the separator automatically
when visiting a buffer:
(add-hook 'csv-mode-hook 'csv-guess-set-separator)
Been using it myself for the past weeks and have been happy with it so
far.
[-- Attachment #2: 0001-Add-CSV-separator-guessing-functionality.patch --]
[-- Type: text/x-diff, Size: 14370 bytes --]
From 7414f7e17ede47c392ce8d401d28ef17513c10e7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Simen=20Heggest=C3=B8yl?= <simenheg@runbox.com>
Date: Sun, 8 May 2022 16:01:35 +0200
Subject: [PATCH] Add CSV separator guessing functionality
Add two new commands: `csv-guess-set-separator' that automatically
guesses and sets the CSV separator of the current buffer, and
`csv-set-separator' for setting it manually.
`csv-guess-set-separator' can be useful to add to the mode hook to
have CSV mode guess and set the separator automatically when visiting
a buffer:
(add-hook 'csv-mode-hook 'csv-guess-set-separator)
* csv-mode.el (csv-separators): Properly quote regexp values.
(csv--set-separator-history, csv--preferred-separators): New
variables.
(csv-set-separator, csv-guess-set-separator)
(csv-guess-separator, csv--separator-candidates)
(csv--separator-score): New functions.
* csv-mode-tests.el (csv-tests--data): New test data.
(csv-tests-guess-separator, csv-tests-separator-candidates)
(csv-tests-separator-score): New tests.
---
csv-mode-tests.el | 80 ++++++++++++++++++++-------
csv-mode.el | 138 +++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 188 insertions(+), 30 deletions(-)
diff --git a/csv-mode-tests.el b/csv-mode-tests.el
index 316dc4bb93..0caeab7d80 100644
--- a/csv-mode-tests.el
+++ b/csv-mode-tests.el
@@ -1,8 +1,8 @@
;;; csv-mode-tests.el --- Tests for CSV mode -*- lexical-binding: t; -*-
-;; Copyright (C) 2020 Free Software Foundation, Inc
+;; Copyright (C) 2020-2022 Free Software Foundation, Inc
-;; Author: Simen Heggestøyl <simenheg@gmail.com>
+;; Author: Simen Heggestøyl <simenheg@runbox.com>
;; Keywords:
;; This program is free software; you can redistribute it and/or modify
@@ -28,83 +28,121 @@
(require 'csv-mode)
(eval-when-compile (require 'subr-x))
-(ert-deftest csv-mode-tests-end-of-field ()
+(ert-deftest csv-tests-end-of-field ()
(with-temp-buffer
(csv-mode)
(insert "aaa,bbb")
(goto-char (point-min))
(csv-end-of-field)
- (should (equal (buffer-substring (point-min) (point))
- "aaa"))
+ (should (equal (buffer-substring (point-min) (point)) "aaa"))
(forward-char)
(csv-end-of-field)
(should (equal (buffer-substring (point-min) (point))
"aaa,bbb"))))
-(ert-deftest csv-mode-tests-end-of-field-with-quotes ()
+(ert-deftest csv-tests-end-of-field-with-quotes ()
(with-temp-buffer
(csv-mode)
(insert "aaa,\"b,b\"")
(goto-char (point-min))
(csv-end-of-field)
- (should (equal (buffer-substring (point-min) (point))
- "aaa"))
+ (should (equal (buffer-substring (point-min) (point)) "aaa"))
(forward-char)
(csv-end-of-field)
(should (equal (buffer-substring (point-min) (point))
"aaa,\"b,b\""))))
-(ert-deftest csv-mode-tests-beginning-of-field ()
+(ert-deftest csv-tests-beginning-of-field ()
(with-temp-buffer
(csv-mode)
(insert "aaa,bbb")
(csv-beginning-of-field)
- (should (equal (buffer-substring (point) (point-max))
- "bbb"))
+ (should (equal (buffer-substring (point) (point-max)) "bbb"))
(backward-char)
(csv-beginning-of-field)
(should (equal (buffer-substring (point) (point-max))
"aaa,bbb"))))
-(ert-deftest csv-mode-tests-beginning-of-field-with-quotes ()
+(ert-deftest csv-tests-beginning-of-field-with-quotes ()
(with-temp-buffer
(csv-mode)
(insert "aaa,\"b,b\"")
(csv-beginning-of-field)
- (should (equal (buffer-substring (point) (point-max))
- "\"b,b\""))
+ (should (equal (buffer-substring (point) (point-max)) "\"b,b\""))
(backward-char)
(csv-beginning-of-field)
(should (equal (buffer-substring (point) (point-max))
"aaa,\"b,b\""))))
-(defun csv-mode-tests--align-fields (before after)
+(defun csv-tests--align-fields (before after)
(with-temp-buffer
(insert (string-join before "\n"))
(csv-align-fields t (point-min) (point-max))
(should (equal (buffer-string) (string-join after "\n")))))
-(ert-deftest csv-mode-tests-align-fields ()
- (csv-mode-tests--align-fields
+(ert-deftest csv-tests-align-fields ()
+ (csv-tests--align-fields
'("aaa,bbb,ccc"
"1,2,3")
'("aaa, bbb, ccc"
"1 , 2 , 3")))
-(ert-deftest csv-mode-tests-align-fields-with-quotes ()
- (csv-mode-tests--align-fields
+(ert-deftest csv-tests-align-fields-with-quotes ()
+ (csv-tests--align-fields
'("aaa,\"b,b\",ccc"
"1,2,3")
'("aaa, \"b,b\", ccc"
"1 , 2 , 3")))
;; Bug#14053
-(ert-deftest csv-mode-tests-align-fields-double-quote-comma ()
- (csv-mode-tests--align-fields
+(ert-deftest csv-tests-align-fields-double-quote-comma ()
+ (csv-tests--align-fields
'("1,2,3"
"a,\"b\"\"c,\",d")
'("1, 2 , 3"
"a, \"b\"\"c,\", d")))
+(defvar csv-tests--data
+ "1,4;Sun, 2022-04-10;4,12
+8;Mon, 2022-04-11;3,19
+3,2;Tue, 2022-04-12;1,00
+2;Wed, 2022-04-13;0,37
+9;Wed, 2022-04-13;0,37")
+
+(ert-deftest csv-tests-guess-separator ()
+ (should-not (csv-guess-separator ""))
+ (should (= (csv-guess-separator csv-tests--data 3) ?,))
+ (should (= (csv-guess-separator csv-tests--data) ?\;))
+ (should (= (csv-guess-separator csv-tests--data)
+ (csv-guess-separator csv-tests--data
+ (length csv-tests--data)))))
+
+(ert-deftest csv-tests-separator-candidates ()
+ (should-not (csv--separator-candidates ""))
+ (should-not (csv--separator-candidates csv-tests--data 0))
+ (should
+ (equal (sort (csv--separator-candidates csv-tests--data 4) #'<)
+ '(?, ?\;)))
+ (should
+ (equal (sort (csv--separator-candidates csv-tests--data) #'<)
+ '(?\s ?, ?- ?\;)))
+ (should
+ (equal
+ (sort (csv--separator-candidates csv-tests--data) #'<)
+ (sort (csv--separator-candidates csv-tests--data
+ (length csv-tests--data))
+ #'<))))
+
+(ert-deftest csv-tests-separator-score ()
+ (should (< (csv--separator-score ?, csv-tests--data)
+ (csv--separator-score ?\s csv-tests--data)
+ (csv--separator-score ?- csv-tests--data)))
+ (should (= (csv--separator-score ?- csv-tests--data)
+ (csv--separator-score ?\; csv-tests--data)))
+ (should (= 0 (csv--separator-score ?\; csv-tests--data 0)))
+ (should (= (csv--separator-score ?\; csv-tests--data)
+ (csv--separator-score ?\; csv-tests--data
+ (length csv-tests--data)))))
+
(provide 'csv-mode-tests)
;;; csv-mode-tests.el ends here
diff --git a/csv-mode.el b/csv-mode.el
index 10ce166052..f31f0da1f5 100644
--- a/csv-mode.el
+++ b/csv-mode.el
@@ -1,11 +1,11 @@
;;; csv-mode.el --- Major mode for editing comma/char separated values -*- lexical-binding: t -*-
-;; Copyright (C) 2003, 2004, 2012-2020 Free Software Foundation, Inc
+;; Copyright (C) 2003, 2004, 2012-2022 Free Software Foundation, Inc
;; Author: "Francis J. Wright" <F.J.Wright@qmul.ac.uk>
;; Maintainer: emacs-devel@gnu.org
;; Version: 1.19
-;; Package-Requires: ((emacs "24.1") (cl-lib "0.5"))
+;; Package-Requires: ((emacs "27.1") (cl-lib "0.5"))
;; Keywords: convenience
;; This package is free software; you can redistribute it and/or modify
@@ -119,7 +119,9 @@
;;; Code:
-(eval-when-compile (require 'cl-lib))
+(eval-when-compile
+ (require 'cl-lib)
+ (require 'subr-x))
(defgroup CSV nil
"Major mode for editing files of comma-separated value type."
@@ -163,12 +165,14 @@ session. Use `customize-set-variable' instead if that is required."
(error "%S is already a quote" x)))
value)
(custom-set-default variable value)
- (setq csv-separator-chars (mapcar #'string-to-char value)
- csv--skip-chars (apply #'concat "^\n" csv-separators)
- csv-separator-regexp (apply #'concat `("[" ,@value "]"))
- csv-font-lock-keywords
- ;; NB: csv-separator-face variable evaluates to itself.
- `((,csv-separator-regexp (0 'csv-separator-face))))))
+ (setq csv-separator-chars (mapcar #'string-to-char value))
+ (let ((quoted-value (mapcar #'regexp-quote value)))
+ (setq csv--skip-chars (apply #'concat "^\n" quoted-value))
+ (setq csv-separator-regexp
+ (apply #'concat `("[" ,@quoted-value "]"))))
+ (setq csv-font-lock-keywords
+ ;; NB: csv-separator-face variable evaluates to itself.
+ `((,csv-separator-regexp (0 'csv-separator-face))))))
(defcustom csv-field-quotes '("\"")
"Field quotes: a list of *single-character* strings.
@@ -368,6 +372,24 @@ It must be either a string or nil."
(modify-syntax-entry ?\n ">" csv-mode-syntax-table))
(setq csv-comment-start string))
+(defvar csv--set-separator-history nil)
+
+(defun csv-set-separator (sep)
+ "Set the CSV separator in the current buffer to SEP."
+ (interactive (list (read-char-from-minibuffer
+ "Separator: " nil 'csv--set-separator-history)))
+ (when (and (boundp 'csv-field-quotes)
+ (member (string sep) csv-field-quotes))
+ (error "%c is already a quote" sep))
+ (setq-local csv-separators (list (string sep)))
+ (setq-local csv-separator-chars (list sep))
+ (let ((quoted-sep (regexp-quote (string sep))))
+ (setq-local csv--skip-chars (format "^\n%s" quoted-sep))
+ (setq-local csv-separator-regexp (format "[%s]" quoted-sep)))
+ (setq-local csv-font-lock-keywords
+ `((,csv-separator-regexp (0 'csv-separator-face))))
+ (font-lock-refresh-defaults))
+
;;;###autoload
(add-to-list 'auto-mode-alist '("\\.[Cc][Ss][Vv]\\'" . csv-mode))
@@ -1728,6 +1750,104 @@ setting works better)."
(jit-lock-unregister #'csv--jit-align)
(csv--jit-unalign (point-min) (point-max))))
(csv--header-flush))
+\f
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;; Separator guessing
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+(defvar csv--preferred-separators
+ '(?\t ?\s ?, ?: ?\;)
+ "Preferred separator characters in case of a tied score.")
+
+(defun csv-guess-set-separator ()
+ "Guess and set the CSV separator of the current buffer.
+
+Add it to the mode hook to have CSV mode guess and set the
+separator automatically when visiting a buffer:
+
+ (add-hook \\='csv-mode-hook \\='csv-guess-set-separator)"
+ (interactive)
+ (let ((sep (csv-guess-separator
+ (buffer-substring-no-properties
+ (point-min)
+ ;; We're probably only going to look at the first 2048
+ ;; or so chars, but take more than we probably need to
+ ;; minimize the chance of breaking the input in the
+ ;; middle of a (long) row.
+ (min 8192 (point-max)))
+ 2048)))
+ (when sep
+ (csv-set-separator sep))))
+
+(defun csv-guess-separator (text &optional cutoff)
+ "Return a guess of which character is the CSV separator in TEXT."
+ (let ((best-separator nil)
+ (best-score 0))
+ (dolist (candidate (csv--separator-candidates text cutoff))
+ (let ((candidate-score
+ (csv--separator-score candidate text cutoff)))
+ (when (or (> candidate-score best-score)
+ (and (= candidate-score best-score)
+ (member candidate csv--preferred-separators)))
+ (setq best-separator candidate)
+ (setq best-score candidate-score))))
+ best-separator))
+
+(defun csv--separator-candidates (text &optional cutoff)
+ "Return a list of candidate CSV separators in TEXT.
+When CUTOFF is passed, look only at the first CUTOFF number of characters."
+ (let ((chars (make-hash-table)))
+ (dolist (c (string-to-list
+ (if cutoff
+ (substring text 0 (min cutoff (length text)))
+ text)))
+ (when (and (not (gethash c chars))
+ (or (= c ?\t)
+ (and (not (member c '(?. ?/ ?\" ?')))
+ (not (member (get-char-code-property c 'general-category)
+ '(Lu Ll Lt Lm Lo Nd Nl No Ps Pe Cc Co))))))
+ (puthash c t chars)))
+ (hash-table-keys chars)))
+
+(defun csv--separator-score (separator text &optional cutoff)
+ "Return a score on how likely SEPARATOR is a separator in TEXT.
+
+When CUTOFF is passed, stop the calculation at the next whole
+line after having read CUTOFF number of characters.
+
+The scoring is based on the idea that most CSV data is tabular,
+i.e. separators should appear equally often on each line.
+Furthermore, more commonly appearing characters are scored higher
+than those who appear less often.
+
+Adapted from the paper \"Wrangling Messy CSV Files by Detecting
+Row and Type Patterns\" by Gerrit J.J. van den Burg , Alfredo
+Nazábal, and Charles Sutton: https://arxiv.org/abs/1811.11242."
+ (let ((groups
+ (with-temp-buffer
+ (csv-set-separator separator)
+ (save-excursion
+ (insert text))
+ (let ((groups (make-hash-table))
+ (chars-read 0))
+ (while (and (/= (point) (point-max))
+ (or (not cutoff)
+ (< chars-read cutoff)))
+ (let* ((lep (line-end-position))
+ (nfields (length (csv--collect-fields lep))))
+ (cl-incf (gethash nfields groups 0))
+ (cl-incf chars-read (- lep (point)))
+ (goto-char (+ lep 1))))
+ groups)))
+ (sum 0))
+ (maphash
+ (lambda (length num)
+ (cl-incf sum (* num (/ (- length 1) (float length)))))
+ groups)
+ (let ((unique-groups (hash-table-count groups)))
+ (if (= 0 unique-groups)
+ 0
+ (/ sum unique-groups)))))
;;; TSV support
--
2.35.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-05-12 19:59 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <87h760jeq7.fsf@simenheg@gmail.com>
2022-05-08 17:56 ` bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing Mattias Engdegård
2022-05-08 19:31 ` Simen Heggestøyl
[not found] ` <87mtfryg73.fsf@simenheg@gmail.com>
2022-05-09 9:37 ` Mattias Engdegård
2022-05-09 11:03 ` Simen Heggestøyl
2022-05-09 11:28 ` Mattias Engdegård
2022-05-12 19:59 ` Simen Heggestøyl
2022-05-08 14:12 Simen Heggestøyl
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).