From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>
Received: from mp2 ([2001:41d0:2:bcc0::])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	by ms0.migadu.com with LMTPS
	id QC6+FH2SfWD5TQAAgWs5BA
	(envelope-from <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>)
	for <larch@yhetil.org>; Mon, 19 Apr 2021 16:23:57 +0200
Received: from aspmx1.migadu.com ([2001:41d0:2:bcc0::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by mp2 with LMTPS
	id gKFLEH2SfWAwQAAAB5/wlQ
	(envelope-from <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>)
	for <larch@yhetil.org>; Mon, 19 Apr 2021 14:23:57 +0000
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by aspmx1.migadu.com (Postfix) with ESMTPS id 6C57B8ACC
	for <larch@yhetil.org>; Mon, 19 Apr 2021 16:23:56 +0200 (CEST)
Received: from localhost ([::1]:57068 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>)
	id 1lYUod-0006v9-18
	for larch@yhetil.org; Mon, 19 Apr 2021 10:23:55 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10]:35872)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <utkarsh190601@gmail.com>)
 id 1lYUoB-0006us-HU
 for emacs-orgmode@gnu.org; Mon, 19 Apr 2021 10:23:27 -0400
Received: from mail-pf1-x42e.google.com ([2607:f8b0:4864:20::42e]:46820)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <utkarsh190601@gmail.com>)
 id 1lYUo9-00052n-Jm
 for emacs-orgmode@gnu.org; Mon, 19 Apr 2021 10:23:27 -0400
Received: by mail-pf1-x42e.google.com with SMTP id d124so23250682pfa.13
 for <emacs-orgmode@gnu.org>; Mon, 19 Apr 2021 07:23:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=from:to:subject:references:cc:date:in-reply-to:message-id
 :user-agent:mime-version;
 bh=13IPtaGmOOwaxYsyBhyaECe/uc6zsM4sOyXxKnHgWtc=;
 b=IxswMqqIZMLZ4uE5f/ycRpJsi8wS2MxLhL9LEEHVtkl4elLep2TVIyPHKPfhYizAA3
 yjdBCfszxXE2JGkksxHBWhvSFdq4mc9X5PvtJqZIP37hxrFsi8zwqjbrJXCkyvphKFwo
 u3lb1WU7ZhGLpd7EYe9/fWPcLqMfELgB82X7hLSoPVbe679q46YKw3hF16uv6jXrQBmG
 mCMOJ5zGlo3YrSozN0pAH3eSvbi5JpcYszDqXkuFWif6C+TS6W8W89uKKzWK35hJW+53
 tgmpr+5lm1jQsDaKtKf4kn3+UTTtirH6IjNyoQJY6zO4ituIKKRF7fcpltH9DRprU7gR
 2D2Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:subject:references:cc:date:in-reply-to
 :message-id:user-agent:mime-version;
 bh=13IPtaGmOOwaxYsyBhyaECe/uc6zsM4sOyXxKnHgWtc=;
 b=t1t8ld//hDN2PPs7Evr4Jz4v3RNx2OJJbdKjddEAPCjEbFnCol4r9T0kKwAPv5I7QH
 ouhDdkW8ZQsuUwkDQ3cCUIxaKjYuyrW4WRSuGjC4GEu0qx7Nb9iqLXGixPPU2qZHdcuH
 L4lOLaDFArkji3jrWSVILzsj4xJ+RVEmeOhPGBAv6IP6S6+TzwJSIQYGExF3Ir+CkU2G
 /CPZFwpI/rB7U/3vZXn6V4CAUEh0lEyzqJriA67/d9ekB81Ik/QJxUrZoKjEG7FCHbRT
 JACa6/UYyRdhKmN+/cipu8e3bsQeEbA4UvNzQMT9cmzmUNbgbpMOnwMzmd2Ye7e+8oJS
 CmNw==
X-Gm-Message-State: AOAM5319yNkCts8XlhMiF78AA3JUh54mQDd4PgS/uhukjrupAZNafpdO
 omT49TuGGjwYtTjOKtL8WfjvEBxwEng=
X-Google-Smtp-Source: ABdhPJyQBhJX/SrGLyTgwDPXMwEkzgMBXOZrqQFzRxpPqNGH6iclHZGN664s/1F9gcnSEmkeNcBekw==
X-Received: by 2002:a63:f443:: with SMTP id p3mr12122015pgk.378.1618842203762; 
 Mon, 19 Apr 2021 07:23:23 -0700 (PDT)
Received: from localhost ([45.251.50.123])
 by smtp.gmail.com with ESMTPSA id i66sm10538396pgd.8.2021.04.19.07.23.22
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 19 Apr 2021 07:23:23 -0700 (PDT)
From: Utkarsh Singh <utkarsh190601@gmail.com>
To: Nicolas Goaziou <mail@nicolasgoaziou.fr>
Subject: Re: [PATCH] org-table-import: Make it more smarter for interactive use
References: <87czuq9958.fsf@gmail.com> <8735vmelfs.fsf@nicolasgoaziou.fr>
Date: Mon, 19 Apr 2021 19:53:25 +0530
In-Reply-To: <8735vmelfs.fsf@nicolasgoaziou.fr> (Nicolas Goaziou's message of
 "Mon, 19 Apr 2021 10:19:03 +0200")
Message-ID: <87k0oyfj4y.fsf@gmail.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
Received-SPF: pass client-ip=2607:f8b0:4864:20::42e;
 envelope-from=utkarsh190601@gmail.com; helo=mail-pf1-x42e.google.com
X-Spam_score_int: 1
X-Spam_score: 0.1
X-Spam_bar: /
X-Spam_report: (0.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
 DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25,
 FREEMAIL_FROM=0.001, PDS_OTHER_BAD_TLD=1.999, RCVD_IN_DNSWL_NONE=-0.0001,
 SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no
X-Spam_action: no action
X-BeenThere: emacs-orgmode@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "General discussions about Org-mode." <emacs-orgmode.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-orgmode>,
 <mailto:emacs-orgmode-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-orgmode>
List-Post: <mailto:emacs-orgmode@gnu.org>
List-Help: <mailto:emacs-orgmode-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-orgmode>,
 <mailto:emacs-orgmode-request@gnu.org?subject=subscribe>
Cc: 47885@debbugs.gnu.org, emacs-orgmode@gnu.org
Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org
Sender: "Emacs-orgmode" <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>
X-Migadu-Flow: FLOW_IN
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org;
	s=key1; t=1618842236;
	h=from:from:sender:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:in-reply-to:in-reply-to:
	 references:references:list-id:list-help:list-unsubscribe:
	 list-subscribe:list-post:dkim-signature;
	bh=13IPtaGmOOwaxYsyBhyaECe/uc6zsM4sOyXxKnHgWtc=;
	b=pI9oWsjghYrJrL8pRZbYFqkI1OnqM47l+ovGA7Q1WY8l64y9wUOcfbHuGuiSgERVyx4pzy
	yKSVexmiv39UI8pt/D6kL9Q+KJrniw8Uv7nxx6KRTAlKrElc3EWq3uUpTfmHHw/7LvHmVd
	g4Yzsas31g2kseNeiKEYwIOEwRPdgmZqVL42L/I1OUPwvyuZpQc3h0PKsAZXbhnlM0wUqk
	67wrc+3n4tmFUq7XP1UlzMIgQULnioo41xm7dOuJtcEss4F6CXLR8ZMwXW25FHMbwZYPZZ
	F01xphDQpE5GfwP9YhB6zVrmEvV0PRBVowEVhBnLabvqNcwPgzg6Rw8DMP8Kjg==
ARC-Seal: i=1; s=key1; d=yhetil.org; t=1618842236; a=rsa-sha256; cv=none;
	b=DPGfCA6IdM3kqBGRlUANd0UmuR7EmlWAJx39NvqL2Ff8t/mdBk+rWItqzwcX9hqv2k1w5r
	Mw6Dyo8S3qufyQeLW+/2S3B4BLbiGgcP1Q9aiWbfjleJ2RQI9ezHYwuAxLu20UN++/1pco
	d+SKgxRWoeu8hBkGIrUZhn7dh8iGQZFHeAsnhb0Dw1KSI31z/stzAsup3nfn0KYwDbVfdq
	6vzspNTNBmrC51H5C2igFMDim6tCQwWBJY7JqvfbTCC6AH7yjKDUH4Zj2mD+kdBuMEEz4h
	Ge/093k6MqeeqH9DcuoKGM5iPiwqJE1Ysbso3BtWJS85wFF/apEypi7IlqwEkg==
ARC-Authentication-Results: i=1;
	aspmx1.migadu.com;
	dkim=pass header.d=gmail.com header.s=20161025 header.b=IxswMqqI;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org
X-Migadu-Spam-Score: -3.76
Authentication-Results: aspmx1.migadu.com;
	dkim=pass header.d=gmail.com header.s=20161025 header.b=IxswMqqI;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org
X-Migadu-Queue-Id: 6C57B8ACC
X-Spam-Score: -3.76
X-Migadu-Scanner: scn0.migadu.com
X-TUID: cCN2KSz4JsZi

On 2021-04-19, 10:19 +0200, Nicolas Goaziou <mail@nicolasgoaziou.fr> wrote:

>> My previous patch proposed to add support for importing file with
>> arbitrary name and building upon that this patch tries to make use of it
>> by making org-table-import smarter by simply adding more separators
>> (delimiters).
>
> Good idea, thank you. Some comments follow.
>
>> +(defun org-table-guess-separator (beg0 end0)
>> +  "Guess separator for `org-table-convert-region' for region BEG0 to END0.
>> +
>> +List of preferred separator:
>> +comma, TAB, ';', ':' or SPACE
>
> I suggest to use full names everywhere: comma, TAB, semicolon, colon, or
> SPACE.
>
>> +If region contains a line which doesn't contain the required
>> +separator then discard the separator and search again using next
>> +separator."
>> +  (let ((beg (save-excursion
>> +	       (goto-char (min beg0 end0))
>> +	       (beginning-of-line 1)
>> +	       (point)))
>
>   (beginning-of-line 1) + (point) -> (line-beginning-position)
>
> since you don't intent to move point.
>
>> +	(end (save-excursion
>> +	       (goto-char (max beg0 end0))
>> +	       (end-of-line 1)
>> +	       (if (bolp) (backward-char 1) (end-of-line 1))
>
> I'm not sure about what you mean above. First, the second call to
> end-of-line is useless, since you're already at the end of the line.
> Second, what is wrong if point is at an empty line? Why do you want to
> move it back?
>
>> +	       (point))))
>
> You may want to use `line-end-position'.
>
>> +    (save-excursion
>> +      (goto-char beg)
>> +      (cond
>> +       ((not (re-search-forward "^[^\n,]+$" end t)) '(4))
>> +       ((not (re-search-forward "^[^\n\t]+$" end t)) '(16))
>> +       ((not (re-search-forward "^[^\n;]+$" end t)) ";")
>> +       ((not (re-search-forward "^[^\n:]+$" end t)) ":")
>> +       ((not (re-search-forward "^\\([^'\"][^\n\s][^'\"]\\)+$" end t)) " ")
>> +       (t nil)))))
>
> I think you need to wrap `save-excursion' around each
> `re-search-forward' call. Otherwise each test starts at the first line
> containing the separator previously tested.
>> +
>>  ;;;###autoload
>>  (defun org-table-convert-region (beg0 end0 &optional separator)
>>    "Convert region to a table.
>> @@ -862,10 +891,7 @@ org-table-convert-region
>>  integer  When a number, use that many spaces, or a TAB, as field separator
>>  regexp   When a regular expression, use it to match the separator
>>  nil      When nil, the command tries to be smart and figure out the
>> -         separator in the following way:
>> -         - when each line contains a TAB, assume TAB-separated material
>> -         - when each line contains a comma, assume CSV material
>> -         - else, assume one or more SPACE characters as separator."
>> +         separator using `org-table-guess-seperator'."

Thanks for reviewing the patch!

> I wonder if creating a new function is warranted here. You could add the
> news checks after those already present in the code, couldn't you?

At first I was also reluctant in creating a new function but decided to
do so because:

+ org-table-convert-region is currently doing two thing 'guessing the
separator' and 'converting the region'.  I thought it was a good idea to
separate out function into it's atomic operations.

+ Current guessing technique is quite basic as it assumes that data
(file that has to be imported) has no error/inconsistency in it.  I
would like to show you the doc string of Python's CSV library
implementation to guess separator (region inside """):

"""
Looks for text enclosed between two identical quotes
(the probable quotechar) which are preceded and followed
by the same character (the probable delimiter).
For example:
    ,'some text',
The quote with the most wins, same with the delimiter.
If there is no quotechar the delimiter can't be determined
this way.
"""

And if this functions fails then we have:

"""
The delimiter /should/ occur the same number of times on
each row. However, due to malformed data, it may not. We don't want
an all or nothing approach, so we allow for small variations in this
number.
1) build a table of the frequency of each character on every line.
2) build a table of frequencies of this frequency (meta-frequency?),
e.g.  'x occurred 5 times in 10 rows, 6 times in 1000 rows,
7 times in 2 rows'
3) use the mode of the meta-frequency to determine the /expected/
frequency for that character
4) find out how often the character actually meets that goal
5) the character that best meets its goal is the delimiter
For performance reasons, the data is evaluated in chunks, so it can
try and evaluate the smallest portion of the data possible, evaluating
additional chunks as necessary.
"""

I tried to do similar in Elisp but currently facing some issues due to
my inexperience in functional programming.  Also moving the 'guessing'
part out the function may lead to development of even better algorithm
than Python counterpart.

Modified version of concerned function:

(defun org-table-guess-separator (beg0 end0)
  "Guess separator for `org-table-convert-region' for region BEG0 to END0.

List of preferred separator:
comma, TAB, semicolon, colon or SPACE.

If region contains a line which doesn't contain the required
separator then discard the separator and search again using next
separator."
  (let* ((beg (save-excursion
		(goto-char (min beg0 end0))
		(line-beginning-position)))
	 (end (save-excursion
		(goto-char (max beg0 end0))
		(line-end-position)))
	 (sep-rexp '((","  "^[^\n,]+$")
		     ("\t" "^[^\n\t]+$")
		     (";"  "^[^\n;]+$")
		     (":"  "^[^\n:]+$")
		     (" "  "^\\([^'\"][^\n\s][^'\"]\\)+$")))
	 (tmp (car sep-rexp))
	 sep)
    (save-excursion
      (goto-char beg)
      (while (and (not sep)
		  (if (save-excursion
			(not (re-search-forward (nth 1 tmp) end t)))
		      (setq sep (nth 0 tmp))
		    (setq sep-rexp (cdr sep-rexp))
		    (setq tmp (car sep-rexp)))))
    sep)))

Version without using iteration:

(defun org-table-guess-separator (beg0 end0)
  "Guess separator for `org-table-convert-region' for region BEG0 to END0.

List of preferred separator:
COMMA, TAB, SEMICOLON, COLON or SPACE.

If region contains a line which doesn't contain the required
separator then discard the separator and search again using next
separator."
  (let ((beg (save-excursion
	       (goto-char (min beg0 end0))
	       (line-beginning-position)))
	(end (save-excursion
	       (goto-char (max beg0 end0))
	       (line-end-position))))
    (save-excursion
      (goto-char beg)
      (cond
       ((save-excursion (not (re-search-forward "^[^\n,]+$" end t))) ",")
       ((save-excursion (not (re-search-forward "^[^\n\t]+$" end t))) "\t")
       ((save-excursion (not (re-search-forward "^[^\n;]+$" end t))) ";")
       ((save-excursion (not (re-search-forward "^[^\n:]+$" end t))) ":")
       ((save-excursion (not (re-search-forward "^\\([^'\"][^\n\s][^'\"]\\)+$" end t))) " ")
       (t nil)))))

--
Utkarsh Singh
http://utkarshsingh.xyz