emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
From: Utkarsh Singh <utkarsh190601@gmail.com>
To: Nicolas Goaziou <mail@nicolasgoaziou.fr>
Cc: 47885@debbugs.gnu.org, emacs-orgmode@gnu.org
Subject: Re: [PATCH] org-table-import: Make it more smarter for interactive use
Date: Mon, 19 Apr 2021 19:53:25 +0530	[thread overview]
Message-ID: <87k0oyfj4y.fsf@gmail.com> (raw)
In-Reply-To: <8735vmelfs.fsf@nicolasgoaziou.fr> (Nicolas Goaziou's message of "Mon, 19 Apr 2021 10:19:03 +0200")

On 2021-04-19, 10:19 +0200, Nicolas Goaziou <mail@nicolasgoaziou.fr> wrote:

>> My previous patch proposed to add support for importing file with
>> arbitrary name and building upon that this patch tries to make use of it
>> by making org-table-import smarter by simply adding more separators
>> (delimiters).
>
> Good idea, thank you. Some comments follow.
>
>> +(defun org-table-guess-separator (beg0 end0)
>> +  "Guess separator for `org-table-convert-region' for region BEG0 to END0.
>> +
>> +List of preferred separator:
>> +comma, TAB, ';', ':' or SPACE
>
> I suggest to use full names everywhere: comma, TAB, semicolon, colon, or
> SPACE.
>
>> +If region contains a line which doesn't contain the required
>> +separator then discard the separator and search again using next
>> +separator."
>> +  (let ((beg (save-excursion
>> +	       (goto-char (min beg0 end0))
>> +	       (beginning-of-line 1)
>> +	       (point)))
>
>   (beginning-of-line 1) + (point) -> (line-beginning-position)
>
> since you don't intent to move point.
>
>> +	(end (save-excursion
>> +	       (goto-char (max beg0 end0))
>> +	       (end-of-line 1)
>> +	       (if (bolp) (backward-char 1) (end-of-line 1))
>
> I'm not sure about what you mean above. First, the second call to
> end-of-line is useless, since you're already at the end of the line.
> Second, what is wrong if point is at an empty line? Why do you want to
> move it back?
>
>> +	       (point))))
>
> You may want to use `line-end-position'.
>
>> +    (save-excursion
>> +      (goto-char beg)
>> +      (cond
>> +       ((not (re-search-forward "^[^\n,]+$" end t)) '(4))
>> +       ((not (re-search-forward "^[^\n\t]+$" end t)) '(16))
>> +       ((not (re-search-forward "^[^\n;]+$" end t)) ";")
>> +       ((not (re-search-forward "^[^\n:]+$" end t)) ":")
>> +       ((not (re-search-forward "^\\([^'\"][^\n\s][^'\"]\\)+$" end t)) " ")
>> +       (t nil)))))
>
> I think you need to wrap `save-excursion' around each
> `re-search-forward' call. Otherwise each test starts at the first line
> containing the separator previously tested.
>> +
>>  ;;;###autoload
>>  (defun org-table-convert-region (beg0 end0 &optional separator)
>>    "Convert region to a table.
>> @@ -862,10 +891,7 @@ org-table-convert-region
>>  integer  When a number, use that many spaces, or a TAB, as field separator
>>  regexp   When a regular expression, use it to match the separator
>>  nil      When nil, the command tries to be smart and figure out the
>> -         separator in the following way:
>> -         - when each line contains a TAB, assume TAB-separated material
>> -         - when each line contains a comma, assume CSV material
>> -         - else, assume one or more SPACE characters as separator."
>> +         separator using `org-table-guess-seperator'."

Thanks for reviewing the patch!

> I wonder if creating a new function is warranted here. You could add the
> news checks after those already present in the code, couldn't you?

At first I was also reluctant in creating a new function but decided to
do so because:

+ org-table-convert-region is currently doing two thing 'guessing the
separator' and 'converting the region'.  I thought it was a good idea to
separate out function into it's atomic operations.

+ Current guessing technique is quite basic as it assumes that data
(file that has to be imported) has no error/inconsistency in it.  I
would like to show you the doc string of Python's CSV library
implementation to guess separator (region inside """):

"""
Looks for text enclosed between two identical quotes
(the probable quotechar) which are preceded and followed
by the same character (the probable delimiter).
For example:
    ,'some text',
The quote with the most wins, same with the delimiter.
If there is no quotechar the delimiter can't be determined
this way.
"""

And if this functions fails then we have:

"""
The delimiter /should/ occur the same number of times on
each row. However, due to malformed data, it may not. We don't want
an all or nothing approach, so we allow for small variations in this
number.
1) build a table of the frequency of each character on every line.
2) build a table of frequencies of this frequency (meta-frequency?),
e.g.  'x occurred 5 times in 10 rows, 6 times in 1000 rows,
7 times in 2 rows'
3) use the mode of the meta-frequency to determine the /expected/
frequency for that character
4) find out how often the character actually meets that goal
5) the character that best meets its goal is the delimiter
For performance reasons, the data is evaluated in chunks, so it can
try and evaluate the smallest portion of the data possible, evaluating
additional chunks as necessary.
"""

I tried to do similar in Elisp but currently facing some issues due to
my inexperience in functional programming.  Also moving the 'guessing'
part out the function may lead to development of even better algorithm
than Python counterpart.

Modified version of concerned function:

(defun org-table-guess-separator (beg0 end0)
  "Guess separator for `org-table-convert-region' for region BEG0 to END0.

List of preferred separator:
comma, TAB, semicolon, colon or SPACE.

If region contains a line which doesn't contain the required
separator then discard the separator and search again using next
separator."
  (let* ((beg (save-excursion
		(goto-char (min beg0 end0))
		(line-beginning-position)))
	 (end (save-excursion
		(goto-char (max beg0 end0))
		(line-end-position)))
	 (sep-rexp '((","  "^[^\n,]+$")
		     ("\t" "^[^\n\t]+$")
		     (";"  "^[^\n;]+$")
		     (":"  "^[^\n:]+$")
		     (" "  "^\\([^'\"][^\n\s][^'\"]\\)+$")))
	 (tmp (car sep-rexp))
	 sep)
    (save-excursion
      (goto-char beg)
      (while (and (not sep)
		  (if (save-excursion
			(not (re-search-forward (nth 1 tmp) end t)))
		      (setq sep (nth 0 tmp))
		    (setq sep-rexp (cdr sep-rexp))
		    (setq tmp (car sep-rexp)))))
    sep)))

Version without using iteration:

(defun org-table-guess-separator (beg0 end0)
  "Guess separator for `org-table-convert-region' for region BEG0 to END0.

List of preferred separator:
COMMA, TAB, SEMICOLON, COLON or SPACE.

If region contains a line which doesn't contain the required
separator then discard the separator and search again using next
separator."
  (let ((beg (save-excursion
	       (goto-char (min beg0 end0))
	       (line-beginning-position)))
	(end (save-excursion
	       (goto-char (max beg0 end0))
	       (line-end-position))))
    (save-excursion
      (goto-char beg)
      (cond
       ((save-excursion (not (re-search-forward "^[^\n,]+$" end t))) ",")
       ((save-excursion (not (re-search-forward "^[^\n\t]+$" end t))) "\t")
       ((save-excursion (not (re-search-forward "^[^\n;]+$" end t))) ";")
       ((save-excursion (not (re-search-forward "^[^\n:]+$" end t))) ":")
       ((save-excursion (not (re-search-forward "^\\([^'\"][^\n\s][^'\"]\\)+$" end t))) " ")
       (t nil)))))

--
Utkarsh Singh
http://utkarshsingh.xyz


  reply	other threads:[~2021-04-19 14:23 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-19  4:43 [PATCH] org-table-import: Make it more smarter for interactive use Utkarsh Singh
2021-04-19  8:19 ` Nicolas Goaziou
2021-04-19 14:23   ` Utkarsh Singh [this message]
2021-04-20 13:40     ` Nicolas Goaziou
2021-04-20 17:15       ` Utkarsh Singh
2021-04-23  4:58       ` Utkarsh Singh
2021-04-27 20:21         ` bug#47885: " Nicolas Goaziou
2021-04-28  8:37           ` Utkarsh Singh
2021-04-28 16:38             ` Maxim Nikulin
2021-05-10 18:36               ` Utkarsh Singh
2021-05-12 17:08                 ` Maxim Nikulin
2021-05-14 14:54                   ` Utkarsh Singh
2021-05-15  9:13                     ` Bastien
2021-05-15 10:10                       ` Utkarsh Singh
2021-05-15 10:30                         ` Bastien
2021-05-15 11:09                           ` Utkarsh Singh
2021-05-17  5:29                         ` Bastien
2021-05-17 16:27                           ` Utkarsh Singh
2021-06-01 16:23                           ` Maxim Nikulin
2021-06-01 17:46                             ` Utkarsh Singh
2021-06-02 12:06                               ` Maxim Nikulin
2021-06-02 15:08                                 ` Utkarsh Singh
2021-06-02 16:44                                   ` Maxim Nikulin
2021-06-04  4:04                                     ` Utkarsh Singh
2021-06-05 12:40                                       ` Maxim Nikulin
2021-06-05 17:50                                         ` Utkarsh Singh
2021-06-09 12:15                                           ` Maxim Nikulin
2021-09-26  8:40                                           ` Bastien
2021-05-16 16:24                     ` Maxim Nikulin
2021-05-17 16:30                       ` Utkarsh Singh
2021-05-18 10:24                       ` Utkarsh Singh
2021-05-18 12:31                         ` Maxim Nikulin
2021-05-18 15:05                           ` Utkarsh Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87k0oyfj4y.fsf@gmail.com \
    --to=utkarsh190601@gmail.com \
    --cc=47885@debbugs.gnu.org \
    --cc=emacs-orgmode@gnu.org \
    --cc=mail@nicolasgoaziou.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).