unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#20704: info.el bug fix; Interprets Info format wrongly
@ 2015-05-31 14:54 Teddy Hogeborn
  2015-06-01 14:01 ` Stefan Monnier
  0 siblings, 1 reply; 10+ messages in thread
From: Teddy Hogeborn @ 2015-05-31 14:54 UTC (permalink / raw)
  To: 20704

[-- Attachment #1: Type: text/plain, Size: 2657 bytes --]

The Info file format (see (texinfo)Info Format Tag Table.)  is
documented as having the reference position in bytes.  However, the
info.el functions "Info-find-in-tag-table-1", "Info-read-subfile", and
"Info-search" reads the byte value and adds it to (point-min), which is
a character position, not a byte position.  This causes the Emacs Info
reader to jump to the wrong position in Info files with a lot of
non-ascii characters.  Solution: Convert the read value to position
using byte-to-position:

diff --git a/lisp/info.el b/lisp/info.el
index 80428e7..b179510 100644
--- a/lisp/info.el
+++ b/lisp/info.el
@@ -1020,7 +1020,8 @@ which the match was found."
       (beginning-of-line)
       (when (re-search-forward regexp nil t)
 	(list (string-equal "Ref:" (match-string 1))
-	      (+ (point-min) (read (current-buffer)))
+	      (+ (point-min) (byte-to-position
+                              (read (current-buffer))))
 	      major-mode)))))
 
 (defun Info-find-in-tag-table (marker regexp &optional strict-case)
@@ -1523,7 +1524,9 @@ is non-nil)."
 			thisfilepos thisfilename)
 		    (search-forward ": ")
 		    (setq thisfilename  (buffer-substring beg (- (point) 2)))
-		    (setq thisfilepos (+ (point-min) (read (current-buffer))))
+		    (setq thisfilepos (+ (point-min)
+                                         (byte-to-position
+                                          (read (current-buffer)))))
 		    ;; read in version 19 stops at the end of number.
 		    ;; Advance to the next line.
 		    (forward-line 1)
@@ -2013,9 +2016,11 @@ If DIRECTION is `backward', search in the reverse direction."
 		        (re-search-backward "\\(^.*\\): [0-9]+$")
 		      (re-search-forward "\\(^.*\\): [0-9]+$"))
 		    (goto-char (+ (match-end 1) 2))
-		    (setq list (cons (cons (+ (point-min)
-					      (read (current-buffer)))
-					   (match-string-no-properties 1))
+		    (setq list (cons (cons
+                                      (+ (point-min)
+                                         (byte-to-position
+                                          (read (current-buffer))))
+                                      (match-string-no-properties 1))
 				     list))
 		    (goto-char (if backward
                                    (1- (match-beginning 0))

Suggested ChangeLog:

----
Convert reference byte positions from Info file to character position.

* lisp/info.el (Info-find-in-tag-table-1, Info-read-subfile)
(Info-search): Convert position read from Info file from bytes to
character position.  Patch by Teddy Hogeborn <teddy@recompile.se>.
----

/Teddy Hogeborn

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* bug#20704: info.el bug fix; Interprets Info format wrongly
  2015-05-31 14:54 bug#20704: info.el bug fix; Interprets Info format wrongly Teddy Hogeborn
@ 2015-06-01 14:01 ` Stefan Monnier
  2015-06-01 15:12   ` Eli Zaretskii
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Monnier @ 2015-06-01 14:01 UTC (permalink / raw)
  To: Teddy Hogeborn; +Cc: 20704

Thanks,

> +	      (+ (point-min) (byte-to-position
> +                              (read (current-buffer))))

Hmm... this only works if the Info file is encoded in UTF-8.
I guess in the case of Info, 99% of the files are just ASCII and there's
a chance that the vast majority of the rest is (or will be) UTF-8,
so maybe this hack works well in practice.

But I think we should define an `Info-bytepos-to-charpos' function for that.
It can be defined as an alias for byte-to-position, but at least it
concentrates this utf-8 assumption at a single place where we can place
a clear comment.


        Stefan





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#20704: info.el bug fix; Interprets Info format wrongly
  2015-06-01 14:01 ` Stefan Monnier
@ 2015-06-01 15:12   ` Eli Zaretskii
  2015-06-09 11:09     ` Teddy Hogeborn
  2015-06-10 17:50     ` Stefan Monnier
  0 siblings, 2 replies; 10+ messages in thread
From: Eli Zaretskii @ 2015-06-01 15:12 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: teddy, 20704

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Mon, 01 Jun 2015 10:01:59 -0400
> Cc: 20704@debbugs.gnu.org
> 
> Thanks,
> 
> > +	      (+ (point-min) (byte-to-position
> > +                              (read (current-buffer))))
> 
> Hmm... this only works if the Info file is encoded in UTF-8.
> I guess in the case of Info, 99% of the files are just ASCII and there's
> a chance that the vast majority of the rest is (or will be) UTF-8,
> so maybe this hack works well in practice.

Using byte-to-position would make things worse for Latin-1 and the
likes.

But it shouldn't be hard to add a simple test of
buffer-file-coding-system: if it states fixed-size encoding, like any
of the 8-bit encodings, or UTF-16, the conversion to character
position is trivial.  AFAIR, the only problems will be with ISO-2022
derived encodings, and those are really rare in Info.  So IMO adding
such a simple test would go a long way towards making the solution
almost perfect.

> But I think we should define an `Info-bytepos-to-charpos' function for that.
> It can be defined as an alias for byte-to-position, but at least it
> concentrates this utf-8 assumption at a single place where we can place
> a clear comment.

Right.

Thanks.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#20704: info.el bug fix; Interprets Info format wrongly
  2015-06-01 15:12   ` Eli Zaretskii
@ 2015-06-09 11:09     ` Teddy Hogeborn
  2015-06-09 14:29       ` Eli Zaretskii
  2015-06-09 16:01       ` Stefan Monnier
  2015-06-10 17:50     ` Stefan Monnier
  1 sibling, 2 replies; 10+ messages in thread
From: Teddy Hogeborn @ 2015-06-09 11:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 20704

[-- Attachment #1: Type: text/plain, Size: 1202 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

> > > +	      (+ (point-min) (byte-to-position
> > > +                              (read (current-buffer))))
> > 
> > Hmm... this only works if the Info file is encoded in UTF-8.  I
> > guess in the case of Info, 99% of the files are just ASCII and
> > there's a chance that the vast majority of the rest is (or will be)
> > UTF-8, so maybe this hack works well in practice.
>
> Using byte-to-position would make things worse for Latin-1 and the
> likes.

No, byte-to-position already checks for that:

---- src/marker.c, line 302
  /* If this buffer has as many characters as bytes,
     each character must be one byte.
     This takes care of the case where enable-multibyte-characters is nil.  */
  if (best_above == best_above_byte)
    return bytepos;
----

Therefore, an Info file in Latin-1 should work just fine.

> But it shouldn't be hard to add a simple test of
> buffer-file-coding-system: if it states fixed-size encoding, like any
> of the 8-bit encodings, or UTF-16,
> the conversion to character position is trivial.

I think you mean UTF-32 instead of UTF-16, since UTF-16 is variable-
length.

/Teddy Hogeborn

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#20704: info.el bug fix; Interprets Info format wrongly
  2015-06-09 11:09     ` Teddy Hogeborn
@ 2015-06-09 14:29       ` Eli Zaretskii
  2015-06-09 16:01       ` Stefan Monnier
  1 sibling, 0 replies; 10+ messages in thread
From: Eli Zaretskii @ 2015-06-09 14:29 UTC (permalink / raw)
  To: Teddy Hogeborn; +Cc: 20704

> From: Teddy Hogeborn <teddy@recompile.se>
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,  20704@debbugs.gnu.org
> Date: Tue, 09 Jun 2015 13:09:08 +0200
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > > > +	      (+ (point-min) (byte-to-position
> > > > +                              (read (current-buffer))))
> > > 
> > > Hmm... this only works if the Info file is encoded in UTF-8.  I
> > > guess in the case of Info, 99% of the files are just ASCII and
> > > there's a chance that the vast majority of the rest is (or will be)
> > > UTF-8, so maybe this hack works well in practice.
> >
> > Using byte-to-position would make things worse for Latin-1 and the
> > likes.
> 
> No, byte-to-position already checks for that:
> 
> ---- src/marker.c, line 302
>   /* If this buffer has as many characters as bytes,
>      each character must be one byte.
>      This takes care of the case where enable-multibyte-characters is nil.  */
>   if (best_above == best_above_byte)
>     return bytepos;
> ----

I think you are misreading the code: the above snippet is for unibyte
buffers, whereas a Latin-1 encoded Info file will be read into a
multibyte buffer (and decoded into the internal Emacs representation
of characters during the read).  So this optimization is not going to
work in that case.

IOW, what matters for byte-to-position is the encoding used in
representing characters in Emacs buffers, not the one used externally
by the Info file on disk.

> Therefore, an Info file in Latin-1 should work just fine.
> 
> > But it shouldn't be hard to add a simple test of
> > buffer-file-coding-system: if it states fixed-size encoding, like any
> > of the 8-bit encodings, or UTF-16,
> > the conversion to character position is trivial.
> 
> I think you mean UTF-32 instead of UTF-16, since UTF-16 is variable-
> length.

UTF-16 is fixed length for characters in the BMP.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#20704: info.el bug fix; Interprets Info format wrongly
  2015-06-09 11:09     ` Teddy Hogeborn
  2015-06-09 14:29       ` Eli Zaretskii
@ 2015-06-09 16:01       ` Stefan Monnier
  1 sibling, 0 replies; 10+ messages in thread
From: Stefan Monnier @ 2015-06-09 16:01 UTC (permalink / raw)
  To: Teddy Hogeborn; +Cc: 20704

>> Using byte-to-position would make things worse for Latin-1 and the
>> likes.

> No, byte-to-position already checks for that:

> ---- src/marker.c, line 302
>   /* If this buffer has as many characters as bytes,
>      each character must be one byte.
>      This takes care of the case where enable-multibyte-characters is nil.  */
>   if (best_above == best_above_byte)
>     return bytepos;
> ----

> Therefore, an Info file in Latin-1 should work just fine.

No, because the representation in the buffer will still be a utf-8
derivative, so best_above will generally not be equal to best_above_byte.


        Stefan





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#20704: info.el bug fix; Interprets Info format wrongly
  2015-06-01 15:12   ` Eli Zaretskii
  2015-06-09 11:09     ` Teddy Hogeborn
@ 2015-06-10 17:50     ` Stefan Monnier
  2015-06-10 18:21       ` Eli Zaretskii
  1 sibling, 1 reply; 10+ messages in thread
From: Stefan Monnier @ 2015-06-10 17:50 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: teddy, 20704

> Using byte-to-position would make things worse for Latin-1 and the likes.

There's also the problem of EOL encoding, but I'll just ignore it for now.
Could someone test the patch below?


        Stefan


diff --git a/lisp/info.el b/lisp/info.el
index 9602337..0de7f1e 100644
--- a/lisp/info.el
+++ b/lisp/info.el
@@ -1020,7 +1020,7 @@ which the match was found."
       (beginning-of-line)
       (when (re-search-forward regexp nil t)
 	(list (string-equal "Ref:" (match-string 1))
-	      (+ (point-min) (read (current-buffer)))
+              (filepos-to-bufferpos (read (current-buffer)) 'approximate)
 	      major-mode)))))
 
 (defun Info-find-in-tag-table (marker regexp &optional strict-case)
@@ -1187,7 +1187,8 @@ is non-nil)."
 
 		  (when found
 		    ;; FOUND is (ANCHOR POS MODE).
-		    (setq guesspos (nth 1 found))
+		    (setq guesspos (filepos-to-bufferpos (nth 1 found)
+                                                         'approximate))
 
 		    ;; If this is an indirect file, determine which
 		    ;; file really holds this node and read it in.
@@ -1203,8 +1204,7 @@ is non-nil)."
 		      (throw 'foo t)))))
 
 	      ;; Else we may have a node, which we search for:
-	      (goto-char (max (point-min)
-			      (- (byte-to-position guesspos) 1000)))
+	      (goto-char (max (point-min) (- guesspos 1000)))
 
 	      ;; Now search from our advised position (or from beg of
 	      ;; buffer) to find the actual node.  First, check
@@ -1523,7 +1523,9 @@ is non-nil)."
 			thisfilepos thisfilename)
 		    (search-forward ": ")
 		    (setq thisfilename  (buffer-substring beg (- (point) 2)))
-		    (setq thisfilepos (+ (point-min) (read (current-buffer))))
+		    (setq thisfilepos
+                          (filepos-to-bufferpos (read (current-buffer))
+                                                'approximate))
 		    ;; read in version 19 stops at the end of number.
 		    ;; Advance to the next line.
 		    (forward-line 1)
@@ -1554,7 +1556,7 @@ is non-nil)."
     ;; Don't add the length of the skipped summary segment to
     ;; the value returned to `Info-find-node-2'.  (Bug#14125)
     (if (numberp nodepos)
-	(+ (- nodepos lastfilepos) (point-min)))))
+	(- nodepos lastfilepos))))
 
 (defun Info-unescape-quotes (value)
   "Unescape double quotes and backslashes in VALUE."
@@ -2013,8 +2015,9 @@ If DIRECTION is `backward', search in the reverse direction."
 		        (re-search-backward "\\(^.*\\): [0-9]+$")
 		      (re-search-forward "\\(^.*\\): [0-9]+$"))
 		    (goto-char (+ (match-end 1) 2))
-		    (setq list (cons (cons (+ (point-min)
-					      (read (current-buffer)))
+		    (setq list (cons (cons (filepos-to-bufferpos
+                                            (read (current-buffer))
+                                            'approximate)
 					   (match-string-no-properties 1))
 				     list))
 		    (goto-char (if backward
diff --git a/lisp/international/mule-util.el b/lisp/international/mule-util.el
index eae787b..1f7df0b 100644
--- a/lisp/international/mule-util.el
+++ b/lisp/international/mule-util.el
@@ -313,6 +313,35 @@ per-character basis, this may not be accurate."
 				  (throw 'tag3 charset)))
 			  charset-list)
 		    nil)))))))))
+
+;;;###autoload
+(defun filepos-to-bufferpos (byte &optional quality coding-system)
+  "Try to return the buffer position corresponding to a particular file position.
+The file position is given as a BYTE count.
+The function presumes the file is encoded with CODING-SYSTEM, which defaults
+to `buffer-file-coding-system'.
+QUALITY can be:
+  `approximate', in which case we may cut some corners to avoid
+    excessive work.
+  nil, in which case we may return nil rather than an approximation."
+  ;; `exact', in which case we may end up re-(en|de)coding a large
+  ;;   part of the file.
+  (unless coding-system (setq coding-system buffer-file-coding-system))
+  (let ((eol (coding-system-eol-type coding-system))
+        (type (coding-system-type coding-system))
+        (pm (save-restriction (widen) (point-min))))
+    (pcase (cons type eol)
+      (`(utf-8 . ,(or 0 2))
+       (let ((bom-offset (coding-system-get coding-system :bom)))
+         (byte-to-position
+          (+ pm (max 0 (- byte (if bom-offset 3 0)))))))
+      ;; FIXME: What if it's a 2-byte charset?  Are there such beasts?
+      (`(charset . ,(or 0 2)) (+ pm byte))
+      (_
+       (pcase quality
+         (`approximate (+ pm (byte-to-position byte)))
+         ;; (`exact ...)
+         )))))
 \f
 (provide 'mule-util)
 





^ permalink raw reply related	[flat|nested] 10+ messages in thread

* bug#20704: info.el bug fix; Interprets Info format wrongly
  2015-06-10 17:50     ` Stefan Monnier
@ 2015-06-10 18:21       ` Eli Zaretskii
  2015-06-11  3:02         ` Stefan Monnier
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Zaretskii @ 2015-06-10 18:21 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: teddy, 20704

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: teddy@recompile.se, 20704@debbugs.gnu.org
> Date: Wed, 10 Jun 2015 13:50:29 -0400
> 
> > Using byte-to-position would make things worse for Latin-1 and the likes.
> 
> There's also the problem of EOL encoding, but I'll just ignore it for now.

That was never a problem before Texinfo 5.x: makeinfo didn't count
the CR characters in the CRLF EOLs, and the Info readers removed the
CR characters when reading the Info files.

But Texinfo 5.x and later does count the CR characters, so the
stand-alone Info reader was recently changed to account for that.
Which means that Emacs will now have a problem, whereby the byte
counts in the tag tables will be inaccurate, and our only hope is the
1000-character tolerance we use to look for the node around the
position stated in the tag table will be large enough.

Read the gory details about that in this thread:

  http://lists.gnu.org/archive/html/bug-texinfo/2014-12/msg00068.html





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#20704: info.el bug fix; Interprets Info format wrongly
  2015-06-10 18:21       ` Eli Zaretskii
@ 2015-06-11  3:02         ` Stefan Monnier
  2015-06-11 13:11           ` Eli Zaretskii
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Monnier @ 2015-06-11  3:02 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: teddy, 20704

> But Texinfo 5.x and later does count the CR characters, so the
> stand-alone Info reader was recently changed to account for that.
> Which means that Emacs will now have a problem, whereby the byte
> counts in the tag tables will be inaccurate, and our only hope is the
> 1000-character tolerance we use to look for the node around the
> position stated in the tag table will be large enough.

If needed, I think we could make it work reasonably cheaply with
something along the lines of (100% guaranteed untested code):

       (let (pos lines (eol-offset 0))
         (while
             (progn
               (setq pos (byte-to-position (+ pm byte (- eol-offset))))
               (setq lines (1- (line-number-at-pos pos)))
               (not (= lines eol-offset)))
           (setq eol-offset (+ eol-offset lines)))
         pos))


-- Stefan





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#20704: info.el bug fix; Interprets Info format wrongly
  2015-06-11  3:02         ` Stefan Monnier
@ 2015-06-11 13:11           ` Eli Zaretskii
  0 siblings, 0 replies; 10+ messages in thread
From: Eli Zaretskii @ 2015-06-11 13:11 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: teddy, 20704

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: teddy@recompile.se, 20704@debbugs.gnu.org
> Date: Wed, 10 Jun 2015 23:02:16 -0400
> 
> > But Texinfo 5.x and later does count the CR characters, so the
> > stand-alone Info reader was recently changed to account for that.
> > Which means that Emacs will now have a problem, whereby the byte
> > counts in the tag tables will be inaccurate, and our only hope is the
> > 1000-character tolerance we use to look for the node around the
> > position stated in the tag table will be large enough.
> 
> If needed, I think we could make it work reasonably cheaply with
> something along the lines of (100% guaranteed untested code):

Sure, but this needs to be conditioned on the EOL encoding we actually
found when we read the file.





^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-06-11 13:11 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-31 14:54 bug#20704: info.el bug fix; Interprets Info format wrongly Teddy Hogeborn
2015-06-01 14:01 ` Stefan Monnier
2015-06-01 15:12   ` Eli Zaretskii
2015-06-09 11:09     ` Teddy Hogeborn
2015-06-09 14:29       ` Eli Zaretskii
2015-06-09 16:01       ` Stefan Monnier
2015-06-10 17:50     ` Stefan Monnier
2015-06-10 18:21       ` Eli Zaretskii
2015-06-11  3:02         ` Stefan Monnier
2015-06-11 13:11           ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).