unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
@ 2024-10-05 11:06 Visuwesh
  2024-10-05 19:56 ` Tassilo Horn
  0 siblings, 1 reply; 18+ messages in thread
From: Visuwesh @ 2024-10-05 11:06 UTC (permalink / raw)
  To: 73638; +Cc: Tassilo Horn

This is a follow up to bug#73530 where a discussion on how to obtain the
outlines for LaTeX PDFs was held.

Currently, if mutool reports the outline as

    % mutool show test.pdf outline
    |	"Text"	#nameddest=section.1
    |	"Annotations"	#nameddest=section.2
    |	"Links"	#nameddest=section.3
    |	"Attachments"	#nameddest=section.4
    +	"Outline"	#nameddest=section.5
    +		"subsection"	#nameddest=subsection.5.1
    |			"subsubsection"	#nameddest=subsubsection.5.1.1

then nothing can be done.  Looking at the source code of mutool, it
looks like the "#..." part is simply a URI.  AFAICT, there's no way to
resolve the URI to get the page number using mutool.  However, one can
write a JS script instead.  Use the "attached" "outline.js" script and run
mutool as follows with a LaTeX PDF:

    % mutool run outline.js test.pdf
    (
    ((level . 1)
    (title . "Text")
    (page . 0))
    ((level . 1)
    (title . "Annotations")
    (page . 1))
    ((level . 1)
    (title . "Links")
    (page . 2))
    ((level . 1)
    (title . "Attachments")
    (page . 3))
    ((level . 1)
    (title . "Outline")
    (page . 4))
    ((level . 2)
    (title . "subsection")
    (page . 4))
    ((level . 3)
    (title . "subsubsection")
    (page . 4))
    )

This can be directly `read' from Emacs skipping the parsing entirely.
JS evaluation takes the same amount of time as `mutool show PDF outline':

    % time mutool run outline.js atkins_physical_chemistry.pdf >/dev/null
        0m00.32s real     0m00.29s user     0m00.02s system
    % time mutool run outline.js atkins_physical_chemistry.pdf >/dev/null
        0m00.31s real     0m00.29s user     0m00.02s system
    % time mutool show atkins_physical_chemistry.pdf outline >/dev/null
        0m00.33s real     0m00.29s user     0m00.04s system
    % time mutool show atkins_physical_chemistry.pdf outline >/dev/null
        0m00.30s real     0m00.25s user     0m00.04s system

[ where atkins_physical_chemistry.pdf is the same 90+MB file I was
  testing in the previous bug report.  ]

I don't know JS at all so the script can probably be improved.  The docs
for the JS interface is at

    https://mupdf.readthedocs.io/en/latest/mutool-run-js-api.html

If this approach is acceptable, we can simply run the JS script instead.
WDYT?

[ I couldn't attach the JS script thanks to Gmail's blocking the
  message.  ]

outline.js:

var document = new Document.openDocument(scriptArgs[0], "application/pdf")
var outline = document.loadOutline()
if(!outline) quit()

print("(")

function pp(outl, level){
    print("((level . " + level + ")")
    print("(title . " + repr(outl.title) + ")")
    print("(page . " + document.resolveLink(outl.uri) + "))")
    if(outl.down){
	for(var i=0; i<outl.down.length; i++){
	    pp(outl.down[i], level+1)
	}
    }
}

for(var i=0; i<outline.length; i++){
    pp(outline[i], 1)
}

print(")")





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-05 11:06 bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs Visuwesh
@ 2024-10-05 19:56 ` Tassilo Horn
  2024-10-06  5:42   ` Eli Zaretskii
  0 siblings, 1 reply; 18+ messages in thread
From: Tassilo Horn @ 2024-10-05 19:56 UTC (permalink / raw)
  To: Visuwesh; +Cc: 73638, Eli Zaretskii

Visuwesh <visuweshm@gmail.com> writes:

> However, one can write a JS script instead.  Use the "attached"
> "outline.js" script and run mutool as follows with a LaTeX PDF:
>
>     % mutool run outline.js test.pdf
>     (
>     ((level . 1)
>     (title . "Text")
>     (page . 0))
>     ...
>     )
>
> This can be directly `read' from Emacs skipping the parsing entirely.

That's really nice.

> JS evaluation takes the same amount of time as `mutool show PDF outline':
>
>     % time mutool run outline.js atkins_physical_chemistry.pdf >/dev/null
>         0m00.32s real     0m00.29s user     0m00.02s system
>     % time mutool run outline.js atkins_physical_chemistry.pdf >/dev/null
>         0m00.31s real     0m00.29s user     0m00.02s system
>     % time mutool show atkins_physical_chemistry.pdf outline >/dev/null
>         0m00.33s real     0m00.29s user     0m00.04s system
>     % time mutool show atkins_physical_chemistry.pdf outline >/dev/null
>         0m00.30s real     0m00.25s user     0m00.04s system
>
> [ where atkins_physical_chemistry.pdf is the same 90+MB file I was
>   testing in the previous bug report.  ]
>
> I don't know JS at all so the script can probably be improved.  The
> docs for the JS interface is at
>
>     https://mupdf.readthedocs.io/en/latest/mutool-run-js-api.html
>
> If this approach is acceptable, we can simply run the JS script
> instead.  WDYT?

To me it sounds great.  But let's ask Eli as well.

Eli, the executive summary is this.  We already can read a PDFs outline
mutool and use that for quick access to chapters and sections through
imenu.  However, it turned out that it depends on the PDF at hand if the
outline is usable for our purpose where "usable" means we get page
references.  And it seems that many PDFs (e.g., those produced by LaTeX)
have no page references but named references which won't do the trick
for doc-view.

Visuwesh figured out that one can run a JS script using "mutool run
<script> foo.pdf" for accessing the PDFs internal structure using the JS
mupdf API and wrote the below simple script which spits out the outline
with page references as sexp structure.

Would it be ok to distribute the below JS helper script with Emacs so
that doc-view can use it?  If so, how?  Maybe the simplest way would be
to just put it in some doc-view--mutool-outline-script variable and copy
it to doc-view-cache-directory when invoking imenu on a PDF file the
first time?

Bye,
  Tassilo

> outline.js:
>
> var document = new Document.openDocument(scriptArgs[0], "application/pdf")
> var outline = document.loadOutline()
> if(!outline) quit()
>
> print("(")
>
> function pp(outl, level){
>     print("((level . " + level + ")")
>     print("(title . " + repr(outl.title) + ")")
>     print("(page . " + document.resolveLink(outl.uri) + "))")
>     if(outl.down){
> 	for(var i=0; i<outl.down.length; i++){
> 	    pp(outl.down[i], level+1)
> 	}
>     }
> }
>
> for(var i=0; i<outline.length; i++){
>     pp(outline[i], 1)
> }
>
> print(")")





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-05 19:56 ` Tassilo Horn
@ 2024-10-06  5:42   ` Eli Zaretskii
  2024-10-06  6:28     ` Visuwesh
  2024-10-06  6:39     ` Visuwesh
  0 siblings, 2 replies; 18+ messages in thread
From: Eli Zaretskii @ 2024-10-06  5:42 UTC (permalink / raw)
  To: Tassilo Horn; +Cc: 73638, visuweshm

> From: Tassilo Horn <tsdh@gnu.org>
> Cc: 73638@debbugs.gnu.org, Eli Zaretskii <eliz@gnu.org>
> Date: Sat, 05 Oct 2024 21:56:24 +0200
> 
> Eli, the executive summary is this.  We already can read a PDFs outline
> mutool and use that for quick access to chapters and sections through
> imenu.  However, it turned out that it depends on the PDF at hand if the
> outline is usable for our purpose where "usable" means we get page
> references.  And it seems that many PDFs (e.g., those produced by LaTeX)
> have no page references but named references which won't do the trick
> for doc-view.
> 
> Visuwesh figured out that one can run a JS script using "mutool run
> <script> foo.pdf" for accessing the PDFs internal structure using the JS
> mupdf API and wrote the below simple script which spits out the outline
> with page references as sexp structure.
> 
> Would it be ok to distribute the below JS helper script with Emacs so
> that doc-view can use it?  If so, how?  Maybe the simplest way would be
> to just put it in some doc-view--mutool-outline-script variable and copy
> it to doc-view-cache-directory when invoking imenu on a PDF file the
> first time?

Can't we invoke the JS interpreter with this script as the command, or
invoke it as async subprocess and pipe the script to it via standard
input?





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-06  5:42   ` Eli Zaretskii
@ 2024-10-06  6:28     ` Visuwesh
  2024-10-06  6:39       ` Eli Zaretskii
  2024-10-06  8:16       ` Tassilo Horn
  2024-10-06  6:39     ` Visuwesh
  1 sibling, 2 replies; 18+ messages in thread
From: Visuwesh @ 2024-10-06  6:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 73638, Tassilo Horn

[-- Attachment #1: Type: text/plain, Size: 1822 bytes --]

[ஞாயிறு அக்டோபர் 06, 2024] Eli Zaretskii wrote:

>> From: Tassilo Horn <tsdh@gnu.org>
>> Cc: 73638@debbugs.gnu.org, Eli Zaretskii <eliz@gnu.org>
>> Date: Sat, 05 Oct 2024 21:56:24 +0200
>> 
>> Eli, the executive summary is this.  We already can read a PDFs outline
>> mutool and use that for quick access to chapters and sections through
>> imenu.  However, it turned out that it depends on the PDF at hand if the
>> outline is usable for our purpose where "usable" means we get page
>> references.  And it seems that many PDFs (e.g., those produced by LaTeX)
>> have no page references but named references which won't do the trick
>> for doc-view.
>> 
>> Visuwesh figured out that one can run a JS script using "mutool run
>> <script> foo.pdf" for accessing the PDFs internal structure using the JS
>> mupdf API and wrote the below simple script which spits out the outline
>> with page references as sexp structure.
>> 
>> Would it be ok to distribute the below JS helper script with Emacs so
>> that doc-view can use it?  If so, how?  Maybe the simplest way would be
>> to just put it in some doc-view--mutool-outline-script variable and copy
>> it to doc-view-cache-directory when invoking imenu on a PDF file the
>> first time?
>
> Can't we invoke the JS interpreter with this script as the command, or
> invoke it as async subprocess and pipe the script to it via standard
> input?

When "mutool run" is called without any script argument, it opens a
REPL.  We would need to `process-send-string' to the REPL but the REPL
reports syntax error when I try to do so.  This is after trying to
"minify" the JS script.  Maybe I am doing something wrong.  I've
attached my test lisp at the end.

I do not understand what you suggest by the former however.


[-- Attachment #2: Type: text/plain, Size: 1351 bytes --]

(defvar test
  "var document = new Document.openDocument(\"%s\", \"application/pdf\")
var outline = document.loadOutline()
if(!outline) quit()

print(\"(\")

function pp(outl, level){ \
    print(\"((level . \" + level + \")\"); \
    print(\"(title . \" + repr(outl.title) + \")\"); \
    print(\"(page . \" + document.resolveLink(outl.uri) + \"))\"); \
    if(outl.down){ \
	for(var i=0; i<outl.down.length; i++){ \
	    pp(outl.down[i], level+1); \
	} \
    } \
}

for(var i=0; i<outline.length; i++){ \
    pp(outline[i], 1); \
}

print(\")\")")


(let ((default-directory "~/lib/emacs/straight/repos/pdf-tools/test/")
      proc)
  (with-current-buffer (get-buffer-create " *doc-view-mupdf-js*")
    (erase-buffer)
    (setq proc
          (make-process :name "mupdf-js"
                        :command (list "mutool" "run" )
                        :buffer (current-buffer)
                        :sentinel (lambda (proc _status)
                                    (when (eq (process-status proc) 'exit)
                                      (with-current-buffer (process-buffer proc)
                                        (goto-char (point-min))
                                        (message "%S" (read (current-buffer))))))))
    (process-send-string proc (format test (expand-file-name "test.pdf")))
    (process-send-eof proc)
    ))

^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-06  5:42   ` Eli Zaretskii
  2024-10-06  6:28     ` Visuwesh
@ 2024-10-06  6:39     ` Visuwesh
  1 sibling, 0 replies; 18+ messages in thread
From: Visuwesh @ 2024-10-06  6:39 UTC (permalink / raw)
  To: Eli Zaretskii, Tassilo Horn; +Cc: 73638

Just realised what you meant by the former: no there's no equivalent of "sh -c".  You need to give the interpreter a script.

On 6 October 2024 11:12:01 GMT+05:30, Eli Zaretskii <eliz@gnu.org> wrote:
>> From: Tassilo Horn <tsdh@gnu.org>
>> Cc: 73638@debbugs.gnu.org, Eli Zaretskii <eliz@gnu.org>
>> Date: Sat, 05 Oct 2024 21:56:24 +0200
>> 
>> Eli, the executive summary is this.  We already can read a PDFs outline
>> mutool and use that for quick access to chapters and sections through
>> imenu.  However, it turned out that it depends on the PDF at hand if the
>> outline is usable for our purpose where "usable" means we get page
>> references.  And it seems that many PDFs (e.g., those produced by LaTeX)
>> have no page references but named references which won't do the trick
>> for doc-view.
>> 
>> Visuwesh figured out that one can run a JS script using "mutool run
>> <script> foo.pdf" for accessing the PDFs internal structure using the JS
>> mupdf API and wrote the below simple script which spits out the outline
>> with page references as sexp structure.
>> 
>> Would it be ok to distribute the below JS helper script with Emacs so
>> that doc-view can use it?  If so, how?  Maybe the simplest way would be
>> to just put it in some doc-view--mutool-outline-script variable and copy
>> it to doc-view-cache-directory when invoking imenu on a PDF file the
>> first time?
>
>Can't we invoke the JS interpreter with this script as the command, or
>invoke it as async subprocess and pipe the script to it via standard
>input?





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-06  6:28     ` Visuwesh
@ 2024-10-06  6:39       ` Eli Zaretskii
  2024-10-06  8:16       ` Tassilo Horn
  1 sibling, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2024-10-06  6:39 UTC (permalink / raw)
  To: Visuwesh; +Cc: 73638, tsdh

> From: Visuwesh <visuweshm@gmail.com>
> Cc: Tassilo Horn <tsdh@gnu.org>,  73638@debbugs.gnu.org
> Date: Sun, 06 Oct 2024 11:58:28 +0530
> 
> > Can't we invoke the JS interpreter with this script as the command, or
> > invoke it as async subprocess and pipe the script to it via standard
> > input?
> 
> When "mutool run" is called without any script argument, it opens a
> REPL.  We would need to `process-send-string' to the REPL but the REPL
> reports syntax error when I try to do so.  This is after trying to
> "minify" the JS script.  Maybe I am doing something wrong.  I've
> attached my test lisp at the end.
> 
> I do not understand what you suggest by the former however.

Many programs that include an interpreter can accept a script either
as a file or as a string passed through the command line, with or
without a special command-line option.  One example is Awk, another is
Python.

In any case, the process-send-string method sounds like the best
alternative to me, so I hope someone will explain how to do it
correctly (does mutool have some forum where you could ask
questions?).  If that is unworkable for some reason, I guess
generating a temporary file with the script is the next thing I would
try.

Thanks.





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-06  6:28     ` Visuwesh
  2024-10-06  6:39       ` Eli Zaretskii
@ 2024-10-06  8:16       ` Tassilo Horn
  2024-10-06 10:32         ` Visuwesh
  1 sibling, 1 reply; 18+ messages in thread
From: Tassilo Horn @ 2024-10-06  8:16 UTC (permalink / raw)
  To: Visuwesh; +Cc: Eli Zaretskii, 73638

Visuwesh <visuweshm@gmail.com> writes:

> When "mutool run" is called without any script argument, it opens a
> REPL.  We would need to `process-send-string' to the REPL but the REPL
> reports syntax error when I try to do so.  This is after trying to
> "minify" the JS script.  Maybe I am doing something wrong.  I've
> attached my test lisp at the end.

I've tried doing the same and also get syntax errors.  Same if I just
copy&paste the script with injected filename in the REPL on a terminal.
Both for the original script as well as for a minified version.  First
I've thought that function definitions won't work at all but that's not
the case, e.g., I can enter this funcion at the REPL and it'll work:

--8<---------------cut here---------------start------------->8---
> function bar(i){if(i==0) print("done"); else {print(i);bar(i-1);}}
> bar(10)
10
9
8
7
6
5
4
3
2
1
done
--8<---------------cut here---------------end--------------->8---

Ok, after experimenting a bit more and minimizing by hand, this version
of the script can be copy&pasted into the REPL and should probably also
work with process-send-string.

--8<---------------cut here---------------start------------->8---
var document = new Document.openDocument("/home/horn/sample.pdf", "application/pdf");
var outline = document.loadOutline();
if(!outline) quit();
function pp(outl, level){print("((level . " + level + ")");print("(title . " + repr(outl.title) + ")");print("(page . " + document.resolveLink(outl.uri) + "))");if(outl.down){for(var i=0; i<outl.down.length; i++){pp(outl.down[i], level+1);}}};
function run(){print("(");for(var i=0; i<outline.length; i++){pp(outline[i], 1);}print(")");};
run();
--8<---------------cut here---------------end--------------->8---

It looks like the REPL wants function definitions on one single but
separate line.  Sadly, that's not documented so I'm not sure how
reliable that formatting works across mutool versions or why the REPL
doesn't accept all valid JS.  The only thing I can find in the man page
is that the = character at the beginning of a line has the special
meaning of "print the result of the following expression".

Bye,
  Tassilo





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-06  8:16       ` Tassilo Horn
@ 2024-10-06 10:32         ` Visuwesh
  2024-10-06 11:26           ` Tassilo Horn
  0 siblings, 1 reply; 18+ messages in thread
From: Visuwesh @ 2024-10-06 10:32 UTC (permalink / raw)
  To: Tassilo Horn; +Cc: Eli Zaretskii, 73638

[-- Attachment #1: Type: text/plain, Size: 2350 bytes --]

[ஞாயிறு அக்டோபர் 06, 2024] Tassilo Horn wrote:

> Visuwesh <visuweshm@gmail.com> writes:
>
>> When "mutool run" is called without any script argument, it opens a
>> REPL.  We would need to `process-send-string' to the REPL but the REPL
>> reports syntax error when I try to do so.  This is after trying to
>> "minify" the JS script.  Maybe I am doing something wrong.  I've
>> attached my test lisp at the end.
>
> I've tried doing the same and also get syntax errors.  Same if I just
> copy&paste the script with injected filename in the REPL on a terminal.
> Both for the original script as well as for a minified version.  First
> I've thought that function definitions won't work at all but that's not
> the case, e.g., I can enter this funcion at the REPL and it'll work:
>
>> function bar(i){if(i==0) print("done"); else {print(i);bar(i-1);}}
>> bar(10)
> 10
> 9
> 8
> 7
> 6
> 5
> 4
> 3
> 2
> 1
> done
>
>
> Ok, after experimenting a bit more and minimizing by hand, this version
> of the script can be copy&pasted into the REPL and should probably also
> work with process-send-string.

Thanks for looking into this, and the solution!

> var document = new Document.openDocument("/home/horn/sample.pdf", "application/pdf");
> var outline = document.loadOutline();
> if(!outline) quit();
> function pp(outl, level){print("((level . " + level + ")");print("(title . " + repr(outl.title) + ")");print("(page . " + document.resolveLink(outl.uri) + "))");if(outl.down){for(var i=0; i<outl.down.length; i++){pp(outl.down[i], level+1);}}};
> function run(){print("(");for(var i=0; i<outline.length; i++){pp(outline[i], 1);}print(")");};
> run();
>
> It looks like the REPL wants function definitions on one single but
> separate line.  Sadly, that's not documented so I'm not sure how
> reliable that formatting works across mutool versions or why the REPL
> doesn't accept all valid JS.  The only thing I can find in the man page
> is that the = character at the beginning of a line has the special
> meaning of "print the result of the following expression".

As you expected, your minified version works fine when doing
process-send-string.  I do have not much experience working with async
processes like this before, what do you think about the approach below?


[-- Attachment #2: Type: text/plain, Size: 1555 bytes --]

(defvar test
  "var document = new Document.openDocument(\"%s\", \"application/pdf\");
var outline = document.loadOutline();
if(!outline) quit();
function pp(outl, level){print(\"((level . \" + level + \")\");print(\"(title . \" + repr(outl.title) + \")\");print(\"(page . \" + document.resolveLink(outl.uri) + \"))\");if(outl.down){for(var i=0; i<outl.down.length; i++){pp(outl.down[i], level+1);}}};
function run(){print(\"BEGIN(\");for(var i=0; i<outline.length; i++){pp(outline[i], 1);}print(\")\");};
run()")


(let ((default-directory "~/lib/emacs/straight/repos/pdf-tools/test/")
      outline proc)
  (with-current-buffer (get-buffer-create " *doc-view-mupdf-js*")
    (erase-buffer)
    (setq proc
          (make-process :name "mupdf-js"
                        :command (list "mutool" "run" )
                        :buffer (current-buffer)
                        :sentinel (lambda (proc _status)
                                    (message "%S" _status)
                                    (when (eq (process-status proc) 'exit)
                                      (with-current-buffer (process-buffer proc)
                                        (goto-char (point-min))
                                        (search-forward "BEGIN")
                                        (setq outline (read (current-buffer))))))))
    (process-send-string proc (format test "test.pdf"))
    ;; Need to send this twice for some reason...
    (process-send-eof proc)
    (process-send-eof proc)
    (while (accept-process-output proc))
    outline))

^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-06 10:32         ` Visuwesh
@ 2024-10-06 11:26           ` Tassilo Horn
  2024-10-06 12:32             ` Visuwesh
  0 siblings, 1 reply; 18+ messages in thread
From: Tassilo Horn @ 2024-10-06 11:26 UTC (permalink / raw)
  To: Visuwesh; +Cc: Eli Zaretskii, 73638

Visuwesh <visuweshm@gmail.com> writes:

> As you expected, your minified version works fine when doing
> process-send-string.  I do have not much experience working with async
> processes like this before, what do you think about the approach
> below?

I don't do that very frequently.  I think it would be simpler if we skip
the sentinel and instead use some :buffer " *mutool-run-result*" with
make-process and just read from there after the accept-process-output
loop.

Bye,
  Tassilo





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-06 11:26           ` Tassilo Horn
@ 2024-10-06 12:32             ` Visuwesh
  2024-10-07  7:02               ` Tassilo Horn
  0 siblings, 1 reply; 18+ messages in thread
From: Visuwesh @ 2024-10-06 12:32 UTC (permalink / raw)
  To: Tassilo Horn; +Cc: Eli Zaretskii, 73638

[-- Attachment #1: Type: text/plain, Size: 683 bytes --]

[ஞாயிறு அக்டோபர் 06, 2024] Tassilo Horn wrote:

> Visuwesh <visuweshm@gmail.com> writes:
>
>> As you expected, your minified version works fine when doing
>> process-send-string.  I do have not much experience working with async
>> processes like this before, what do you think about the approach
>> below?
>
> I don't do that very frequently.  I think it would be simpler if we skip
> the sentinel and instead use some :buffer " *mutool-run-result*" with
> make-process and just read from there after the accept-process-output
> loop.

That was a serious brainfart, indeed.  I've went with your approach in
the attached, please review.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Make-imenu-index-generation-for-PDFs-more-reliable.patch --]
[-- Type: text/x-diff, Size: 4395 bytes --]

From 6a9de26ac3efbdd9931c74db90e2aeac2bc0dca8 Mon Sep 17 00:00:00 2001
From: Visuwesh <visuweshm@gmail.com>
Date: Sun, 6 Oct 2024 18:02:06 +0530
Subject: [PATCH] Make imenu index generation for PDFs more reliable

Do away with parsing the output of "mutool show FILE outline"
since the URI reported in its output may not include the page
number of the heading, and instead may contained "nameddest"
elements which cannot be resolved using "mutool".  Instead, use
a MuPDF JS script to generate the PDF outline allowing to
resolve such URIs.

* lisp/doc-view.el (doc-view--outline-rx): Remove as no longer
needed.
(doc-view--outline): Reflect that outline can be generated for
non-PDF files too.
(doc-view--mutool-pdf-outline-script): Add new variable to hold
the JS script used to generate the outline.
(doc-view--pdf-outline): Use the script.  (bug#73638)
---
 lisp/doc-view.el | 48 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/lisp/doc-view.el b/lisp/doc-view.el
index 446beeafd9f..a5e84f1e2ab 100644
--- a/lisp/doc-view.el
+++ b/lisp/doc-view.el
@@ -1969,14 +1969,26 @@ doc-view-search-previous-match
 	(doc-view-goto-page (caar (last doc-view--current-search-matches)))))))
 
 ;;;; Imenu support
-(defconst doc-view--outline-rx
-  "[^\t]+\\(\t+\\)\"\\(.+\\)\"\t#\\(?:page=\\)?\\([0-9]+\\)")
-
 (defvar-local doc-view--outline nil
-  "Cached PDF outline, so that it is only computed once per document.
+  "Cached document outline, so that it is only computed once per document.
 It can be the symbol `unavailable' to indicate that outline is
 unavailable for the document.")
 
+(defvar doc-view--mutool-pdf-outline-script
+  "var document = new Document.openDocument(\"%s\", \"application/pdf\");
+var outline = document.loadOutline();
+if(!outline) quit();
+function pp(outl, level){print(\"((level . \" + level + \")\");\
+print(\"(title . \" + repr(outl.title) + \")\");\
+print(\"(page . \" + document.resolveLink(outl.uri) + \"))\");\
+if(outl.down){for(var i=0; i<outl.down.length; i++){pp(outl.down[i], level+1);}}};
+function run(){print(\"BEGIN(\");\
+for(var i=0; i<outline.length; i++){pp(outline[i], 1);}print(\")\");};
+run()"
+  "JS script to extract the PDF's outline using mutool.
+The script has to be minified to pass it to the REPL.  The \"BEGIN\"
+marker is here to skip past the prompt characters.")
+
 (defun doc-view--pdf-outline (&optional file-name)
   "Return a list describing the outline of FILE-NAME.
 Return a list describing the current file if FILE-NAME is nil.
@@ -1986,21 +1998,25 @@ doc-view--pdf-outline
 structure is extracted by `doc-view--imenu-subtree'."
   (let ((fn (or file-name (buffer-file-name))))
     (when fn
-      (let ((outline nil)
-            (fn (expand-file-name fn)))
-        (with-temp-buffer
-          (unless (eql 0 (call-process doc-view-pdfdraw-program nil
-                                       (current-buffer) nil "show" fn "outline"))
+      (with-temp-buffer
+        (let ((proc (make-process
+                     :name "doc-view-pdf-outline"
+                     :command (list "mutool" "run")
+                     :buffer (current-buffer))))
+          (process-send-string proc (format doc-view--mutool-pdf-outline-script
+                                            (expand-file-name fn)))
+          ;; Need to send this twice for some reason...
+          (process-send-eof)
+          (process-send-eof)
+          (while (accept-process-output proc))
+          (unless (eq (process-status proc) 'exit)
             (setq doc-view--outline 'unavailable)
             (imenu-unavailable-error "Unable to create imenu index using `mutool'"))
           (goto-char (point-min))
-          (while (re-search-forward doc-view--outline-rx nil t)
-            (push `((level . ,(length (match-string 1)))
-                    (title . ,(replace-regexp-in-string "\\\\[rt]" " "
-                                                        (match-string 2)))
-                    (page . ,(string-to-number (match-string 3))))
-                  outline)))
-        (nreverse outline)))))
+          (search-forward "BEGIN")
+          (condition-case nil
+              (read (current-buffer))
+            (end-of-file nil)))))))
 
 (defun doc-view--djvu-outline (&optional file-name)
   "Return a list describing the outline of FILE-NAME.
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-06 12:32             ` Visuwesh
@ 2024-10-07  7:02               ` Tassilo Horn
  2024-10-07  9:26                 ` Visuwesh
  0 siblings, 1 reply; 18+ messages in thread
From: Tassilo Horn @ 2024-10-07  7:02 UTC (permalink / raw)
  To: Visuwesh; +Cc: Eli Zaretskii, 73638

Visuwesh <visuweshm@gmail.com> writes:

Hi!

>> I don't do that very frequently.  I think it would be simpler if we
>> skip the sentinel and instead use some :buffer " *mutool-run-result*"
>> with make-process and just read from there after the
>> accept-process-output loop.
>
> That was a serious brainfart, indeed.  I've went with your approach in
> the attached, please review.
>
> @@ -1986,21 +1998,25 @@ doc-view--pdf-outline
>  structure is extracted by `doc-view--imenu-subtree'."
>    (let ((fn (or file-name (buffer-file-name))))
>      (when fn
> -      (let ((outline nil)
> -            (fn (expand-file-name fn)))
> -        (with-temp-buffer
> -          (unless (eql 0 (call-process doc-view-pdfdraw-program nil
> -                                       (current-buffer) nil "show" fn "outline"))
> +      (with-temp-buffer
> +        (let ((proc (make-process
> +                     :name "doc-view-pdf-outline"
> +                     :command (list "mutool" "run")
> +                     :buffer (current-buffer))))
> +          (process-send-string proc (format doc-view--mutool-pdf-outline-script
> +                                            (expand-file-name fn)))
> +          ;; Need to send this twice for some reason...
> +          (process-send-eof)
> +          (process-send-eof)
> +          (while (accept-process-output proc))
> +          (unless (eq (process-status proc) 'exit)
>              (setq doc-view--outline 'unavailable)
>              (imenu-unavailable-error "Unable to create imenu index using `mutool'"))
>            (goto-char (point-min))
> -          (while (re-search-forward doc-view--outline-rx nil t)
> -            (push `((level . ,(length (match-string 1)))
> -                    (title . ,(replace-regexp-in-string "\\\\[rt]" " "
> -                                                        (match-string 2)))
> -                    (page . ,(string-to-number (match-string 3))))
> -                  outline)))
> -        (nreverse outline)))))
> +          (search-forward "BEGIN")

If the script fails for some reason, there will be no BEGIN and we let a
search-failed error bubble up.  So I'd put it in the condition-case and
handle it like the end-of-file error.  Or simply provide the NOERROR
search-forward arg.

> +          (condition-case nil
> +              (read (current-buffer))
> +            (end-of-file nil)))))))

Maybe it would also a good idea to use a :stderr buffer with
make-process and put its contents into the imenu-unavailable-error.
That way, chances are better we get the reason for failure delivered in
bug reports.

Otherwise, it all looks good to me. :-)

Thanks,
  Tassilo





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-07  7:02               ` Tassilo Horn
@ 2024-10-07  9:26                 ` Visuwesh
  2024-10-07  9:55                   ` Visuwesh
  0 siblings, 1 reply; 18+ messages in thread
From: Visuwesh @ 2024-10-07  9:26 UTC (permalink / raw)
  To: Tassilo Horn; +Cc: Eli Zaretskii, 73638

[திங்கள் அக்டோபர் 07, 2024] Tassilo Horn wrote:

> Visuwesh <visuweshm@gmail.com> writes:
>
> Hi!
>
>>> I don't do that very frequently.  I think it would be simpler if we
>>> skip the sentinel and instead use some :buffer " *mutool-run-result*"
>>> with make-process and just read from there after the
>>> accept-process-output loop.
>>
>> That was a serious brainfart, indeed.  I've went with your approach in
>> the attached, please review.
>>
>> @@ -1986,21 +1998,25 @@ doc-view--pdf-outline
>>  structure is extracted by `doc-view--imenu-subtree'."
>>    (let ((fn (or file-name (buffer-file-name))))
>>      (when fn
>> -      (let ((outline nil)
>> -            (fn (expand-file-name fn)))
>> -        (with-temp-buffer
>> -          (unless (eql 0 (call-process doc-view-pdfdraw-program nil
>> -                                       (current-buffer) nil "show" fn "outline"))
>> +      (with-temp-buffer
>> +        (let ((proc (make-process
>> +                     :name "doc-view-pdf-outline"
>> +                     :command (list "mutool" "run")
>> +                     :buffer (current-buffer))))
>> +          (process-send-string proc (format doc-view--mutool-pdf-outline-script
>> +                                            (expand-file-name fn)))
>> +          ;; Need to send this twice for some reason...
>> +          (process-send-eof)
>> +          (process-send-eof)
>> +          (while (accept-process-output proc))
>> +          (unless (eq (process-status proc) 'exit)
>>              (setq doc-view--outline 'unavailable)
>>              (imenu-unavailable-error "Unable to create imenu index using `mutool'"))
>>            (goto-char (point-min))
>> -          (while (re-search-forward doc-view--outline-rx nil t)
>> -            (push `((level . ,(length (match-string 1)))
>> -                    (title . ,(replace-regexp-in-string "\\\\[rt]" " "
>> -                                                        (match-string 2)))
>> -                    (page . ,(string-to-number (match-string 3))))
>> -                  outline)))
>> -        (nreverse outline)))))
>> +          (search-forward "BEGIN")
>
> If the script fails for some reason, there will be no BEGIN and we let a
> search-failed error bubble up.  So I'd put it in the condition-case and
> handle it like the end-of-file error.  Or simply provide the NOERROR
> search-forward arg.

Ahh, the intention of the condition-case below was to handle this case.
Thanks for catching my mistake, it is a common error of mine to forget
the NOERROR argument.

>> +          (condition-case nil
>> +              (read (current-buffer))
>> +            (end-of-file nil)))))))
>
> Maybe it would also a good idea to use a :stderr buffer with
> make-process and put its contents into the imenu-unavailable-error.
> That way, chances are better we get the reason for failure delivered in
> bug reports.

I do not think it is worth the trouble since only syntax errors are
likely to surface up in stderr which would be very unlikely.  If the PDF
file does not have an outline, there would be nothing printed by our
script so end-of-file error should catch that case.  

> Otherwise, it all looks good to me. :-)

If you are okay with leaving out the stderr case, I will send a patch
with a non-nil NOERROR argument to the quoted search-forward form.





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-07  9:26                 ` Visuwesh
@ 2024-10-07  9:55                   ` Visuwesh
  2024-10-07 11:03                     ` Tassilo Horn
  0 siblings, 1 reply; 18+ messages in thread
From: Visuwesh @ 2024-10-07  9:55 UTC (permalink / raw)
  To: Tassilo Horn; +Cc: Eli Zaretskii, 73638

[-- Attachment #1: Type: text/plain, Size: 1395 bytes --]

[திங்கள் அக்டோபர் 07, 2024] Visuwesh wrote:

>>> [...]
>>> -        (nreverse outline)))))
>>> +          (search-forward "BEGIN")
>>
>> If the script fails for some reason, there will be no BEGIN and we let a
>> search-failed error bubble up.  So I'd put it in the condition-case and
>> handle it like the end-of-file error.  Or simply provide the NOERROR
>> search-forward arg.
>
> Ahh, the intention of the condition-case below was to handle this case.
> Thanks for catching my mistake, it is a common error of mine to forget
> the NOERROR argument.
>
>>> +          (condition-case nil
>>> +              (read (current-buffer))
>>> +            (end-of-file nil)))))))
>>
>> Maybe it would also a good idea to use a :stderr buffer with
>> make-process and put its contents into the imenu-unavailable-error.
>> That way, chances are better we get the reason for failure delivered in
>> bug reports.
>
> I do not think it is worth the trouble since only syntax errors are
> likely to surface up in stderr which would be very unlikely.  If the PDF
> file does not have an outline, there would be nothing printed by our
> script so end-of-file error should catch that case.  

Actually, this wasn't quite correct I think.  We would have stray > in
the buffer and read would return the symbol >.  I corrected that in the
attached.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Make-imenu-index-generation-for-PDFs-more-reliable.patch --]
[-- Type: text/x-diff, Size: 4414 bytes --]

From a5055e18889460b429ccacf2970c7ccaf5f423c7 Mon Sep 17 00:00:00 2001
From: Visuwesh <visuweshm@gmail.com>
Date: Sun, 6 Oct 2024 18:02:06 +0530
Subject: [PATCH] Make imenu index generation for PDFs more reliable

Do away with parsing the output of "mutool show FILE outline"
since the URI reported in its output may not include the page
number of the heading, and instead may contained "nameddest"
elements which cannot be resolved using "mutool".  Instead, use
a MuPDF JS script to generate the PDF outline allowing to
resolve such URIs.

* lisp/doc-view.el (doc-view--outline-rx): Remove as no longer
needed.
(doc-view--outline): Reflect that outline can be generated for
non-PDF files too.
(doc-view--mutool-pdf-outline-script): Add new variable to hold
the JS script used to generate the outline.
(doc-view--pdf-outline): Use the script.  (bug#73638)
---
 lisp/doc-view.el | 48 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/lisp/doc-view.el b/lisp/doc-view.el
index 446beeafd9f..fcfdff18a40 100644
--- a/lisp/doc-view.el
+++ b/lisp/doc-view.el
@@ -1969,14 +1969,26 @@ doc-view-search-previous-match
 	(doc-view-goto-page (caar (last doc-view--current-search-matches)))))))
 
 ;;;; Imenu support
-(defconst doc-view--outline-rx
-  "[^\t]+\\(\t+\\)\"\\(.+\\)\"\t#\\(?:page=\\)?\\([0-9]+\\)")
-
 (defvar-local doc-view--outline nil
-  "Cached PDF outline, so that it is only computed once per document.
+  "Cached document outline, so that it is only computed once per document.
 It can be the symbol `unavailable' to indicate that outline is
 unavailable for the document.")
 
+(defvar doc-view--mutool-pdf-outline-script
+  "var document = new Document.openDocument(\"%s\", \"application/pdf\");
+var outline = document.loadOutline();
+if(!outline) quit();
+function pp(outl, level){print(\"((level . \" + level + \")\");\
+print(\"(title . \" + repr(outl.title) + \")\");\
+print(\"(page . \" + document.resolveLink(outl.uri) + \"))\");\
+if(outl.down){for(var i=0; i<outl.down.length; i++){pp(outl.down[i], level+1);}}};
+function run(){print(\"BEGIN(\");\
+for(var i=0; i<outline.length; i++){pp(outline[i], 1);}print(\")\");};
+run()"
+  "JS script to extract the PDF's outline using mutool.
+The script has to be minified to pass it to the REPL.  The \"BEGIN\"
+marker is here to skip past the prompt characters.")
+
 (defun doc-view--pdf-outline (&optional file-name)
   "Return a list describing the outline of FILE-NAME.
 Return a list describing the current file if FILE-NAME is nil.
@@ -1986,21 +1998,25 @@ doc-view--pdf-outline
 structure is extracted by `doc-view--imenu-subtree'."
   (let ((fn (or file-name (buffer-file-name))))
     (when fn
-      (let ((outline nil)
-            (fn (expand-file-name fn)))
-        (with-temp-buffer
-          (unless (eql 0 (call-process doc-view-pdfdraw-program nil
-                                       (current-buffer) nil "show" fn "outline"))
+      (with-temp-buffer
+        (let ((proc (make-process
+                     :name "doc-view-pdf-outline"
+                     :command (list "mutool" "run")
+                     :buffer (current-buffer))))
+          (process-send-string proc (format doc-view--mutool-pdf-outline-script
+                                            (expand-file-name fn)))
+          ;; Need to send this twice for some reason...
+          (process-send-eof)
+          (process-send-eof)
+          (while (accept-process-output proc))
+          (unless (eq (process-status proc) 'exit)
             (setq doc-view--outline 'unavailable)
             (imenu-unavailable-error "Unable to create imenu index using `mutool'"))
           (goto-char (point-min))
-          (while (re-search-forward doc-view--outline-rx nil t)
-            (push `((level . ,(length (match-string 1)))
-                    (title . ,(replace-regexp-in-string "\\\\[rt]" " "
-                                                        (match-string 2)))
-                    (page . ,(string-to-number (match-string 3))))
-                  outline)))
-        (nreverse outline)))))
+          (when (search-forward "BEGIN" nil t)
+            (condition-case nil
+                (read (current-buffer))
+              (end-of-file nil))))))))
 
 (defun doc-view--djvu-outline (&optional file-name)
   "Return a list describing the outline of FILE-NAME.
-- 
2.45.2


[-- Attachment #3: Type: text/plain, Size: 189 bytes --]


>
>> Otherwise, it all looks good to me. :-)
>
> If you are okay with leaving out the stderr case, I will send a patch
> with a non-nil NOERROR argument to the quoted search-forward form.

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-07  9:55                   ` Visuwesh
@ 2024-10-07 11:03                     ` Tassilo Horn
  2024-10-07 12:53                       ` Visuwesh
  0 siblings, 1 reply; 18+ messages in thread
From: Tassilo Horn @ 2024-10-07 11:03 UTC (permalink / raw)
  To: Visuwesh; +Cc: Eli Zaretskii, 73638

Visuwesh <visuweshm@gmail.com> writes:

>>> Maybe it would also a good idea to use a :stderr buffer with
>>> make-process and put its contents into the imenu-unavailable-error.
>>> That way, chances are better we get the reason for failure delivered
>>> in bug reports.
>>
>> I do not think it is worth the trouble since only syntax errors are
>> likely to surface up in stderr which would be very unlikely.  If the
>> PDF file does not have an outline, there would be nothing printed by
>> our script so end-of-file error should catch that case.
>
> Actually, this wasn't quite correct I think.  We would have stray > in
> the buffer and read would return the symbol >.  I corrected that in
> the attached.

The patch looks good.  But during testing, it seems that the index is
always off by one page, i.e., the index for some section brings me to
page 117 but the section heading is actually on page 118.

I have that both with the Peter Atkins et al. book you suggested as well
as with own papers which didn't work at all previously due to #nameddest
references.

Bye,
  Tassilo





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-07 11:03                     ` Tassilo Horn
@ 2024-10-07 12:53                       ` Visuwesh
  2024-10-07 15:04                         ` Tassilo Horn
  0 siblings, 1 reply; 18+ messages in thread
From: Visuwesh @ 2024-10-07 12:53 UTC (permalink / raw)
  To: Tassilo Horn; +Cc: Eli Zaretskii, 73638

[-- Attachment #1: Type: text/plain, Size: 1417 bytes --]

[திங்கள் அக்டோபர் 07, 2024] Tassilo Horn wrote:

> Visuwesh <visuweshm@gmail.com> writes:
>
>>>> Maybe it would also a good idea to use a :stderr buffer with
>>>> make-process and put its contents into the imenu-unavailable-error.
>>>> That way, chances are better we get the reason for failure delivered
>>>> in bug reports.
>>>
>>> I do not think it is worth the trouble since only syntax errors are
>>> likely to surface up in stderr which would be very unlikely.  If the
>>> PDF file does not have an outline, there would be nothing printed by
>>> our script so end-of-file error should catch that case.
>>
>> Actually, this wasn't quite correct I think.  We would have stray > in
>> the buffer and read would return the symbol >.  I corrected that in
>> the attached.
>
> The patch looks good.  But during testing, it seems that the index is
> always off by one page, i.e., the index for some section brings me to
> page 117 but the section heading is actually on page 118.
>
> I have that both with the Peter Atkins et al. book you suggested as well
> as with own papers which didn't work at all previously due to #nameddest
> references.

Ugghhh, looks like the page number returned by the JS function is
zero-indexed.  Thanks for the catch (and sorry for the many mistakes and
hence the back-and-forth).  Should be corrected in the attached patch.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Make-imenu-index-generation-for-PDFs-more-reliable.patch --]
[-- Type: text/x-diff, Size: 4416 bytes --]

From 84563a74cc2fba7279153f08d442b69c2977f2b4 Mon Sep 17 00:00:00 2001
From: Visuwesh <visuweshm@gmail.com>
Date: Sun, 6 Oct 2024 18:02:06 +0530
Subject: [PATCH] Make imenu index generation for PDFs more reliable

Do away with parsing the output of "mutool show FILE outline"
since the URI reported in its output may not include the page
number of the heading, and instead may contained "nameddest"
elements which cannot be resolved using "mutool".  Instead, use
a MuPDF JS script to generate the PDF outline allowing to
resolve such URIs.

* lisp/doc-view.el (doc-view--outline-rx): Remove as no longer
needed.
(doc-view--outline): Reflect that outline can be generated for
non-PDF files too.
(doc-view--mutool-pdf-outline-script): Add new variable to hold
the JS script used to generate the outline.
(doc-view--pdf-outline): Use the script.  (bug#73638)
---
 lisp/doc-view.el | 48 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/lisp/doc-view.el b/lisp/doc-view.el
index 446beeafd9f..a49cbc69717 100644
--- a/lisp/doc-view.el
+++ b/lisp/doc-view.el
@@ -1969,14 +1969,26 @@ doc-view-search-previous-match
 	(doc-view-goto-page (caar (last doc-view--current-search-matches)))))))
 
 ;;;; Imenu support
-(defconst doc-view--outline-rx
-  "[^\t]+\\(\t+\\)\"\\(.+\\)\"\t#\\(?:page=\\)?\\([0-9]+\\)")
-
 (defvar-local doc-view--outline nil
-  "Cached PDF outline, so that it is only computed once per document.
+  "Cached document outline, so that it is only computed once per document.
 It can be the symbol `unavailable' to indicate that outline is
 unavailable for the document.")
 
+(defvar doc-view--mutool-pdf-outline-script
+  "var document = new Document.openDocument(\"%s\", \"application/pdf\");
+var outline = document.loadOutline();
+if(!outline) quit();
+function pp(outl, level){print(\"((level . \" + level + \")\");\
+print(\"(title . \" + repr(outl.title) + \")\");\
+print(\"(page . \" + document.resolveLink(outl.uri)+1 + \"))\");\
+if(outl.down){for(var i=0; i<outl.down.length; i++){pp(outl.down[i], level+1);}}};
+function run(){print(\"BEGIN(\");\
+for(var i=0; i<outline.length; i++){pp(outline[i], 1);}print(\")\");};
+run()"
+  "JS script to extract the PDF's outline using mutool.
+The script has to be minified to pass it to the REPL.  The \"BEGIN\"
+marker is here to skip past the prompt characters.")
+
 (defun doc-view--pdf-outline (&optional file-name)
   "Return a list describing the outline of FILE-NAME.
 Return a list describing the current file if FILE-NAME is nil.
@@ -1986,21 +1998,25 @@ doc-view--pdf-outline
 structure is extracted by `doc-view--imenu-subtree'."
   (let ((fn (or file-name (buffer-file-name))))
     (when fn
-      (let ((outline nil)
-            (fn (expand-file-name fn)))
-        (with-temp-buffer
-          (unless (eql 0 (call-process doc-view-pdfdraw-program nil
-                                       (current-buffer) nil "show" fn "outline"))
+      (with-temp-buffer
+        (let ((proc (make-process
+                     :name "doc-view-pdf-outline"
+                     :command (list "mutool" "run")
+                     :buffer (current-buffer))))
+          (process-send-string proc (format doc-view--mutool-pdf-outline-script
+                                            (expand-file-name fn)))
+          ;; Need to send this twice for some reason...
+          (process-send-eof)
+          (process-send-eof)
+          (while (accept-process-output proc))
+          (unless (eq (process-status proc) 'exit)
             (setq doc-view--outline 'unavailable)
             (imenu-unavailable-error "Unable to create imenu index using `mutool'"))
           (goto-char (point-min))
-          (while (re-search-forward doc-view--outline-rx nil t)
-            (push `((level . ,(length (match-string 1)))
-                    (title . ,(replace-regexp-in-string "\\\\[rt]" " "
-                                                        (match-string 2)))
-                    (page . ,(string-to-number (match-string 3))))
-                  outline)))
-        (nreverse outline)))))
+          (when (search-forward "BEGIN" nil t)
+            (condition-case nil
+                (read (current-buffer))
+              (end-of-file nil))))))))
 
 (defun doc-view--djvu-outline (&optional file-name)
   "Return a list describing the outline of FILE-NAME.
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-07 12:53                       ` Visuwesh
@ 2024-10-07 15:04                         ` Tassilo Horn
  2024-10-08  9:44                           ` Visuwesh
  0 siblings, 1 reply; 18+ messages in thread
From: Tassilo Horn @ 2024-10-07 15:04 UTC (permalink / raw)
  To: Visuwesh; +Cc: Eli Zaretskii, 73638

Visuwesh <visuweshm@gmail.com> writes:

>> The patch looks good.  But during testing, it seems that the index is
>> always off by one page, i.e., the index for some section brings me to
>> page 117 but the section heading is actually on page 118.
>>
>> I have that both with the Peter Atkins et al. book you suggested as
>> well as with own papers which didn't work at all previously due to
>> #nameddest references.
>
> Ugghhh, looks like the page number returned by the JS function is
> zero-indexed.  Thanks for the catch (and sorry for the many mistakes
> and hence the back-and-forth).  Should be corrected in the attached
> patch.

Nope, now I get off-by-many-hundreds errors.  The Imenu entries have the
page number in parens, right?  If so, I have many references to pages
that are thrice as large as the actual number of pages, e.g., here some
parts of the *Completions* buffer for the Atkins book:

--8<---------------cut here---------------start------------->8---
FOCUS.1.The.properties.of.gases.(341)
FOCUS.10.Molecular.symmetry.(4181)
FOCUS.11.Molecular.spectroscopy.(4481)
FOCUS.12.Magnetic.resonance.(5181)
FOCUS.13.Statistical.thermodynamics.(5621)
FOCUS.14.Molecular.interactions.(6141)
FOCUS.15.Solids.(6701)
FOCUS.16.Molecules.in.motio.(7201)
FOCUS.17.Chemical.kinetics.(7521)
FOCUS.18.Reaction.dynamics.(8101)
FOCUS.19.Processes.at.solid.surfaces.(8541)
--8<---------------cut here---------------end--------------->8---

It's large but doesn't have more than 8000 pages.

Bye,
  Tassilo





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-07 15:04                         ` Tassilo Horn
@ 2024-10-08  9:44                           ` Visuwesh
  2024-10-08 15:43                             ` Tassilo Horn
  0 siblings, 1 reply; 18+ messages in thread
From: Visuwesh @ 2024-10-08  9:44 UTC (permalink / raw)
  To: Tassilo Horn; +Cc: Eli Zaretskii, 73638

[-- Attachment #1: Type: text/plain, Size: 1623 bytes --]

[திங்கள் அக்டோபர் 07, 2024] Tassilo Horn wrote:

> Visuwesh <visuweshm@gmail.com> writes:
>
>>> The patch looks good.  But during testing, it seems that the index is
>>> always off by one page, i.e., the index for some section brings me to
>>> page 117 but the section heading is actually on page 118.
>>>
>>> I have that both with the Peter Atkins et al. book you suggested as
>>> well as with own papers which didn't work at all previously due to
>>> #nameddest references.
>>
>> Ugghhh, looks like the page number returned by the JS function is
>> zero-indexed.  Thanks for the catch (and sorry for the many mistakes
>> and hence the back-and-forth).  Should be corrected in the attached
>> patch.
>
> Nope, now I get off-by-many-hundreds errors.  The Imenu entries have the
> page number in parens, right?  If so, I have many references to pages
> that are thrice as large as the actual number of pages, e.g., here some
> parts of the *Completions* buffer for the Atkins book:
>
> FOCUS.1.The.properties.of.gases.(341)
> FOCUS.10.Molecular.symmetry.(4181)
> FOCUS.11.Molecular.spectroscopy.(4481)
> FOCUS.12.Magnetic.resonance.(5181)
> FOCUS.13.Statistical.thermodynamics.(5621)
> FOCUS.14.Molecular.interactions.(6141)
> FOCUS.15.Solids.(6701)
> FOCUS.16.Molecules.in.motio.(7201)
> FOCUS.17.Chemical.kinetics.(7521)
> FOCUS.18.Reaction.dynamics.(8101)
> FOCUS.19.Processes.at.solid.surfaces.(8541)
>
> It's large but doesn't have more than 8000 pages.

I messed up by not considering the precedence of operators.  :-( Fixed
in the attached patch.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Make-imenu-index-generation-for-PDFs-more-reliable.patch --]
[-- Type: text/x-diff, Size: 4418 bytes --]

From 39205fc097c30803c077014defaacef9512af902 Mon Sep 17 00:00:00 2001
From: Visuwesh <visuweshm@gmail.com>
Date: Sun, 6 Oct 2024 18:02:06 +0530
Subject: [PATCH] Make imenu index generation for PDFs more reliable

Do away with parsing the output of "mutool show FILE outline"
since the URI reported in its output may not include the page
number of the heading, and instead may contained "nameddest"
elements which cannot be resolved using "mutool".  Instead, use
a MuPDF JS script to generate the PDF outline allowing to
resolve such URIs.

* lisp/doc-view.el (doc-view--outline-rx): Remove as no longer
needed.
(doc-view--outline): Reflect that outline can be generated for
non-PDF files too.
(doc-view--mutool-pdf-outline-script): Add new variable to hold
the JS script used to generate the outline.
(doc-view--pdf-outline): Use the script.  (bug#73638)
---
 lisp/doc-view.el | 48 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/lisp/doc-view.el b/lisp/doc-view.el
index 446beeafd9f..57a24418616 100644
--- a/lisp/doc-view.el
+++ b/lisp/doc-view.el
@@ -1969,14 +1969,26 @@ doc-view-search-previous-match
 	(doc-view-goto-page (caar (last doc-view--current-search-matches)))))))
 
 ;;;; Imenu support
-(defconst doc-view--outline-rx
-  "[^\t]+\\(\t+\\)\"\\(.+\\)\"\t#\\(?:page=\\)?\\([0-9]+\\)")
-
 (defvar-local doc-view--outline nil
-  "Cached PDF outline, so that it is only computed once per document.
+  "Cached document outline, so that it is only computed once per document.
 It can be the symbol `unavailable' to indicate that outline is
 unavailable for the document.")
 
+(defvar doc-view--mutool-pdf-outline-script
+  "var document = new Document.openDocument(\"%s\", \"application/pdf\");
+var outline = document.loadOutline();
+if(!outline) quit();
+function pp(outl, level){print(\"((level . \" + level + \")\");\
+print(\"(title . \" + repr(outl.title) + \")\");\
+print(\"(page . \" + (document.resolveLink(outl.uri)+1) + \"))\");\
+if(outl.down){for(var i=0; i<outl.down.length; i++){pp(outl.down[i], level+1);}}};
+function run(){print(\"BEGIN(\");\
+for(var i=0; i<outline.length; i++){pp(outline[i], 1);}print(\")\");};
+run()"
+  "JS script to extract the PDF's outline using mutool.
+The script has to be minified to pass it to the REPL.  The \"BEGIN\"
+marker is here to skip past the prompt characters.")
+
 (defun doc-view--pdf-outline (&optional file-name)
   "Return a list describing the outline of FILE-NAME.
 Return a list describing the current file if FILE-NAME is nil.
@@ -1986,21 +1998,25 @@ doc-view--pdf-outline
 structure is extracted by `doc-view--imenu-subtree'."
   (let ((fn (or file-name (buffer-file-name))))
     (when fn
-      (let ((outline nil)
-            (fn (expand-file-name fn)))
-        (with-temp-buffer
-          (unless (eql 0 (call-process doc-view-pdfdraw-program nil
-                                       (current-buffer) nil "show" fn "outline"))
+      (with-temp-buffer
+        (let ((proc (make-process
+                     :name "doc-view-pdf-outline"
+                     :command (list "mutool" "run")
+                     :buffer (current-buffer))))
+          (process-send-string proc (format doc-view--mutool-pdf-outline-script
+                                            (expand-file-name fn)))
+          ;; Need to send this twice for some reason...
+          (process-send-eof)
+          (process-send-eof)
+          (while (accept-process-output proc))
+          (unless (eq (process-status proc) 'exit)
             (setq doc-view--outline 'unavailable)
             (imenu-unavailable-error "Unable to create imenu index using `mutool'"))
           (goto-char (point-min))
-          (while (re-search-forward doc-view--outline-rx nil t)
-            (push `((level . ,(length (match-string 1)))
-                    (title . ,(replace-regexp-in-string "\\\\[rt]" " "
-                                                        (match-string 2)))
-                    (page . ,(string-to-number (match-string 3))))
-                  outline)))
-        (nreverse outline)))))
+          (when (search-forward "BEGIN" nil t)
+            (condition-case nil
+                (read (current-buffer))
+              (end-of-file nil))))))))
 
 (defun doc-view--djvu-outline (&optional file-name)
   "Return a list describing the outline of FILE-NAME.
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs
  2024-10-08  9:44                           ` Visuwesh
@ 2024-10-08 15:43                             ` Tassilo Horn
  0 siblings, 0 replies; 18+ messages in thread
From: Tassilo Horn @ 2024-10-08 15:43 UTC (permalink / raw)
  To: Visuwesh; +Cc: Eli Zaretskii, 73638-done

Visuwesh <visuweshm@gmail.com> writes:

>> Nope, now I get off-by-many-hundreds errors.  The Imenu entries have
>> the page number in parens, right?  If so, I have many references to
>> pages that are thrice as large as the actual number of pages, e.g.,
>> here some parts of the *Completions* buffer for the Atkins book:
>>
>> FOCUS.1.The.properties.of.gases.(341)
>> FOCUS.10.Molecular.symmetry.(4181)
>> FOCUS.11.Molecular.spectroscopy.(4481)
>> FOCUS.12.Magnetic.resonance.(5181)
>> FOCUS.13.Statistical.thermodynamics.(5621)
>> FOCUS.14.Molecular.interactions.(6141)
>> FOCUS.15.Solids.(6701)
>> FOCUS.16.Molecules.in.motio.(7201)
>> FOCUS.17.Chemical.kinetics.(7521)
>> FOCUS.18.Reaction.dynamics.(8101)
>> FOCUS.19.Processes.at.solid.surfaces.(8541)
>>
>> It's large but doesn't have more than 8000 pages.
>
> I messed up by not considering the precedence of operators.  :-( Fixed
> in the attached patch.

Works!  Applied and pushed.

Thanks a lot,
  Tassilo





^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-10-08 15:43 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-05 11:06 bug#73638: 31.0.50; doc-view: imenu index cannot be made for LaTeX PDFs Visuwesh
2024-10-05 19:56 ` Tassilo Horn
2024-10-06  5:42   ` Eli Zaretskii
2024-10-06  6:28     ` Visuwesh
2024-10-06  6:39       ` Eli Zaretskii
2024-10-06  8:16       ` Tassilo Horn
2024-10-06 10:32         ` Visuwesh
2024-10-06 11:26           ` Tassilo Horn
2024-10-06 12:32             ` Visuwesh
2024-10-07  7:02               ` Tassilo Horn
2024-10-07  9:26                 ` Visuwesh
2024-10-07  9:55                   ` Visuwesh
2024-10-07 11:03                     ` Tassilo Horn
2024-10-07 12:53                       ` Visuwesh
2024-10-07 15:04                         ` Tassilo Horn
2024-10-08  9:44                           ` Visuwesh
2024-10-08 15:43                             ` Tassilo Horn
2024-10-06  6:39     ` Visuwesh

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).