Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
       [not found] ` <E1WQ7Co-0004c8-Lo@vcs.savannah.gnu.org>
@ 2014-03-19 13:15   ` Stefan
  2014-03-19 14:08     ` Dmitry Gutov
  2014-03-19 16:40     ` Eli Zaretskii
  0 siblings, 2 replies; 17+ messages in thread
From: Stefan @ 2014-03-19 13:15 UTC (permalink / raw)
  To: emacs-devel

> -            (- (point) (line-beginning-position) -1))))
> +            (1+ (string-bytes (buffer-substring
> +                               (line-beginning-position)
> +                               (point)))))))

Instead of buffer-subtring composed with string-bytes, you could use
position-bytes.  You might also like to add a comment like "Hack attack:
assume the file's encoding is the same as Emacs's internal encoding".


        Stefan



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-19 13:15   ` [elpa] 02/04: company-clang: handle multibyte chars between bol and point Stefan
@ 2014-03-19 14:08     ` Dmitry Gutov
  2014-03-19 16:54       ` Eli Zaretskii
  2014-03-19 16:40     ` Eli Zaretskii
  1 sibling, 1 reply; 17+ messages in thread
From: Dmitry Gutov @ 2014-03-19 14:08 UTC (permalink / raw)
  To: Stefan; +Cc: emacs-devel

Stefan <monnier@iro.umontreal.ca> writes:

> Instead of buffer-subtring composed with string-bytes, you could use
> position-bytes.

I figured written this way it looks a tiny bit nicer, and the
performance difference is negligible.  Maybe I shouldn't have tried to
be inventive, though.

> You might also like to add a comment like "Hack attack:
> assume the file's encoding is the same as Emacs's internal encoding".

Hm, yes.  Although it just assumes that the encoding used multiple bytes
for the same sets of chars as Emacs internals, which is more reasonable.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-19 14:08     ` Dmitry Gutov
@ 2014-03-19 16:54       ` Eli Zaretskii
  2014-03-19 17:56         ` Dmitry Gutov
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-19 16:54 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: monnier, emacs-devel

> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Wed, 19 Mar 2014 16:08:14 +0200
> Cc: emacs-devel@gnu.org
> 
> > You might also like to add a comment like "Hack attack:
> > assume the file's encoding is the same as Emacs's internal encoding".
> 
> Hm, yes.  Although it just assumes that the encoding used multiple bytes
> for the same sets of chars as Emacs internals, which is more reasonable.

Sorry, maybe I'm missing something, but I don't see how this could be
a reasonable assumption.  Don't you need to produce the same byte
stream as would be found in the file when saved to disk?  If so, then
you need to produce data about byte counts as they will be in that
encoding, which is defined by buffer-file-coding-system.

Apologies if I missed something.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-19 16:54       ` Eli Zaretskii
@ 2014-03-19 17:56         ` Dmitry Gutov
  2014-03-19 18:33           ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Dmitry Gutov @ 2014-03-19 17:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

On 19.03.2014 18:54, Eli Zaretskii wrote:

> Sorry, maybe I'm missing something, but I don't see how this could be
> a reasonable assumption.  Don't you need to produce the same byte
> stream as would be found in the file when saved to disk?

Since we only need to count the bytes between the bol and point (on 
disk, yes), and multibyte chars are relatively rare, we can afford not 
to be very accurate.

But if you could point out an easy way to obtain that byte count more 
correctly, that would be quite welcome.

> If so, then
> you need to produce data about byte counts as they will be in that
> encoding, which is defined by buffer-file-coding-system.

So, um, do I use something like

  (length (encode-coding-string STR buffer-file-coding-system))

?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-19 17:56         ` Dmitry Gutov
@ 2014-03-19 18:33           ` Eli Zaretskii
  2014-03-19 21:15             ` Dmitry Gutov
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-19 18:33 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: monnier, emacs-devel

> Date: Wed, 19 Mar 2014 19:56:34 +0200
> From: Dmitry Gutov <dgutov@yandex.ru>
> CC: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> On 19.03.2014 18:54, Eli Zaretskii wrote:
> 
> > Sorry, maybe I'm missing something, but I don't see how this could be
> > a reasonable assumption.  Don't you need to produce the same byte
> > stream as would be found in the file when saved to disk?
> 
> Since we only need to count the bytes between the bol and point (on 
> disk, yes), and multibyte chars are relatively rare, we can afford not 
> to be very accurate.

If clang can endure inaccurate counts (I don't know if it can), then
perhaps that's good enough.  But then so will be just length of the
string in characters.

> But if you could point out an easy way to obtain that byte count more 
> correctly, that would be quite welcome.

I did, see below.

> > If so, then
> > you need to produce data about byte counts as they will be in that
> > encoding, which is defined by buffer-file-coding-system.
> 
> So, um, do I use something like
> 
>   (length (encode-coding-string STR buffer-file-coding-system))
> 
> ?

Yes, I think so.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-19 18:33           ` Eli Zaretskii
@ 2014-03-19 21:15             ` Dmitry Gutov
  2014-03-20  2:56               ` Dmitry Gutov
  2014-03-20  3:47               ` Eli Zaretskii
  0 siblings, 2 replies; 17+ messages in thread
From: Dmitry Gutov @ 2014-03-19 21:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

On 19.03.2014 20:33, Eli Zaretskii wrote:

> If clang can endure inaccurate counts (I don't know if it can)

Only in the sense that it won't blow up, just return inaccurate results.

>> But if you could point out an easy way to obtain that byte count more
>> correctly, that would be quite welcome.
>
> I did, see below.

Thank you. (Without being Cc'd, I haven't read your other email until 
after my previous reply).

>> So, um, do I use something like
>>
>>    (length (encode-coding-string STR buffer-file-coding-system))
>>
>> ?
>
> Yes, I think so.

Thanks.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-19 21:15             ` Dmitry Gutov
@ 2014-03-20  2:56               ` Dmitry Gutov
  2014-03-20  3:58                 ` Eli Zaretskii
  2014-03-20  3:47               ` Eli Zaretskii
  1 sibling, 1 reply; 17+ messages in thread
From: Dmitry Gutov @ 2014-03-20  2:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

>>>    (length (encode-coding-string STR buffer-file-coding-system))

Alas, this doesn't work. If I set the file's encoding to UTF-16, the
current code works (with Clang 3.4), whereas using the approach above,
doesn't.

So looks like Clang uses some other encoding than that the file is saved
to disk with.

Probably UTF-8 or similar, which isn't far from utf-8-emacs.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-20  2:56               ` Dmitry Gutov
@ 2014-03-20  3:58                 ` Eli Zaretskii
  2014-03-20  4:10                   ` Dmitry Gutov
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-20  3:58 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: monnier, emacs-devel

> From: Dmitry Gutov <dgutov@yandex.ru>
> Cc: monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> Date: Thu, 20 Mar 2014 04:56:02 +0200
> 
> >>>    (length (encode-coding-string STR buffer-file-coding-system))
> 
> Alas, this doesn't work. If I set the file's encoding to UTF-16, the
> current code works (with Clang 3.4), whereas using the approach above,
> doesn't.

Please tell the details: what does "don't work" mean?

> So looks like Clang uses some other encoding than that the file is saved
> to disk with.
> 
> Probably UTF-8 or similar, which isn't far from utf-8-emacs.

The question is not what Clang uses, the question is how does it
expect the offsets to be supplied for files encoded in different
encodings.  That is something that should be described in the Clang
manuals.  I assumed that it needs offsets in bytes, but that
assumption was not based on anything except looking at your code.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-20  3:58                 ` Eli Zaretskii
@ 2014-03-20  4:10                   ` Dmitry Gutov
  2014-03-20 16:11                     ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Dmitry Gutov @ 2014-03-20  4:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

On 20.03.2014 05:58, Eli Zaretskii wrote:

> Please tell the details: what does "don't work" mean?

It means that Clang returns a wrong list of completions.

Take this test file, for example:

===
typedef struct test_s {
   int num_a;
   long num_b;
   char c;
} test_t;

int main(int args, char *argv[]) {
   test_t tt;
   /*ыыы*/tt.;
   return 0;
}
===

Put point after `.', type `M-x company-clang'. The list of completions 
should include 3 items, from the struct test_t.

"Doesn't work" usually means that it returns a different, much longer 
list. So, with the above file saved in UTF-8, either approach works. But 
when it's in UTF-16, only the current one succeeds.

> The question is not what Clang uses, the question is how does it
> expect the offsets to be supplied for files encoded in different
> encodings.  That is something that should be described in the Clang
> manuals.

Either it isn't, or I don't know what to search for.

 > I assumed that it needs offsets in bytes, but that
> assumption was not based on anything except looking at your code.

The docstring for the relevant function 
(http://clang.llvm.org/doxygen/group__CINDEX__CODE__COMPLET.html#ga50fedfa85d8d1517363952f2e10aa3bf) 
says "column", but apparently it has a special notion of columns. For 
example, it considers any tab character as taking only one column.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-20  4:10                   ` Dmitry Gutov
@ 2014-03-20 16:11                     ` Eli Zaretskii
  2014-03-20 18:58                       ` Richard Stallman
  2014-03-21  3:47                       ` Dmitry Gutov
  0 siblings, 2 replies; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-20 16:11 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: monnier, emacs-devel

> Date: Thu, 20 Mar 2014 06:10:09 +0200
> From: Dmitry Gutov <dgutov@yandex.ru>
> CC: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> "Doesn't work" usually means that it returns a different, much longer 
> list. So, with the above file saved in UTF-8, either approach works. But 
> when it's in UTF-16, only the current one succeeds.
> 
> > The question is not what Clang uses, the question is how does it
> > expect the offsets to be supplied for files encoded in different
> > encodings.  That is something that should be described in the Clang
> > manuals.
> 
> Either it isn't, or I don't know what to search for.
> 
>  > I assumed that it needs offsets in bytes, but that
> > assumption was not based on anything except looking at your code.
> 
> The docstring for the relevant function 
> (http://clang.llvm.org/doxygen/group__CINDEX__CODE__COMPLET.html#ga50fedfa85d8d1517363952f2e10aa3bf) 
> says "column", but apparently it has a special notion of columns. For 
> example, it considers any tab character as taking only one column.

I needed to look in their sources, but the information there isn't
clear-cut, either (or maybe I didn't understand the code ;-).  Some
functions that convert file offsets to columns count bytes from the
beginning of the line, others count characters, assuming a UTF-8
encoding.  But since you say the attempt to count characters in
non-UTF-8 encoding failed, I guess clang needs byte counts of UTF-8
encoding.

In any case, please note that UTF-8 and the internal encoding used by
Emacs are not exactly identical, so IMO you should encode into UTF-8
and then use 'length' to compute the "column".



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-20 16:11                     ` Eli Zaretskii
@ 2014-03-20 18:58                       ` Richard Stallman
  2014-03-20 19:04                         ` Dmitry Gutov
  2014-03-21  3:47                       ` Dmitry Gutov
  1 sibling, 1 reply; 17+ messages in thread
From: Richard Stallman @ 2014-03-20 18:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, monnier, dgutov

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

The presence of company-clang in our repository is a problem,
independent of whether it has bugs.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-20 18:58                       ` Richard Stallman
@ 2014-03-20 19:04                         ` Dmitry Gutov
  2014-03-21 12:15                           ` Richard Stallman
  0 siblings, 1 reply; 17+ messages in thread
From: Dmitry Gutov @ 2014-03-20 19:04 UTC (permalink / raw)
  To: rms; +Cc: emacs-devel

On 20.03.2014 20:58, Richard Stallman wrote:

> The presence of company-clang in our repository is a problem,
> independent of whether it has bugs.

To be clear, I don't intend to remove it. Personally, at least.

Richard, have you received my last email? This is the second time you've 
asked what job company-clang does, I've replied again, and there wasn't 
any response back.

I'd like to at least be sure that you can receive my emails.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-20 19:04                         ` Dmitry Gutov
@ 2014-03-21 12:15                           ` Richard Stallman
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Stallman @ 2014-03-21 12:15 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    Richard, have you received my last email? This is the second time you've 
    asked what job company-clang does, I've replied again, and there wasn't 
    any response back.

It takes me a day to respond to any message, if I can write the
response right away on seeing it.  If the message requires work or
thought, it will take longer.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-20 16:11                     ` Eli Zaretskii
  2014-03-20 18:58                       ` Richard Stallman
@ 2014-03-21  3:47                       ` Dmitry Gutov
  2014-03-21  8:04                         ` Eli Zaretskii
  1 sibling, 1 reply; 17+ messages in thread
From: Dmitry Gutov @ 2014-03-21  3:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

On 20.03.2014 18:11, Eli Zaretskii wrote:

> I needed to look in their sources, but the information there isn't
> clear-cut, either (or maybe I didn't understand the code ;-).  Some
> functions that convert file offsets to columns count bytes from the
> beginning of the line, others count characters, assuming a UTF-8
> encoding.  But since you say the attempt to count characters in
> non-UTF-8 encoding failed, I guess clang needs byte counts of UTF-8
> encoding.

Yes. And from what I've read 
(http://stackoverflow.com/a/8259610/615245), non-ANSI encoding support 
was added piecewise, so maybe the relevant code still hasn't settled.

> In any case, please note that UTF-8 and the internal encoding used by
> Emacs are not exactly identical, so IMO you should encode into UTF-8
> and then use 'length' to compute the "column".

This makes sense. I don't think anyone's likely to encounter a source 
file with characters that are encoded differently between utf-8 and 
utf-8-emacs, but I guess the latter is unspecced, so it could change in 
the future.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-21  3:47                       ` Dmitry Gutov
@ 2014-03-21  8:04                         ` Eli Zaretskii
  0 siblings, 0 replies; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-21  8:04 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: monnier, emacs-devel

> Date: Fri, 21 Mar 2014 05:47:11 +0200
> From: Dmitry Gutov <dgutov@yandex.ru>
> CC: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> > In any case, please note that UTF-8 and the internal encoding used by
> > Emacs are not exactly identical, so IMO you should encode into UTF-8
> > and then use 'length' to compute the "column".
> 
> This makes sense. I don't think anyone's likely to encounter a source 
> file with characters that are encoded differently between utf-8 and 
> utf-8-emacs, but I guess the latter is unspecced, so it could change in 
> the future.

The most popular use case for the differences between internal
encoding and UTF-8 is when you have raw binary bytes in the source,
for some reason.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-19 21:15             ` Dmitry Gutov
  2014-03-20  2:56               ` Dmitry Gutov
@ 2014-03-20  3:47               ` Eli Zaretskii
  1 sibling, 0 replies; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-20  3:47 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: monnier, emacs-devel

> Date: Wed, 19 Mar 2014 23:15:43 +0200
> From: Dmitry Gutov <dgutov@yandex.ru>
> CC: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> >> But if you could point out an easy way to obtain that byte count more
> >> correctly, that would be quite welcome.
> >
> > I did, see below.
> 
> Thank you. (Without being Cc'd, I haven't read your other email until 
> after my previous reply).

Sorry about that.  This happened because the message I was replying to
didn't have you in the addressee list.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point
  2014-03-19 13:15   ` [elpa] 02/04: company-clang: handle multibyte chars between bol and point Stefan
  2014-03-19 14:08     ` Dmitry Gutov
@ 2014-03-19 16:40     ` Eli Zaretskii
  1 sibling, 0 replies; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-19 16:40 UTC (permalink / raw)
  To: Stefan; +Cc: emacs-devel

> From: Stefan <monnier@iro.umontreal.ca>
> Date: Wed, 19 Mar 2014 09:15:27 -0400
> 
> > -            (- (point) (line-beginning-position) -1))))
> > +            (1+ (string-bytes (buffer-substring
> > +                               (line-beginning-position)
> > +                               (point)))))))
> 
> Instead of buffer-subtring composed with string-bytes, you could use
> position-bytes.  You might also like to add a comment like "Hack attack:
> assume the file's encoding is the same as Emacs's internal encoding".

Why assume such a thing?  It's bound to break some day, for some user.

I would suggest encoding the buffer substring using
buffer-file-coding-system, and then using (length string) on the
result (which will be a unibyte string, so there's no difference
between byte and character counts).  Then this code will be portable,
I think.



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-03-21 12:15 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20140319033013.17542.14344@vcs.savannah.gnu.org>
     [not found] ` <E1WQ7Co-0004c8-Lo@vcs.savannah.gnu.org>
2014-03-19 13:15   ` [elpa] 02/04: company-clang: handle multibyte chars between bol and point Stefan
2014-03-19 14:08     ` Dmitry Gutov
2014-03-19 16:54       ` Eli Zaretskii
2014-03-19 17:56         ` Dmitry Gutov
2014-03-19 18:33           ` Eli Zaretskii
2014-03-19 21:15             ` Dmitry Gutov
2014-03-20  2:56               ` Dmitry Gutov
2014-03-20  3:58                 ` Eli Zaretskii
2014-03-20  4:10                   ` Dmitry Gutov
2014-03-20 16:11                     ` Eli Zaretskii
2014-03-20 18:58                       ` Richard Stallman
2014-03-20 19:04                         ` Dmitry Gutov
2014-03-21 12:15                           ` Richard Stallman
2014-03-21  3:47                       ` Dmitry Gutov
2014-03-21  8:04                         ` Eli Zaretskii
2014-03-20  3:47               ` Eli Zaretskii
2014-03-19 16:40     ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.