* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point [not found] ` <E1WQ7Co-0004c8-Lo@vcs.savannah.gnu.org> @ 2014-03-19 13:15 ` Stefan 2014-03-19 14:08 ` Dmitry Gutov 2014-03-19 16:40 ` Eli Zaretskii 0 siblings, 2 replies; 17+ messages in thread From: Stefan @ 2014-03-19 13:15 UTC (permalink / raw) To: emacs-devel > - (- (point) (line-beginning-position) -1)))) > + (1+ (string-bytes (buffer-substring > + (line-beginning-position) > + (point))))))) Instead of buffer-subtring composed with string-bytes, you could use position-bytes. You might also like to add a comment like "Hack attack: assume the file's encoding is the same as Emacs's internal encoding". Stefan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-19 13:15 ` [elpa] 02/04: company-clang: handle multibyte chars between bol and point Stefan @ 2014-03-19 14:08 ` Dmitry Gutov 2014-03-19 16:54 ` Eli Zaretskii 2014-03-19 16:40 ` Eli Zaretskii 1 sibling, 1 reply; 17+ messages in thread From: Dmitry Gutov @ 2014-03-19 14:08 UTC (permalink / raw) To: Stefan; +Cc: emacs-devel Stefan <monnier@iro.umontreal.ca> writes: > Instead of buffer-subtring composed with string-bytes, you could use > position-bytes. I figured written this way it looks a tiny bit nicer, and the performance difference is negligible. Maybe I shouldn't have tried to be inventive, though. > You might also like to add a comment like "Hack attack: > assume the file's encoding is the same as Emacs's internal encoding". Hm, yes. Although it just assumes that the encoding used multiple bytes for the same sets of chars as Emacs internals, which is more reasonable. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-19 14:08 ` Dmitry Gutov @ 2014-03-19 16:54 ` Eli Zaretskii 2014-03-19 17:56 ` Dmitry Gutov 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2014-03-19 16:54 UTC (permalink / raw) To: Dmitry Gutov; +Cc: monnier, emacs-devel > From: Dmitry Gutov <dgutov@yandex.ru> > Date: Wed, 19 Mar 2014 16:08:14 +0200 > Cc: emacs-devel@gnu.org > > > You might also like to add a comment like "Hack attack: > > assume the file's encoding is the same as Emacs's internal encoding". > > Hm, yes. Although it just assumes that the encoding used multiple bytes > for the same sets of chars as Emacs internals, which is more reasonable. Sorry, maybe I'm missing something, but I don't see how this could be a reasonable assumption. Don't you need to produce the same byte stream as would be found in the file when saved to disk? If so, then you need to produce data about byte counts as they will be in that encoding, which is defined by buffer-file-coding-system. Apologies if I missed something. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-19 16:54 ` Eli Zaretskii @ 2014-03-19 17:56 ` Dmitry Gutov 2014-03-19 18:33 ` Eli Zaretskii 0 siblings, 1 reply; 17+ messages in thread From: Dmitry Gutov @ 2014-03-19 17:56 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel On 19.03.2014 18:54, Eli Zaretskii wrote: > Sorry, maybe I'm missing something, but I don't see how this could be > a reasonable assumption. Don't you need to produce the same byte > stream as would be found in the file when saved to disk? Since we only need to count the bytes between the bol and point (on disk, yes), and multibyte chars are relatively rare, we can afford not to be very accurate. But if you could point out an easy way to obtain that byte count more correctly, that would be quite welcome. > If so, then > you need to produce data about byte counts as they will be in that > encoding, which is defined by buffer-file-coding-system. So, um, do I use something like (length (encode-coding-string STR buffer-file-coding-system)) ? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-19 17:56 ` Dmitry Gutov @ 2014-03-19 18:33 ` Eli Zaretskii 2014-03-19 21:15 ` Dmitry Gutov 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2014-03-19 18:33 UTC (permalink / raw) To: Dmitry Gutov; +Cc: monnier, emacs-devel > Date: Wed, 19 Mar 2014 19:56:34 +0200 > From: Dmitry Gutov <dgutov@yandex.ru> > CC: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > On 19.03.2014 18:54, Eli Zaretskii wrote: > > > Sorry, maybe I'm missing something, but I don't see how this could be > > a reasonable assumption. Don't you need to produce the same byte > > stream as would be found in the file when saved to disk? > > Since we only need to count the bytes between the bol and point (on > disk, yes), and multibyte chars are relatively rare, we can afford not > to be very accurate. If clang can endure inaccurate counts (I don't know if it can), then perhaps that's good enough. But then so will be just length of the string in characters. > But if you could point out an easy way to obtain that byte count more > correctly, that would be quite welcome. I did, see below. > > If so, then > > you need to produce data about byte counts as they will be in that > > encoding, which is defined by buffer-file-coding-system. > > So, um, do I use something like > > (length (encode-coding-string STR buffer-file-coding-system)) > > ? Yes, I think so. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-19 18:33 ` Eli Zaretskii @ 2014-03-19 21:15 ` Dmitry Gutov 2014-03-20 2:56 ` Dmitry Gutov 2014-03-20 3:47 ` Eli Zaretskii 0 siblings, 2 replies; 17+ messages in thread From: Dmitry Gutov @ 2014-03-19 21:15 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel On 19.03.2014 20:33, Eli Zaretskii wrote: > If clang can endure inaccurate counts (I don't know if it can) Only in the sense that it won't blow up, just return inaccurate results. >> But if you could point out an easy way to obtain that byte count more >> correctly, that would be quite welcome. > > I did, see below. Thank you. (Without being Cc'd, I haven't read your other email until after my previous reply). >> So, um, do I use something like >> >> (length (encode-coding-string STR buffer-file-coding-system)) >> >> ? > > Yes, I think so. Thanks. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-19 21:15 ` Dmitry Gutov @ 2014-03-20 2:56 ` Dmitry Gutov 2014-03-20 3:58 ` Eli Zaretskii 2014-03-20 3:47 ` Eli Zaretskii 1 sibling, 1 reply; 17+ messages in thread From: Dmitry Gutov @ 2014-03-20 2:56 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel >>> (length (encode-coding-string STR buffer-file-coding-system)) Alas, this doesn't work. If I set the file's encoding to UTF-16, the current code works (with Clang 3.4), whereas using the approach above, doesn't. So looks like Clang uses some other encoding than that the file is saved to disk with. Probably UTF-8 or similar, which isn't far from utf-8-emacs. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-20 2:56 ` Dmitry Gutov @ 2014-03-20 3:58 ` Eli Zaretskii 2014-03-20 4:10 ` Dmitry Gutov 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2014-03-20 3:58 UTC (permalink / raw) To: Dmitry Gutov; +Cc: monnier, emacs-devel > From: Dmitry Gutov <dgutov@yandex.ru> > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > Date: Thu, 20 Mar 2014 04:56:02 +0200 > > >>> (length (encode-coding-string STR buffer-file-coding-system)) > > Alas, this doesn't work. If I set the file's encoding to UTF-16, the > current code works (with Clang 3.4), whereas using the approach above, > doesn't. Please tell the details: what does "don't work" mean? > So looks like Clang uses some other encoding than that the file is saved > to disk with. > > Probably UTF-8 or similar, which isn't far from utf-8-emacs. The question is not what Clang uses, the question is how does it expect the offsets to be supplied for files encoded in different encodings. That is something that should be described in the Clang manuals. I assumed that it needs offsets in bytes, but that assumption was not based on anything except looking at your code. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-20 3:58 ` Eli Zaretskii @ 2014-03-20 4:10 ` Dmitry Gutov 2014-03-20 16:11 ` Eli Zaretskii 0 siblings, 1 reply; 17+ messages in thread From: Dmitry Gutov @ 2014-03-20 4:10 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel On 20.03.2014 05:58, Eli Zaretskii wrote: > Please tell the details: what does "don't work" mean? It means that Clang returns a wrong list of completions. Take this test file, for example: === typedef struct test_s { int num_a; long num_b; char c; } test_t; int main(int args, char *argv[]) { test_t tt; /*ыыы*/tt.; return 0; } === Put point after `.', type `M-x company-clang'. The list of completions should include 3 items, from the struct test_t. "Doesn't work" usually means that it returns a different, much longer list. So, with the above file saved in UTF-8, either approach works. But when it's in UTF-16, only the current one succeeds. > The question is not what Clang uses, the question is how does it > expect the offsets to be supplied for files encoded in different > encodings. That is something that should be described in the Clang > manuals. Either it isn't, or I don't know what to search for. > I assumed that it needs offsets in bytes, but that > assumption was not based on anything except looking at your code. The docstring for the relevant function (http://clang.llvm.org/doxygen/group__CINDEX__CODE__COMPLET.html#ga50fedfa85d8d1517363952f2e10aa3bf) says "column", but apparently it has a special notion of columns. For example, it considers any tab character as taking only one column. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-20 4:10 ` Dmitry Gutov @ 2014-03-20 16:11 ` Eli Zaretskii 2014-03-20 18:58 ` Richard Stallman 2014-03-21 3:47 ` Dmitry Gutov 0 siblings, 2 replies; 17+ messages in thread From: Eli Zaretskii @ 2014-03-20 16:11 UTC (permalink / raw) To: Dmitry Gutov; +Cc: monnier, emacs-devel > Date: Thu, 20 Mar 2014 06:10:09 +0200 > From: Dmitry Gutov <dgutov@yandex.ru> > CC: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > "Doesn't work" usually means that it returns a different, much longer > list. So, with the above file saved in UTF-8, either approach works. But > when it's in UTF-16, only the current one succeeds. > > > The question is not what Clang uses, the question is how does it > > expect the offsets to be supplied for files encoded in different > > encodings. That is something that should be described in the Clang > > manuals. > > Either it isn't, or I don't know what to search for. > > > I assumed that it needs offsets in bytes, but that > > assumption was not based on anything except looking at your code. > > The docstring for the relevant function > (http://clang.llvm.org/doxygen/group__CINDEX__CODE__COMPLET.html#ga50fedfa85d8d1517363952f2e10aa3bf) > says "column", but apparently it has a special notion of columns. For > example, it considers any tab character as taking only one column. I needed to look in their sources, but the information there isn't clear-cut, either (or maybe I didn't understand the code ;-). Some functions that convert file offsets to columns count bytes from the beginning of the line, others count characters, assuming a UTF-8 encoding. But since you say the attempt to count characters in non-UTF-8 encoding failed, I guess clang needs byte counts of UTF-8 encoding. In any case, please note that UTF-8 and the internal encoding used by Emacs are not exactly identical, so IMO you should encode into UTF-8 and then use 'length' to compute the "column". ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-20 16:11 ` Eli Zaretskii @ 2014-03-20 18:58 ` Richard Stallman 2014-03-20 19:04 ` Dmitry Gutov 2014-03-21 3:47 ` Dmitry Gutov 1 sibling, 1 reply; 17+ messages in thread From: Richard Stallman @ 2014-03-20 18:58 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, monnier, dgutov [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] The presence of company-clang in our repository is a problem, independent of whether it has bugs. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-20 18:58 ` Richard Stallman @ 2014-03-20 19:04 ` Dmitry Gutov 2014-03-21 12:15 ` Richard Stallman 0 siblings, 1 reply; 17+ messages in thread From: Dmitry Gutov @ 2014-03-20 19:04 UTC (permalink / raw) To: rms; +Cc: emacs-devel On 20.03.2014 20:58, Richard Stallman wrote: > The presence of company-clang in our repository is a problem, > independent of whether it has bugs. To be clear, I don't intend to remove it. Personally, at least. Richard, have you received my last email? This is the second time you've asked what job company-clang does, I've replied again, and there wasn't any response back. I'd like to at least be sure that you can receive my emails. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-20 19:04 ` Dmitry Gutov @ 2014-03-21 12:15 ` Richard Stallman 0 siblings, 0 replies; 17+ messages in thread From: Richard Stallman @ 2014-03-21 12:15 UTC (permalink / raw) To: Dmitry Gutov; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Richard, have you received my last email? This is the second time you've asked what job company-clang does, I've replied again, and there wasn't any response back. It takes me a day to respond to any message, if I can write the response right away on seeing it. If the message requires work or thought, it will take longer. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-20 16:11 ` Eli Zaretskii 2014-03-20 18:58 ` Richard Stallman @ 2014-03-21 3:47 ` Dmitry Gutov 2014-03-21 8:04 ` Eli Zaretskii 1 sibling, 1 reply; 17+ messages in thread From: Dmitry Gutov @ 2014-03-21 3:47 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel On 20.03.2014 18:11, Eli Zaretskii wrote: > I needed to look in their sources, but the information there isn't > clear-cut, either (or maybe I didn't understand the code ;-). Some > functions that convert file offsets to columns count bytes from the > beginning of the line, others count characters, assuming a UTF-8 > encoding. But since you say the attempt to count characters in > non-UTF-8 encoding failed, I guess clang needs byte counts of UTF-8 > encoding. Yes. And from what I've read (http://stackoverflow.com/a/8259610/615245), non-ANSI encoding support was added piecewise, so maybe the relevant code still hasn't settled. > In any case, please note that UTF-8 and the internal encoding used by > Emacs are not exactly identical, so IMO you should encode into UTF-8 > and then use 'length' to compute the "column". This makes sense. I don't think anyone's likely to encounter a source file with characters that are encoded differently between utf-8 and utf-8-emacs, but I guess the latter is unspecced, so it could change in the future. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-21 3:47 ` Dmitry Gutov @ 2014-03-21 8:04 ` Eli Zaretskii 0 siblings, 0 replies; 17+ messages in thread From: Eli Zaretskii @ 2014-03-21 8:04 UTC (permalink / raw) To: Dmitry Gutov; +Cc: monnier, emacs-devel > Date: Fri, 21 Mar 2014 05:47:11 +0200 > From: Dmitry Gutov <dgutov@yandex.ru> > CC: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > > In any case, please note that UTF-8 and the internal encoding used by > > Emacs are not exactly identical, so IMO you should encode into UTF-8 > > and then use 'length' to compute the "column". > > This makes sense. I don't think anyone's likely to encounter a source > file with characters that are encoded differently between utf-8 and > utf-8-emacs, but I guess the latter is unspecced, so it could change in > the future. The most popular use case for the differences between internal encoding and UTF-8 is when you have raw binary bytes in the source, for some reason. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-19 21:15 ` Dmitry Gutov 2014-03-20 2:56 ` Dmitry Gutov @ 2014-03-20 3:47 ` Eli Zaretskii 1 sibling, 0 replies; 17+ messages in thread From: Eli Zaretskii @ 2014-03-20 3:47 UTC (permalink / raw) To: Dmitry Gutov; +Cc: monnier, emacs-devel > Date: Wed, 19 Mar 2014 23:15:43 +0200 > From: Dmitry Gutov <dgutov@yandex.ru> > CC: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > >> But if you could point out an easy way to obtain that byte count more > >> correctly, that would be quite welcome. > > > > I did, see below. > > Thank you. (Without being Cc'd, I haven't read your other email until > after my previous reply). Sorry about that. This happened because the message I was replying to didn't have you in the addressee list. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point 2014-03-19 13:15 ` [elpa] 02/04: company-clang: handle multibyte chars between bol and point Stefan 2014-03-19 14:08 ` Dmitry Gutov @ 2014-03-19 16:40 ` Eli Zaretskii 1 sibling, 0 replies; 17+ messages in thread From: Eli Zaretskii @ 2014-03-19 16:40 UTC (permalink / raw) To: Stefan; +Cc: emacs-devel > From: Stefan <monnier@iro.umontreal.ca> > Date: Wed, 19 Mar 2014 09:15:27 -0400 > > > - (- (point) (line-beginning-position) -1)))) > > + (1+ (string-bytes (buffer-substring > > + (line-beginning-position) > > + (point))))))) > > Instead of buffer-subtring composed with string-bytes, you could use > position-bytes. You might also like to add a comment like "Hack attack: > assume the file's encoding is the same as Emacs's internal encoding". Why assume such a thing? It's bound to break some day, for some user. I would suggest encoding the buffer substring using buffer-file-coding-system, and then using (length string) on the result (which will be a unibyte string, so there's no difference between byte and character counts). Then this code will be portable, I think. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2014-03-21 12:15 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20140319033013.17542.14344@vcs.savannah.gnu.org> [not found] ` <E1WQ7Co-0004c8-Lo@vcs.savannah.gnu.org> 2014-03-19 13:15 ` [elpa] 02/04: company-clang: handle multibyte chars between bol and point Stefan 2014-03-19 14:08 ` Dmitry Gutov 2014-03-19 16:54 ` Eli Zaretskii 2014-03-19 17:56 ` Dmitry Gutov 2014-03-19 18:33 ` Eli Zaretskii 2014-03-19 21:15 ` Dmitry Gutov 2014-03-20 2:56 ` Dmitry Gutov 2014-03-20 3:58 ` Eli Zaretskii 2014-03-20 4:10 ` Dmitry Gutov 2014-03-20 16:11 ` Eli Zaretskii 2014-03-20 18:58 ` Richard Stallman 2014-03-20 19:04 ` Dmitry Gutov 2014-03-21 12:15 ` Richard Stallman 2014-03-21 3:47 ` Dmitry Gutov 2014-03-21 8:04 ` Eli Zaretskii 2014-03-20 3:47 ` Eli Zaretskii 2014-03-19 16:40 ` Eli Zaretskii
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.