From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: [elpa] 02/04: company-clang: handle multibyte chars between bol and point Date: Thu, 20 Mar 2014 18:11:16 +0200 Message-ID: <83fvmc97ff.fsf@gnu.org> References: <20140319033013.17542.14344@vcs.savannah.gnu.org> <87mwgm9t81.fsf@yandex.ru> <834n2u9lj7.fsf@gnu.org> <5329DA52.2030704@yandex.ru> <83vbva82cy.fsf@gnu.org> <532A08FF.8020001@yandex.ru> <87ior9pohp.fsf@yandex.ru> <83k3bp8qrz.fsf@gnu.org> <532A6A21.8040802@yandex.ru> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1395331874 10909 80.91.229.3 (20 Mar 2014 16:11:14 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 20 Mar 2014 16:11:14 +0000 (UTC) Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org To: Dmitry Gutov Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Mar 20 17:11:23 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1WQfYt-0006QG-Dg for ged-emacs-devel@m.gmane.org; Thu, 20 Mar 2014 17:11:19 +0100 Original-Received: from localhost ([::1]:47995 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WQfYs-0007HK-ST for ged-emacs-devel@m.gmane.org; Thu, 20 Mar 2014 12:11:18 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:39768) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WQfYj-00075a-Do for emacs-devel@gnu.org; Thu, 20 Mar 2014 12:11:13 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WQfYe-0001VM-R5 for emacs-devel@gnu.org; Thu, 20 Mar 2014 12:11:09 -0400 Original-Received: from mtaout28.012.net.il ([80.179.55.184]:56581) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WQfYe-0001U9-F0 for emacs-devel@gnu.org; Thu, 20 Mar 2014 12:11:04 -0400 Original-Received: from conversion-daemon.mtaout28.012.net.il by mtaout28.012.net.il (HyperSendmail v2007.08) id <0N2Q00N00SR9D700@mtaout28.012.net.il> for emacs-devel@gnu.org; Thu, 20 Mar 2014 18:10:58 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by mtaout28.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0N2Q00LH5SY90U30@mtaout28.012.net.il>; Thu, 20 Mar 2014 18:10:58 +0200 (IST) In-reply-to: <532A6A21.8040802@yandex.ru> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 80.179.55.184 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:170624 Archived-At: > Date: Thu, 20 Mar 2014 06:10:09 +0200 > From: Dmitry Gutov > CC: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > "Doesn't work" usually means that it returns a different, much longer > list. So, with the above file saved in UTF-8, either approach works. But > when it's in UTF-16, only the current one succeeds. > > > The question is not what Clang uses, the question is how does it > > expect the offsets to be supplied for files encoded in different > > encodings. That is something that should be described in the Clang > > manuals. > > Either it isn't, or I don't know what to search for. > > > I assumed that it needs offsets in bytes, but that > > assumption was not based on anything except looking at your code. > > The docstring for the relevant function > (http://clang.llvm.org/doxygen/group__CINDEX__CODE__COMPLET.html#ga50fedfa85d8d1517363952f2e10aa3bf) > says "column", but apparently it has a special notion of columns. For > example, it considers any tab character as taking only one column. I needed to look in their sources, but the information there isn't clear-cut, either (or maybe I didn't understand the code ;-). Some functions that convert file offsets to columns count bytes from the beginning of the line, others count characters, assuming a UTF-8 encoding. But since you say the attempt to count characters in non-UTF-8 encoding failed, I guess clang needs byte counts of UTF-8 encoding. In any case, please note that UTF-8 and the internal encoding used by Emacs are not exactly identical, so IMO you should encode into UTF-8 and then use 'length' to compute the "column".