Terminology concerning strings
Koblinger Egmont
egmont at uhulinux.hu
Wed Apr 6 11:14:08 UTC 2005
On Wed, Apr 06, 2005 at 12:36:26PM +0200, Leonard den Ottolander wrote:
> > > * the _size_ of a string (as well as for other objects) is the number of
> > > bytes that is allocated for it. For arrays, it is the number of
> > > entries of the array. For strings it is at least _length_ + 1.
> > >
> > > * the _length_ of a string is the number of characters in it, excluding
> > > the terminating '\0'.
>
> > It seems to me that this terminology is not yet multibyte-aware. Since UTF-8
> > becomes an everyday issue and AFAIR is planned for mainstream mc 4.7.0, IMHO
> > it is very important to create a clear terminology for this even if it's not
> > yet officially implemented now.
>
> It seems you haven't read Roland's post very well. He clearly
> differentiates between size (raw number of bytes) and length (number of
> characters represented on the screen). From discussions with him I know
> he writes this post explicitly with multibyte charsets in mind. "ecs" in
> ecssup.{c,h} stands for "extended charset".
>
> Or am I missing your point?
No, it seems that I missed Roland's point.
Roland says that size >= length + 1. Just to clarify things: I guess there
are two completely different reasons why size can be greater than (and not
equal to) length + 1.
a) One can allocate a larger buffer than strlen+1. For example,
x=malloc(10); strcpy(x, "asdf"); in this example length is 4, size is 10.
Or is size==5 in this case?
b) Each multibyte character (e.g. any accented letters in UTF-8) counts as 1
for length, but at least two for size.
Am I right?
--
Egmont
More information about the mc-devel
mailing list