soft and program: Digest for comp.lang.c++@googlegroups.com

Tuesday, January 11, 2022

Digest for comp.lang.c++@googlegroups.com - 8 updates in 1 topic

comp.lang.c++@googlegroups.com

Google Groups

Can anyone improve this ? - 8 Updates

Bonita Montero <Bonita.Montero@gmail.com>: Jan 11 09:08AM +0100

I had an idea how to write a fast UTF-8 strlen, here it is:

size_t utf8Strlen( char const *str )
{
struct encode_t { size_t lenIncr, strIncr; };
static encode_t const encodes[] =
{
{ 1, 1 },
{ 0, 0 },
{ 1, 2 },
{ 1, 3 },
{ 1, 4 },
{ 0, 0 },
{ 0, 0 },
{ 0, 0 },
{ 0, 0 }
};
size_t len = 0;
for( unsigned char c; (c = *str); )
{
encode_t const &enc = encodes[(size_t)countl_zero<unsigned
char>( ~c )];
if( !enc.lenIncr ) [[unlikely]]
return -1;
len += enc.lenIncr;
for( char const *cpEnd = str + enc.strIncr; ++str != cpEnd; )
if( ((unsigned char)*str & 0x0C0) != 0x080 ) [[unlikely]]
return -1;
}
return len;
}

Has anyone further ideas to improve this ?

Juha Nieminen <nospam@thanks.invalid>: Jan 11 09:50AM

> return len;
> }

> Has anyone further ideas to improve this ?

How does it compare to the more straightforward:

std::size_t utf8Strlen(const char *str)
{
std::size_t length = 0;
for(std::size_t index = 0; str[index]; ++index, ++length)
if((str[index] & 0xC0) == 0xC0)
while((str[index+1] & 0xC0) == 0x80)
++index;
return length;
}

Juha Nieminen <nospam@thanks.invalid>: Jan 11 11:07AM

> ++index;
> return length;
> }

Actually there's an even simpler way:

std::size_t utf8Strlen(const char *str)
{
std::size_t length = 0;
for(std::size_t index = 0; str[index]; ++index)
length += (str[index] & 0xC0) != 0x80;
return length;
}

Ben Bacarisse <ben.usenet@bsb.me.uk>: Jan 11 11:08AM

> ++index;
> return length;
> }

Or the even more straightforward:

size_t ustrlen(char *s)
{
size_t len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;
return len;
}

--
Ben.

Bonita Montero <Bonita.Montero@gmail.com>: Jan 11 06:28PM +0100

Am 11.01.2022 um 10:50 schrieb Juha Nieminen:
> ++index;
> return length;
> }

The issue with that solution is that it doesn't correctly detect
any kind of mis-formatted UTF-8-string. I get the number of chars
preceding a header-char from the table and check if there are an
according number of 0x80-headered chars.

Christian Gollwitzer <auriocus@gmx.de>: Jan 11 06:41PM +0100

Am 11.01.22 um 18:28 schrieb Bonita Montero:
> any kind of mis-formatted UTF-8-string. I get the number of chars
> preceding a header-char from the table and check if there are an
> according number of 0x80-headered chars.

I wrote this code once to check, if a string is valid UTF-8 - actually,
I used this platform to test-drive it:
https://leetcode.com/problems/utf-8-validation/ and it came off as the
fastest solution handed in so far:

==================================================================
enum utf8token { utf8lowbyte = 1, utf8doublet = 2, utf8triplet = 3,
utf8quadruplet = 4, utf8highbyte, utf8fail };

static utf8token utf8classify(unsigned char data) {
if ((data & 0x80) == 0) { return utf8lowbyte; }
if ((data & 0xC0) == 0x80) { return utf8highbyte;}
if ((data & 0xE0) == 0xC0) { return utf8doublet; }
if ((data & 0xF0) == 0xE0) { return utf8triplet; }
if ((data & 0xF8) == 0xF0) { return utf8quadruplet; }
return utf8fail;
}

static bool valid_utf8(const char* data, std::size_t dataSize) {
for (std::size_t i = 0; i < dataSize; i++) {
int codelength = utf8classify(static_cast<unsigned char>(data[i]));
if (codelength == utf8highbyte || codelength == utf8fail)
return false;

for (int j = 1; j<codelength; j++) {
// check for premature end of input
i++;
if (i >= dataSize) return false;

if (utf8classify(static_cast<unsigned char>(data[i])) !=
utf8highbyte)
return false;
}
}

return true;
}
==========================================================

It only returns true or false according to the problem definition, but I
think you can easily rework it to count the number of characters. I
believe all you have to do is increase a counter for every turn of the
outer loop.

Christian

Bonita Montero <Bonita.Montero@gmail.com>: Jan 11 07:36PM +0100

Am 11.01.2022 um 18:41 schrieb Christian Gollwitzer:
> if ((data & 0xF8) == 0xF0) { return utf8quadruplet; }
> return utf8fail;
> }

Sorry, that makes a lot of unprecitible branches.
That's while I used a table.

Bonita Montero <Bonita.Montero@gmail.com>: Jan 11 09:51PM +0100

I think that's even simpler:

size_t utf8Strlen( char const *str )
{
static bool const isHeader[8] = { true, false, true, true, true, false,
false, false };
static size_t const sizes[8] = { 1, 0, 2, 3, 4, 0, 0, 0 };
size_t len = 0;
for( unsigned char c; (c = *str); )
{
size_t headerClass = countl_zero<unsigned char>( ~c );
if( !isHeader[headerClass] ) [[unlikely]]
return -1;
++len;
for( char const *cpEnd = str + sizes[headerClass]; ++str != cpEnd; )
if( ((unsigned char)*str & 0x0C0) != 0x080 ) [[unlikely]]
return -1;
}
return len;
}

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Tuesday, January 11, 2022

Digest for comp.lang.c++@googlegroups.com - 8 updates in 1 topic

No comments:

Blog Archive

About Me