soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

Here is my answer - 1 Update
What's computer programming - 2 Updates
What is the best encoding (experiences...) for unicode? - 18 Updates
Why all tutorials/books use non-unicode string? - 1 Update

Ramine <ramine@1.1>: Feb 22 05:36PM -0800

Silvio wrote:

>Please stop using the word "invent" when you haven't actually invented
>anything. It makes you come over as very stupid.

You look like a kind of stupid guy, because you are not using your
intellect to reason better..

You say i didn't invented algorithms and libraries ?

Look here at my invention called SemaCondvar and SemaMonitor here:

https://sites.google.com/site/aminer68/semacondvar-semamonitor

Where do you have such this concepts and this objects in Java or C++ or
C# tell me ? my SemaCondVar and SemaMonitor combines all the
characteristics of a windows event object and a windows semaphore and an
EventCount and a condition variable and it adds also the following
characteristic: When you pass True in the first parameter of the
constructor the signal(s) will not be lost even if there is no waiting
threads, if it's False the signal(s) will be lost if there is no waiting
thread.

You see now ? this is my invention and you will not find it in C++
or C# or Java.

And look at my other invention here called scalable MLock:

https://sites.google.com/site/aminer68/scalable-mlock

My scalable MLock is waitfree, read about it more on my webpage...

An look at my other invention called , parallel conjugate gradient
linear system solver library that is cache-aware and scalable on "NUMA"
architecture here:

https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware

Also look here at my other invention called parallel varfiler, it is
very powerful , you will not find it in C# or Java or C++, you have to
read about it here:

https://sites.google.com/site/aminer68/more-scalable-parallel-varfiler

I have also invented a scalable RWLock with many variants, read about
them here:

https://sites.google.com/site/aminer68/scalable-rwlock

I have also invented other libraries like my parallel archiver etc.
please look at them here:

https://sites.google.com/site/aminer68/more-scalable-parallel-varfiler

So from now on don't say that i didn't invented algorithm and libraries..

Thank you,
Amine Moulay Ramdane.

What's computer programming

Ramine <ramine@1.1>: Feb 22 04:05PM -0800

Hello...

I was thinking more, and i have come to an interesting subject..

I was asking myself what`s computer programming ?

And after i have learned to program and after i have invented many
algorithms and parallel libraries ...

I think i am able now to define more what is computer programming...

I will define it like the following:

Computer programming is like "robotics", when you construct a robot
you have to give it legs and hands and motors etc. and you have to learn
it how to behave by incorporating in it "knowledge" to give it the
ability to behave in different ways: such as walk , and jump etc. so
computer programming is like robotics , because when you construct a
computer program it is like constructing a robot, you have to construct
every parts of the program by constructing "objects" and there
"properties", and you have to give your program "knowlege" by
implementing the "methods" of the objects to give your program the
ability to behave in different ways, like a robot... this is how i can
define programming , it's like robotics...

Thank you,
Amine Moulay Ramdane.

Ramine <ramine@1.1>: Feb 22 04:46PM -0800

Hello....

On 2/22/2015 1:09 PM, Richard Heathfield wrote:> No, it's more like
creating a list of instructions that a computer can
> is basically a computer with peripherals that can move, and you have to
> program the movements. The definition thus becomes recursive, and not in
> a helpful way.

I don't agree with you..

Because what's a physical robot ? and what's a computer ?

When you construct a computer program , you also move and mix and
work with bit and bytes and you also do measurements on the bits and
bytes and on conceptual representations and numerical representations of
physical objects to construct other objects from bits and bytes and from
objects and from methods and conceptual representations and numerical
representations of physical objects and you have to incorporate
knowledge in it to give it the ability to behave in different ways... so
a computer program is like a physical microprocessor or a computer or a
robot, because to construct a them have to move and mix the matter and
to work with matter and do measurements on the matter to construct
physical representations (from a concept also), and you have to
incorporate in them "knowledge" to give them the ability to behave in
different ways: so computer programs are like a physical
microprocessors, or physical computers or a physical robots...

Thank you,
Amine Moulay Ramdane.

What is the best encoding (experiences...) for unicode?

Paavo Helde <myfirstname@osa.pri.ee>: Feb 22 09:50AM -0600

> a separate issue.

> So what encoding you guys use? UTF-8 or UTF-16? What is the
> recommendation and your experiences.

Fully UTF-8, for portability. When the OS interface requires something else
(notably UTF-16 on Windows), the translation is done on the application
border.

Works well, but may become a bit tedious if large parts of the application
are using some UTF-16-only frameworks or libraries (like MFC or its newer
cousins). MFC also attempts to do its own narrow-to-wide conversions, but
using a wrong codepage, so these automatic conversions need to be switched
off for avoiding surprises.

hth
Paavo

JiiPee <no@notvalid.com>: Feb 22 04:15PM

On 22/02/2015 15:34, Öö Tiib wrote:
> 4) fits into std::string (std::wstring is unspecified if it is UTF-16 or
> UTF-32 or something else entirely)
> 5) is encoding of majority of internet text content

yes. I knew most of these and agree.

> UTF-16 may be more convenient on Windows or with Qt. Still if significant
> part of input or output goes in UTF-8 (I already mentioned internet) then
> I would pick UTF-8 as internal representation for texts in your application.

yes, with windows maybe utf-16. I do Windows programming....
Ok, so I think I better actually maybe use utf-8 and switch to 16 when
needed with windows.

> 'std::string' is continuous container of such bytes (specific encoding
> of possible text in it is not guaranteed by it). When that is accepted
> then everything works.

ok but lets say I use 3 russian letters. Many times I want to know that
there is 3 letters rather than the byte size is like 7.

> what is expected at other side (plus inevitable error handling). C++
> itself offers too few and inconvenient methods for that so we typically
> seek help for converting and checking from outside of C++ standard.

I meant std:: -. typo.

I mean, if I have a Russian word with 3 russian letters, I want to know
how many letters there are rather than how many bytes it is total.
Sometimes we need that, right?

> people expect your program to ignore case of characters or to convert
> to upper-case or to convert to lower case (or even to title case) and
> how such things are done may be specific to local traditions.

ye I understand, read about it ....

JiiPee <no@notvalid.com>: Feb 22 04:20PM

On 22/02/2015 15:50, Paavo Helde wrote:

>> So what encoding you guys use? UTF-8 or UTF-16? What is the
>> recommendation and your experiences.
> Fully UTF-8, for portability.

ok, I take your word for it and others words as well as people here
seems to agree with that. Am sure people have good experience.
And good to hear this opinion like this. They did not give much
recommendations on web, just pos and neg sides of eatch. But I want to
hear recommendations.

> When the OS interface requires something else
> (notably UTF-16 on Windows), the translation is done on the application
> border.

ok. sounds logical

> Works well, but may become a bit tedious if large parts of the application
> are using some UTF-16-only frameworks or libraries (like MFC or its newer
> cousins).

hmmm, I am. But on the other hand I also use a lot there TinyXml C++
library (http://www.grinninglizard.com/tinyxmldocs/index.html) to save
my data which uses UTF-8 format. So I guess that makes it so that I
should use UTF-8.

> MFC also attempts to do its own narrow-to-wide conversions, but
> using a wrong codepage, so these automatic conversions need to be switched
> off for avoiding surprises.

well, I guess I better then check all characters work ok.

Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 22 05:01PM

On Sun, 22 Feb 2015 14:15:13 +0000
> UTF-8? Like how to get its length, find a certain character, how to
> store it (well, i guess just a char [] array does the job, or even
> std::string).

The point you need to realise about unicode is that there is no way to
index into unicode codepoints with either UTF-8 and UTF-16, because they
are variable length encodings. To find a character at a given character
index, you have to go to the start of the string and work your way
along. Furthermore, any given UTF-8 or UTF-16 string must be treated as
const, because altering a given character may change the length (in 8
or 16 bit code units) of the entire string, so indexing is not
particularly useful.

You might think that would not be true of UTF-32, and you would be half
right. But only half right, because although you can index into a
codepoint with UTF-32, combining characters also come into play. For
example there are two renditions of characters with a diaresis. Lower
case o with diaresis could be given as either 'U+006F U+0308' (two code
points) or 'U+00F6' (one code point). Any reasonable equality test
should treat them as the same. So how would you describe the "length"
of that string?

There are libraries available to help with this, such a ICU and
UTF8-CPP. In practice I have rarely found them necessary. For UTF-8 I
use std::string, and if I want to progress by code points along the
string I have an iterator which does that for me, for which operator++
and operator-- iterate by whole unicode code points, and which when
dereferenced returns a 32 bit value (the current unicode code point
value) rather than a char.

Chris

JiiPee <no@notvalid.com>: Feb 22 05:21PM

On 22/02/2015 17:01, Chris Vine wrote:
> and operator-- iterate by whole unicode code points, and which when
> dereferenced returns a 32 bit value (the current unicode code point
> value) rather than a char.

ok, this is good to know what people use here. If others also tell what
they use that is helpful...
Ok, you use just string.... I guess that might work for me also actually.
So how do you calculate the lenght of a Russian word for example. lets
say you have 4 letter Russina word... what function you use to calculate
that letter lenght?

JiiPee <no@notvalid.com>: Feb 22 05:25PM

On 22/02/2015 17:01, Chris Vine wrote:
> and operator-- iterate by whole unicode code points, and which when
> dereferenced returns a 32 bit value (the current unicode code point
> value) rather than a char.

interesting . So the data , lets say 10000 characters, is stored as
utf-8 (to save space). Then when you need one character you give it as
32 bit value. hmmm ...

Paavo Helde <myfirstname@osa.pri.ee>: Feb 22 11:59AM -0600

> say you have 4 letter Russina word... what function you use to
> calculate
> that letter lenght?

Why would you need that? Sure, there are programs for which this is
relevant, like a program for solving or composing Russian crosswords, but
for the type of software you are writing, why would you need this? Just
curious...

If I would need this (and nothing else Unicode-related) then I probably
would use a little function like that (not tested):

size_t LengthInCodepoints(const unsigned char* utf8, size_t sizeInBytes)
{
size_t result = 0;
for (size_t i=0; i<sizeInBytes; ++i) {
if ((utf8[i] & 0x80)==0 || (utf8[i] & 0x40)!=0) {
++result;
}
}
return result;
}

If I would need more, I would use the ICU library or something.

Cheers
Paavo

JiiPee <no@notvalid.com>: Feb 22 06:10PM

On 22/02/2015 17:59, Paavo Helde wrote:

> If I would need more, I would use the ICU library or something.

> Cheers
> Paavo

ok, I give better example: say you have a Russian sentence and you want
to change a certain word in side it, say "car" to "vehicle". How would
you do it? So how do you do replace? That is needed surely...

jt@toerring.de (Jens Thoms Toerring): Feb 22 06:42PM

> ok but lets say I use 3 russian letters. Many times I want to know that
> there is 3 letters rather than the byte size is like 7.

Here's a function for counting the number of UTF-8 letters in
C string (this is from a project written in C, but it should
not be too much trouble coverting it to C++ and use std:string
instead).

/***************************************
* Function for determing the number of (UTF-8) characters in
* a string. If it's not a valid UTF-8 string -1 is returned.
***************************************/

ssize_t
utf8_length( const char * str )
{
const unsigned char * p = ( const unsigned char * ) str;
ssize_t cnt = 0;

if ( ! str )
return -1;

for ( ; *p; p++, cnt++ )
{
if ( *p <= 0x7F ) // ASCII
/* empty */ ;
else if ( ( *p & 0xE0 ) == 0xC0 ) // should be 2 bytes
{
if ( ( *++p & 0xC0 ) != 0x80 )
return -1;
}
else if ( ( *p & 0xF0 ) == 0xE0 ) // should be 3 bytes
{
if ( ( *++p & 0xC0 ) != 0x80
|| ( *++p & 0xC0 ) != 0x80 )
return -1;
}
else if ( ( *p & 0xF8 ) == 0xF0 ) // should be 4 bytes
{
if ( ( *++p & 0xC0 ) != 0x80
|| ( *++p & 0xC0 ) != 0x80
|| ( *++p & 0xC0 ) != 0x80 )
return -1;
}
else // anything else is invalid
return -1;
}

return cnt;
}

You can probably already see the elements of an iterator lurking
in there;-)
Best regards. Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de

Robert Wessel <robertwessel2@yahoo.com>: Feb 22 01:09PM -0600

>projects really. So looking for direction.

>People already gave instructions and I read them, but just asking if
>there is more.

You might want to read this:

http://utf8everywhere.org/

Robert Wessel <robertwessel2@yahoo.com>: Feb 22 01:15PM -0600

On 22 Feb 2015 18:42:54 GMT, jt@toerring.de (Jens Thoms Toerring)
wrote:

>}

>You can probably already see the elements of an iterator lurking
>in there;-)

That will count code points, but not what you'd think of as characters
if you consider things like combining code points.

Paavo Helde <myfirstname@osa.pri.ee>: Feb 22 01:23PM -0600

> ok, I give better example: say you have a Russian sentence and you
> want to change a certain word in side it, say "car" to "vehicle". How
> would you do it? So how do you do replace? That is needed surely...

Here you go:

std::string sentence = ...;
std::string carInRussian =
"\xd0\xb0\xd0\xb2\xd1\x82\xd0\xbe\xd0\xbc"
"\xd0\xbe\xd0\xb1\xd0\xb8\xd0\xbb\xd1\x8c";
std::string vehicleInRussian =
"\xd0\xbc\xd0\xb0\xd1\x88\xd0\xb8\xd0\xbd\xd0\xb0";

std::string::size_type pos = sentence.find(carInRussian);
if (pos!=std::string::npos) {
sentence.replace(pos, carInRussian.length(), vehicleInRussian);
}

No Unicode knowledge needed for such replacements. Zilch. Nada.

hth
Paavo

Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 22 07:26PM

On Sun, 22 Feb 2015 18:10:22 +0000
> ok, I give better example: say you have a Russian sentence and you
> want to change a certain word in side it, say "car" to "vehicle". How
> would you do it? So how do you do replace? That is needed surely...

You would normalize to either use or not use precomposed characters (a
unicode issue), decide how to deal with singular and plural forms and
other grammatical inflection (a language issue bearing on your problem
space) and then partition on the word(s) in question and construct a new
string from the two parts with substitute word inserted, having regard
to whatever decisions you have reached about the grammatical forms. (Be
pleased that unlike Celtic languages such a Welsh or Irish, Russian does
not as far as I am aware have initial morphisms to contend with as well
as suffixed inflexions.)

The unicode question is but one of your issues here.

On your earlier question of how do you find the length of a Russian
word, then no answer can be given until you specify what you are
measuring as your unit of length. Code points? If so, with or without
precomposed characters? Or glyphs? Or graphemes? On the latter,
(and looking more widely than just Russian) how do you treat ligatures?
For example, is eszett (ß) one or two "characters" and if one does it
become two in its upper case form? Is ligatured fi one or two
"characters"? If using Hangul, how do you deal with Hangul jamos? If
using Indic scripts, how are you counting consonant clusters? Asking
the question is usually completely pointless.

Chris

Paavo Helde <myfirstname@osa.pri.ee>: Feb 22 01:43PM -0600

Paavo Helde <myfirstname@osa.pri.ee> wrote in
>> want to change a certain word in side it, say "car" to "vehicle". How
>> would you do it? So how do you do replace? That is needed surely...

> No Unicode knowledge needed for such replacements. Zilch. Nada.

Maybe I should have pointed out that the UTF-8 (und UTF-16) encodings have
been carefully designed to make such things working.

Complications arise if the strings are not normalized in the same way and
are using different ways for representing the same letters. But such things
can happen with ASCII as well, e.g. "$3400" and "$3,400", a tab versus
space, capitalization etc. Anyway, the point is that assuming the texts are
normalized to the needed extent, processing them without any knowledge of
code point borders is often trivial.

Cheers
Paavo

jt@toerring.de (Jens Thoms Toerring): Feb 22 08:00PM

> That will count code points, but not what you'd think of as characters
> if you consider things like combining code points.

Good catch! Hadn't thougt of that. Back to the drawing
board;-)
Best regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de

JiiPee <no@notvalid.com>: Feb 22 08:47PM

On 22/02/2015 19:23, Paavo Helde wrote:

> No Unicode knowledge needed for such replacements. Zilch. Nada.

> hth
> Paavo

ok thanks, i ll save this

Geoff <geoff@invalid.invalid>: Feb 22 12:51PM -0800

>I mean, if I have a Russian word with 3 russian letters, I want to know
>how many letters there are rather than how many bytes it is total.
>Sometimes we need that, right?

std::wstring wstr = L"123";

std::cout << "There are " << wstr.length() << " characters in wstr, "
<< "its size is " << sizeof(wstr) << std::endl;

JiiPee <no@notvalid.com>: Feb 22 09:04PM

On 22/02/2015 19:09, Robert Wessel wrote:
>> there is more.

> You might want to read this:

> http://utf8everywhere.org/

I already read that... but then there was somebody arguing strongly
there that utf-16 is better and seemed like he won the argument and this
man a bit backed down..... donno.... thats why still asking.

Why all tutorials/books use non-unicode string?

David Brown <david.brown@hesbynett.no>: Feb 22 05:42PM +0100

On 21/02/15 20:27, JiiPee wrote:
> Maybe somebody on forum or something said also "Qt s future is not
> certain".

> Ok, let me quickly try to google it again....

Don't bother trying to find old inaccurate information - either accept
that the other posters here have given you accurate information, or go
to the QT website and read their licensing information.

QT has changed hands a couple of times - as is usual in such cases, this
lead to lots of rumours about its demise, or the demise of the open
source licences for QT. And since QT was originally commercial-only,
then commercial+GPL, there is an abundance of old and outdated posts and
websites that do not know about the LGPL version.

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Sunday, February 22, 2015

Digest for comp.lang.c++@googlegroups.com - 22 updates in 4 topics

No comments:

Blog Archive

About Me