soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

Why all tutorials/books use non-unicode string? - 7 Updates
About programming.... - 1 Update

Why all tutorials/books use non-unicode string?

jt@toerring.de (Jens Thoms Toerring): Feb 21 10:03PM

> all turorials and C++ books use almost always char as a character type
> (1 byte)? Why use examples which are not used in real world? This I do
> not understand.

As some others have pointed out it adds an additipnal burden
because in a tutorial you now also have to explain how UTF-8
works, which distracts from what is the topic.

The std::string method length() doesn't stop working when you
use UTF-8 - of course dependent on what you call "work". It still
tells you ow many bytes the string occupies. What it doesn't do
anymore is telling you how many "letters" (or "glyphs" or what-
ever name you prefer - I try to avoid the word character since
it's too easily confused with the concept of a 'char') are con-
tained in the string. As you know UTF-8 a "letter" can be re-
presented by anything between 1 to 4 bytes. Thus you can't
equate "number of bytes" to "number of ;etters" anymore, as
has been traditionally done with ASCII. So you need a new
function for counting those "letters". Fortunately, writing
that isn't too hard: By inspecting the upper bits of the first
byte you can easily determine how many bytes that "letter"
occupies, which makes iterating over a string to count the
number of letters relatively simple.

There are a few pitfalls, though: not all 1 to 4 byte long
byte sets are valid UTf-8 entities, so, if you deal with
external input, you must check for that possibility and
design some strategy of dealing with such cases.

An imprtant aspect is dealing with the environment: if, for
example, the users keybooard is set up to send LATIN1 en-
coded charcters but you're expecting UTF-8 input that will
end in grief. Or when the output medium is set up to use a
different encoding than what your program emits the output
will look rather strange. So you will have to spend some
time giving more attention to locale settings etc. which in
a pure ASCII world usually are taken to be arcane stuff.
Another aspect is that, if you're serious about it, have
to start thinking about questions like: how do I enter
Chinese or Japanese or Greek etc. characters using a
standard US-English keyboard (or what tools does my
system supply for that purpose).

Dealing with UTF-8 in a program actually is relatively trivial
- you have to distinguish between byte count and letter count,
you should check if the input is "legal UTF-8 and you may have
to write some UTF-8 aware iterator when looping over a string
(so it gives you the next letter, not the next 'char') etc.
And, if this is for an already existing application, you'll
have to check whereever strings are used if what you want is
the "length" in bytes or in "letters".

I've recently done the switch from pure ASCII to UTF-8 for
a legacy library back from about 20 years ago. I've dragged
my feet for a long time doing that since I always thought that
"char-count equals letter count" would be that deep-rooted in
a piece of software that old that it would be nearly impossible
to fix that basic assumption. But, when I finally made the at-
tempt I was positively surprised that it was a lot easier than
I'd ever imagined - in most places dealing with strings it was
immediately clear if this was about the letters or bytes in a
string and, with a few functions for dealing with UTF-8, it took
me a very short time.

From that experience I tend to conclude that most pf the "angst"
about UTF-8 is more from unfamilarity than anything else. The
actual problems are often more the enviroment the user is wor-
king in - if the keyboard is set up to send LATIN1 or CJK or
whatever other legacy encoding, then there's were the real pro-
blems are. So, it's a new world definitely, and one has to
learn a few new things and become aware of a new potential
problems (and existing solutions;-).

I can only recommend to do a few experiments with some "toy"
programs. The concept behind UTF-8 is IMHO, while ingenious,
surprisingly simple, so I found it more helpful to write a few
functions for counting "letters" in a string or detecting in-
valid byte sequences than trying to understand some rather
complex libraries that do all the work for you. Not that I'd
consider those libaries to be useless, but to understand what
they're doing for you it's good to have spend a bit of time
trying to solve the simpler problems to get a feel of what's
involved - otherwise the documentation can often hard to un-
derstand;-)
Best regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de

Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Feb 21 10:10PM

On 21/02/2015 22:03, Jens Thoms Toerring wrote:

> actual problems are often more the enviroment the user is wor-
> king in - if the keyboard is set up to send LATIN1 or CJK or
> whatever other legacy encoding, then there's were the real pro-

In my experience keyboards send scan codes not characters; the OS
translates scan codes into characters.

/Flibble

Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 21 10:12PM

On 21 Feb 2015 20:45:02 GMT
> the LGPL. So, for example, if you use Qt version 4.x for
> ypur program the user must hav th possibility to link it in-
> stead against Qt version 5.y.

I think that's probably wrong. As it has been explained to me by
those who purport to know, if you use static linking you are probably in
trouble and you may be required to distribute your code under the LGPL
if you distribute outside your organisation. If you use dynamic
linking you are generally OK. The requirement in such cases is that
your code must be relinkable against a different library (such as one
you yourself write) with the same ABI, which is a criterion met by all
current Windows and unix dynamic linking mechanisms. There is no
requirement for portability to different non-ABI compatible versions of
a particular library - that would serve no purpose.

The legal issue behind this is apparently whether you "use" the library
(OK) or if you are a derived work (you might be in trouble). Dynamic
linking seems to be regarded as the former. Static linking, or
extending the particular library itself for your own purposes, seems to
be regarded as the latter.

Chris

jt@toerring.de (Jens Thoms Toerring): Feb 21 10:28PM

> > whatever other legacy encoding, then there's were the real pro-

> In my experience keyboards send scan codes not characters; the OS
> translates scan codes into characters.

Yes, of course, you've got the OS to do the scan code translation.
That was why I first wrote about the user's "environment" before
using the more colloquial "the kevboard is set up to send..." as
a shortcut for "the keyboard sends scan codes, which the OS (and
maybe even some other intermediaries) translates into something
which finally arrives at the application". Getting a grip on that
is, as far as I see it, often the hardest part, not the actual
dealing with UTF-8 encoded strings.

Best regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de

jt@toerring.de (Jens Thoms Toerring): Feb 21 10:54PM

> those who purport to know, if you use static linking you are probably in
> trouble and you may be required to distribute your code under the LGPL
> if you distribute outside your organisation.

I don't see your point here since with static linking you'd
definitely make it impossible to link with a new version of
the LGPL'ed linrary. Static linking is a no-no in this case.

> linking seems to be regarded as the former. Static linking, or
> extending the particular library itself for your own purposes, seems to
> be regarded as the latter.

Sorry, you've lost me here;-) In my book saying: link it against
whatever version of Qt (or whatever other LGPL library this is
about) by giving the user all the bits s/he needs to make that
possible (including the parts of the sources that rely on wor-
king with the LGPL library) is the most honest way to go. What
that doesn't exclude is having another dynamically linked but
closed source library you actually demand money for, and which
doesn't depend on any services of the LGPL'ed library (so there
is not the least bit of doubt that there's no "deriving" going
on) - but which does the really interesting bits of work that
make it something people may be motivated to pay for.

Best regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de

jt@toerring.de (Jens Thoms Toerring): Feb 21 11:05PM

> > who supported it stopped supporting it....

> Wrong; Qt was sold to Digia which and is alive and well and continually
> supported/updated.

And there's also that thingy that Qt is open-source - so
while Trolltech -> Nokia -> Digia stopping supporting fur-
ther development can be a major issue, it's not necessarily
the end of the world as we know it. The KDE guys rely on it
heavily and while they probably don't have tons of money they
still would have a big incentive to keep it alive. In that
(hypothetical) case questions about the future of cross-plat-
form compatibility might arise, but we're not at that point
yet, I hope...
Best regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de

Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 21 11:13PM

On 21 Feb 2015 22:54:12 GMT
jt@toerring.de (Jens Thoms Toerring) wrote:
[snip]
> is not the least bit of doubt that there's no "deriving" going
> on) - but which does the really interesting bits of work that
> make it something people may be motivated to pay for.

I believe you are deranged.

Chris

About programming....

Wouter van Ooijen <wouter@voti.nl>: Feb 21 11:01PM +0100

Ramine schreef op 22-Feb-15 om 1:04 AM:

> You have to understand me more, as you have noticed i have invented
> and implemented many algorithms an libraries, and i have come to a
> conclusion that i want to share with you

If you expect me to follow a link to your wonderfull creations I'd first
like a line or two that explains what it will do for me. (That also
gives me a hint of how good you are at expressing yourself efficiently.)

Wouter

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Saturday, February 21, 2015

Digest for comp.lang.c++@googlegroups.com - 8 updates in 2 topics

No comments:

Blog Archive

About Me