soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

best way to remove std::vector member - 5 Updates
"C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs" - 8 Updates
Big data Tutorials - 1 Update

Victor Bazarov <v.bazarov@comcast.invalid>: Sep 09 08:09AM -0400

On 9/8/2016 3:59 PM, mark wrote:

>> ????

> Your removeOwner invalidates vector indices. So I don't see you needing
> indexed access.

Indices cannot be "invalidated". Iterators are invalidated, that's true,
but they are not used anyway. The code as written, functionally sound,
albeit inefficient.

> If you need to support duplicate entries, there is multiset /
> unordered_multiset. What you do loose with set / multiset is the
> insertion order.

V
--
I do not respond to top-posted replies, please don't ask

mark <mark@invalid.invalid>: Sep 09 05:06PM +0200

On 2016-09-09 14:09, Victor Bazarov wrote:

> Indices cannot be "invalidated". Iterators are invalidated, that's true,
> but they are not used anyway. The code as written, functionally sound,
> albeit inefficient.

What I mean is that elements shift. Elements after a removed element
don't keep their index. The last index will be completely invalid (going
beyond the end).

Victor Bazarov <v.bazarov@comcast.invalid>: Sep 09 11:22AM -0400

On 9/9/2016 11:06 AM, mark wrote:

> What I mean is that elements shift. Elements after a removed element
> don't keep their index. The last index will be completely invalid (going
> beyond the end).

Yes, and that's why the OP used '--num' and '--i' in the loop body.
Please take another look at the code.

V
--
I do not respond to top-posted replies, please don't ask

mark <mark@invalid.invalid>: Sep 09 08:33PM +0200

On 2016-09-09 17:22, Victor Bazarov wrote:
>> beyond the end).

> Yes, and that's why the OP used '--num' and '--i' in the loop body.
> Please take another look at the code.

I never said his code was wrong. I just stated my assumption that he
probably wasn't getting much use out of vector properties (like being
contiguous, random access) and that an alternate data structure might be
better.

Victor Bazarov <v.bazarov@comcast.invalid>: Sep 09 04:07PM -0400

On 9/9/2016 2:33 PM, mark wrote:
> probably wasn't getting much use out of vector properties (like being
> contiguous, random access) and that an alternate data structure might be
> better.

I think I see now. You're saying that the choice of the container is
not in harmony with 'removeOwner' function's disruptive effects, i.e.
*if* some other part of the program held onto some indices (to elements
of that vector), 'removeOwner' would break the contract of their
[presumed] invariance. Yes? Thank you.

The OP didn't explain the reasoning for choosing 'std::vector', and I
usually refrain from speculating on the code I can't see. My mistake is
to assume that others do as well.

V
--
I do not respond to top-posted replies, please don't ask

"C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs"

Lynn McGuire <lynnmcguire5@gmail.com>: Sep 08 09:02PM -0500

On 9/7/2016 7:03 PM, Alf P. Steinbach wrote:

> - Alf
> (who used to read MSDN Magazine at one time, it was all so shiny! and who is only a one-time Visual C++ MVP, as opposed to the
> four-time and nine-times Visual C++ MVP the author used as experts, but hey)

Doesn't Microsoft define wchar_t type as 2 bytes ? And isn't that embedded deep into the Win32 API ?

My understanding that UTF16 is just like UTF8. When you need those extra byte(s), UTF8 just adds more to the mix as needed. Doesn't
UTF16 do the same ?

Lynn

"Öö Tiib" <ootiib@hot.ee>: Sep 08 11:43PM -0700

On Friday, 9 September 2016 05:03:07 UTC+3, Lynn McGuire wrote:
> > four-time and nine-times Visual C++ MVP the author used as experts, but hey)

> Doesn't Microsoft define wchar_t type as 2 bytes ? And isn't that
> embedded deep into the Win32 API ?

Yes, but it is non-conforming. Also, like its name suggests Win32 is
legacy. Very rare people have 32 bit hardware on their desktop.

> My understanding that UTF16 is just like UTF8. When you need those
> extra byte(s), UTF8 just adds more to the mix as needed. Doesn't
> UTF16 do the same ?

UTF-8 does not have endianness issues and 7-bit ASCII is UTF-8. So
UTF-8 has two benefits above UTF-16.

David Brown <david.brown@hesbynett.no>: Sep 09 09:20AM +0200

On 09/09/16 04:02, Lynn McGuire wrote:
>> but hey)

> Doesn't Microsoft define wchar_t type as 2 bytes ? And isn't that
> embedded deep into the Win32 API ?

Yes. When MS started using unicode, they did not have proper unicode -
they used UCS-2 which is basically UTF-16 except that there are no
multi-unit encodings. It covers all the unicode characters that fit in
a single 16-bit UTF-16 code unit. And since UCS2 was the execution
character set for C and C++ on Windows at that time, 16-bit wchar_t was
fine.

But as they have moved more and more of the API and system towards full
UTF-16, it is no longer conforming - you need 32-bit wchar_t in order to
support all unicode code points in a single code unit.

> My understanding that UTF16 is just like UTF8. When you need those
> extra byte(s), UTF8 just adds more to the mix as needed. Doesn't UTF16
> do the same ?

UTF-8 is efficient and convenient, because it can store ASCII in single
bytes, it has no endian issues, and UTF-8 strings can be manipulated
with common string handling routines. UTF-32 is as close as you can get
to "one code unit is one character", though there are still combining
accent marks to consider. UTF-16 is the worst of both worlds - it is
inefficient, requires multiple code units per character (though many
implementations fail to do so properly, treating it more like UCS-2),
and has endian problems.

The world has settled firmly on UTF-8 as the encoding for storing and
transferring text - rightly so, because it is the best choice in many ways.

And most of the programming world has settled on UTF-32 for internal
encodings in cases where character counting might be convenient and
therefore UTF-8 is not ideal. It is not used much, but because it is an
internal format then at least you don't have endian issues.

UTF-16 is used mainly in two legacy situations - Windows API, and Java.
(QT is doing what it can to move over to UTF-8, restricted by
compatibility with older code.)

David Brown <david.brown@hesbynett.no>: Sep 09 09:24AM +0200

On 09/09/16 08:43, Öö Tiib wrote:
>> embedded deep into the Win32 API ?

> Yes, but it is non-conforming. Also, like its name suggests Win32 is
> legacy. Very rare people have 32 bit hardware on their desktop.

While Win32 is legacy in that it is an old API that suffers from
accumulated cruft and poor design decisions (though they might have made
sense 20 years ago when they were made), the great majority of /new/
Windows programs are still Win32. Only a few types of program are
written to Win64 - those that need lots of memory, or with access deep
into the bowels of the system, or that can benefit from the extra
registers, wider integers or extra instructions of x86-64. For most
Windows development work, it is easier to stick to Win32 (or a toolkit
that uses Win32 underneath), and on Windows 32-bit programs run faster
than 64-bit programs in most cases.

"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Sep 09 10:51AM +0200

On 09.09.2016 09:20, David Brown wrote:

> UTF-16 is used mainly in two legacy situations - Windows API, and Java.

You forgot the Unicode APIs.

ICU FAQ: "How is a Unicode string represented in ICU4C? A Unicode string
is currently represented as UTF-16."

Also, what you write about UTF-16 being inefficient is a very local
view. In Japanese, UTF-16 is generally more efficient than UTF-8. Or so
I've heard.

Certainly, on a platform using UTF-16 natively, as in Windows, that's
most efficient.

But there is much religious belief committed to this. One crowd starts
chanting about holy war and kill the non-believers if one should happen
to mention that a byte order mark lets Windows tools work correctly with
UTF-8. That's because they want to not seem incompetent when their tools
fail to handle it, as in particular gcc once failed. I think all that
religion and zealot behavior started with how gcc botched it. This
religion's priests are powerful enough to /change standards/ to make
their botched tools seem technically correct, or at least not botched.

As with most everything else, one should just use a suitable tool for
the job. UTF-8 is very good for external text representation, and is
good enough for Asian languages, even if UTF-16 would probably be more
efficient. UTF-16 is very good for internal text representation for use
of ICU (Unicode processing) and in Windows, and it works also in
Unix-land, and so is IMHO the generally most reasonable choice for that
/if/ one has to decide on a single representation on all platforms.

Cheers & hth.,

- Alf

David Brown <david.brown@hesbynett.no>: Sep 09 12:46PM +0200

On 09/09/16 10:51, Alf P. Steinbach wrote:

> Also, what you write about UTF-16 being inefficient is a very local
> view. In Japanese, UTF-16 is generally more efficient than UTF-8. Or so
> I've heard.

That is true for a few types of document. But most documents don't
consist of pure text. They are html, xml, or some other format that
mixes the characters with structural or formatting commands. (Most
files that really are pure text are in ASCII.) When these are taken
into account, it turns out (from statistical sampling of web pages and
documents around the internet) that there is very little to be gained by
using UTF-16 even for languages where many of their characters take 2
bytes in UTF-16 and 3 bytes in UTF-8. And if there is some sort of
compression involved, as there often is one big web pages or when the
text is within a pdf, docx or odt file, the difference is negligible.

Or so I have heard :-) I haven't made such studies myself.

> religion and zealot behavior started with how gcc botched it. This
> religion's priests are powerful enough to /change standards/ to make
> their botched tools seem technically correct, or at least not botched.

I haven't heard stories about gcc botching unicode - but I have heard
many stories about how Windows has botched it, and how the C++ (and C)
standards botched it to make it easier to work with Windows' botched API.

(gcc has so far failed to make proper unicode identifiers, and the C and
C++ standards have botched the choices of allowed unicode characters in
identifiers, but that's another matter.)

But for full disclosure here, most of my programming with gcc is on
small embedded systems and I have not had to deal with unicode there at
all. And in my Windows programming, wxPython handles it all without
bothering me with the details.

And I understand that when MS choose UCS-2 they were in early and were
trying to make an efficient handling international characters that was a
big step up from multiple 8-bit code pages - it's just a shame they
could not change from 16-bit to 32-bit encoding when unicode changed
(with Unicode 2.0 in 1996, several years after Windows NT).

> of ICU (Unicode processing) and in Windows, and it works also in
> Unix-land, and so is IMHO the generally most reasonable choice for that
> /if/ one has to decide on a single representation on all platforms.

Well, these things are partly a matter of opinion, and partly a matter
of fitting with the APIs, libraries, toolkits, translation tools, etc.,
that you are using. If you are using Windows API and Windows tools, you
might find UTF-16 more convenient. But as the worst compromise between
UTF-8 and UTF-32, it only makes sense for code close to the Windows API.
Otherwise you'll find you've got something that /looks/ like it is one
code unit per code point, but fails for CJK unified ideographs or
emoticons. My understanding is that Windows /still/ suffers from bugs
with system code that assumes UCS-2 rather than UTF-16, though the bugs
have been greatly reduced in later versions of Windows.

"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Sep 09 02:54PM +0200

On 09.09.2016 12:46, David Brown wrote:

> I haven't heard stories about gcc botching unicode

It used to be that conforming UTF-8 source code with non-ASCII string
literals could be compiled with either g++, or with MSVC, but not both.
g++ would choke on a BOM at the start, which it should have accepted.
And MSVC lacked an option to tell it the source endcoding, and would
interpret it as Windows ANSI if was UTF-8 without BOM.

At the time the zealots recommended restricting oneself to a subset of
the language that could be compiled with both g++ and MSVC. This was
viewed as a problem with MSVC, and/or with Windows conventions, rather
than botched encoding support in the g++ compiler. I.e., one should use
UTF-encoded source without BOM (which didn't challenge g++) and only
pure ASCII narrow literals (ditto), and no wide literals (ditto),
because that could also, as they saw it, be compiled with MSVC.

They were lying to the compiler, and in the blame assignment, and as is
usual that gave ungood results.

> - but I have heard
> many stories about how Windows has botched it

Well, you can tell that it's propaganda by the fact that it's about
assigning blame. IMO assigning blame elsewhere for ported software that
is, or was, of very low quality.

But it's not just ported Unix-land software that's ungood: the
propaganda could probably not have worked if Windows itself wasn't full
of the weirdest stupidity and bugs, including, up front in modern
Windows, that Windows Explorer, the GUI shell, scrolls away what you're
clicking on so that double-clicks generally have unpredictable effects,
or just don't work. It's so stupid that one suspects the internal
sabotage between different units, that Microsoft is infamous for. Not to
mention the lack of UTF-8 support in the Windows console subsystem.
Which has an API that effectively restricts it to UCS-2. /However/,
while there is indeed plenty wrong in general with Windows, Windows did
get Unicode support a good ten years before Linux, roughly.

UTF-8 was the old AT&T geniuses' (Ken Thompson and Rob Pike) idea for
possibly getting the Unix world on a workable path towards Unicode, by
supporting pure ASCII streams without code change. And that worked. But
it was not established in general until well past the year 2000.

>, and how the C++ (and C)
> standards botched it to make it easier to work with Windows' botched API.

That doesn't make sense to me, sorry.

> that you are using. If you are using Windows API and Windows tools, you
> might find UTF-16 more convenient. But as the worst compromise between
> UTF-8 and UTF-32,

Think about e.g. ICU.

Do you believe that the Unicode guys themselves would choose the worst
compromise?

That does not make technical sense.

So, this claim is technically nonsense. It only makes sense socially, as
an in-group (Unix-land) versus out-group (Windows) valuation: "they're
bad, they even small bad; we're good". I.e. it's propaganda.

In order to filter out propaganda, think about whether it makes value
evaluations like "worst", or "better", and where it does that, is it
with respect to defined goals and full technical considerations?

Cheers & hth.,

- Alf

David Brown <david.brown@hesbynett.no>: Sep 09 05:13PM +0200

On 09/09/16 14:54, Alf P. Steinbach wrote:
> because that could also, as they saw it, be compiled with MSVC.

> They were lying to the compiler, and in the blame assignment, and as is
> usual that gave ungood results.

UTF-8 files rarely use a BOM - its only use is to avoid
misinterpretation of the file as Latin-1 or some other encoding. So it
is understandable, but incorrect, for gcc to fail when given utf-8
source code with a BOM. It is less understandable, and at least as bad
for MSVC to require a BOM. gcc fixed their code in version 4.4 - has
MSVC fixed their compiler yet?

But that seems like a small issue to call "botching unicode".

gcc still doesn't allow extended identifiers, however - but I think MSVC
does? Clang certainly does.

> Well, you can tell that it's propaganda by the fact that it's about
> assigning blame. IMO assigning blame elsewhere for ported software that
> is, or was, of very low quality.

Are you referring to gcc as "very low quality ported software"? That's
an unusual viewpoint.

> Which has an API that effectively restricts it to UCS-2. /However/,
> while there is indeed plenty wrong in general with Windows, Windows did
> get Unicode support a good ten years before Linux, roughly.

From a very quick google search, I see references to Unicode in Linux
from the late 1990s - so ten years is perhaps an exaggeration. It
depends on what you mean by "support", of course - for the most part,
Linux really doesn't care about character sets or encodings. Strings
are a bunch of bytes terminated by a null character, and the OS will
pass them around, use them for file system names, etc., with an almost
total disregard for the contents - baring a few special characters such
as "/". As far as I understand it, the Linux kernel itself isn't much
bothered with unicode at all, and it's only the locale-related stuff,
libraries like iconv, font libraries, and of course things like X that
need to concern themselves with unicode. This was one of the key
reasons for liking UTF-8 - it meant very little had to change.

> possibly getting the Unix world on a workable path towards Unicode, by
> supporting pure ASCII streams without code change. And that worked. But
> it was not established in general until well past the year 2000.

Maybe MS and Windows simply suffer from being the brave pioneer here,
and the *nix world watched them then saw how to do it better.

>> , and how the C++ (and C)
>> standards botched it to make it easier to work with Windows' botched API.

> That doesn't make sense to me, sorry.

The C++ standard currently has char, wchar_t, char16_t and char32_t
(plus signed and unsigned versions of char - I don't know if the other
types have signed and unsigned versions). wchar_t is of platform
dependent size, making it (and wide strings) pretty much useless for
platform independent code. But it is the way it is because MS wanted to
have 16-bit wchar_t to suit their UCS-2 / UTF-16 APIs. The whole thing
would have been much simpler if the language simply stuck to 8-bit code
units throughout, letting the compiler and/or locale settings figure out
the input encoding and the run-time encoding. If you really need a type
for holding a single unicode character, then the only sane choice is a
32-bit type.

> Think about e.g. ICU.

> Do you believe that the Unicode guys themselves would choose the worst
> compromise?

When unicode was started, everyone thought 16 bits would be enough. So
decisions that stretch back to before Unicode 2.0 would use 16-bit types
for encodings that covered all code points. When the ICU was started,
there was no UTF-32 (I don't know if there was a UTF-8 at the time).
The best choice at that time was UCS-2 - it is only as unicode outgrew
16 bits that this became a poor choice.

> So, this claim is technically nonsense. It only makes sense socially, as
> an in-group (Unix-land) versus out-group (Windows) valuation: "they're
> bad, they even small bad; we're good". I.e. it's propaganda.

No, the claim is technically correct - UTF-16 has all the disadvantages
of both UTF-8 and UTF-32, and misses several of the important
(independent) advantages of those two encodings. Except for
compatibility with existing UTF-16 APIs and code, there are few or no
use-cases where UTF-16 is a better choice than UTF-8 and also a better
choice than UTF-32. The fact that the world has pretty much
standardised on UTF-8 for transmission of unicode is a good indication
of this - UTF-16 is /only/ used where compatibility with existing UTF-16
APIs and code is of overriding concern.

But historically, ICU and Windows (and Java, and QT) made the sensible
decision of using UCS-2 when they started, but apart from QT they have
failed to make a transition to UTF-8 when it was clear that this was the
way forward.

> In order to filter out propaganda, think about whether it makes value
> evaluations like "worst", or "better", and where it does that, is it
> with respect to defined goals and full technical considerations?

The advantages of UTF-8 over UTF-16 are clear and technical, not just
"feelings" or propaganda.

Big data Tutorials

vasuinfo1100@gmail.com: Sep 09 03:13AM -0700

What is big data? It's a phrase used to put into numbers data sets that are so large and complex that they become very hard to exchange, secure, and carefully study with typical tools. Big data Tutorials These courses on big data show you how to solve these problems, and many more, with leading IT tools and ways of doing things. This big data tutorial is designed to get data storage managers up to speed on the conversations shaping the decisions many IT managers are making about big data. This course is geared to make the students to become an expert. BigData Training Tutorial focuses on leading focus consulting and training specialists.

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Friday, September 9, 2016

Digest for comp.lang.c++@googlegroups.com - 14 updates in 3 topics

No comments:

Blog Archive

About Me