- best way to remove std::vector member - 5 Updates
- "C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs" - 8 Updates
- Big data Tutorials - 1 Update
Victor Bazarov <v.bazarov@comcast.invalid>: Sep 09 08:09AM -0400 On 9/8/2016 3:59 PM, mark wrote: >> ???? > Your removeOwner invalidates vector indices. So I don't see you needing > indexed access. Indices cannot be "invalidated". Iterators are invalidated, that's true, but they are not used anyway. The code as written, functionally sound, albeit inefficient. > If you need to support duplicate entries, there is multiset / > unordered_multiset. What you do loose with set / multiset is the > insertion order. V -- I do not respond to top-posted replies, please don't ask |
mark <mark@invalid.invalid>: Sep 09 05:06PM +0200 On 2016-09-09 14:09, Victor Bazarov wrote: > Indices cannot be "invalidated". Iterators are invalidated, that's true, > but they are not used anyway. The code as written, functionally sound, > albeit inefficient. What I mean is that elements shift. Elements after a removed element don't keep their index. The last index will be completely invalid (going beyond the end). |
Victor Bazarov <v.bazarov@comcast.invalid>: Sep 09 11:22AM -0400 On 9/9/2016 11:06 AM, mark wrote: > What I mean is that elements shift. Elements after a removed element > don't keep their index. The last index will be completely invalid (going > beyond the end). Yes, and that's why the OP used '--num' and '--i' in the loop body. Please take another look at the code. V -- I do not respond to top-posted replies, please don't ask |
mark <mark@invalid.invalid>: Sep 09 08:33PM +0200 On 2016-09-09 17:22, Victor Bazarov wrote: >> beyond the end). > Yes, and that's why the OP used '--num' and '--i' in the loop body. > Please take another look at the code. I never said his code was wrong. I just stated my assumption that he probably wasn't getting much use out of vector properties (like being contiguous, random access) and that an alternate data structure might be better. |
Victor Bazarov <v.bazarov@comcast.invalid>: Sep 09 04:07PM -0400 On 9/9/2016 2:33 PM, mark wrote: > probably wasn't getting much use out of vector properties (like being > contiguous, random access) and that an alternate data structure might be > better. I think I see now. You're saying that the choice of the container is not in harmony with 'removeOwner' function's disruptive effects, i.e. *if* some other part of the program held onto some indices (to elements of that vector), 'removeOwner' would break the contract of their [presumed] invariance. Yes? Thank you. The OP didn't explain the reasoning for choosing 'std::vector', and I usually refrain from speculating on the code I can't see. My mistake is to assume that others do as well. V -- I do not respond to top-posted replies, please don't ask |
Lynn McGuire <lynnmcguire5@gmail.com>: Sep 08 09:02PM -0500 On 9/7/2016 7:03 PM, Alf P. Steinbach wrote: > - Alf > (who used to read MSDN Magazine at one time, it was all so shiny! and who is only a one-time Visual C++ MVP, as opposed to the > four-time and nine-times Visual C++ MVP the author used as experts, but hey) Doesn't Microsoft define wchar_t type as 2 bytes ? And isn't that embedded deep into the Win32 API ? My understanding that UTF16 is just like UTF8. When you need those extra byte(s), UTF8 just adds more to the mix as needed. Doesn't UTF16 do the same ? Lynn |
"Öö Tiib" <ootiib@hot.ee>: Sep 08 11:43PM -0700 On Friday, 9 September 2016 05:03:07 UTC+3, Lynn McGuire wrote: > > four-time and nine-times Visual C++ MVP the author used as experts, but hey) > Doesn't Microsoft define wchar_t type as 2 bytes ? And isn't that > embedded deep into the Win32 API ? Yes, but it is non-conforming. Also, like its name suggests Win32 is legacy. Very rare people have 32 bit hardware on their desktop. > My understanding that UTF16 is just like UTF8. When you need those > extra byte(s), UTF8 just adds more to the mix as needed. Doesn't > UTF16 do the same ? UTF-8 does not have endianness issues and 7-bit ASCII is UTF-8. So UTF-8 has two benefits above UTF-16. |
David Brown <david.brown@hesbynett.no>: Sep 09 09:20AM +0200 On 09/09/16 04:02, Lynn McGuire wrote: >> but hey) > Doesn't Microsoft define wchar_t type as 2 bytes ? And isn't that > embedded deep into the Win32 API ? Yes. When MS started using unicode, they did not have proper unicode - they used UCS-2 which is basically UTF-16 except that there are no multi-unit encodings. It covers all the unicode characters that fit in a single 16-bit UTF-16 code unit. And since UCS2 was the execution character set for C and C++ on Windows at that time, 16-bit wchar_t was fine. But as they have moved more and more of the API and system towards full UTF-16, it is no longer conforming - you need 32-bit wchar_t in order to support all unicode code points in a single code unit. > My understanding that UTF16 is just like UTF8. When you need those > extra byte(s), UTF8 just adds more to the mix as needed. Doesn't UTF16 > do the same ? UTF-8 is efficient and convenient, because it can store ASCII in single bytes, it has no endian issues, and UTF-8 strings can be manipulated with common string handling routines. UTF-32 is as close as you can get to "one code unit is one character", though there are still combining accent marks to consider. UTF-16 is the worst of both worlds - it is inefficient, requires multiple code units per character (though many implementations fail to do so properly, treating it more like UCS-2), and has endian problems. The world has settled firmly on UTF-8 as the encoding for storing and transferring text - rightly so, because it is the best choice in many ways. And most of the programming world has settled on UTF-32 for internal encodings in cases where character counting might be convenient and therefore UTF-8 is not ideal. It is not used much, but because it is an internal format then at least you don't have endian issues. UTF-16 is used mainly in two legacy situations - Windows API, and Java. (QT is doing what it can to move over to UTF-8, restricted by compatibility with older code.) |
David Brown <david.brown@hesbynett.no>: Sep 09 09:24AM +0200 On 09/09/16 08:43, Öö Tiib wrote: >> embedded deep into the Win32 API ? > Yes, but it is non-conforming. Also, like its name suggests Win32 is > legacy. Very rare people have 32 bit hardware on their desktop. While Win32 is legacy in that it is an old API that suffers from accumulated cruft and poor design decisions (though they might have made sense 20 years ago when they were made), the great majority of /new/ Windows programs are still Win32. Only a few types of program are written to Win64 - those that need lots of memory, or with access deep into the bowels of the system, or that can benefit from the extra registers, wider integers or extra instructions of x86-64. For most Windows development work, it is easier to stick to Win32 (or a toolkit that uses Win32 underneath), and on Windows 32-bit programs run faster than 64-bit programs in most cases. |
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Sep 09 10:51AM +0200 On 09.09.2016 09:20, David Brown wrote: > UTF-16 is used mainly in two legacy situations - Windows API, and Java. You forgot the Unicode APIs. ICU FAQ: "How is a Unicode string represented in ICU4C? A Unicode string is currently represented as UTF-16." Also, what you write about UTF-16 being inefficient is a very local view. In Japanese, UTF-16 is generally more efficient than UTF-8. Or so I've heard. Certainly, on a platform using UTF-16 natively, as in Windows, that's most efficient. But there is much religious belief committed to this. One crowd starts chanting about holy war and kill the non-believers if one should happen to mention that a byte order mark lets Windows tools work correctly with UTF-8. That's because they want to not seem incompetent when their tools fail to handle it, as in particular gcc once failed. I think all that religion and zealot behavior started with how gcc botched it. This religion's priests are powerful enough to /change standards/ to make their botched tools seem technically correct, or at least not botched. As with most everything else, one should just use a suitable tool for the job. UTF-8 is very good for external text representation, and is good enough for Asian languages, even if UTF-16 would probably be more efficient. UTF-16 is very good for internal text representation for use of ICU (Unicode processing) and in Windows, and it works also in Unix-land, and so is IMHO the generally most reasonable choice for that /if/ one has to decide on a single representation on all platforms. Cheers & hth., - Alf |
David Brown <david.brown@hesbynett.no>: Sep 09 12:46PM +0200 On 09/09/16 10:51, Alf P. Steinbach wrote: > Also, what you write about UTF-16 being inefficient is a very local > view. In Japanese, UTF-16 is generally more efficient than UTF-8. Or so > I've heard. That is true for a few types of document. But most documents don't consist of pure text. They are html, xml, or some other format that mixes the characters with structural or formatting commands. (Most files that really are pure text are in ASCII.) When these are taken into account, it turns out (from statistical sampling of web pages and documents around the internet) that there is very little to be gained by using UTF-16 even for languages where many of their characters take 2 bytes in UTF-16 and 3 bytes in UTF-8. And if there is some sort of compression involved, as there often is one big web pages or when the text is within a pdf, docx or odt file, the difference is negligible. Or so I have heard :-) I haven't made such studies myself. > religion and zealot behavior started with how gcc botched it. This > religion's priests are powerful enough to /change standards/ to make > their botched tools seem technically correct, or at least not botched. I haven't heard stories about gcc botching unicode - but I have heard many stories about how Windows has botched it, and how the C++ (and C) standards botched it to make it easier to work with Windows' botched API. (gcc has so far failed to make proper unicode identifiers, and the C and C++ standards have botched the choices of allowed unicode characters in identifiers, but that's another matter.) But for full disclosure here, most of my programming with gcc is on small embedded systems and I have not had to deal with unicode there at all. And in my Windows programming, wxPython handles it all without bothering me with the details. And I understand that when MS choose UCS-2 they were in early and were trying to make an efficient handling international characters that was a big step up from multiple 8-bit code pages - it's just a shame they could not change from 16-bit to 32-bit encoding when unicode changed (with Unicode 2.0 in 1996, several years after Windows NT). > of ICU (Unicode processing) and in Windows, and it works also in > Unix-land, and so is IMHO the generally most reasonable choice for that > /if/ one has to decide on a single representation on all platforms. Well, these things are partly a matter of opinion, and partly a matter of fitting with the APIs, libraries, toolkits, translation tools, etc., that you are using. If you are using Windows API and Windows tools, you might find UTF-16 more convenient. But as the worst compromise between UTF-8 and UTF-32, it only makes sense for code close to the Windows API. Otherwise you'll find you've got something that /looks/ like it is one code unit per code point, but fails for CJK unified ideographs or emoticons. My understanding is that Windows /still/ suffers from bugs with system code that assumes UCS-2 rather than UTF-16, though the bugs have been greatly reduced in later versions of Windows. |
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Sep 09 02:54PM +0200 On 09.09.2016 12:46, David Brown wrote: > I haven't heard stories about gcc botching unicode It used to be that conforming UTF-8 source code with non-ASCII string literals could be compiled with either g++, or with MSVC, but not both. g++ would choke on a BOM at the start, which it should have accepted. And MSVC lacked an option to tell it the source endcoding, and would interpret it as Windows ANSI if was UTF-8 without BOM. At the time the zealots recommended restricting oneself to a subset of the language that could be compiled with both g++ and MSVC. This was viewed as a problem with MSVC, and/or with Windows conventions, rather than botched encoding support in the g++ compiler. I.e., one should use UTF-encoded source without BOM (which didn't challenge g++) and only pure ASCII narrow literals (ditto), and no wide literals (ditto), because that could also, as they saw it, be compiled with MSVC. They were lying to the compiler, and in the blame assignment, and as is usual that gave ungood results. > - but I have heard > many stories about how Windows has botched it Well, you can tell that it's propaganda by the fact that it's about assigning blame. IMO assigning blame elsewhere for ported software that is, or was, of very low quality. But it's not just ported Unix-land software that's ungood: the propaganda could probably not have worked if Windows itself wasn't full of the weirdest stupidity and bugs, including, up front in modern Windows, that Windows Explorer, the GUI shell, scrolls away what you're clicking on so that double-clicks generally have unpredictable effects, or just don't work. It's so stupid that one suspects the internal sabotage between different units, that Microsoft is infamous for. Not to mention the lack of UTF-8 support in the Windows console subsystem. Which has an API that effectively restricts it to UCS-2. /However/, while there is indeed plenty wrong in general with Windows, Windows did get Unicode support a good ten years before Linux, roughly. UTF-8 was the old AT&T geniuses' (Ken Thompson and Rob Pike) idea for possibly getting the Unix world on a workable path towards Unicode, by supporting pure ASCII streams without code change. And that worked. But it was not established in general until well past the year 2000. >, and how the C++ (and C) > standards botched it to make it easier to work with Windows' botched API. That doesn't make sense to me, sorry. > that you are using. If you are using Windows API and Windows tools, you > might find UTF-16 more convenient. But as the worst compromise between > UTF-8 and UTF-32, Think about e.g. ICU. Do you believe that the Unicode guys themselves would choose the worst compromise? That does not make technical sense. So, this claim is technically nonsense. It only makes sense socially, as an in-group (Unix-land) versus out-group (Windows) valuation: "they're bad, they even small bad; we're good". I.e. it's propaganda. In order to filter out propaganda, think about whether it makes value evaluations like "worst", or "better", and where it does that, is it with respect to defined goals and full technical considerations? Cheers & hth., - Alf |
David Brown <david.brown@hesbynett.no>: Sep 09 05:13PM +0200 On 09/09/16 14:54, Alf P. Steinbach wrote: > because that could also, as they saw it, be compiled with MSVC. > They were lying to the compiler, and in the blame assignment, and as is > usual that gave ungood results. UTF-8 files rarely use a BOM - its only use is to avoid misinterpretation of the file as Latin-1 or some other encoding. So it is understandable, but incorrect, for gcc to fail when given utf-8 source code with a BOM. It is less understandable, and at least as bad for MSVC to require a BOM. gcc fixed their code in version 4.4 - has MSVC fixed their compiler yet? But that seems like a small issue to call "botching unicode". gcc still doesn't allow extended identifiers, however - but I think MSVC does? Clang certainly does. > Well, you can tell that it's propaganda by the fact that it's about > assigning blame. IMO assigning blame elsewhere for ported software that > is, or was, of very low quality. Are you referring to gcc as "very low quality ported software"? That's an unusual viewpoint. > Which has an API that effectively restricts it to UCS-2. /However/, > while there is indeed plenty wrong in general with Windows, Windows did > get Unicode support a good ten years before Linux, roughly. From a very quick google search, I see references to Unicode in Linux from the late 1990s - so ten years is perhaps an exaggeration. It depends on what you mean by "support", of course - for the most part, Linux really doesn't care about character sets or encodings. Strings are a bunch of bytes terminated by a null character, and the OS will pass them around, use them for file system names, etc., with an almost total disregard for the contents - baring a few special characters such as "/". As far as I understand it, the Linux kernel itself isn't much bothered with unicode at all, and it's only the locale-related stuff, libraries like iconv, font libraries, and of course things like X that need to concern themselves with unicode. This was one of the key reasons for liking UTF-8 - it meant very little had to change. > possibly getting the Unix world on a workable path towards Unicode, by > supporting pure ASCII streams without code change. And that worked. But > it was not established in general until well past the year 2000. Maybe MS and Windows simply suffer from being the brave pioneer here, and the *nix world watched them then saw how to do it better. >> , and how the C++ (and C) >> standards botched it to make it easier to work with Windows' botched API. > That doesn't make sense to me, sorry. The C++ standard currently has char, wchar_t, char16_t and char32_t (plus signed and unsigned versions of char - I don't know if the other types have signed and unsigned versions). wchar_t is of platform dependent size, making it (and wide strings) pretty much useless for platform independent code. But it is the way it is because MS wanted to have 16-bit wchar_t to suit their UCS-2 / UTF-16 APIs. The whole thing would have been much simpler if the language simply stuck to 8-bit code units throughout, letting the compiler and/or locale settings figure out the input encoding and the run-time encoding. If you really need a type for holding a single unicode character, then the only sane choice is a 32-bit type. > Think about e.g. ICU. > Do you believe that the Unicode guys themselves would choose the worst > compromise? When unicode was started, everyone thought 16 bits would be enough. So decisions that stretch back to before Unicode 2.0 would use 16-bit types for encodings that covered all code points. When the ICU was started, there was no UTF-32 (I don't know if there was a UTF-8 at the time). The best choice at that time was UCS-2 - it is only as unicode outgrew 16 bits that this became a poor choice. > So, this claim is technically nonsense. It only makes sense socially, as > an in-group (Unix-land) versus out-group (Windows) valuation: "they're > bad, they even small bad; we're good". I.e. it's propaganda. No, the claim is technically correct - UTF-16 has all the disadvantages of both UTF-8 and UTF-32, and misses several of the important (independent) advantages of those two encodings. Except for compatibility with existing UTF-16 APIs and code, there are few or no use-cases where UTF-16 is a better choice than UTF-8 and also a better choice than UTF-32. The fact that the world has pretty much standardised on UTF-8 for transmission of unicode is a good indication of this - UTF-16 is /only/ used where compatibility with existing UTF-16 APIs and code is of overriding concern. But historically, ICU and Windows (and Java, and QT) made the sensible decision of using UCS-2 when they started, but apart from QT they have failed to make a transition to UTF-8 when it was clear that this was the way forward. > In order to filter out propaganda, think about whether it makes value > evaluations like "worst", or "better", and where it does that, is it > with respect to defined goals and full technical considerations? The advantages of UTF-8 over UTF-16 are clear and technical, not just "feelings" or propaganda. |
vasuinfo1100@gmail.com: Sep 09 03:13AM -0700 What is big data? It's a phrase used to put into numbers data sets that are so large and complex that they become very hard to exchange, secure, and carefully study with typical tools. Big data Tutorials These courses on big data show you how to solve these problems, and many more, with leading IT tools and ways of doing things. This big data tutorial is designed to get data storage managers up to speed on the conversations shaping the decisions many IT managers are making about big data. This course is geared to make the students to become an expert. BigData Training Tutorial focuses on leading focus consulting and training specialists. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment