- How to get standard paths?!? - 20 Updates
- How to convert strings? - 4 Updates
- "C++: Size Matters in Platform Compatibility" - 1 Update
Szyk Cech <szykcech@spoko.pl>: Aug 17 01:58PM +0200 Hello! I am looking the best way to get system paths in Linux and Windows. Off-course the best will be portable way. Like this: https://doc.qt.io/qt-5/qstandardpaths.html But I don't want to use Qt. I want to write most of my app in pure C++. I am looking for following paths: std::wstring gSettingsDir(); std::wstring gLocalDataDir(); std::wstring gAppDataDir(); std::wstring gLogDir(); std::wstring gTempDir(); Can you give me some hints how to get them in Linux and Windows?!? Thanks in advance and best regards! Szyk Cech |
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Aug 17 02:02PM +0200 On 17.08.2019 13:58, Szyk Cech wrote: > std::wstring gLogDir(); > std::wstring gTempDir(); > Can you give me some hints how to get them in Linux and Windows?!? In the previous posting you were enquiring about UTF-32 encoded wide strings. Be advised that in Windows wide strings are UTF-16 encoded. Cheers!, - Alf |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 02:02PM +0200 Use the environment-variables %USERPROFILE% or $HOME: https://en.wikipedia.org/wiki/Home_directory#Default_home_directory_per_operating_system |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 02:05PM +0200 > Be advised that in Windows wide strings are UTF-16 encoded. And I think that it would be the best to use std::u16string because wchar_t is not absolutely guaranteed to be 16 bit wide. |
David Brown <david.brown@hesbynett.no>: Aug 17 03:11PM +0200 On 17/08/2019 14:05, Bonita Montero wrote: > And I think that it would be the best to use std::u16string > because wchar_t is not absolutely guaranteed to be 16 bit > wide. It is guaranteed /not/ to be 16-bit on most systems. |
David Brown <david.brown@hesbynett.no>: Aug 17 03:13PM +0200 On 17/08/2019 13:58, Szyk Cech wrote: > Can you give me some hints how to get them in Linux and Windows?!? > Thanks in advance and best regards! > Szyk Cech If you can use C++17, consider the support in <filesystem> : <https://en.cppreference.com/w/cpp/header/filesystem> I haven't tried it myself, but maybe it has what you need. |
Sam <sam@email-scan.com>: Aug 17 09:21AM -0400 Szyk Cech writes: > std::wstring gLogDir(); > std::wstring gTempDir(); > Can you give me some hints how to get them in Linux and Windows?!? There's no such thing, whatsoever, in Linux. This is all MS-Windows flotsam. On Linux, there are some well-known directories, such as /var/tmp for temporary files. There's also /tmp, but on most Linux distributions applications should use /var/tmp. And that's about it. None of these labels resemble anything else on Linux. Now, you do have well-known directories like /var/log, where various log files may be found, but application normally can't write to it, unless there's some special preparations made in advance. And, of course, all directories and paths on Linux are plain std::string-s. Linux code rarely uses std::wstring, that's mostly an MS-Windows plague. In the age of internationalization, Linux appears to have converged on UTF-8 and plain std::strings, with some occasional usage of std::u32string where it's convenient to handle text in UTF-32. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 03:59PM +0200 >> because wchar_t is not absolutely guaranteed to be 16 bit >> wide. > It is guaranteed /not/ to be 16-bit on most systems. You can be pretty sure that by far most implementations specify whcar_t as 16 bit since it makes no sense to implement it differently. But it's easier to conform to any hypothetical implementation by using char16_t. |
David Brown <david.brown@hesbynett.no>: Aug 17 04:20PM +0200 On 17/08/2019 15:59, Bonita Montero wrote: > whcar_t as 16 bit since it makes no sense to implement it differently. > But it's easier to conform to any hypothetical implementation by using > char16_t. Again, you are assuming Windows is everything. On most systems it is 32-bit. The only exceptions I know are Windows, and a few 8-bit embedded targets. (There may be others that I haven't tested, of course.) On all *nix systems it is 32-bit, and on most embedded systems. MS was early in the "Unicode" game by using UCS-2 in Windows NT. They get credit for trying, IMHO. But it quickly became apparent that 16 bits was not sufficient, and anyone who was not tied to Windows used 32-bit utf-32 when they needed "one object is one character", typically for internal use only, and utf-8 when they wanted strings. If only MS had taken the hit and made the changes while they still could, it would have avoided the hideous mess that resulted with the painful 16-bit types. 16-bit is too short to hold all the Unicode characters, but long enough to have all the disadvantages of wasted space and endianness issues. And now we are left with the jumble that is five different character types in C++ (not including signed and unsigned versions), MS' influence putting utf-16 nonsense into the C++ standards, and of course other platform-independent languages (Java, Python) and libraries (QT) using utf-16 or ucs-16 to be compatible with MS's error. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 04:28PM +0200 > influence putting utf-16 nonsense into the C++ standards, and of course > other platform-independent languages (Java, Python) and libraries (QT) > using utf-16 or ucs-16 to be compatible with MS's error. We were not talking about UTF-whatever. We were talking about wchar_t or more precisely std::u16sting. You can store 16 bit characters in the latter as well as UTF-16-seqences in this. We were talking about that you can almost rely on the fact that wchar_t is 16 bits wide. There might be ancient CPU-architectures where the re- gisters aren't 8, 16, ... bits wide but you can be certain that no one would implement a conforming C++-compiler for these systems. |
Ralf Goertz <me@myprovider.invalid>: Aug 17 04:37PM +0200 Am Sat, 17 Aug 2019 16:28:41 +0200 > where the re- gisters aren't 8, 16, ... bits wide but you can be > certain that no one would implement a conforming C++-compiler for > these systems. #include <iostream> int main() { std::cout<<sizeof(wchar_t)<<std::endl; } >g++ -o wchar wchar.cc > ./wchar 4 > g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/9/lto-wrapper OFFLOAD_TARGET_NAMES=hsa:nvptx-none Target: x86_64-suse-linux … |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 04:54PM +0200 > OFFLOAD_TARGET_NAMES=hsa:nvptx-none > Target: x86_64-suse-linux > … Hard, I wouldn't have expected this. I think this isn't a clever decision. For internal processing the representation doesn't matter but for compatiblity-reasons it should match the most commun Unicode representation. |
Paavo Helde <myfirstname@osa.pri.ee>: Aug 17 06:37PM +0300 On 17.08.2019 17:54, Bonita Montero wrote: > decision. For internal processing the representation doesn't matter > but for compatiblity-reasons it should match the most commun Unicode > representation. For once, I agree with you. Alas, Microsoft does not listen and still stubbornly clings to its unfortunate 16-bit Unicode representation, which is neither the most common (UTF-8 is) nor useful for easier text manipulation (UCS-32 is). |
David Brown <david.brown@hesbynett.no>: Aug 17 05:40PM +0200 On 17/08/2019 16:54, Bonita Montero wrote: > decision. For internal processing the representation doesn't matter > but for compatiblity-reasons it should match the most commun Unicode > representation. How about reading what others post, and perhaps /thinking/ a little? In particular, get out of the "the world is Windows" mindset. wchar_t has /never/ been specifically about Unicode. It means "wide character", it is supposed to big enough to hold a character of any character set supported by the implementation. For example, a Chinese system might support ASCII and Big5, which is a 16-bit encoding - it could therefore have a 16-bit wchar_t. For systems that support Unicode, wchar_t is required to be at least 32-bit. That is why it /is/ 32-bit on any modern system - except broken Windows where wchar_t is not as big as the C++ standards require. The common case of 16-bit wchar_t in Windows compilers is in fact not standard C++. As for the "most common Unicode representation", for files or data interchange, that is UTF-8 by many orders of magnitude. UTF-16 and UTF-32 are found in a few niche situations. Internally, within programs, UTF-32 is sometimes used (mostly by people who think it is important to be able to index characters or count characters really quickly). It is the standard wchar_t type, used by most systems. Within Windows, you can't avoid UTF-16 internally and for API calls - and it is also used by people who mistakenly think one wchar_t represents one code point. And UTF-16 is also used by some libraries, such as QT, that are stuck with it for historical reasons. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 05:52PM +0200 > character set supported by the implementation. For example, a Chinese > system might support ASCII and Big5, which is a 16-bit encoding - it > could therefore have a 16-bit wchar_t. Unrealistic that it will be used for anything different than Unicode. > For systems that support Unicode, wchar_t is required to be at least > 32-bit. No, it's implementation-defined. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 05:54PM +0200 > stubbornly clings to its unfortunate 16-bit Unicode representation, > which is neither the most common (UTF-8 is) nor useful for easier text > manipulation (UCS-32 is). The APIs accepting UTF-16 strings aren't for persistence. And there aren't any function for string-manipulation on the Win32-API. |
David Brown <david.brown@hesbynett.no>: Aug 17 05:56PM +0200 On 17/08/2019 16:28, Bonita Montero wrote: > We were not talking about UTF-whatever. We were talking about wchar_t > or more precisely std::u16sting. You can store 16 bit characters in the > latter as well as UTF-16-seqences in this. You can store any character encoding supported by the compiler in a wchar_t, unless you are using Windows, which is broken and can't store Unicode characters in a wchar_t because it is too small. You can store any UTF-16 code unit in a char16_t. That does not mean you can store any UTF-16 character in a char16_t - you can only store those that fit in one unit. And in a std::u16string, you can store any UTF-16 string of characters. You can store any UTF-32 code unit in a char32_t. That covers all Unicode code points, but there are Unicode characters that are made of combinations of code points. And in a std::u32string, you can store any UTF-32 string of characters. And you can store any UTF-8 code unit in a char8_t. In a wchar_t, you can (except on Windows) store any character from the "execution wide-character set" for the compiler. That does not need to be Unicode. On most modern systems, wchar_t matches char32_t. But it is not a requirement. > We were talking about that you can almost rely on the fact that wchar_t > is 16 bits wide. You are wrong. You can rely on wchar_t being 32-bit, except on Windows and a few small 8-bit systems that often have very limited support for wchar_t at all. > There might be ancient CPU-architectures where the re- > gisters aren't 8, 16, ... bits wide but you can be certain that no one > would implement a conforming C++-compiler for these systems. People use C++ all the time on 8-bit and 16-bit processors - /new/ designs, not ancient ones. They often don't have a full C++ library, but they are freestanding systems and don't need the full library to be compliant. (They might be non-compliant in other ways.) Windows compilers are invariably non-compliant regarding wchar_t because it is only 16-bit on that platform, when it is required to be 32-bit. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 06:00PM +0200 >> For systems that support Unicode, wchar_t is required to be at least >> 32-bit. > No, it's implementation-defined. That is what the standard says about that: "Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). Type wchar_t shall have the same size, signedness, and alignment requirements (3.11) as one of the other integral types, called its underlying type." |
David Brown <david.brown@hesbynett.no>: Aug 17 06:00PM +0200 On 17/08/2019 17:52, Bonita Montero wrote: >> system might support ASCII and Big5, which is a 16-bit encoding - it >> could therefore have a 16-bit wchar_t. > Unrealistic that it will be used for anything different than Unicode. wchar_t existed from long before Unicode became the standard choice. But I agree that for future systems, it is unrealistic for wchar_t to be used with anything other than UTF-32 encodings. >> For systems that support Unicode, wchar_t is required to be at least >> 32-bit. > No, it's implementation-defined. It is required to be big enough for any of the compiler's wide-character execution set. Since Windows supports Unicode (regardless of the encodings used), a Windows compiler must be able to hold any Unicode code point in a wchar_t - i.e., wchar_t must be a minimum of 32 bit. |
David Brown <david.brown@hesbynett.no>: Aug 17 06:01PM +0200 On 17/08/2019 18:00, Bonita Montero wrote: > among the supported locales (22.3.1). Type wchar_t shall have the same > size, signedness, and alignment requirements (3.11) as one of the other > integral types, called its underlying type." Exactly, yes. 16-bit wchar_t can't do that on a system that supports Unicode (regardless of the encoding). |
Szyk Cech <szykcech@spoko.pl>: Aug 17 01:10PM +0200 Hello! I want to write string conversion functions: std::wstring <--> unsigned char where first is UTF-32, but second can be with any encoding. I want to have functions like this: std::wstring gRawToUnicode(std::vector<unsigned char> aString, std::wstring aEncoding); std::vector<unsigned char> gUnicodeToRaw(std::wstring aString, std::wstring aEncoding); Important to me is ability to handle any input encoding (defined as wstring) because I want to use this functions in future versions of my text editor. I have two questions: 1. Is this possible to make it in pure C++ (stl based)?!? 2. Do I have to use ICU library for this?!? ad1. If so, please give me examples: + How to get list of all supported encodings? + How to convert strings in pure C++ when we know only input/output format?!? So I don't want example with hardcoded input/output encoding - I want to handle any input format and any output format (according to my functions). Thanks in advance! And best regards! Szyk Cech |
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Aug 17 01:36PM +0200 On 17.08.2019 13:10, Szyk Cech wrote: > I want to write string conversion functions: > std::wstring <--> unsigned char > where first is UTF-32, but second can be with any encoding. As I read it, you want to convert between UTF-32 and any byte-oriented encoding. That's a noble goal. You could look into what functionality is used by e.g. Scintilla component, but I remember from making a Notepad++ extension for UTF-8 as default, that it's rather messy and generally ungood. > std::wstring aEncoding); > std::vector<unsigned char> gUnicodeToRaw(std::wstring aString, > std::wstring aEncoding); Surely not as the bottom foundation. Stringly typed stuff belongs up near the user, invoking strongly typed stuff below. A reasonable approach to bridge the gap between user interface stringly typed (e.g. where an editor has a textual command interface somewhere), and internal strongly typed, can be to use an encoding id string as a key to a repository of converters, which then hands you a converter for that encoding, or fails to find one. I'm sure there's a pattern name for that. Like inversion or some silly name like that. > text editor. > I have two questions: > 1. Is this possible to make it in pure C++ (stl based)?!? Yes, but then you have to implement most all of it by yourself. The standard library supports only two general encoding conversion: between wide text and the locale's multibyte strings, and between the various UTF encodings. The latter set of conversions have been deprecated, and they do anyway not make for very portable code if they're used directly, even though they're still as of C++17 "standard". E.g. g++ and MSVC differ in (1) where they stop on detecting an input error, and (2) in the endianess (!) of the result. > 2. Do I have to use ICU library for this?!? Yes, in practice. > format?!? So I don't want example with hardcoded input/output encoding - > I want to handle any input format and any output format (according to my > functions). I know what I would do: I would just start doing it. But since I haven't done it I can't help you other than just noting that diving into stuff like that, is in general both (1) much easier than you thought, and (2) much more labor intensive, like orders of magnitude more work, than you thought. You have hereby been motivated and warned. Cheers!, - Alf |
Sam <sam@email-scan.com>: Aug 17 09:14AM -0400 Szyk Cech writes: > because I want to use this functions in future versions of my text editor. > I have two questions: > 1. Is this possible to make it in pure C++ (stl based)?!? Yes. This is done by using the std::codecvt facet, see <URL:https://en.cppreference.com/w/cpp/locale/codecvt>. > 2. Do I have to use ICU library for this?!? No, but I find the C++ library's implementation of this functionality to be unnecessarily convoluted, and a royal pain. Using a third party library will likely be easier. I use iconv, which seems to come standard as part of glibc, on Linux. I'm sure that MS-Windows has its own interface you can use, if you are unfortunate enough to be using C++ on MS-Windows. You should be able to find some documentation on that in MSDN. > ad1. If so, please give me examples: > + How to get list of all supported encodings? This is not supported by the C++ library itself. The C++ library expects you to know which encoding you want to use, and then you use its arkane interface to do the conversion. > format?!? So I don't want example with hardcoded input/output encoding - I > want to handle any input format and any output format (according to my > functions). I've given you some search terms, above. Google should be able to find plenty of examples. |
Paavo Helde <myfirstname@osa.pri.ee>: Aug 17 06:05PM +0300 On 17.08.2019 14:10, Szyk Cech wrote: > I want to write string conversion functions: > std::wstring <--> unsigned char > where first is UTF-32, but second can be with any encoding. On Windows, wchar_t is 16 bits, so std::wstring is most probably UTF-16 (the native Windows string encoding), not UTF-32. If you are interested in Linux/POSIX only, then use iconv (man iconv_open et al). Note that this is an extensible interface, the glibc base implementation supports only few encodings whereas the glibc-locale package adds support for a wide variety of encodings. One can type iconv --list to see what is supported by the current Linux installation. On Windows one should use its own native SDK functions like MultiByteToWideChar() et al. > Important to me is ability to handle any input encoding (defined as > wstring) Most encodings are using bytes, not wchar_t. Wchar_t usually implies a single fixed encoding (UTF-32 or UTF-16, depending on platform). > because I want to use this functions in future versions of my > text editor. If this is a plain/code text editor, I strongly suggest to incorporate the Scintilla text editor control as the main component, this would probably save 90% of work. Cheers Paavo |
David Brown <david.brown@hesbynett.no>: Aug 17 03:08PM +0200 On 16/08/2019 21:04, Lynn McGuire wrote: > "C++: Size Matters in Platform Compatibility" > <https://www.codeproject.com/Tips/5164768/Cplusplus-Size-Matters-in-Platform-Compatibility> Please put < > marks around links. > Interesting. Especially on storing Unicode as UTF-8. If you are storing data in binary formats that need to be portable, you need to define and document the format, making it independent of the platform. The author here glosses over the important bits - endianness and alignment (and consequently padding), which he doesn't mention at all. Time should be not be stored in text format - that is very space inefficient in a binary file, and even if you stick to ISO formats there are huge numbers of variants possible. Often the most sensible formats is 64 bit Unix epoch timestamps (happily covering the big bang up to well past the death of our sun). IEEE floating point double Unix epoch seconds will last longer than protons, and give you accurate fractions of a second for more normal timeframes. If these are not suitable, define a format for your particular requirements. For character data where plain ASCII won't do, utf-8 is the only sane choice IMHO. Other formats - utf-16, utf-32, wchar_t, etc., might be used internally for dealing with historical API's. The author is right to use fixed size types for integers. However, it doesn't make sense to have pointer types at all in a stored file or data interchange format. The other alternative is to switch entirely to a text format such as JSON. It is a lot less efficient to parse in C or C++ (though once we get introspection in C++, JSON generation and parsing will be a good deal nicer). But it is easy to pass around between OS's, languages and applications, easy to handle in most higher level languages, and easy for debugging. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment