Saturday, August 17, 2019

Digest for comp.lang.c++@googlegroups.com - 25 updates in 3 topics

Szyk Cech <szykcech@spoko.pl>: Aug 17 01:58PM +0200

Hello!
 
I am looking the best way to get system paths in Linux and Windows.
Off-course the best will be portable way. Like this:
https://doc.qt.io/qt-5/qstandardpaths.html
But I don't want to use Qt. I want to write most of my app in pure C++.
 
I am looking for following paths:
std::wstring gSettingsDir();
std::wstring gLocalDataDir();
std::wstring gAppDataDir();
std::wstring gLogDir();
std::wstring gTempDir();
 
Can you give me some hints how to get them in Linux and Windows?!?
 
 
Thanks in advance and best regards!
Szyk Cech
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Aug 17 02:02PM +0200

On 17.08.2019 13:58, Szyk Cech wrote:
> std::wstring gLogDir();
> std::wstring gTempDir();
 
> Can you give me some hints how to get them in Linux and Windows?!?
 
In the previous posting you were enquiring about UTF-32 encoded wide
strings.
 
Be advised that in Windows wide strings are UTF-16 encoded.
 
 
Cheers!,
 
- Alf
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 02:02PM +0200

Use the environment-variables %USERPROFILE% or $HOME:
https://en.wikipedia.org/wiki/Home_directory#Default_home_directory_per_operating_system
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 02:05PM +0200

> Be advised that in Windows wide strings are UTF-16 encoded.
 
And I think that it would be the best to use std::u16string
because wchar_t is not absolutely guaranteed to be 16 bit
wide.
David Brown <david.brown@hesbynett.no>: Aug 17 03:11PM +0200

On 17/08/2019 14:05, Bonita Montero wrote:
 
> And I think that it would be the best to use std::u16string
> because wchar_t is not absolutely guaranteed to be 16 bit
> wide.
 
It is guaranteed /not/ to be 16-bit on most systems.
David Brown <david.brown@hesbynett.no>: Aug 17 03:13PM +0200

On 17/08/2019 13:58, Szyk Cech wrote:
 
> Can you give me some hints how to get them in Linux and Windows?!?
 
> Thanks in advance and best regards!
> Szyk Cech
 
If you can use C++17, consider the support in <filesystem> :
<https://en.cppreference.com/w/cpp/header/filesystem>
 
I haven't tried it myself, but maybe it has what you need.
Sam <sam@email-scan.com>: Aug 17 09:21AM -0400

Szyk Cech writes:
 
> std::wstring gLogDir();
> std::wstring gTempDir();
 
> Can you give me some hints how to get them in Linux and Windows?!?
 
There's no such thing, whatsoever, in Linux. This is all MS-Windows flotsam.
On Linux, there are some well-known directories, such as /var/tmp for
temporary files. There's also /tmp, but on most Linux distributions
applications should use /var/tmp. And that's about it. None of these labels
resemble anything else on Linux. Now, you do have well-known directories
like /var/log, where various log files may be found, but application
normally can't write to it, unless there's some special preparations made in
advance.
 
And, of course, all directories and paths on Linux are plain std::string-s.
Linux code rarely uses std::wstring, that's mostly an MS-Windows plague. In
the age of internationalization, Linux appears to have converged on UTF-8
and plain std::strings, with some occasional usage of std::u32string where
it's convenient to handle text in UTF-32.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 03:59PM +0200

>> because wchar_t is not absolutely guaranteed to be 16 bit
>> wide.
 
> It is guaranteed /not/ to be 16-bit on most systems.
 
You can be pretty sure that by far most implementations specify
whcar_t as 16 bit since it makes no sense to implement it differently.
But it's easier to conform to any hypothetical implementation by using
char16_t.
David Brown <david.brown@hesbynett.no>: Aug 17 04:20PM +0200

On 17/08/2019 15:59, Bonita Montero wrote:
> whcar_t as 16 bit since it makes no sense to implement it differently.
> But it's easier to conform to any hypothetical implementation by using
> char16_t.
 
Again, you are assuming Windows is everything.
 
On most systems it is 32-bit. The only exceptions I know are Windows,
and a few 8-bit embedded targets. (There may be others that I haven't
tested, of course.) On all *nix systems it is 32-bit, and on most
embedded systems.
 
MS was early in the "Unicode" game by using UCS-2 in Windows NT. They
get credit for trying, IMHO. But it quickly became apparent that 16
bits was not sufficient, and anyone who was not tied to Windows used
32-bit utf-32 when they needed "one object is one character", typically
for internal use only, and utf-8 when they wanted strings. If only MS
had taken the hit and made the changes while they still could, it would
have avoided the hideous mess that resulted with the painful 16-bit
types. 16-bit is too short to hold all the Unicode characters, but long
enough to have all the disadvantages of wasted space and endianness
issues. And now we are left with the jumble that is five different
character types in C++ (not including signed and unsigned versions), MS'
influence putting utf-16 nonsense into the C++ standards, and of course
other platform-independent languages (Java, Python) and libraries (QT)
using utf-16 or ucs-16 to be compatible with MS's error.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 04:28PM +0200

> influence putting utf-16 nonsense into the C++ standards, and of course
> other platform-independent languages (Java, Python) and libraries (QT)
> using utf-16 or ucs-16 to be compatible with MS's error.
 
We were not talking about UTF-whatever. We were talking about wchar_t
or more precisely std::u16sting. You can store 16 bit characters in the
latter as well as UTF-16-seqences in this.
We were talking about that you can almost rely on the fact that wchar_t
is 16 bits wide. There might be ancient CPU-architectures where the re-
gisters aren't 8, 16, ... bits wide but you can be certain that no one
would implement a conforming C++-compiler for these systems.
Ralf Goertz <me@myprovider.invalid>: Aug 17 04:37PM +0200

Am Sat, 17 Aug 2019 16:28:41 +0200
> where the re- gisters aren't 8, 16, ... bits wide but you can be
> certain that no one would implement a conforming C++-compiler for
> these systems.
 
 
#include <iostream>
 
int main() {
std::cout<<sizeof(wchar_t)<<std::endl;
}
 
>g++ -o wchar wchar.cc
> ./wchar
4
> g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/9/lto-wrapper
OFFLOAD_TARGET_NAMES=hsa:nvptx-none
Target: x86_64-suse-linux

Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 04:54PM +0200

> OFFLOAD_TARGET_NAMES=hsa:nvptx-none
> Target: x86_64-suse-linux
> …
 
Hard, I wouldn't have expected this. I think this isn't a clever
decision. For internal processing the representation doesn't matter
but for compatiblity-reasons it should match the most commun Unicode
representation.
Paavo Helde <myfirstname@osa.pri.ee>: Aug 17 06:37PM +0300

On 17.08.2019 17:54, Bonita Montero wrote:
> decision. For internal processing the representation doesn't matter
> but for compatiblity-reasons it should match the most commun Unicode
> representation.
 
For once, I agree with you. Alas, Microsoft does not listen and still
stubbornly clings to its unfortunate 16-bit Unicode representation,
which is neither the most common (UTF-8 is) nor useful for easier text
manipulation (UCS-32 is).
David Brown <david.brown@hesbynett.no>: Aug 17 05:40PM +0200

On 17/08/2019 16:54, Bonita Montero wrote:
> decision. For internal processing the representation doesn't matter
> but for compatiblity-reasons it should match the most commun Unicode
> representation.
 
How about reading what others post, and perhaps /thinking/ a little? In
particular, get out of the "the world is Windows" mindset.
 
wchar_t has /never/ been specifically about Unicode. It means "wide
character", it is supposed to big enough to hold a character of any
character set supported by the implementation. For example, a Chinese
system might support ASCII and Big5, which is a 16-bit encoding - it
could therefore have a 16-bit wchar_t.
 
For systems that support Unicode, wchar_t is required to be at least
32-bit. That is why it /is/ 32-bit on any modern system - except broken
Windows where wchar_t is not as big as the C++ standards require. The
common case of 16-bit wchar_t in Windows compilers is in fact not
standard C++.
 
As for the "most common Unicode representation", for files or data
interchange, that is UTF-8 by many orders of magnitude. UTF-16 and
UTF-32 are found in a few niche situations. Internally, within
programs, UTF-32 is sometimes used (mostly by people who think it is
important to be able to index characters or count characters really
quickly). It is the standard wchar_t type, used by most systems.
Within Windows, you can't avoid UTF-16 internally and for API calls -
and it is also used by people who mistakenly think one wchar_t
represents one code point. And UTF-16 is also used by some libraries,
such as QT, that are stuck with it for historical reasons.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 05:52PM +0200

> character set supported by the implementation.  For example, a Chinese
> system might support ASCII and Big5, which is a 16-bit encoding - it
> could therefore have a 16-bit wchar_t.
 
Unrealistic that it will be used for anything different than Unicode.
 
> For systems that support Unicode, wchar_t is required to be at least
> 32-bit.
 
No, it's implementation-defined.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 05:54PM +0200

> stubbornly clings to its unfortunate 16-bit Unicode representation,
> which is neither the most common (UTF-8 is) nor useful for easier text
> manipulation (UCS-32 is).
 
The APIs accepting UTF-16 strings aren't for persistence.
And there aren't any function for string-manipulation on
the Win32-API.
David Brown <david.brown@hesbynett.no>: Aug 17 05:56PM +0200

On 17/08/2019 16:28, Bonita Montero wrote:
 
> We were not talking about UTF-whatever. We were talking about wchar_t
> or more precisely std::u16sting. You can store 16 bit characters in the
> latter as well as UTF-16-seqences in this.
 
You can store any character encoding supported by the compiler in a
wchar_t, unless you are using Windows, which is broken and can't store
Unicode characters in a wchar_t because it is too small.
 
You can store any UTF-16 code unit in a char16_t. That does not mean
you can store any UTF-16 character in a char16_t - you can only store
those that fit in one unit. And in a std::u16string, you can store any
UTF-16 string of characters.
 
You can store any UTF-32 code unit in a char32_t. That covers all
Unicode code points, but there are Unicode characters that are made of
combinations of code points. And in a std::u32string, you can store any
UTF-32 string of characters.
 
And you can store any UTF-8 code unit in a char8_t.
 
 
In a wchar_t, you can (except on Windows) store any character from the
"execution wide-character set" for the compiler. That does not need to
be Unicode.
 
On most modern systems, wchar_t matches char32_t. But it is not a
requirement.
 
 
> We were talking about that you can almost rely on the fact that wchar_t
> is 16 bits wide.
 
You are wrong. You can rely on wchar_t being 32-bit, except on Windows
and a few small 8-bit systems that often have very limited support for
wchar_t at all.
 
> There might be ancient CPU-architectures where the re-
> gisters aren't 8, 16, ... bits wide but you can be certain that no one
> would implement a conforming C++-compiler for these systems.
 
People use C++ all the time on 8-bit and 16-bit processors - /new/
designs, not ancient ones. They often don't have a full C++ library,
but they are freestanding systems and don't need the full library to be
compliant. (They might be non-compliant in other ways.)
 
Windows compilers are invariably non-compliant regarding wchar_t because
it is only 16-bit on that platform, when it is required to be 32-bit.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 06:00PM +0200

>> For systems that support Unicode, wchar_t is required to be at least
>> 32-bit.
 
> No, it's implementation-defined.
 
That is what the standard says about that:
 
"Type wchar_t is a distinct type whose values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales (22.3.1). Type wchar_t shall have the same
size, signedness, and alignment requirements (3.11) as one of the other
integral types, called its underlying type."
David Brown <david.brown@hesbynett.no>: Aug 17 06:00PM +0200

On 17/08/2019 17:52, Bonita Montero wrote:
>> system might support ASCII and Big5, which is a 16-bit encoding - it
>> could therefore have a 16-bit wchar_t.
 
> Unrealistic that it will be used for anything different than Unicode.
 
wchar_t existed from long before Unicode became the standard choice.
But I agree that for future systems, it is unrealistic for wchar_t to be
used with anything other than UTF-32 encodings.
 
 
>> For systems that support Unicode, wchar_t is required to be at least
>> 32-bit.
 
> No, it's implementation-defined.
 
It is required to be big enough for any of the compiler's wide-character
execution set. Since Windows supports Unicode (regardless of the
encodings used), a Windows compiler must be able to hold any Unicode
code point in a wchar_t - i.e., wchar_t must be a minimum of 32 bit.
David Brown <david.brown@hesbynett.no>: Aug 17 06:01PM +0200

On 17/08/2019 18:00, Bonita Montero wrote:
> among the supported locales (22.3.1). Type wchar_t shall have the same
> size, signedness, and alignment requirements (3.11) as one of the other
> integral types, called its underlying type."
 
Exactly, yes. 16-bit wchar_t can't do that on a system that supports
Unicode (regardless of the encoding).
Szyk Cech <szykcech@spoko.pl>: Aug 17 01:10PM +0200

Hello!
 
I want to write string conversion functions:
std::wstring <--> unsigned char
where first is UTF-32, but second can be with any encoding.
 
I want to have functions like this:
 
std::wstring gRawToUnicode(std::vector<unsigned char> aString,
std::wstring aEncoding);
std::vector<unsigned char> gUnicodeToRaw(std::wstring aString,
std::wstring aEncoding);
 
Important to me is ability to handle any input encoding (defined as
wstring) because I want to use this functions in future versions of my
text editor.
 
I have two questions:
1. Is this possible to make it in pure C++ (stl based)?!?
2. Do I have to use ICU library for this?!?
 
ad1. If so, please give me examples:
+ How to get list of all supported encodings?
+ How to convert strings in pure C++ when we know only input/output
format?!? So I don't want example with hardcoded input/output encoding -
I want to handle any input format and any output format (according to my
functions).
 
Thanks in advance! And best regards!
Szyk Cech
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Aug 17 01:36PM +0200

On 17.08.2019 13:10, Szyk Cech wrote:
 
> I want to write string conversion functions:
> std::wstring <--> unsigned char
> where first is UTF-32, but second can be with any encoding.
 
As I read it, you want to convert between UTF-32 and any byte-oriented
encoding.
 
That's a noble goal.
 
You could look into what functionality is used by e.g. Scintilla
component, but I remember from making a Notepad++ extension for UTF-8 as
default, that it's rather messy and generally ungood.
 
 
> std::wstring aEncoding);
> std::vector<unsigned char> gUnicodeToRaw(std::wstring aString,
> std::wstring aEncoding);
 
Surely not as the bottom foundation.
 
Stringly typed stuff belongs up near the user, invoking strongly typed
stuff below.
 
A reasonable approach to bridge the gap between user interface stringly
typed (e.g. where an editor has a textual command interface somewhere),
and internal strongly typed, can be to use an encoding id string as a
key to a repository of converters, which then hands you a converter for
that encoding, or fails to find one.
 
I'm sure there's a pattern name for that.
 
Like inversion or some silly name like that.
 
 
> text editor.
 
> I have two questions:
> 1. Is this possible to make it in pure C++ (stl based)?!?
 
Yes, but then you have to implement most all of it by yourself.
 
The standard library supports only two general encoding conversion:
between wide text and the locale's multibyte strings, and between the
various UTF encodings.
 
The latter set of conversions have been deprecated, and they do anyway
not make for very portable code if they're used directly, even though
they're still as of C++17 "standard". E.g. g++ and MSVC differ in (1)
where they stop on detecting an input error, and (2) in the endianess
(!) of the result.
 
 
> 2. Do I have to use ICU library for this?!?
 
Yes, in practice.
 
 
> format?!? So I don't want example with hardcoded input/output encoding -
> I want to handle any input format and any output format (according to my
> functions).
 
I know what I would do: I would just start doing it.
 
But since I haven't done it I can't help you other than just noting that
diving into stuff like that, is in general both (1) much easier than you
thought, and (2) much more labor intensive, like orders of magnitude
more work, than you thought.
 
You have hereby been motivated and warned.
 
 
Cheers!,
 
- Alf
Sam <sam@email-scan.com>: Aug 17 09:14AM -0400

Szyk Cech writes:
 
> because I want to use this functions in future versions of my text editor.
 
> I have two questions:
> 1. Is this possible to make it in pure C++ (stl based)?!?
 
Yes. This is done by using the std::codecvt facet, see
<URL:https://en.cppreference.com/w/cpp/locale/codecvt>.
 
> 2. Do I have to use ICU library for this?!?
 
No, but I find the C++ library's implementation of this functionality to be
unnecessarily convoluted, and a royal pain. Using a third party library will
likely be easier. I use iconv, which seems to come standard as part of
glibc, on Linux. I'm sure that MS-Windows has its own interface you can use,
if you are unfortunate enough to be using C++ on MS-Windows. You should be
able to find some documentation on that in MSDN.
 
> ad1. If so, please give me examples:
> + How to get list of all supported encodings?
 
This is not supported by the C++ library itself. The C++ library expects you
to know which encoding you want to use, and then you use its arkane
interface to do the conversion.
 
> format?!? So I don't want example with hardcoded input/output encoding - I
> want to handle any input format and any output format (according to my
> functions).
 
I've given you some search terms, above. Google should be able to find
plenty of examples.
Paavo Helde <myfirstname@osa.pri.ee>: Aug 17 06:05PM +0300

On 17.08.2019 14:10, Szyk Cech wrote:
 
> I want to write string conversion functions:
> std::wstring <--> unsigned char
> where first is UTF-32, but second can be with any encoding.
 
On Windows, wchar_t is 16 bits, so std::wstring is most probably UTF-16
(the native Windows string encoding), not UTF-32.
 
If you are interested in Linux/POSIX only, then use iconv (man
iconv_open et al). Note that this is an extensible interface, the glibc
base implementation supports only few encodings whereas the glibc-locale
package adds support for a wide variety of encodings. One can type
 
iconv --list
 
to see what is supported by the current Linux installation.
 
On Windows one should use its own native SDK functions like
MultiByteToWideChar() et al.
 
 
> Important to me is ability to handle any input encoding (defined as
> wstring)
 
Most encodings are using bytes, not wchar_t. Wchar_t usually implies a
single fixed encoding (UTF-32 or UTF-16, depending on platform).
 
> because I want to use this functions in future versions of my
> text editor.
 
If this is a plain/code text editor, I strongly suggest to incorporate
the Scintilla text editor control as the main component, this would
probably save 90% of work.
 
Cheers
Paavo
David Brown <david.brown@hesbynett.no>: Aug 17 03:08PM +0200

On 16/08/2019 21:04, Lynn McGuire wrote:
> "C++: Size Matters in Platform Compatibility"
 
> <https://www.codeproject.com/Tips/5164768/Cplusplus-Size-Matters-in-Platform-Compatibility>
 
Please put < > marks around links.
 
 
> Interesting.  Especially on storing Unicode as UTF-8.
 
If you are storing data in binary formats that need to be portable, you
need to define and document the format, making it independent of the
platform. The author here glosses over the important bits - endianness
and alignment (and consequently padding), which he doesn't mention at all.
 
Time should be not be stored in text format - that is very space
inefficient in a binary file, and even if you stick to ISO formats there
are huge numbers of variants possible. Often the most sensible formats
is 64 bit Unix epoch timestamps (happily covering the big bang up to
well past the death of our sun). IEEE floating point double Unix epoch
seconds will last longer than protons, and give you accurate fractions
of a second for more normal timeframes. If these are not suitable,
define a format for your particular requirements.
 
For character data where plain ASCII won't do, utf-8 is the only sane
choice IMHO. Other formats - utf-16, utf-32, wchar_t, etc., might be
used internally for dealing with historical API's.
 
The author is right to use fixed size types for integers. However, it
doesn't make sense to have pointer types at all in a stored file or data
interchange format.
 
 
The other alternative is to switch entirely to a text format such as
JSON. It is a lot less efficient to parse in C or C++ (though once we
get introspection in C++, JSON generation and parsing will be a good
deal nicer). But it is easy to pass around between OS's, languages and
applications, easy to handle in most higher level languages, and easy
for debugging.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: