- wstring_convert - 20 Updates
- Simple proof of {Linz, Siper, Kozen} counter-example decidability - 1 Update
- What's up with "Large Scale C++" - 1 Update
"daniel...@gmail.com" <danielaparker@gmail.com>: Dec 22 07:32PM -0800 On Tuesday, December 22, 2020 at 5:43:19 PM UTC-5, Öö Tiib wrote: > That character is used to make "incorrect unicode" produced in your > product to look ugly in competitor's product that technically validates > it correctly. It's very difficult to understand your point. Are you suggesting that the fstream family should only be used to read bytes, with the decoding of those bytes always left to a higher level? And perhaps that all the fstream variants that perform non binary reads, including wfstream, be deprecated as well? Of course that would break existing code, as has the deprecation of the codecvt header and the the standard conversion facets. Daniel |
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Dec 23 06:33AM +0100 On 22.12.2020 22:56, Öö Tiib wrote: > Converting UTF-8 into UTF-16 is simple only if it is correct (in some > manner of "correct") UTF-8. What to do when it is incorrect (in some sense > of "incorrect")? Close the application? But it was "only" text, shame on you. Very true. I remember running into issues with g++ versus Visual C++, for using the standard library's functionality: * MinGW g++ produced big endian 16-bit values while Visual C++ produced little endian. * One of them consumed the next byte on error while the other didn't. As I recall for the first point g++ was wrong while for the last point Visual C++ was wrong, so there was no way to write portable code without at least some checking of results and adaption to the compiler. - Alf |
"Öö Tiib" <ootiib@hot.ee>: Dec 23 01:27AM -0800 > > product to look ugly in competitor's product that technically validates > > it correctly. > It's very difficult to understand your point. Can be as the problem is tricky. I try to elaborate but you may not like my point. It is question about how can we layer our software. For example addition of two signed integers is undefined behavior for quarter of input values. Not implementation-defined. Not even unspecified result. Full whopping bear trap in one of most elementary operations. So to make software that behaves in some sane manner in hands of distressed housewives we are expected to make some kind of layer before signed addition that ensures that the input is not in that range of undefined behavior. That is doable, no problems. With functions that are meant for taking their input from outside of program the committee can not use same pattern and burden such undefined behavior on shoulders of programmers. It is impossible to write input-checking layers before input itself. They can standardize whatever nonsense but clear DOS attack or worse feels too lot even for them. > Are you suggesting that the > fstream family should only be used to read bytes, with the decoding of > those bytes always left to a higher level? That is what I'm doing in practice. Anything that comes from outside is dirty data consisting of full range of possible byte values. > And perhaps that all the fstream > variants that perform non binary reads, including wfstream, be deprecated > as well? Either define fully for all possible input bytes or admit your incapability and deprecate the tools that are useless. Implementation vendors *want* the result to be non-portable. > Of course that would break existing code, as has the deprecation > of the codecvt header and the the standard conversion facets. AFAIK the header is named <locale>. I don't know how to use it in sane manner. It slows everything down but is not useful for i18n or portabilty. It is for pile of newbie questions in style of "Why my garbage parses CSV files wrongly in hands of my German customers?" Quite pointless feature in my experience. |
Bonita Montero <Bonita.Montero@gmail.com>: Dec 23 10:38AM +0100 > Only subset of sequences of bytes is valid UTF-8 or valid UTF-16. > Rest are invalid. ... Examples ? |
"Öö Tiib" <ootiib@hot.ee>: Dec 23 02:03AM -0800 On Wednesday, 23 December 2020 at 11:38:39 UTC+2, Bonita Montero wrote: > > Only subset of sequences of bytes is valid UTF-8 or valid UTF-16. > > Rest are invalid. ... > Examples ? Can you read only 15 first words from posts? I already said that Wikipedia even paints the examples *red* <https://en.wikipedia.org/wiki/UTF-8#Codepage_layout> Full docs are at unicode.org. <https://www.unicode.org/versions/Unicode6.0.0/> |
Richard Damon <Richard@Damon-Family.org>: Dec 23 07:40AM -0500 On 12/23/20 4:38 AM, Bonita Montero wrote: >> Only subset of sequences of bytes is valid UTF-8 or valid UTF-16. >> Rest are invalid. ... > Examples ? The key point is UTF-8 has a distinct syntax for how byte sequences from a valid code-point. Some simple examples: 1) String starts with 0x80 (This is a following byte, without a leading byte) 2) String: 0xc0 0x40 (First byte says 2 byte code-point, next byte isn't a following byte but a single byte code) 3) String: 0xC0 0x8F 0x8F (first byte says 2 byte code-point, but there are then 2 following bytes indicating a 3 byte code-point) There are also some more semantic errors like the sequence 0xC0 0x81 which would encode the code-point U+0001, but that should be encoded as just 0x01, The Unicode standard says this should be treated as an error. You could also encode a value greater than U+10FFFF which is an error, and UTF-8 originally was defined to allow encoding to more than 4 bytes until UTF-16 compatibility made them redefine it to limit code points to at most 0x10FFFF. UTF-8 was designed to allow for simple operation with the data. No code-point is a sub-sequence of another code-point, it is easy to start in the middle of a string and find the beginning of the next or previous code-point in the string. It is NOT optimized for minimum data storage (but not that bad). There are enough rules about what is valid, that a test if a string/file is UTF-8 encoded by just checking if it is valid works pretty well (pure ASCII passes, the odds of an 8-bit code-page data using the upper characters will pass is very small). |
Richard Damon <Richard@Damon-Family.org>: Dec 23 07:44AM -0500 On 12/23/20 12:33 AM, Alf P. Steinbach wrote: > Visual C++ was wrong, so there was no way to write portable code without > at least some checking of results and adaption to the compiler. > - Alf Neither big endian or little endian UTF-16 is 'wrong', it may not be the encoding you want, but isn't flat out wrong. (One is actually called UTF-16BE and the other UTF-16LE). Now, if it claimed that it was generating native endian UTF-16, then having the wrong endianness would be wrong. |
Bonita Montero <Bonita.Montero@gmail.com>: Dec 23 02:03PM +0100 >> Examples ? > Can you read only 15 first words from posts? UTF-8 is independent of invalid Unicode code-points. |
"daniel...@gmail.com" <danielaparker@gmail.com>: Dec 23 05:31AM -0800 On Tuesday, December 22, 2020 at 4:56:49 PM UTC-5, Öö Tiib wrote: > Converting UTF-8 into UTF-16 is simple only if it is correct (in some > manner of "correct") UTF-8. What to do when it is incorrect (in some sense > of "incorrect")? Close the application? But it was "only" text, shame on you. If the question is, what about if the input you thought was UTF-8 is not in fact UTF-8, well, what about it? What about if the input you thought was a JPEG file is not in fact a JPEG file? With errors, you can attempt to recover, or fail. You can be as lenient or strict as the utilities you're using allow you to be. Daniel |
Juha Nieminen <nospam@thanks.invalid>: Dec 23 01:55PM > For example addition of two signed integers is undefined behavior for > quarter of input values. Not implementation-defined. Not even unspecified > result. Full whopping bear trap in one of most elementary operations. I believe this is a combination of "you don't pay for what you don't need" and wanting to support a wide variety of possible architectures at the same time. There may be some more exotic CPU architecture where an integer overflow or underflow causes a CPU interrupt, for example. The standard doesn't want to force the compiler to add extraneous validity checks to every single arithmetic operation by demanding that eg. addition always gives some result (rather than, for example, crashing the program). Such checks might be nice in the very rare cases where you really need them, but in the vast majority of cases they would only make the program slower for literally zero benefit. Even if there would be a check, what exactly should the compiler do? If it detects an overflow it simulates the result of unsigned 2's complement arithmetic (cast back to signed afterwards)? Return the maximum value? Something else? What exactly should the standard mandate it to do? It's just easier for the standard to say to the compiler developers "do whatever you want in this case". Of course then it's the responsibility of the programmer to be aware that relying on a particular behavior for signed integer overflow is technically speaking non-portable. Not that it matters much in practice. |
Richard Damon <Richard@Damon-Family.org>: Dec 23 12:31PM -0500 On 12/23/20 8:55 AM, Juha Nieminen wrote: > Of course then it's the responsibility of the programmer to be aware that > relying on a particular behavior for signed integer overflow is > technically speaking non-portable. Not that it matters much in practice. Yes, there have been processors that the results of a signed arithmetic overflow was a processor trap, an others where the results might be something like a clamped result. So defining what should happen was going to be expensive on some platform. They also hadn't with the phrase unspecified value or and implementation define trap, which would have beed significantly more restrictive that just general undefined behavior. It also turns out that for most reasonable applications, it isn't too hard to avoid the problem, you generally know the expected magnitudes of the values and can use a type that can handle that range. It also turned out that some useful optimizations were possible if the compiler could assume that signed overflow did not happen, and this could give a noticable speed advantage to some cases. There also is the interesting fact that where there is enough demand for a specified behavior for signed overflow, and implementation can provide that behavior in a way that it documents, and in fact many implementations DO provide an option for signed overflow to just generate the wrap around behavior that a typical 2's complement processor will generate. |
"Öö Tiib" <ootiib@hot.ee>: Dec 23 10:45AM -0800 On Wednesday, 23 December 2020 at 15:56:14 UTC+2, Juha Nieminen wrote: > I believe this is a combination of "you don't pay for what you don't need" > and wanting to support a wide variety of possible architectures at the > same time. Yes. That was example of solvable problem. I ended the paragraph "That is doable, no problems." > checks might be nice in the very rare cases where you really need them, > but in the vast majority of cases they would only make the program > slower for literally zero benefit. Yes. If there would be such architecture then programmers would pay more attention to "So to make software that behaves in some sane manner in hands of distressed housewives we are expected to make some kind of layer before signed addition that ensures that the input is not in that range of undefined behavior. " that I wrote. > complement arithmetic (cast back to signed afterwards)? Return the > maximum value? Something else? What exactly should the standard > mandate it to do? Oh, easy. On hypothetical case that I can wish something from Joulupukki about that matter I would perhaps ask something like guaranteed SIGFPE there. But I am still in position that "That is doable, no problems." like it is too. > Of course then it's the responsibility of the programmer to be aware that > relying on a particular behavior for signed integer overflow is > technically speaking non-portable. Not that it matters much in practice. No! That is incorrect to rely on any behavior on case of signed overflow! Major compilers optimize assuming that there are no signed overflow. Whole branches that depend on particular behavior of overflow can be erased by optimizer. "So to make software that behaves in some sane manner in hands of distressed housewives we are expected to make some kind of layer before signed addition that ensures that the input is not in that range of undefined behavior." That is the sole way to handle it. All other ways of handling it by programmer can result with it outright blowing up in hands of said distressed housewives. |
"Öö Tiib" <ootiib@hot.ee>: Dec 23 10:52AM -0800 On Wednesday, 23 December 2020 at 15:03:23 UTC+2, Bonita Montero wrote: > >> Examples ? > > Can you read only 15 first words from posts? > UTF-8 is independent of invalid Unicode code-points.4 Can't you read even those 15? You replied to: "Only subset of sequences of bytes is valid UTF-8 or valid UTF-16. Rest are invalid." Bytes can fail being valid far before reaching any code points. |
Richard Damon <Richard@Damon-Family.org>: Dec 23 02:04PM -0500 On 12/23/20 1:45 PM, Öö Tiib wrote: > That is the sole way to handle it. All other ways of handling it by > programmer can result with it outright blowing up in hands of said > distressed housewives. No, you do NOT need a 'layer' before signed addition, you need to design your program not to overflow and use right sized numbers. For instance, if you are working on a game with screen coordinates, you can design it with a limited range of coordinates (say 0 - 8191), then the difference of any two positions will fit into a 16 bit signed number, and the sum of two to get a midpoint will also fit. IF you need a 2d distance, and will do it by sum of squares, you know that squaring needs to be done in 32 bits. With this sort of basic anaylsis, you can make sure you never will generate a signed overflow, you just need to early in the code make sure that your points stay in their allowed range, which is a natural part of the problem. Yes, sometimes you are working with an application that needs to deal with more arbitrary scaled numbers, but then error handling would likely require doing those test anyway to avoid giving wrong answers, you just can't depend on known behavior of overflow to detect it, you need to bounds check before (or do math in a larger type and check if it will fit before downcasting). |
"Öö Tiib" <ootiib@hot.ee>: Dec 23 11:19AM -0800 > > of "incorrect")? Close the application? But it was "only" text, shame on you. > If the question is, what about if the input you thought was UTF-8 is not in > fact UTF-8, well, what about it? That depends what requirements say. > What about if the input you thought was a > JPEG file is not in fact a JPEG file? That also depends what requirements say. > With errors, you can attempt to recover, or fail. You can be as lenient or > strict as the utilities you're using allow you to be. So utilities in <locale> that do not let me to implement what requirements say are useless and should be deprecated. |
"Öö Tiib" <ootiib@hot.ee>: Dec 23 11:40AM -0800 On Wednesday, 23 December 2020 at 21:04:54 UTC+2, Richard Damon wrote: > > distressed housewives. > No, you do NOT need a 'layer' before signed addition, you need to design > your program not to overflow and use right sized numbers. That is logical layer if the numbers can not overflow logically. > number, and the sum of two to get a midpoint will also fit. IF you need > a 2d distance, and will do it by sum of squares, you know that squaring > needs to be done in 32 bits. Good example of logic in real meta-programming (IOW in programmers brain) layer! > can't depend on known behavior of overflow to detect it, you need to > bounds check before (or do math in a larger type and check if it will > fit before downcasting). Yes. We happened to discuss handling of dirty data like something that we read and that is supposed to be UTF-8 but might be is invalid. We can not solve issues with numbers in some JSON file at that logic layer that you so well brought example of. We have to write code that checks those ranges and do what is needed on case the check fails, no way to supply those to addition operator immediately. |
Richard Damon <Richard@Damon-Family.org>: Dec 23 02:41PM -0500 On 12/23/20 2:19 PM, Öö Tiib wrote: >> strict as the utilities you're using allow you to be. > So utilities in <locale> that do not let me to implement what requirements > say are useless and should be deprecated. <locale> and such is reasonably good for OUTPUT. C does not provide a great input method if you need to process defensively (which if you don't control the input you should). It isn't that hard to write a small package of input routines to handle the bad cases the way you want, the problem is that what is a bad case, and what you want to do with them vary so much from place to place, that making a standard library method to do it is hard. |
"Öö Tiib" <ootiib@hot.ee>: Dec 23 12:35PM -0800 On Wednesday, 23 December 2020 at 21:41:31 UTC+2, Richard Damon wrote: > problem is that what is a bad case, and what you want to do with them > vary so much from place to place, that making a standard library method > to do it is hard. My impression is that <locale> is utterly insufficient for i18n in whatever direction or even as building block of it but lets just disagree there as tastes vary. Unicode is formally well-defined only issue with it is that it has gone from 6.0.0 to 13.0.0 during last 10 years. With so moving target it sometimes makes version indication kind of desirable. My take on any converter/filter is that based on data in input sometimes it can be fully round-trip, sometimes it loses and/or adds something and sometimes fails to convert. Best is when it has some fully defined default behavior that can be configured and also that it indicates whatever it did to caller. How to react to each of those cases is then all up to caller. We have C++ so I would love compile-time configurable but dynamic is fine as bottle-neck is usually speed of channel or media. I do not understand what is so tricky about it as I do it all the time. Filter being "for output" you meant in sense that quality of its input can be blamed on programmer? Also on case of "for input" it can be blamed to some programmer ... just that chances are that the programmer is more anonymous. In both directions it is bad excuse for weak work. |
Manfred <noname@add.invalid>: Dec 23 10:13PM +0100 On 12/23/2020 7:45 PM, Öö Tiib wrote: > On Wednesday, 23 December 2020 at 15:56:14 UTC+2, Juha Nieminen wrote: >> Öö Tiib <oot...@hot.ee> wrote: [...] > guaranteed SIGFPE there. > But I am still in position that "That is doable, no problems." like > it is too. [...] > That is the sole way to handle it. All other ways of handling it by > programmer can result with it outright blowing up in hands of said > distressed housewives. Now, I must /really/ ask, Öö: Is the wife of Joulupukki one distressed housewife ? That is really bugging me. |
Richard Damon <Richard@Damon-Family.org>: Dec 23 05:02PM -0500 On 12/23/20 3:35 PM, Öö Tiib wrote: > My impression is that <locale> is utterly insufficient for i18n in whatever > direction or even as building block of it but lets just disagree there as > tastes vary. locale was NEVER the complete i18n solution. It provides a few key features, like the representation of numbers and currency, and does a fairly decent job at that IF you feed it the information that is needed (which is sometimes tougher to know). Yes, it doesn't handle thing like I want numbers printed in European format except for things intended to be cut and pasted into Excel which I have setup for a different format. The programmer needs to figure out which locale set to use for what. > Unicode is formally well-defined only issue with it is that it has gone > from 6.0.0 to 13.0.0 during last 10 years. With so moving target > it sometimes makes version indication kind of desirable. The thing to note is that Unicode has been incredibly backwards compatible, and most of the 'changes' have been defining what new code-points represent, which requires updating character classification tables if (and only if) you want to process 'correctly' those new characters. These changes do NOT invalidate any older processing > My take on any converter/filter is that based on data in input > sometimes it can be fully round-trip, sometimes it loses and/or > adds something and sometimes fails to convert. The 'illegal' Unicode that I have mentioned hasn't changed since the VERY early days (once Unicode became a 21 bit character set). Valid (in that sense) UTF-8 should fully round trip to UTF-16 or UCS-4 without any changes. Yes, an arbitrary string of bytes is very likely to be marked invalid, and not round trip. There are also some not uncommon enough errors that people make encoding data that won't round trip if done right (and an application being strict is supposed to mark these cases with the replacement character, but many will just silently 'fix' them. > dynamic is fine as bottle-neck is usually speed of channel or > media. I do not understand what is so tricky about it as I do it > all the time. The problem here is that range of possible desired error recovery is so broad that it becomes unwieldy to implement. > input" it can be blamed to some programmer ... just that > chances are that the programmer is more anonymous. In > both directions it is bad excuse for weak work. Output routines can specify their calling conditions, as the programmer using them has essential control over the data going into them. Yes, if he doesn't meet the published requreements for the routine, he can be 'blamed' for them not working. Input routines take some of their input from something at least potentially outside of the control of the programmer using them. This input potentially even comes from a source that is adversarial to the program. Specification for processing inputs can sometimes get quite involved, especially if the input is possibly coming from an untrained user, specifying not only the primary expected inputs, but possible variations that users might try, as well as safeguards for dealing with hostile input (sometimes you want to do more than just ignore it). Yes, if the input is securely from a trusted source and known to be free from errors, you can be a bit more lax with parsing, and perhaps some of the simple input processing routines from the standard library can be used. |
olcott <NoOne@NoWhere.com>: Dec 23 01:07PM -0600 The simplest way to understand that a halt decider could correctly decide the {Linz, Siper, Kozen} counter-examples (shown here as H_Hat) is to see what happens when we define a halt decider that simply simulates its input and returns True only after its input terminates (if ever). #define HALT __asm hlt void H_Hat(u32 P) { u32 Input_Halted = Simulate(P, P); if (Input_Halts) HERE: goto HERE; else HALT } int main() { u32 Input_Halted = Simulate((u32)H_Hat, (u32)H_Hat); Output("Input_Halted =", Input_Halted); HALT; } _H_Hat() [000007ca](01) 55 push ebp [000007cb](02) 8bec mov ebp,esp [000007cd](01) 51 push ecx [000007ce](03) 8b4508 mov eax,[ebp+08] [000007d1](01) 50 push eax [000007d2](03) 8b4d08 mov ecx,[ebp+08] [000007d5](01) 51 push ecx [000007d6](05) e80ffeffff call 000005ea [000007db](03) 83c408 add esp,+08 [000007de](03) 8945fc mov [ebp-04],eax [000007e1](04) 837dfc00 cmp dword [ebp-04],+00 [000007e5](02) 7404 jz 000007eb [000007e7](02) ebfe jmp 000007e7 [000007e9](02) eb01 jmp 000007ec [000007eb](01) f4 hlt [000007ec](02) 8be5 mov esp,ebp [000007ee](01) 5d pop ebp [000007ef](01) c3 ret _main() [000007fa](01) 55 push ebp [000007fb](02) 8bec mov ebp,esp [000007fd](01) 51 push ecx [000007fe](05) 68ca070000 push 000007ca [00000803](05) 68ca070000 push 000007ca [00000808](05) e8ddfdffff call 000005ea [0000080d](03) 83c408 add esp,+08 [00000810](03) 8945fc mov [ebp-04],eax [00000813](03) 8b45fc mov eax,[ebp-04] [00000816](01) 50 push eax [00000817](05) 68d7020000 push 000002d7 [0000081c](05) e8e9faffff call 0000030a [00000821](03) 83c408 add esp,+08 [00000824](01) f4 hlt [00000825](02) 8be5 mov esp,ebp [00000827](01) 5d pop ebp [00000828](01) c3 ret Output_Debug_Trace() Trace_List.size(30) ---[000007fa](01) 55 push ebp ---[000007fb](02) 8bec mov ebp,esp ---[000007fd](01) 51 push ecx ---[000007fe](05) 68ca070000 push 000007ca ---[00000803](05) 68ca070000 push 000007ca ---[00000808](05) e8ddfdffff call 000005ea --CALL [000005ea] ---[000007ca](01) 55 push ebp ---[000007cb](02) 8bec mov ebp,esp ---[000007cd](01) 51 push ecx ---[000007ce](03) 8b4508 mov eax,[ebp+08] ---[000007d1](01) 50 push eax ---[000007d2](03) 8b4d08 mov ecx,[ebp+08] ---[000007d5](01) 51 push ecx ---[000007d6](05) e80ffeffff call 000005ea --CALL [000005ea] ---[000007ca](01) 55 push ebp ---[000007cb](02) 8bec mov ebp,esp ---[000007cd](01) 51 push ecx ---[000007ce](03) 8b4508 mov eax,[ebp+08] ---[000007d1](01) 50 push eax ---[000007d2](03) 8b4d08 mov ecx,[ebp+08] ---[000007d5](01) 51 push ecx ---[000007d6](05) e80ffeffff call 000005ea --CALL [000005ea] ... the last 8 lines infinitely repeat. Every time that the same function is called from the same machine address a second time (within an execution trace) without any control flow instructions in-between is a case of infinite recursion. This is shown at execution trace lines 14-22 above. (1) The C code does map to its machine code. (2) The machine code does map to its execution trace (when Halts() is a simulator). (3) The execution trace does map to infinite recursion. The semantics specified by the syntax of the formal x86 language proves (2) and (3). -- Copyright 2020 Pete Olcott "Great spirits have always encountered violent opposition from mediocre minds." Einstein |
Jorgen Grahn <grahn+nntp@snipabacken.se>: Dec 23 04:49PM On Tue, 2020-12-22, Keith Thompson wrote: > On amazon.com, I see: > Large-Scale C++ Volume I: Process and Architecture Dec 17, 2019 > Large-Scale C++ Volume II: Design and Implementation Mar 14, 2021 And I hear he's been talking at conferences recently. My coworkers who follow such things talked about it -- not because he was a name from the distant past of C++, but because they found him interesting. /Jorgen -- // Jorgen Grahn <grahn@ Oo o. . . \X/ snipabacken.se> O o . |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment