- Why all tutorials/books use non-unicode string? - 7 Updates
- About programming.... - 1 Update
jt@toerring.de (Jens Thoms Toerring): Feb 21 10:03PM > all turorials and C++ books use almost always char as a character type > (1 byte)? Why use examples which are not used in real world? This I do > not understand. As some others have pointed out it adds an additipnal burden because in a tutorial you now also have to explain how UTF-8 works, which distracts from what is the topic. The std::string method length() doesn't stop working when you use UTF-8 - of course dependent on what you call "work". It still tells you ow many bytes the string occupies. What it doesn't do anymore is telling you how many "letters" (or "glyphs" or what- ever name you prefer - I try to avoid the word character since it's too easily confused with the concept of a 'char') are con- tained in the string. As you know UTF-8 a "letter" can be re- presented by anything between 1 to 4 bytes. Thus you can't equate "number of bytes" to "number of ;etters" anymore, as has been traditionally done with ASCII. So you need a new function for counting those "letters". Fortunately, writing that isn't too hard: By inspecting the upper bits of the first byte you can easily determine how many bytes that "letter" occupies, which makes iterating over a string to count the number of letters relatively simple. There are a few pitfalls, though: not all 1 to 4 byte long byte sets are valid UTf-8 entities, so, if you deal with external input, you must check for that possibility and design some strategy of dealing with such cases. An imprtant aspect is dealing with the environment: if, for example, the users keybooard is set up to send LATIN1 en- coded charcters but you're expecting UTF-8 input that will end in grief. Or when the output medium is set up to use a different encoding than what your program emits the output will look rather strange. So you will have to spend some time giving more attention to locale settings etc. which in a pure ASCII world usually are taken to be arcane stuff. Another aspect is that, if you're serious about it, have to start thinking about questions like: how do I enter Chinese or Japanese or Greek etc. characters using a standard US-English keyboard (or what tools does my system supply for that purpose). Dealing with UTF-8 in a program actually is relatively trivial - you have to distinguish between byte count and letter count, you should check if the input is "legal UTF-8 and you may have to write some UTF-8 aware iterator when looping over a string (so it gives you the next letter, not the next 'char') etc. And, if this is for an already existing application, you'll have to check whereever strings are used if what you want is the "length" in bytes or in "letters". I've recently done the switch from pure ASCII to UTF-8 for a legacy library back from about 20 years ago. I've dragged my feet for a long time doing that since I always thought that "char-count equals letter count" would be that deep-rooted in a piece of software that old that it would be nearly impossible to fix that basic assumption. But, when I finally made the at- tempt I was positively surprised that it was a lot easier than I'd ever imagined - in most places dealing with strings it was immediately clear if this was about the letters or bytes in a string and, with a few functions for dealing with UTF-8, it took me a very short time. From that experience I tend to conclude that most pf the "angst" about UTF-8 is more from unfamilarity than anything else. The actual problems are often more the enviroment the user is wor- king in - if the keyboard is set up to send LATIN1 or CJK or whatever other legacy encoding, then there's were the real pro- blems are. So, it's a new world definitely, and one has to learn a few new things and become aware of a new potential problems (and existing solutions;-). I can only recommend to do a few experiments with some "toy" programs. The concept behind UTF-8 is IMHO, while ingenious, surprisingly simple, so I found it more helpful to write a few functions for counting "letters" in a string or detecting in- valid byte sequences than trying to understand some rather complex libraries that do all the work for you. Not that I'd consider those libaries to be useless, but to understand what they're doing for you it's good to have spend a bit of time trying to solve the simpler problems to get a feel of what's involved - otherwise the documentation can often hard to un- derstand;-) Best regards, Jens -- \ Jens Thoms Toerring ___ jt@toerring.de \__________________________ http://toerring.de |
Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Feb 21 10:10PM On 21/02/2015 22:03, Jens Thoms Toerring wrote: > actual problems are often more the enviroment the user is wor- > king in - if the keyboard is set up to send LATIN1 or CJK or > whatever other legacy encoding, then there's were the real pro- In my experience keyboards send scan codes not characters; the OS translates scan codes into characters. /Flibble |
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 21 10:12PM On 21 Feb 2015 20:45:02 GMT > the LGPL. So, for example, if you use Qt version 4.x for > ypur program the user must hav th possibility to link it in- > stead against Qt version 5.y. I think that's probably wrong. As it has been explained to me by those who purport to know, if you use static linking you are probably in trouble and you may be required to distribute your code under the LGPL if you distribute outside your organisation. If you use dynamic linking you are generally OK. The requirement in such cases is that your code must be relinkable against a different library (such as one you yourself write) with the same ABI, which is a criterion met by all current Windows and unix dynamic linking mechanisms. There is no requirement for portability to different non-ABI compatible versions of a particular library - that would serve no purpose. The legal issue behind this is apparently whether you "use" the library (OK) or if you are a derived work (you might be in trouble). Dynamic linking seems to be regarded as the former. Static linking, or extending the particular library itself for your own purposes, seems to be regarded as the latter. Chris |
jt@toerring.de (Jens Thoms Toerring): Feb 21 10:28PM > > whatever other legacy encoding, then there's were the real pro- > In my experience keyboards send scan codes not characters; the OS > translates scan codes into characters. Yes, of course, you've got the OS to do the scan code translation. That was why I first wrote about the user's "environment" before using the more colloquial "the kevboard is set up to send..." as a shortcut for "the keyboard sends scan codes, which the OS (and maybe even some other intermediaries) translates into something which finally arrives at the application". Getting a grip on that is, as far as I see it, often the hardest part, not the actual dealing with UTF-8 encoded strings. Best regards, Jens -- \ Jens Thoms Toerring ___ jt@toerring.de \__________________________ http://toerring.de |
jt@toerring.de (Jens Thoms Toerring): Feb 21 10:54PM > those who purport to know, if you use static linking you are probably in > trouble and you may be required to distribute your code under the LGPL > if you distribute outside your organisation. I don't see your point here since with static linking you'd definitely make it impossible to link with a new version of the LGPL'ed linrary. Static linking is a no-no in this case. > linking seems to be regarded as the former. Static linking, or > extending the particular library itself for your own purposes, seems to > be regarded as the latter. Sorry, you've lost me here;-) In my book saying: link it against whatever version of Qt (or whatever other LGPL library this is about) by giving the user all the bits s/he needs to make that possible (including the parts of the sources that rely on wor- king with the LGPL library) is the most honest way to go. What that doesn't exclude is having another dynamically linked but closed source library you actually demand money for, and which doesn't depend on any services of the LGPL'ed library (so there is not the least bit of doubt that there's no "deriving" going on) - but which does the really interesting bits of work that make it something people may be motivated to pay for. Best regards, Jens -- \ Jens Thoms Toerring ___ jt@toerring.de \__________________________ http://toerring.de |
jt@toerring.de (Jens Thoms Toerring): Feb 21 11:05PM > > who supported it stopped supporting it.... > Wrong; Qt was sold to Digia which and is alive and well and continually > supported/updated. And there's also that thingy that Qt is open-source - so while Trolltech -> Nokia -> Digia stopping supporting fur- ther development can be a major issue, it's not necessarily the end of the world as we know it. The KDE guys rely on it heavily and while they probably don't have tons of money they still would have a big incentive to keep it alive. In that (hypothetical) case questions about the future of cross-plat- form compatibility might arise, but we're not at that point yet, I hope... Best regards, Jens -- \ Jens Thoms Toerring ___ jt@toerring.de \__________________________ http://toerring.de |
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 21 11:13PM On 21 Feb 2015 22:54:12 GMT jt@toerring.de (Jens Thoms Toerring) wrote: [snip] > is not the least bit of doubt that there's no "deriving" going > on) - but which does the really interesting bits of work that > make it something people may be motivated to pay for. I believe you are deranged. Chris |
Wouter van Ooijen <wouter@voti.nl>: Feb 21 11:01PM +0100 Ramine schreef op 22-Feb-15 om 1:04 AM: > You have to understand me more, as you have noticed i have invented > and implemented many algorithms an libraries, and i have come to a > conclusion that i want to share with you If you expect me to follow a link to your wonderfull creations I'd first like a line or two that explains what it will do for me. (That also gives me a hint of how good you are at expressing yourself efficiently.) Wouter |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment