- Here is my answer - 1 Update
- What's computer programming - 2 Updates
- What is the best encoding (experiences...) for unicode? - 18 Updates
- Why all tutorials/books use non-unicode string? - 1 Update
| Ramine <ramine@1.1>: Feb 22 05:36PM -0800 Silvio wrote: >Please stop using the word "invent" when you haven't actually invented >anything. It makes you come over as very stupid. You look like a kind of stupid guy, because you are not using your intellect to reason better.. You say i didn't invented algorithms and libraries ? Look here at my invention called SemaCondvar and SemaMonitor here: https://sites.google.com/site/aminer68/semacondvar-semamonitor Where do you have such this concepts and this objects in Java or C++ or C# tell me ? my SemaCondVar and SemaMonitor combines all the characteristics of a windows event object and a windows semaphore and an EventCount and a condition variable and it adds also the following characteristic: When you pass True in the first parameter of the constructor the signal(s) will not be lost even if there is no waiting threads, if it's False the signal(s) will be lost if there is no waiting thread. You see now ? this is my invention and you will not find it in C++ or C# or Java. And look at my other invention here called scalable MLock: https://sites.google.com/site/aminer68/scalable-mlock My scalable MLock is waitfree, read about it more on my webpage... An look at my other invention called , parallel conjugate gradient linear system solver library that is cache-aware and scalable on "NUMA" architecture here: https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware Also look here at my other invention called parallel varfiler, it is very powerful , you will not find it in C# or Java or C++, you have to read about it here: https://sites.google.com/site/aminer68/more-scalable-parallel-varfiler I have also invented a scalable RWLock with many variants, read about them here: https://sites.google.com/site/aminer68/scalable-rwlock I have also invented other libraries like my parallel archiver etc. please look at them here: https://sites.google.com/site/aminer68/more-scalable-parallel-varfiler So from now on don't say that i didn't invented algorithm and libraries.. Thank you, Amine Moulay Ramdane. |
| Ramine <ramine@1.1>: Feb 22 04:05PM -0800 Hello... I was thinking more, and i have come to an interesting subject.. I was asking myself what`s computer programming ? And after i have learned to program and after i have invented many algorithms and parallel libraries ... I think i am able now to define more what is computer programming... I will define it like the following: Computer programming is like "robotics", when you construct a robot you have to give it legs and hands and motors etc. and you have to learn it how to behave by incorporating in it "knowledge" to give it the ability to behave in different ways: such as walk , and jump etc. so computer programming is like robotics , because when you construct a computer program it is like constructing a robot, you have to construct every parts of the program by constructing "objects" and there "properties", and you have to give your program "knowlege" by implementing the "methods" of the objects to give your program the ability to behave in different ways, like a robot... this is how i can define programming , it's like robotics... Thank you, Amine Moulay Ramdane. |
| Ramine <ramine@1.1>: Feb 22 04:46PM -0800 Hello.... On 2/22/2015 1:09 PM, Richard Heathfield wrote:> No, it's more like creating a list of instructions that a computer can > is basically a computer with peripherals that can move, and you have to > program the movements. The definition thus becomes recursive, and not in > a helpful way. I don't agree with you.. Because what's a physical robot ? and what's a computer ? When you construct a computer program , you also move and mix and work with bit and bytes and you also do measurements on the bits and bytes and on conceptual representations and numerical representations of physical objects to construct other objects from bits and bytes and from objects and from methods and conceptual representations and numerical representations of physical objects and you have to incorporate knowledge in it to give it the ability to behave in different ways... so a computer program is like a physical microprocessor or a computer or a robot, because to construct a them have to move and mix the matter and to work with matter and do measurements on the matter to construct physical representations (from a concept also), and you have to incorporate in them "knowledge" to give them the ability to behave in different ways: so computer programs are like a physical microprocessors, or physical computers or a physical robots... Thank you, Amine Moulay Ramdane. |
| Paavo Helde <myfirstname@osa.pri.ee>: Feb 22 09:50AM -0600 > a separate issue. > So what encoding you guys use? UTF-8 or UTF-16? What is the > recommendation and your experiences. Fully UTF-8, for portability. When the OS interface requires something else (notably UTF-16 on Windows), the translation is done on the application border. Works well, but may become a bit tedious if large parts of the application are using some UTF-16-only frameworks or libraries (like MFC or its newer cousins). MFC also attempts to do its own narrow-to-wide conversions, but using a wrong codepage, so these automatic conversions need to be switched off for avoiding surprises. hth Paavo |
| JiiPee <no@notvalid.com>: Feb 22 04:15PM On 22/02/2015 15:34, Öö Tiib wrote: > 4) fits into std::string (std::wstring is unspecified if it is UTF-16 or > UTF-32 or something else entirely) > 5) is encoding of majority of internet text content yes. I knew most of these and agree. > UTF-16 may be more convenient on Windows or with Qt. Still if significant > part of input or output goes in UTF-8 (I already mentioned internet) then > I would pick UTF-8 as internal representation for texts in your application. yes, with windows maybe utf-16. I do Windows programming.... Ok, so I think I better actually maybe use utf-8 and switch to 16 when needed with windows. > 'std::string' is continuous container of such bytes (specific encoding > of possible text in it is not guaranteed by it). When that is accepted > then everything works. ok but lets say I use 3 russian letters. Many times I want to know that there is 3 letters rather than the byte size is like 7. > what is expected at other side (plus inevitable error handling). C++ > itself offers too few and inconvenient methods for that so we typically > seek help for converting and checking from outside of C++ standard. I meant std:: -. typo. I mean, if I have a Russian word with 3 russian letters, I want to know how many letters there are rather than how many bytes it is total. Sometimes we need that, right? > people expect your program to ignore case of characters or to convert > to upper-case or to convert to lower case (or even to title case) and > how such things are done may be specific to local traditions. ye I understand, read about it .... |
| JiiPee <no@notvalid.com>: Feb 22 04:20PM On 22/02/2015 15:50, Paavo Helde wrote: >> So what encoding you guys use? UTF-8 or UTF-16? What is the >> recommendation and your experiences. > Fully UTF-8, for portability. ok, I take your word for it and others words as well as people here seems to agree with that. Am sure people have good experience. And good to hear this opinion like this. They did not give much recommendations on web, just pos and neg sides of eatch. But I want to hear recommendations. > When the OS interface requires something else > (notably UTF-16 on Windows), the translation is done on the application > border. ok. sounds logical > Works well, but may become a bit tedious if large parts of the application > are using some UTF-16-only frameworks or libraries (like MFC or its newer > cousins). hmmm, I am. But on the other hand I also use a lot there TinyXml C++ library (http://www.grinninglizard.com/tinyxmldocs/index.html) to save my data which uses UTF-8 format. So I guess that makes it so that I should use UTF-8. > MFC also attempts to do its own narrow-to-wide conversions, but > using a wrong codepage, so these automatic conversions need to be switched > off for avoiding surprises. well, I guess I better then check all characters work ok. |
| Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 22 05:01PM On Sun, 22 Feb 2015 14:15:13 +0000 > UTF-8? Like how to get its length, find a certain character, how to > store it (well, i guess just a char [] array does the job, or even > std::string). The point you need to realise about unicode is that there is no way to index into unicode codepoints with either UTF-8 and UTF-16, because they are variable length encodings. To find a character at a given character index, you have to go to the start of the string and work your way along. Furthermore, any given UTF-8 or UTF-16 string must be treated as const, because altering a given character may change the length (in 8 or 16 bit code units) of the entire string, so indexing is not particularly useful. You might think that would not be true of UTF-32, and you would be half right. But only half right, because although you can index into a codepoint with UTF-32, combining characters also come into play. For example there are two renditions of characters with a diaresis. Lower case o with diaresis could be given as either 'U+006F U+0308' (two code points) or 'U+00F6' (one code point). Any reasonable equality test should treat them as the same. So how would you describe the "length" of that string? There are libraries available to help with this, such a ICU and UTF8-CPP. In practice I have rarely found them necessary. For UTF-8 I use std::string, and if I want to progress by code points along the string I have an iterator which does that for me, for which operator++ and operator-- iterate by whole unicode code points, and which when dereferenced returns a 32 bit value (the current unicode code point value) rather than a char. Chris |
| JiiPee <no@notvalid.com>: Feb 22 05:21PM On 22/02/2015 17:01, Chris Vine wrote: > and operator-- iterate by whole unicode code points, and which when > dereferenced returns a 32 bit value (the current unicode code point > value) rather than a char. ok, this is good to know what people use here. If others also tell what they use that is helpful... Ok, you use just string.... I guess that might work for me also actually. So how do you calculate the lenght of a Russian word for example. lets say you have 4 letter Russina word... what function you use to calculate that letter lenght? |
| JiiPee <no@notvalid.com>: Feb 22 05:25PM On 22/02/2015 17:01, Chris Vine wrote: > and operator-- iterate by whole unicode code points, and which when > dereferenced returns a 32 bit value (the current unicode code point > value) rather than a char. interesting . So the data , lets say 10000 characters, is stored as utf-8 (to save space). Then when you need one character you give it as 32 bit value. hmmm ... |
| Paavo Helde <myfirstname@osa.pri.ee>: Feb 22 11:59AM -0600 > say you have 4 letter Russina word... what function you use to > calculate > that letter lenght? Why would you need that? Sure, there are programs for which this is relevant, like a program for solving or composing Russian crosswords, but for the type of software you are writing, why would you need this? Just curious... If I would need this (and nothing else Unicode-related) then I probably would use a little function like that (not tested): size_t LengthInCodepoints(const unsigned char* utf8, size_t sizeInBytes) { size_t result = 0; for (size_t i=0; i<sizeInBytes; ++i) { if ((utf8[i] & 0x80)==0 || (utf8[i] & 0x40)!=0) { ++result; } } return result; } If I would need more, I would use the ICU library or something. Cheers Paavo |
| JiiPee <no@notvalid.com>: Feb 22 06:10PM On 22/02/2015 17:59, Paavo Helde wrote: > If I would need more, I would use the ICU library or something. > Cheers > Paavo ok, I give better example: say you have a Russian sentence and you want to change a certain word in side it, say "car" to "vehicle". How would you do it? So how do you do replace? That is needed surely... |
| jt@toerring.de (Jens Thoms Toerring): Feb 22 06:42PM > ok but lets say I use 3 russian letters. Many times I want to know that > there is 3 letters rather than the byte size is like 7. Here's a function for counting the number of UTF-8 letters in C string (this is from a project written in C, but it should not be too much trouble coverting it to C++ and use std:string instead). /*************************************** * Function for determing the number of (UTF-8) characters in * a string. If it's not a valid UTF-8 string -1 is returned. ***************************************/ ssize_t utf8_length( const char * str ) { const unsigned char * p = ( const unsigned char * ) str; ssize_t cnt = 0; if ( ! str ) return -1; for ( ; *p; p++, cnt++ ) { if ( *p <= 0x7F ) // ASCII /* empty */ ; else if ( ( *p & 0xE0 ) == 0xC0 ) // should be 2 bytes { if ( ( *++p & 0xC0 ) != 0x80 ) return -1; } else if ( ( *p & 0xF0 ) == 0xE0 ) // should be 3 bytes { if ( ( *++p & 0xC0 ) != 0x80 || ( *++p & 0xC0 ) != 0x80 ) return -1; } else if ( ( *p & 0xF8 ) == 0xF0 ) // should be 4 bytes { if ( ( *++p & 0xC0 ) != 0x80 || ( *++p & 0xC0 ) != 0x80 || ( *++p & 0xC0 ) != 0x80 ) return -1; } else // anything else is invalid return -1; } return cnt; } You can probably already see the elements of an iterator lurking in there;-) Best regards. Jens -- \ Jens Thoms Toerring ___ jt@toerring.de \__________________________ http://toerring.de |
| Robert Wessel <robertwessel2@yahoo.com>: Feb 22 01:09PM -0600 >projects really. So looking for direction. >People already gave instructions and I read them, but just asking if >there is more. You might want to read this: http://utf8everywhere.org/ |
| Robert Wessel <robertwessel2@yahoo.com>: Feb 22 01:15PM -0600 On 22 Feb 2015 18:42:54 GMT, jt@toerring.de (Jens Thoms Toerring) wrote: >} >You can probably already see the elements of an iterator lurking >in there;-) That will count code points, but not what you'd think of as characters if you consider things like combining code points. |
| Paavo Helde <myfirstname@osa.pri.ee>: Feb 22 01:23PM -0600 > ok, I give better example: say you have a Russian sentence and you > want to change a certain word in side it, say "car" to "vehicle". How > would you do it? So how do you do replace? That is needed surely... Here you go: std::string sentence = ...; std::string carInRussian = "\xd0\xb0\xd0\xb2\xd1\x82\xd0\xbe\xd0\xbc" "\xd0\xbe\xd0\xb1\xd0\xb8\xd0\xbb\xd1\x8c"; std::string vehicleInRussian = "\xd0\xbc\xd0\xb0\xd1\x88\xd0\xb8\xd0\xbd\xd0\xb0"; std::string::size_type pos = sentence.find(carInRussian); if (pos!=std::string::npos) { sentence.replace(pos, carInRussian.length(), vehicleInRussian); } No Unicode knowledge needed for such replacements. Zilch. Nada. hth Paavo |
| Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 22 07:26PM On Sun, 22 Feb 2015 18:10:22 +0000 > ok, I give better example: say you have a Russian sentence and you > want to change a certain word in side it, say "car" to "vehicle". How > would you do it? So how do you do replace? That is needed surely... You would normalize to either use or not use precomposed characters (a unicode issue), decide how to deal with singular and plural forms and other grammatical inflection (a language issue bearing on your problem space) and then partition on the word(s) in question and construct a new string from the two parts with substitute word inserted, having regard to whatever decisions you have reached about the grammatical forms. (Be pleased that unlike Celtic languages such a Welsh or Irish, Russian does not as far as I am aware have initial morphisms to contend with as well as suffixed inflexions.) The unicode question is but one of your issues here. On your earlier question of how do you find the length of a Russian word, then no answer can be given until you specify what you are measuring as your unit of length. Code points? If so, with or without precomposed characters? Or glyphs? Or graphemes? On the latter, (and looking more widely than just Russian) how do you treat ligatures? For example, is eszett (ß) one or two "characters" and if one does it become two in its upper case form? Is ligatured fi one or two "characters"? If using Hangul, how do you deal with Hangul jamos? If using Indic scripts, how are you counting consonant clusters? Asking the question is usually completely pointless. Chris |
| Paavo Helde <myfirstname@osa.pri.ee>: Feb 22 01:43PM -0600 Paavo Helde <myfirstname@osa.pri.ee> wrote in >> want to change a certain word in side it, say "car" to "vehicle". How >> would you do it? So how do you do replace? That is needed surely... > No Unicode knowledge needed for such replacements. Zilch. Nada. Maybe I should have pointed out that the UTF-8 (und UTF-16) encodings have been carefully designed to make such things working. Complications arise if the strings are not normalized in the same way and are using different ways for representing the same letters. But such things can happen with ASCII as well, e.g. "$3400" and "$3,400", a tab versus space, capitalization etc. Anyway, the point is that assuming the texts are normalized to the needed extent, processing them without any knowledge of code point borders is often trivial. Cheers Paavo |
| jt@toerring.de (Jens Thoms Toerring): Feb 22 08:00PM > That will count code points, but not what you'd think of as characters > if you consider things like combining code points. Good catch! Hadn't thougt of that. Back to the drawing board;-) Best regards, Jens -- \ Jens Thoms Toerring ___ jt@toerring.de \__________________________ http://toerring.de |
| JiiPee <no@notvalid.com>: Feb 22 08:47PM On 22/02/2015 19:23, Paavo Helde wrote: > No Unicode knowledge needed for such replacements. Zilch. Nada. > hth > Paavo ok thanks, i ll save this |
| Geoff <geoff@invalid.invalid>: Feb 22 12:51PM -0800 >I mean, if I have a Russian word with 3 russian letters, I want to know >how many letters there are rather than how many bytes it is total. >Sometimes we need that, right? std::wstring wstr = L"123"; std::cout << "There are " << wstr.length() << " characters in wstr, " << "its size is " << sizeof(wstr) << std::endl; |
| JiiPee <no@notvalid.com>: Feb 22 09:04PM On 22/02/2015 19:09, Robert Wessel wrote: >> there is more. > You might want to read this: > http://utf8everywhere.org/ I already read that... but then there was somebody arguing strongly there that utf-16 is better and seemed like he won the argument and this man a bit backed down..... donno.... thats why still asking. |
| David Brown <david.brown@hesbynett.no>: Feb 22 05:42PM +0100 On 21/02/15 20:27, JiiPee wrote: > Maybe somebody on forum or something said also "Qt s future is not > certain". > Ok, let me quickly try to google it again.... Don't bother trying to find old inaccurate information - either accept that the other posters here have given you accurate information, or go to the QT website and read their licensing information. QT has changed hands a couple of times - as is usual in such cases, this lead to lots of rumours about its demise, or the demise of the open source licences for QT. And since QT was originally commercial-only, then commercial+GPL, there is an abundance of old and outdated posts and websites that do not know about the LGPL version. |
| You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment