- Moons - 1 Update
- readsome() vs. fread() - 19 Updates
Vir Campestris <vir.campestris@invalid.invalid>: Feb 22 09:14PM On 21/02/2018 22:26, mcheung63@gmail.com wrote: Please ignore these people. Most of us don't see their posts - only your replies. Andy |
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 21 10:38PM On Wed, 21 Feb 2018 22:13:19 GMT > >whether that is ever done in fact. > fread will return the number of bytes requested, unless EOF occurs. > Regardless of the size of the stdio buffer. As will std::filebuf::sgetn(). The issue is whether the internal buffers are short-cicuited or not. On an optimized block read by std::filebuf::sgetn(), anything in the buffers will first be extracted and then a call to unix read() will be made directly into the buffer passed in to std::filebuf::sgetn() (rather than into the streambuffer's internal buffer). I was hypothesizing that the poorer performance of fread() on large block transfers which was reported may be caused by the fact that it either cannot, or does not, short-circuit in this way. |
scott@slp53.sl.home (Scott Lurndal): Feb 21 11:52PM >overlaying the object." >fgetc will always go via the internal stream buffer, for efficiency >reasons. I'm not sure why anyone would actually do fread for large block transfers, but maybe they're limited to windows. On any POSIX system, pread(2)/pwrite(2) are the easy and efficient[*] way to access and modify fixed sized records. [*] mmap(2) wins on the efficiency metric. |
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 21 10:59PM On Wed, 21 Feb 2018 22:38:33 +0000 > I was hypothesizing that the poorer performance of fread() on large > block transfers which was reported may be caused by the fact that it > either cannot, or does not, short-circuit in this way. So far as I understand the C standard, it looks as if fread() cannot make this optimization. The C standard says about fread(): "For each object, 'size' calls are made to the fgetc function and the results stored, in the order read, in an array of unsigned char exactly overlaying the object." fgetc will always go via the internal stream buffer, for efficiency reasons. |
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 21 11:59PM On Wed, 21 Feb 2018 23:52:52 GMT scott@slp53.sl.home (Scott Lurndal) wrote: [snip] > system, pread(2)/pwrite(2) are the easy and efficient[*] way to > access and modify fixed sized records. > [*] mmap(2) wins on the efficiency metric. Quite so. Presumably the authors of the C standard take the view that if you want unix read() you should use unix read(). It is not very different from fread() apart from the fact that read() might do a short read so you need to put it in a do loop until the request is met or end-of-file is encountered, and that you need to account for EINTR (both of which outcomes are handled automatically by fread()). Presumably the C++ standard authors thought that the C++ abstractions were better, so the optimization needs to be catered for. If so, I think I agree with them. |
James Kuyper <jameskuyper@verizon.net>: Feb 21 11:29PM -0500 On 02/21/2018 05:59 PM, Chris Vine wrote: >> On Wed, 21 Feb 2018 22:13:19 GMT >> scott@slp53.sl.home (Scott Lurndal) wrote: >>> Chris Vine <chris@cvine--nospam--.freeserve.co.uk> writes: ... > overlaying the object." > fgetc will always go via the internal stream buffer, for efficiency > reasons. Yes, but keep in mind that it's only the observable behavior (5.1.2.3p6) of a program that is constrained by those requirements. The difference between actually calling fgetc() separately for each byte and doing a single large block read doesn't involve anything that qualifies as observable behavior. The speed with which something happens does NOT qualify as "observable behavior" as that term is defined by the C standard (even though it is trivially easy to observe such behavior). |
James Kuyper <jameskuyper@verizon.net>: Feb 21 11:41PM -0500 On 02/21/2018 06:59 PM, Chris Vine wrote: > Presumably the C++ standard authors thought that the C++ abstractions > were better, so the optimization needs to be catered for. If so, I > think I agree with them. The distinction you're making doesn't really exist. Both standards define "observable behavior" almost identically, and both allow any optimization that produces observable behavior that's consistent with the standards' requirements, even if it is not produced by the same mechanism as that described in the requirements. That is sufficient freedom to allow the same optimization for fread() and std::streambuf::xsgetn(). |
Paavo Helde <myfirstname@osa.pri.ee>: Feb 22 08:55AM +0200 On 22.02.2018 1:52, Scott Lurndal wrote: > but maybe they're limited to windows. On any POSIX system, pread(2)/pwrite(2) are > the easy and efficient[*] way to access and modify fixed sized records. > [*] mmap(2) wins on the efficiency metric. +1 for mmap. I'm baffled why anybody should discuss the relative speed of various binary file content copying methods when there is a way to avoid this copying step, at least on more common platforms. If you are not using mmap it means you are not interested in performance, so why discuss this in such great lengths? |
"Öö Tiib" <ootiib@hot.ee>: Feb 21 11:18PM -0800 On Thursday, 22 February 2018 01:53:03 UTC+2, Scott Lurndal wrote: > I'm not sure why anyone would actually do fread for large block transfers, > but maybe they're limited to windows. On any POSIX system, pread(2)/pwrite(2) are > the easy and efficient[*] way to access and modify fixed sized records. Yes, I did bring example of 10 years ago of Windows being target platform. Reading 1GB file in 64kB chunks with default buffers fread did it about 49 sec and ifstream::read did it about 24 sec. > [*] mmap(2) wins on the efficiency metric. When I set buffer size to 2 MB on current Mac-book then both fread and ifstream::read read 1 GB file with 2 sec and the chunks size does not seemingly affect it. The 2 seconds is likely the limit of SSD so appears that mmap() is overkill for mundane sequential reads on this platform. Since mmap() is more error-prone it should be perhaps only used when it is easier to understand. For example for random access of large file the code using mmap is easier to understand than code that winds the streams back and forth. |
Paavo Helde <myfirstname@osa.pri.ee>: Feb 22 11:46AM +0200 On 22.02.2018 9:18, Öö Tiib wrote: > only used when it is easier to understand. For example for random > access of large file the code using mmap is easier to understand than > code that winds the streams back and forth. What do you mean by mmap being more error-prone? I do not recall having any problems with it ever. About read/fread/pread: any reading of file content into the user-space buffer first reads it into the OS disk cache[*], then copies it over to the user space buffer. The second step is omitted by mmap(), and the first step is performed only for pages you touch. Also, if the file already happens to be in the disk cache, the first step is altogether omitted. [*] On some platforms one can specify some flags like O_DIRECT to bypass the disk cache, but this will likely make the program slower, not faster. |
Jorgen Grahn <grahn+nntp@snipabacken.se>: Feb 22 11:17AM On Thu, 2018-02-22, Paavo Helde wrote: > avoid this copying step, at least on more common platforms. If you are > not using mmap it means you are not interested in performance, so why > discuss this in such great lengths? You can only mmap "true" files -- that's often my reason for not using it. (Also, working on Unix, I rarely have to do binary I/O.) /Jorgen -- // Jorgen Grahn <grahn@ Oo o. . . \X/ snipabacken.se> O o . |
Barry Schwarz <schwarzb@dqel.com>: Feb 21 09:44AM -0800 On Wed, 21 Feb 2018 12:05:57 -0500, "James R. Kuyper" >that I don't understand when using readsome(). I created the following >programs to investigate those results (error handling suppressed for the >sake of clarity): <snip code> >-Wstrict-prototypes -Wmissing-prototypes read_test_c.c -o read_test_c >~/testprog(100) ./read_test_c read_test >0: 256 <snip> >-Woverloaded-virtual -Wsign-promo read_test_c++.cpp -o read_test_c++ >~/testprog(103) ./read_test_c++ read_test >0: 256 <snip> >Could someone explain to me why the C++ version apparently read one more >byte than the C version, which is also one more byte than the file size? >Also, why infile.eof() was false at the end? The description for readsome at cplusplus.com says that it will stop reading when there is no data in the stream buffer, even if end of file has not been reached. That may answer you second question. What are the last few characters in the file? What are the last few characters placed in your array when the short record is read? -- Remove del for email |
Jorgen Grahn <grahn+nntp@snipabacken.se>: Feb 21 07:08PM On Wed, 2018-02-21, James R. Kuyper wrote: > In C, I'd use fread() and check the return value for a short read. > Looking over the C++ standard, I came to the conclusion that readsome() > is what I should use for comparable purposes. Why not istream::read(buf, count)? That's the closest equivalent to fread() for iostreams. /Jorgen -- // Jorgen Grahn <grahn@ Oo o. . . \X/ snipabacken.se> O o . |
"Öö Tiib" <ootiib@hot.ee>: Feb 22 04:45AM -0800 On Thursday, 22 February 2018 11:46:22 UTC+2, Paavo Helde wrote: > > code that winds the streams back and forth. > What do you mean by mmap being more error-prone? I do not recall having > any problems with it ever. mmap() does nothing a good programmer can't handle merely it is more complex to and so the mechanism is more error-prone. You likely already know the details i just give first 3 that pop into mind: * memory mapping uses fixed page length (lets say multiplies of 4KB). That does not on general case match with (lets say 5KB) file sizes and mismatch always provides niche for fun to next maintainer. * when file size exceeds the addressable space (say 3GB on 32 bit system whose kernel uses 2 GB) then orchestrating portions mapped can be fun. * i/o errors raise SIGSEGV on Mac and EXECUTE_IN_PAGE_ERROR on Windows. Handling user ejecting mapped media during access is fun. > omitted. > [*] On some platforms one can specify some flags like O_DIRECT to bypass > the disk cache, but this will likely make the program slower, not faster. That is copying from memory to memory. It's pace is something like 0.1 sec/GB and so is not major fraction of pace 2 sec/GB. 2 sec/GB roughly matches with pace of the SSD drive from what the data is read. Activity monitor shows CPU % to be 7.2 about that ifstreaming process at that speed. So that 7.2% is playground where the alternatives can optimize out redundant memory-to memory copies and the like. It might be worth of effort but does not likely alter the speed of media (that seems to be actual throughput bottleneck). |
"James R. Kuyper" <jameskuyper@verizon.net>: Feb 21 02:37PM -0500 On 02/21/2018 02:08 PM, Jorgen Grahn wrote: >> is what I should use for comparable purposes. > Why not istream::read(buf, count)? That's the closest equivalent to > fread() for iostreams. 1. It doesn't return the count of characters read, though adding ".gcount() at the end resolves that problem. 2. 30.7.4.3p30 says "Characters are extracted and stored until either of the following occurs: (30.1) — n characters are stored; (30.2) — end-of-file occurs on the input sequence (in which case the function calls setstate(failbit | eofbit), which may throw ios_base::failure." I'm relying on reaching end-of-file in order to know when I've read all the records, so having read() throw an exception when that happens would be inconvenient. However, my testing shows that no exception is thrown, so I may be misunderstanding something. |
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 21 10:06PM On Wed, 21 Feb 2018 13:15:50 -0800 (PST) > mattered. Setting optimal buffer (with setvbuf() or > streambuf::pubsetbuf() ) solved it but surprisingly few > programmers were aware of those features. std::streambuf::xsgetn(), and so std::ifstream::read() and std::filebuf::sgetn(), are allowed by the C++ standard on a large block read (in effect, when the buffer size passed in is larger than the streambuffer's own buffer size) to bypass the streambuffer's buffer entirely. std::ifstream::read() and std::filebuf::sgetn() are then passed on directly to unix read() or the windows equivalent. I wonder if that was the reason for the difference with fread(). I am not sure that fread() is entitled to do the same; or if it is, whether that is ever done in fact. |
Manfred <noname@invalid.add>: Feb 22 06:08PM +0100 On 2/21/2018 7:50 PM, James R. Kuyper wrote: > On 02/21/2018 12:44 PM, Barry Schwarz wrote: >> On Wed, 21 Feb 2018 12:05:57 -0500, "James R. Kuyper" <snip> >>> Could someone explain to me why the C++ version apparently read one more >>> byte than the C version, which is also one more byte than the file size? >>> Also, why infile.eof() was false at the end? From Bjarne's book, about unformatted input: "If you have a choice, use formatted input instead of these low-level input functions." - the low-level input functions mentioned here include istream::read() > how the stream buffer gets filled, so I assumed that it was either a > mistake on their part, or a misunderstanding of what they were saying on > my part. From the same Bjarne's book, following page: "The following functions depend on the detailed interaction between the stream buffer and the real data source and should only be used if necessary and then very carefully" - the functions mentioned here include istream::readsome() > Still, that's probably the explanation for what I found (see below). > Reviewing the description of readsome() with that in mind, it means that > in_avail() has a different meaning than I thought it did. In my experience, the iostream library was historically one of the most controversial among C++ standard libraries (as opposed to STL which has been doing great since the beginning). C++11 improved a lot on this, but still for unformatted (a.k.a. binary) IO, the good old fread()/fwrite() are hard to beat in terms of code cleanliness and robustness, meaning istream::read() probably reaches a comparable level, but not much more, IMHO. Apart of traps like that indicated by Öö Tiib, I think the preference for istream::read() would be more due to uniformity with the rest of the code. |
"James R. Kuyper" <jameskuyper@verizon.net>: Feb 22 12:43PM -0500 On 02/22/2018 12:08 PM, Manfred wrote: ... > "If you have a choice, use formatted input instead of these low-level > input functions." - the low-level input functions mentioned here include > istream::read() I don't think I have a choice - this file contains binary data. ... > Apart of traps like that indicated by Öö Tiib, I think the preference > for istream::read() would be more due to uniformity with the rest of the > code. Having compared them, I tend to agree with that. The C++ code I posted was no simpler nor any more type safe than the C code. The C++ code is, however, vastly more customizable: could have used basic_ifstream<charT, traits> |
"James R. Kuyper" <jameskuyper@verizon.net>: Feb 22 12:54PM -0500 I accidentally hit "Send" on the wrong message, before it was complete. On 02/22/2018 12:08 PM, Manfred wrote: ... > "If you have a choice, use formatted input instead of these low-level > input functions." - the low-level input functions mentioned here include > istream::read() I don't think I have a choice - this file contains binary data. ... > Apart of traps like that indicated by Öö Tiib, I think the preference > for istream::read() would be more due to uniformity with the rest of the > code. Having compared them, I tend to agree with that. The C++ code I posted was no simpler nor any more type safe than the C code. The only significant advantage that the C++ code has is that its vastly more customizable: I could have used basic_ifstream<charT, traits> with my own classes for charT or traits, or provided my own class derived from basic_istream<>. |
Manfred <noname@invalid.add>: Feb 22 07:18PM +0100 On 2/22/2018 6:54 PM, James R. Kuyper wrote: > more customizable: I could have used basic_ifstream<charT, traits> with > my own classes for charT or traits, or provided my own class derived > from basic_istream<>. True, but then you would probably not be handling the stream as a pure bytestream, iow it would not be 'unformatted' in strict terms (and then the question would be what is the behavior of read() with charT other than char) A more significant advantage, obviously, would be the use of a common istream interface for different kinds of stream e.g. stringstream and fstream. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment