Monday, July 20, 2020

Digest for comp.lang.c++@googlegroups.com - 16 updates in 5 topics

Mike Copeland <mrc2323@cox.net>: Jul 19 04:34PM -0700

I am working on a C++ source file analyzer; I had one that worked for
C sources. That program was written many years ago, and I'm attempting
to update it for C++ code, as well as use C++ structures and features.
The code for parsing C++ code is tedious, and I'm looking for a
library or functional code that will (1) parse non-comment code elements
and (2) return token strings.
Is there something I can link to/use that will help me? I've done
some Google searching and have seen references to Clang, Elsa, Metre and
ANTLR - all of which seem much more than I need. I just want source
code tokens and to know which source code line they're from.
TIA
 
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
Ian Collins <ian-news@hotmail.com>: Jul 20 11:41AM +1200

On 20/07/2020 11:34, Mike Copeland wrote:
> some Google searching and have seen references to Clang, Elsa, Metre and
> ANTLR - all of which seem much more than I need. I just want source
> code tokens and to know which source code line they're from.
 
https://clang.llvm.org/doxygen/classclang_1_1Parser.html
 
--
Ian.
Sam <sam@email-scan.com>: Jul 19 08:41PM -0400

Mike Copeland writes:
 
> ANTLR - all of which seem much more than I need. I just want source
> code tokens and to know which source code line they're from.
> TIA
 
Maybe for some subset of C++ grammar one could come up with a "Simple C++
Source Parser".
 
But in order to parse the full C++ syntax, especially c++11 and higher,
that's a project of a lifetime, I'm afraid.
"Öö Tiib" <ootiib@hot.ee>: Jul 19 11:35PM -0700

On Monday, 20 July 2020 02:34:14 UTC+3, Mike Copeland wrote:
> some Google searching and have seen references to Clang, Elsa, Metre and
> ANTLR - all of which seem much more than I need. I just want source
> code tokens and to know which source code line they're from.
 
On general case it is impossible to make simple parser for logical
analyzing of C++ code since its grammar is large and full of
complications.
 
The simplest parsers are made for syntax highlighting or
automatic reformatting, but such are usually uninterested in
meaning of code so results are not suitable for substantive
analysis.
Example ... Artistic Style.
 
There are bit more aware parsers made for automatic documenting.
Those are more complex. You will get bit better tagged results
from such parsers. Additionally those parsers tend to pay
lot of attention to contents of comments but you can ignore that
aspect.
Example ... Doxygen.
 
Even those things are relatively far from trivial but since
we do not know your goals and you say Clang is more than
you need ... then perhaps try.
Scott Newman <scott69@gmail.com>: Jul 20 08:42AM +0200

C++ is only a bit more complicated to parse than C.
Try to write a parser on your own.
David Brown <david.brown@hesbynett.no>: Jul 20 10:49AM +0200

On 20/07/2020 01:34, Mike Copeland wrote:
> ANTLR - all of which seem much more than I need. I just want source
> code tokens and to know which source code line they're from.
> TIA
 
I think it is unlikely that you'll get far without using a big project.
Parsing C++ has got more and more difficult - there are more new
syntaxes, context-dependent keywords, even a new operator in the latest
version. I would recommend you look again at existing parsers, and see
if you can learn to use them.
 
It might take you time to get the hang of clang as a parser, but that's
a job you do once - and then you can take advantage of all the work they
do and you don't have to update or re-write things for each new C++
version. clang is /designed/ to be usable as a library, and as a
parser, for syntax highlighting in IDE's, for making static analysers,
for JIT compilation, and other tools.
 
I don't know the other tools you mentioned, but I personally would
definitely concentrate on clang first. I'd start with the existing
clang analyser, and see where that could take me - that could be a very
good starting point for adding the new analysers that interest you.
 
(gcc might also be worth a look these days. There is an analyser
framework in the latest version, there is support for plugins that can
get access to parsed source information for checking, with existing
plugins for other kinds of static or style checking. There is even a
project underway for making a JIT compiler library of gcc. I don't
think gcc is as far down this path as clang, but maybe it is of use.)
Paavo Helde <eesnimi@osa.pri.ee>: Jul 20 12:55PM +0300

20.07.2020 02:34 Mike Copeland kirjutas:
> ANTLR - all of which seem much more than I need. I just want source
> code tokens and to know which source code line they're from.
> TIA
 
If you just want tokens without any knowledge what they mean, then this
should be pretty straightforward, the C++ preprocessor does exactly
that: removes comments and outputs token strings. It also helpfully adds
extra spaces between tokens which would otherwise appear glued together.
It also outputs file names and line numbers so keeping track of line
numbers should be easy. So if I was given this task, I would start with
getting my toolchain to output preprocessed source files instead of
object files.
 
In preprocessed source, extracting tokens is simple in general, except
for string literals and especially raw string literals which are a bit
more tricky.
 
Beware though that tokens without meaning do not give you much. If all
you know is that there is a token 'final' on line 5095, it does not even
tell you if this is a C++ keyword or some other name, not to speak about
in which scope, namespace or class it belongs to.
 
Also, if there are different preprocessor branches, only one of them
survives after preprocessing step.
Keith Thompson <Keith.S.Thompson+u@gmail.com>: Jul 20 10:53AM -0700

Paavo Helde <eesnimi@osa.pri.ee> writes:
[...]
> you know is that there is a token 'final' on line 5095, it does not
> even tell you if this is a C++ keyword or some other name, not to
> speak about in which scope, namespace or class it belongs to.
 
If you see 'final' on line 5095, you can be sure it's not a C++ keyword.
 
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips Healthcare
void Void(void) { Void(); } /* The recursive call of the void */
James Kuyper <jameskuyper@alumni.caltech.edu>: Jul 20 06:42PM -0400

On 7/20/20 1:53 PM, Keith Thompson wrote:
>> even tell you if this is a C++ keyword or some other name, not to
>> speak about in which scope, namespace or class it belongs to.
 
> If you see 'final' on line 5095, you can be sure it's not a C++ keyword.
 
For the benefit of those who don't already know what Keith is referring
to: the C++ standard does not describe either override or final as
keywords. Instead, what it says about them is:
 
"The identifiers in Table 4 have a special meaning when appearing in a
certain context. When referred to in the grammar, these identifiers are
used explicitly rather than using the identifier grammar production.
Unless otherwise specified, any ambiguity as to whether a given
identifier has a special meaning is resolved to interpret the token as a
regular identifier."
 
Note that this is a meaningful distinction. Keywords can never be used
as regular identifiers - they are always parsed as keywords, and if they
appear in a place where that keyword is not permitted, it's a syntax
error. When used anywhere other the specific context where they're
referred to in the grammar, override and final can be used as ordinary
identifiers.
In principle, the part about ambiguities being resolved in favor of the
regular identifier is another difference, but after a careful review of
all of the relevant grammar rules, I can't figure out any way to create
such an amibiguity - I may have missed something.
Frederick Gotham <cauldwell.thomas@gmail.com>: Jul 20 02:31PM -0700

Let me start off with this easy example:
 
#define MONKEY 5
 
#ifdef MONKEY
 
int Func(void) { return MONKEY; }
 
#else
 
int Func(void) { return 0; }
 

No comments: