Wednesday, August 2, 2023

Digest for comp.lang.c++@googlegroups.com - 8 updates in 1 topic

Fonntuggnio <JoeFonntuggnio@libbbero.it>: Aug 02 07:38PM +0200

Sorry for the total OT, but I failed to build a RegEx with
the "help" (rotfl) of three different so called IA, getting
to nothing
 
I am scanning an HTML document (not in javascript, so I do
not have access to DOM nodes from inside) and I need to
match EVERY <p> whole tag.
 
for whole I mean, starting from the <p and ending with the
corresponding </p>, but such paragrapha MAY (and may not)
contain
 
a long list of attributes, with or without zero or more \n
\r \t characters, valid, before the >.
 
An innerText possibly multiline, also with or without zero
or more \n \r \t characters inside the text.
 
I have tried most suggestions from Bearly, ChatGpt and
You.Com IA, but none worked
 
(my test is the RegEx engine from KATE Editor with the
loaded HTML. It is handy since it highlights in yellow the
matches, and I can verify that the RegEx tried fail to
detect perfectly valid paragraphs).
 
If sb happens to be familiar with RegEx supporting
"invisible" characters ... I'd be very grateful for any hint.
Ciao !
Christian Gollwitzer <auriocus@gmx.de>: Aug 02 08:00PM +0200

Am 02.08.23 um 19:38 schrieb Fonntuggnio:
> access to DOM nodes from inside) and I need to match EVERY <p> whole tag.
 
> for whole I mean, starting from the <p and ending with the corresponding
> </p>, but such paragrapha MAY (and may not) contain
 
This may not be possible at all. RegExes cannot match nesting pairs,
i.e. if your <p></p> contains other <p></p> pairs then you have reached
the end of what a RE is capable of. Also due to the way these tags are
structured, you need at least negative lookahead for it, which also not
all RE engines support.
 
If you do
<p.*</p>
 
then the RE would catch from the first <p to the last </p>, hence you
need to specify the .* with a lookahead like (?!</p>), or use a
non-greedy RE.
 
 
> (my test is the RegEx engine from KATE Editor with the loaded HTML.
> If sb happens to be familiar with RegEx supporting "invisible"
> characters ... I'd be very grateful for any hint.
 
This may as well be the problem. Some RE engines treat newline
characters as special, i.e. it may be that Kate matches only *within* a
line.
 
In short - maybe a RE engine is simply not a good tool to do that. Then
use an XML parser instead.
 
Christian
Paavo Helde <eesnimi@osa.pri.ee>: Aug 02 09:15PM +0300

02.08.2023 20:38 Fonntuggnio kirjutas:
> access to DOM nodes from inside) and I need to match EVERY <p> whole tag.
 
> for whole I mean, starting from the <p and ending with the corresponding
> </p>, but such paragrapha MAY (and may not) contain
 
I'm afraid HTML cannot be parsed with a regex in general. Also, the HTML
rules are very lax, for example there is no such guarantee that there
actually appears a corresponding terminating </p> tag.
 
Also, there is no guarantee that the actual content is contained in the
<p> tags, it might well be outside and all the <p> tags might actually
be empty <p/>.
 
For extracting the content of unknown pages reliably you would probably
need some kind of a state machine, with a fair knowledge of obscure HTML
rules. Of course, there are libraries for that.
scott@slp53.sl.home (Scott Lurndal): Aug 02 06:20PM


>For extracting the content of unknown pages reliably you would probably
>need some kind of a state machine, with a fair knowledge of obscure HTML
>rules. Of course, there are libraries for that.
 
Best way to deal with HTML is using xslt processors. You may want
to run the html text through a canonicalizer first.
Keith Thompson <Keith.S.Thompson+u@gmail.com>: Aug 02 11:58AM -0700


> I am scanning an HTML document (not in javascript, so I do not have
> access to DOM nodes from inside) and I need to match EVERY <p> whole
> tag.
[...]
 
https://stackoverflow.com/a/1732454/827263
 
"TONY THE PONY HE COMES"
 
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
MarioCPPP <NoliMihiFrangereMentulam@libero.it>: Aug 03 12:52AM +0200

On 02/08/23 20:00, Christian Gollwitzer wrote:
>> contain
 
> This may not be possible at all. RegExes cannot match
> nesting pairs, i.e. if your <p></p> contains other <p></p>
 
this may be safely excluded. Other type of tags (like <i> or
<em> may be nested, but not <p> itself). Is it still a problem ?
 
 
> then the RE would catch from the first <p to the last </p>,
> hence you need to specify the .* with a lookahead like
> (?!</p>), or use a non-greedy RE.
 
the .* seem to fail facing multiline and tabs alas
 
 
> This may as well be the problem. Some RE engines treat
> newline characters as special, i.e. it may be that Kate
> matches only *within* a line.
 
mmmmmm intresting.
What other editor would you recommend then ?
 
 
--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP
MarioCPPP <NoliMihiFrangereMentulam@libero.it>: Aug 03 12:55AM +0200

On 02/08/23 20:15, Paavo Helde wrote:
 
>> I am scanning an HTML document (not in javascript, so I do
>> not have access to DOM nodes from inside) and I need to
>> match EVERY <p> whole tag.
 
it is HTML generated by LibreOffice .odt, so rather well
formatted (if not elegant)
 
> Also, the HTML rules are very lax, for example there is no
> such guarantee that there actually appears a corresponding
> terminating </p> tag.
 
I have read manually a lot of the generated code, and I had
no evidence of bad formatting
 
 
> Also, there is no guarantee that the actual content is
> contained in the <p> tags, it might well be outside and all
> the <p> tags might actually be empty <p/>.
 
true, <td> and some other contain some renderized text. But
for my purpose just paragraphs could suffice.
 
 
> For extracting the content of unknown pages
 
they are not unknown : they are .odt exported as HTML, by
LibreOffice.
 
 
--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP
MarioCPPP <NoliMihiFrangereMentulam@libero.it>: Aug 03 12:56AM +0200

On 02/08/23 20:20, Scott Lurndal wrote:
>> rules. Of course, there are libraries for that.
 
> Best way to deal with HTML is using xslt processors. You may want
> to run the html text through a canonicalizer first.
 
both terms were unknown to me beforehand, so I thank you
since I can do some searches
 
 
--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: