- [OT] Help for a RegEx - 8 Updates
Fonntuggnio <JoeFonntuggnio@libbbero.it>: Aug 02 07:38PM +0200 Sorry for the total OT, but I failed to build a RegEx with the "help" (rotfl) of three different so called IA, getting to nothing I am scanning an HTML document (not in javascript, so I do not have access to DOM nodes from inside) and I need to match EVERY <p> whole tag. for whole I mean, starting from the <p and ending with the corresponding </p>, but such paragrapha MAY (and may not) contain a long list of attributes, with or without zero or more \n \r \t characters, valid, before the >. An innerText possibly multiline, also with or without zero or more \n \r \t characters inside the text. I have tried most suggestions from Bearly, ChatGpt and You.Com IA, but none worked (my test is the RegEx engine from KATE Editor with the loaded HTML. It is handy since it highlights in yellow the matches, and I can verify that the RegEx tried fail to detect perfectly valid paragraphs). If sb happens to be familiar with RegEx supporting "invisible" characters ... I'd be very grateful for any hint. Ciao ! |
Christian Gollwitzer <auriocus@gmx.de>: Aug 02 08:00PM +0200 Am 02.08.23 um 19:38 schrieb Fonntuggnio: > access to DOM nodes from inside) and I need to match EVERY <p> whole tag. > for whole I mean, starting from the <p and ending with the corresponding > </p>, but such paragrapha MAY (and may not) contain This may not be possible at all. RegExes cannot match nesting pairs, i.e. if your <p></p> contains other <p></p> pairs then you have reached the end of what a RE is capable of. Also due to the way these tags are structured, you need at least negative lookahead for it, which also not all RE engines support. If you do <p.*</p> then the RE would catch from the first <p to the last </p>, hence you need to specify the .* with a lookahead like (?!</p>), or use a non-greedy RE. > (my test is the RegEx engine from KATE Editor with the loaded HTML. > If sb happens to be familiar with RegEx supporting "invisible" > characters ... I'd be very grateful for any hint. This may as well be the problem. Some RE engines treat newline characters as special, i.e. it may be that Kate matches only *within* a line. In short - maybe a RE engine is simply not a good tool to do that. Then use an XML parser instead. Christian |
Paavo Helde <eesnimi@osa.pri.ee>: Aug 02 09:15PM +0300 02.08.2023 20:38 Fonntuggnio kirjutas: > access to DOM nodes from inside) and I need to match EVERY <p> whole tag. > for whole I mean, starting from the <p and ending with the corresponding > </p>, but such paragrapha MAY (and may not) contain I'm afraid HTML cannot be parsed with a regex in general. Also, the HTML rules are very lax, for example there is no such guarantee that there actually appears a corresponding terminating </p> tag. Also, there is no guarantee that the actual content is contained in the <p> tags, it might well be outside and all the <p> tags might actually be empty <p/>. For extracting the content of unknown pages reliably you would probably need some kind of a state machine, with a fair knowledge of obscure HTML rules. Of course, there are libraries for that. |
scott@slp53.sl.home (Scott Lurndal): Aug 02 06:20PM >For extracting the content of unknown pages reliably you would probably >need some kind of a state machine, with a fair knowledge of obscure HTML >rules. Of course, there are libraries for that. Best way to deal with HTML is using xslt processors. You may want to run the html text through a canonicalizer first. |
Keith Thompson <Keith.S.Thompson+u@gmail.com>: Aug 02 11:58AM -0700 > I am scanning an HTML document (not in javascript, so I do not have > access to DOM nodes from inside) and I need to match EVERY <p> whole > tag. [...] https://stackoverflow.com/a/1732454/827263 "TONY THE PONY HE COMES" -- Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com Will write code for food. void Void(void) { Void(); } /* The recursive call of the void */ |
MarioCPPP <NoliMihiFrangereMentulam@libero.it>: Aug 03 12:52AM +0200 On 02/08/23 20:00, Christian Gollwitzer wrote: >> contain > This may not be possible at all. RegExes cannot match > nesting pairs, i.e. if your <p></p> contains other <p></p> this may be safely excluded. Other type of tags (like <i> or <em> may be nested, but not <p> itself). Is it still a problem ? > then the RE would catch from the first <p to the last </p>, > hence you need to specify the .* with a lookahead like > (?!</p>), or use a non-greedy RE. the .* seem to fail facing multiline and tabs alas > This may as well be the problem. Some RE engines treat > newline characters as special, i.e. it may be that Kate > matches only *within* a line. mmmmmm intresting. What other editor would you recommend then ? -- 1) Resistere, resistere, resistere. 2) Se tutti pagano le tasse, le tasse le pagano tutti MarioCPPP |
MarioCPPP <NoliMihiFrangereMentulam@libero.it>: Aug 03 12:55AM +0200 On 02/08/23 20:15, Paavo Helde wrote: >> I am scanning an HTML document (not in javascript, so I do >> not have access to DOM nodes from inside) and I need to >> match EVERY <p> whole tag. it is HTML generated by LibreOffice .odt, so rather well formatted (if not elegant) > Also, the HTML rules are very lax, for example there is no > such guarantee that there actually appears a corresponding > terminating </p> tag. I have read manually a lot of the generated code, and I had no evidence of bad formatting > Also, there is no guarantee that the actual content is > contained in the <p> tags, it might well be outside and all > the <p> tags might actually be empty <p/>. true, <td> and some other contain some renderized text. But for my purpose just paragraphs could suffice. > For extracting the content of unknown pages they are not unknown : they are .odt exported as HTML, by LibreOffice. -- 1) Resistere, resistere, resistere. 2) Se tutti pagano le tasse, le tasse le pagano tutti MarioCPPP |
MarioCPPP <NoliMihiFrangereMentulam@libero.it>: Aug 03 12:56AM +0200 On 02/08/23 20:20, Scott Lurndal wrote: >> rules. Of course, there are libraries for that. > Best way to deal with HTML is using xslt processors. You may want > to run the html text through a canonicalizer first. both terms were unknown to me beforehand, so I thank you since I can do some searches -- 1) Resistere, resistere, resistere. 2) Se tutti pagano le tasse, le tasse le pagano tutti MarioCPPP |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment