Conversation
I was getting failures where URLs with underscores were getting <em> blocks thrown in them, creating invalid HTML.
Most don't pass (and are disabled), but should. Unfortunately, adding a word boundary (\b) to the strong regex works for these tests, but somehow the full parser then breaks. The regex that I believed should work is added as a comment, for anyone wishing to make things work going forward.
| // match. | ||
|
|
||
| std::string text = "some ___text testing it__ out"; | ||
| std::string expected = "some <strong>_text testing it</strong> out"; |
There was a problem hiding this comment.
In the CommonMark Spec, there is this example:
markdown "___foo__\n"
html "<p>_<strong>foo</strong></p>\n"
example 456
Also GitHub spec: https://github.github.com/gfm/#example-465
So I guess, the _ in the expected string has to come before the strong tag.
| void Parse(std::string& line) override | ||
| { | ||
| // Modifed from previous version, with help from | ||
| // https://stackoverflow.com/questions/61346949/regex-for-markdown-emphasis |
There was a problem hiding this comment.
I would suggest to put this comment into your commit message instead and remove it from here, because as a comment it could become outdated at some point.
| // it then passes all the 'disabled' tests in the 'strong parser' | ||
| // test, but then it fails general parsing. For some reason, | ||
| // "__text__" translates "<i></i>text<i></i>" even though there | ||
| // are no word boundaries at the correct places. It's weird! |
There was a problem hiding this comment.
The strong parser is usually handled before the emphasized one: https://github.com/progsource/maddy/blob/master/include/maddy/parser.h#L195
This is so, that double _ or * can be easier determined as strong. Therefor I guess, that is why you only get italic tags, if you do not first run it through the strong parser.
|
Thank you for your contribution and sorry for the delay in me checking the PR. |
Description of the change
Don't replace internal underscores with emphasis tags. This was particularly important for URLs, which are often littered with them, and none of them should instead have
<em>in them; that's invalid HTML.The same change should be made for strong tags (like this) but the same fix there broke overall parsing. Stubs provided; maybe someone can figure out what went wrong.
Checklist for contributor
tools/format.py)