Is the link reference definition behavior under-specified?

balpha · June 12, 2015, 9:58am

A link reference definition does not correspond to a structural element of a document.

and that is all it currently has to say about the fact that link reference definitions (“LRDs”) are removed.

Looking at the behavior of the reference implementation, this is arguably not true, or at least imprecise.

Consider the following Markdown:

- 
x

which yields

<ul>
<li></li>
</ul>
<p>x</p>

– the x is outside the list.

However if we add a LRD:

- [foo]: /foo.html
x

we get

<ul>
<li>x</li>
</ul>

with the x inside the list item, because it is now considered paragraph continuation text of the first list item line, which is no longer empty.

So while you could argue that the LRD is not a structural element, its presence in the Markdown source certainly changes the structure of the document (beyond the obvious question whether or not a link may be created elsewhere in the document).

In fact it is close to impossible to say something like “after the LRDs are collected, the document is treated like they hadn’t been there in the first place”, because you can end up with weird cycles. Imagine this:

x
- [foo]: /foo.html

Because it’s at the beginning of a paragraph in a list item, the LRD is parsed and removed. If you now treat the document like the LRD hadn’t been there, you’re converting this:

x
-

– which is <h2>x</h2> because the dash is now a setext header underline*. However, if the dash is now interpreted as an underline, not a bullet, then the original LRD would no longer have been in a spot where a LRD is legal, and thus shouldn’t have been removed. But if the LRD hadn’t been removed, the dash would go back to being a bullet. GOTO START.

* This is not what the spec currently says, but it is what the reference implementation does, and it is likely that the spec will be changed accordingly. You can construct a similar example with a LRD and a horizontal line using the current spec as-is.

So it seems inevitable that LRDs, even after being removed, have an influence on the document structure, and the reference implementation’s behavior makes some sense to me (at least I could not think of any highly problematic issues).

However the spec should accurately describe the desired behavior to match, because this issue seems very prone to edge cases.

Any suggestions how this could be phrased? Or any problems/inconsistencies with the current behavior that you can think of and that I’ve missed?

jgm · June 12, 2015, 5:13pm

This feels like a bug in the reference implementation.

The reference implementation takes a kind of short-cut in handling link references, piggybacking on paragraph parsing, but as your example shows, this can go wrong.

I’d say this should become an issue on the reference implementation’s tracker, rather than something that needs fixing in the spec.

balpha · June 15, 2015, 10:21am

Fair enough. Can I still suggest that the behavior is clarified in the spec? Upon rereading, I realized that the fact that LRDs are part of the “Leaf blocks” section hints at the desired behavior, but the phrasing

does not correspond to a structural element of a document

somewhat confused me. To me, conceptually, a LRD does correspond to a structural element of a document; it’s just an element that causes no output.

I would suggest adding these two examples to the spec (happy to open a PR if this is uncontroversial):

While a [link reference definition] does not directly cause visible
output, it would be incorrect to treat it like it wasn't present at
all. The following constitutes three nested list items, and not a
horizontal rule:

.
- - - [foo]: /url
.
<ul>
<li>
<ul>
<li>
<ul>
<li></li>
</ul>
</li>
</ul>
</li>
</ul>
.

A [link reference definition] is not part of a paragraph. So this
text is *not* [paragraph continuation text], and thus is not part
of the blockquote:

.
> [foo]: /url
text
.
<blockquote>
</blockquote>
<p>text</p>
.

Note that the reference implementation correctly handles the first, but it currently fails the second test.

balpha · June 15, 2015, 10:28am

Yes, and after looking at the code, I realized why you did it this way: if LRDs were parsed as blocks, the current model of deciding the current block type line-by-line breaks down. Line 1 here starts a regular paragraph, not a LRD, but we cannont know this until line 6 when we see the “XXX”:

[
foo
bar
]:
/url "tit
le" XXX

My gut feeling is that there should be something like a _mightBeAReferenceDefinition on a paragraph node, and if that’s true the parser has to bite the bullet and run a check (in continue) on every new line to see whether the content so far can constitute part of a legal reference defintion.

This sounds pretty ugly; maybe you or others who know the parser much more intimately than I do have better ideas.

Should I move this to a GitHub issue?

raph · June 15, 2015, 6:56pm

I’m not a huge fan of changing the impl. To me, there is a natural ordering, you determine the block boundaries, at which point you know where the link reference definitions are, then you can do inline parsing. The proposed change would mix and interleave the stages. As I understand it, you’d have a first pass where you detect block boundaries with sufficient precision to identify link reference definitions but not distinguish list items from headers, then do the link reference definitions, then a second pass where list items and headers can be distinguished.

The example given is an edge case that is very unlikely to occur in real documents. I don’t think it’s worth complicating the implementation or the conceptual model over parsing, unless there’s a case that it really is more natural for humans, or where there is a compatibility concern.

balpha · June 15, 2015, 7:15pm

I am not worried about the edge cases much; if the decision is that it should (mostly) work as the reference implementation does right now, that’s fine with me.

What’s important to me is (hence the topic title) that it’s correctly specified, and that should include expected behavior in edge cases like this. If any of my examples cause weird results I’m okay with it as long as it’s the same weird result in every CommonMark implementation.

I’ve worked with edge case inconsistencies between Markdown versions for five years, the exciting thing about CM is that it wants to end these problems.

raph · June 15, 2015, 7:20pm

100% agreement. The spec should be precise. My point is that given a choice between changing the spec and changing the implementation, I lean toward the former. In fact, I think even if the implementation is changed, the spec will need additional wording to make clear that the parse is identical to if the link reference definition had been removed from the original document. One (imho reasonable) way to interpret the existing spec language is that the link reference definition does not correspond to a node in the resulting AST.

jgm · June 15, 2015, 7:45pm

@balpha Yes, now I remember why I piggybacked on paragraph parsing!

I’m not sure about the solution. But something needs to be done, both in spec and in implementations.

Why don’t you add a github issue, perhaps on the spec itself, calling for this to be clarified and linking here. That way it won’t be forgotten.