Contradiction/confusion about CM vs. Gruber line breaks

The spec currently uses hard/soft line break terminology in the reverse of Gruber’s use. First, let me note this paragraph from Gruber’s syntax page:

The implication of the “one or more consecutive lines of text” rule is that Markdown supports “hard-wrapped” text paragraphs. This differs significantly from most other text-to-HTML formatters (including Movable Type’s “Convert Line Breaks” option) which translate every line break character in a paragraph into a <br /> tag.

That paragraph has confused me for years. He’s saying that Markdown supports “hard-wrapped” text by not obeying the line breaks the writer inserts (which would require a <br> tag in the HTML output.) It always seemed like someone was playing a game of saying the exact opposite of what they meant.

I now realize that he was time-traveling. There is a temporal conflation in how a lot of people in the Markdown community have been talking about line breaks. Time 1 is the time of writing. Time 2 is the rendering of the output (for reading). When Gruber talks about hard-wrapped text, he means the writer hitting the Enter key and “hard-wrapping” the text for their own fleeting purposes. He means Markdown supports hard-wrapping at Time 1, while obliterating it in the rendered output at Time 2.

This temporal conflation and ambiguity came up in this discussion on this forum. There’s enormous confusion about what hard and soft line breaks mean.

The spec, section 6.9, reads:

A line break (not in a code span or HTML tag) that is preceded by two or more spaces and does not occur at the end of a block is parsed as a hard line break (rendered in HTML as a <br /> tag)

Gruber spoke of hard-wrapped text as that which would not generate a <br> tag. Here we are using “hard line break” to mean an entity that does yield a <br> tag in the HTML. He meant a hard line wrap for the author to look at in their source text (T1), while we mean a hard line break in the output (T2).

(I assume wrap and break can be taken as equivalent here.)

Should we strive to match his usage? At the least I think we should clarify the difference between writer-pressing-Enter line breaks vs. breaks in the rendered output.

I also suggest making the spec’s sentences here simpler and more direct by saying “A soft/hard line break is…” as opposed to the long-winded passive voice form where the subject comes at the end of the sentence (as in the quote from section 6.9 above.)

Afterword: From the long discussion/debate here about what the default line-breaking behavior should be (linked above), I think lots of people will be confused by the paleo-Markdown norm of obliterating the user’s line breaks (as learned by GitHub.) Many Markdown norms seem driven by a distinct subculture that may be waning. For example, I have no idea why I would hard-wrap text in an editor if I didn’t want it to be hard wrapped in the end – editors have automatic soft-wrapping, so why hard-wrap? There’s something about the Unix-like OS experience that I missed. Also, I submit that the frequent references saying that Markdown is just like e-mail format probably make sense to far fewer people in 2015/16 than they did in 2004. I’ve never written e-mails in this rumored e-mail format – Gmail, Outlook, Thunderbird et al. have had rich text for decades. Luckily, we only have one “like e-mail” reference in the spec – I’m just suggesting that we not count on such references moving forward.

Joe,

Take a look at the reasoning behind what others have called “semantic line breaks.” They are easier to read and great for Git and other version control. Composition and editing are one thing (keep line breaks). Presentation another matter entirely, which is the whole idea behind writing with a text editor.

http://rhodesmill.org/brandon/2012/one-sentence-per-line/

https://news.ycombinator.com/item?id=4642395

Rick

2 Likes

Seems pretty clear to me.

If there’s really a lot of confusion over that statement, maybe we should just change

A line break (not in a code span or HTML tag) that is preceded by two or more spaces and does not occur at the end of a block is parsed as a hard line break (rendered in HTML as a <br /> tag)

to

A line break (not in a code span or HTML tag) that is preceded by two or more spaces and does not occur at the end of a block is rendered as a hard line break (rendered in HTML as a <br /> tag)

I’m in the camp that line breaks in the source should not translate to line breaks in the presentation unless specified explicitly (two spaces, backslash)

Miguel de Icaza’s style guide points out why you might wrap your text.

I agree that the language is a little confusing.

Hard-wrapped text usually refers to text that has been automatically wrapped to fit a certain line width. It’s quite common in the Linux world. So when Gruber says

…Markdown supports “hard-wrapped” text paragraphs.

what he means is “Markdown will turn your hard-wrapped paragraphs into nice-looking HTML paragraphs by ignoring the line breaks that your text editor automatically added.” <br>s in HTML generally look awful and should be used extremely sparingly in web typography (they look especially bad when their positions were decided by a terminal text editor working with a monospaced font). That’s why CM requires you to explicitly specify each line break that you want preserved in HTML.

If you’re writing text where every line break is intentionally manually specified, then either

  1. You’re writing a lot of poetry, in which case I don’t think the default behavior should be changed for your niche usage
  2. You actually want those line breaks to be new paragraphs, in which case you should be using double newlines
  3. You actually want to specify the line break for every line of text in a normal paragraph, blatantly ignoring the fact that the appearance of the line breaks also depends on the browser’s window size and any number of CSS rules which are beyond Markdown’s control and you’re inserting cosmetic information into HTML which is awful to maintain, in which case you can spend the extra time to insert manual line breaks until you realize you should stop doing that.
1 Like

Me, too, but I always wondered why Gruber – contrary to saying that Markdown was greatly inspired by and build upon conventions established in plain-text email – invented the double space at the end which is incompatible with format=flowed (RFC 3676).

I think we could get rid of the potentially confusing word “hard.” In the AST in the C and JS reference implementations, and in CommonMark.dtd, we use the terminology “line break” and “soft break.” A “line break” is supposed to correspond to a line break in the rendered output (e.g. a <br> tag in HTML). A “soft break” just indicates a place where the author had a newline in the source; it can be rendered either as a space or as a newline in HTML. In principle we could have just parsed these “soft breaks” as spaces, but it is nice to be able to reproduce the author’s pattern of line breaks in Markdown in the HTML output, if desired.

I recognize that the use of “soft” above is confusing for the same reason that “hard” was. Open to other terminological suggestions.

Hi all,

First, my preferred line-breaking behavior—honoring the user-entered breaks—is irrelevant as far as the spec is concerned, since CommonMark’s remit is to standardize Markdown, and Markdown’s line break behavior is already defined. So you can treat my comment on that issue as an excursion.

Rick and Mike, I guess I don’t get it. I don’t see the use case. de Icaza is talking about reading prose in a terminal. To me, it looks awful. He’s got Mark Twain in a terminal with a monospaced font, and it’s blurry and ugly, with no anti-aliasing or subpixel optimization, none of the modern font stack. I’d hate to read text like that, but to each his own.

Ajedi, your edit would be an improvement. I wouldn’t use “render” though, since CommonMark is not a renderer. I’d say something like “is encoded as hard line break.” What’s more, after reading the Unicode spec, I wonder if we should be calling this a paragraph break. That seems to be what a “hard line break” represents, semantically, in many discussions and specs.

Mightymax, I’m confused. You’re saying hard-wrapped text is text that is automatically wrapped to fit a certain line width? I understand that to be the definition of soft-wrapped text. See the Wikipedia description here.

John, I don’t know what a “newline in HTML” would mean, regarding your sentence on soft breaks. HTML doesn’t have a new line character apart from the break <br> tag, which is your hard line break.

Instead my notion of temporal conflation, where we’re mixing up a line break at time of writing (writer presses Enter to hard-wrap text for his own reasons) vs. a line break in rendered output, we might want to clarify the the agents.

From the Wikipedia page, it seems like soft-breaking is always software induced, automatic, bound to the viewport of a text editor or what have you. Hard line breaks are understood as human decisions, like pressing Enter, and is commonly treated as a paragraph break. HTML muddies the waters because it ignores human-induced line breaks of the normal sort, the newlines in the source.

John, if you want to preserve the newlines in the source while ignoring them for HTML purposes, I just discovered the Unicode Line Separator character: U+2028. It sounds very semantically useful. It’s on page 211 of the Unicode 8.0 standard (PDF, and a big one). I assume HTML renderers would ignore them. Well I suppose you could just keep the source line breaks in the HTML file. Renderers would ignore them, but they’d be there if anyone needed them.

Joe D.

This isn’t the place to rehash this. But remember, one of the central goals of Markdown was to have source that is readable. Hard-wrapping lines aids readability. Yes, if you have your editor set up right, it can soft-wrap the lines for you. But hard-wrapped text is readable without any special equipment. [EDIT: Diffability is also a very important feature.]

No, there’s an important semantic difference between what we’re calling a hard line break (represented by <br> in HTML) and a paragraph division (represented by <p>).

As you observe yourself, HTML source can contain newline characters. Semantically they behave like spaces in most contexts. cmark’s current default behavior is to put a newline in the generated HTML source where a newline occurs in the CommonMark input source. Keeping track of (what we’ve been calling) softbreaks allows that. Of course, it’s not important semantically.

3 Likes

Source readability and diffability. Amen.

1 Like