Replace double-hyphen ("-‎-") with a dash ("–") like in TeX?


#1

Hi,

I’m not sure if it’s in the context of this project, but I’d find replacing double-hyphen with a dash like TeX does very useful. I think that especially in blogs, which have some “literate” quality, hyphens should not be used in place of dashes, since they are very different characters – one combines and one separates. I find the TeX solution very convenient, and I think that it is true to the Markdown spirit, in that it keeps the ASCII source readable by humans.

It can be learned from http://en.wikipedia.org/wiki/Dash#Electronic_usage that there’s generally no easy way to enter a unicode dash. I think that making dashes easy would somewhat improve the quality of Markdown documents.

Cheers,
Noam


#2

It’s funny - Discourse already replaces double-hyphens in the title with dashes, but not in the contents. So I added an invisible LTR mark between the hyphens to prevent that…


#3

As a heavy em dash user, I would appreciate the syntax sugar. It’s not impossible to write —, but it makes the input ugly and unreadable at that point. I don’t think we need to support every HTML character entity—just common ones that have a natural ASCII-glyph equivelent.


#4

This would be great. Proper typography is very important.


#5

I like the idea of replacing -- with – (an en dash), but I also like the idea of replacing --- with — (an em dash).

FWIW these are actually very easy to type on Macs: +- for an en dash, ++- for an em dash.


#6

I’m strongly in favor of replacing -- by an en dash and --- by an em dash.

However, the em dash syntax of --- will conflict with the horizontal rule syntax.


#7

I think that it is true to the Markdown spirit, in that it keeps the ASCII source readable by humans.

The spirit is to keep plain text source readable by humans. That would be some sort of unicode encoding nowadays (utf8 hopefully), and the dash character is in unicode so nothing prevents anyone from using it.

Don’t get me wrong, this kind of tricks could be nice for an extension that allows old style “pure ASCII” file to look good. Kind of like smarty pants (gosh that guy is a genius). But for the basic usage, I feel it is a bit of overkill.


#8

But the source is not very readable when -, , and look practically identical in monospaced fonts.

[quote]That would be some sort of unicode encoding nowadays (utf8 hopefully), and the dash character is in unicode so nothing prevents anyone from using it.
[/quote]
Except not knowing the magic key presses. Each OS has a different sequence, and on top of that, Linux’s is not enabled by default on many distributions, and Windows’ is hard to remember. The fallback solution is copying and pasting, which is a pain.


#9

But the source is not very readable when -, , and look practically identical in monospaced fonts.

You have a point there.


#10

The difficulty of typing dashes is also a valid point, I think. For instance, many laptops lack a number pad, making the Windows way quite difficult.


#11

I first thoght, good catch! But the horizontal rule syntax is block level. It will not conflict if this en dash or em dash rule would be used as an inline element. E.g.

Word1 --- word 2

The three hyphens above are not interpreted as horizontal rule.


#12

I guess the only difficulty there is if someone wanted to put an em dash on a line by itself, but I imagine that’s probably rare.


#13

One of my use cases for commonmark is to output manpages so please do not do that…

\--help for --help might not be exactly great.


#14

I think that you should use backticks, that is:

`--help`

As it’s not text, it’s a type of code.


#15

It makes sense indeed. Objection retracted =)


#16

I think these kinds of substitutions are good if offered in the editor with immediate feedback (offering as “completion” when you type ---, or auto-correcting it and letting you Ctrl+Z if undesired) but shouldn’t if done in the markdown processor.
(Side-by-side preview, while technically “immediate feedback”, is not good IMO because you still have --- in the source — see bottom paragragh for rationale.)

  • It unnecessarily complicates the spec. It is possible to type unicode, and — works in a pinch. By complicates I don’t mean implementation but cognitive load for the user: “you get exactly the characters you typed” is the simplest possible rule.

  • It takes you by surprise. The internet is full of blogs talking about unix command lines with --option="foo bar" converted to —option=“foo bar”, so you can’t copy-paste them [1]. The authors should have known better and escaped them but didn’t. I argue that doing it when click Post is too late.

  • Not everyone wants it (if you easily type them on your keyboard), and not everyone wants the same replacements (I think smart quotes are culture-dependent).

  • Different implementations (or even versions) will have different behavior. Exhibit A:

    --old-dashes
      Selects the pandoc <= 1.8.2.1 behavior for parsing smart dashes: - before a numeral is an en-dash, and -- is an em-dash. This option is selected automatically for textile input.
    
  • Compatibility issue: Most existing markdown implementations don’t do these substitutions (at least by default), so if CommonMark would require such substitutions, it won’t reduce fragmentation, only increase it.

  • The result of all this is that a user can’t just go
    It's markdown --- I know this
    and start typing -- knowing whether it’ll result in a or --

So I think it’s inevitable that some platforms will do typographical substitutions (with varying rules), and others won’t. It’s better to carry this experimentation in the editor because it doesn’t create new dialects — the resulting markdown sources would be portable.
That’s why side-by-side preview of conversion is not good enough.


#17

Given that we now live in a unicode world, the fact that n- and m- dashes are difficult to type on some input platforms is not IMO a problem the markup language should solve.

Users that care about these issues will solve their problem in an appropriate way, either via key-binding tools or app selection.

Or implementation extensions will handle it, as pandoc does.


#18

Not sure if this has been sorted out in the spec (it doesn’t appear that way, though)…

According to The Chicago Manual of Style, em dashes do not have spaces on either side (e.g., My brothers—whom I really don’t care for—are butt heads.). Also, you can have 2-em and 3-em dashes; 2-em (——) to denote missing words, and 3-em (———) for use in bibliographies.

Then there’s the issue of supporting en-dashes, too.

My workaround for these is to set up TextExpander with shortcuts to drop in — (em-dash) and – (en-dash) as I’m moving along.


#19

Nothing in the spec, but cmark and commonmark.js both have an option for “smart typography.” This replaces -- with en-dash, --- with em-dash, and makes straight quotes curly in an intelligient way. (You can try it on http://spec.commonmark.org/dingus/ if you check “smart punctuation.”)

I had originally conceived of this as a rendering option, and not something that really needed to be in the spec. But it turns out that in order to do smart punctuation properly, you really need to do it in the parsing phase. Main reason: you don’t want an escaped \" to produce a curly quote, but once the parsing phase has passed, the AST just contains a " character, and the information that it was escaped has been thrown away. Subsidiary reason: to do it right, you have to match up start and end quotes, and this is a parsing job.

I should write out the parsing rules for this, at least as a supplement to the spec.


#20

This would essentially refute proposals to add --del-- unless -- would not be converted to when adjacent to a letter on either side. For what it’s worth, ~~del~~ is more common anyway, but doesn’t match ++ins++ as nicely.