Clarify "correct unicodes" for digital entities?

vitaly · October 2, 2014, 7:52am

Current spec says, that any correct unicodes allowed in digitals entities (like this &# 3777;). Do you think it’s good idea to allow control codes like 0x00-0x1F?

In JS i have no access to libraries like punicode, and did this naive-paranoidal check:

function isValidEntityCode(c) {
  // broken sequence
  if (c >= 0xD800 && c <= 0xDFFF) { return false; }
  if (c >= 0xF5 && c <= 0xFF) { return false; }
  if (c === 0xC0 || c === 0xC1) { return false; }
  // never used
  if (c >= 0xFDD0 && c <= 0xFDEF) { return false; }
  if ((c & 0xFFFF) === 0xFFFF || (c & 0xFFFF) === 0xFFFE) { return false; }
  // control codes
  if (c <= 0x1F) { return false; }
  if (c >= 0x7F && c <= 0x9F) { return false; }
  // out of range
  if (c > 0x10FFFF) { return false; }
  return true;
}

Do you think it’s ok or should be relaxed?

jgm · October 2, 2014, 5:15pm

Can you say more about why we might want to restrict these?

vitaly · October 2, 2014, 6:01pm

I’m not sure for 100%. Just a paranoidal desire to protect output from been broken. For example, if someone copy-paste output from browser window to console. In YAML this chars are completely prohibited: /[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x84\x86-\x9F\uD800-\uDFFF\uFFFE\uFFFF]/

0x00 is special case. Parser authors can use it as end of string marker, to simplify bounds check.

The opposite question - can you say why we minght want to have these? My knowledge is limited to web usage only.