How to extract first heading, paragraph and image?

matmuchrapna · April 9, 2015, 3:33pm

Hey, I’m not good with AST structures, but want to try it out. Can you point me with tips how to operate with AST tree to find first nodes (e.g. heading, paragraph and image)?

jgm · April 9, 2015, 4:32pm

+++ matmuchrapna [Apr 09 15 15:43 ]:

[1]matmuchrapna [2]Vladimir Starkov
April 9

Hey, I’m not good with AST structures, but want to try it out. Can you
point me with tips how to operate with AST tree to find first nodes
(e.g. heading, paragraph and image)?

Are you using cmark or commonmark.js?

matmuchrapna · April 9, 2015, 6:40pm

I’m using commonmark.js https://github.com/jgm/commonmark.js

jgm · April 9, 2015, 8:41pm

+++ matmuchrapna [Apr 09 15 18:54 ]:

[1]matmuchrapna [2]Vladimir Starkov
April 9

Do the samples here help?

Also see the README.md for commonmark.js.

matmuchrapna · April 10, 2015, 1:24am

@jgm, thanks guy, you are awesome,

resolved my issue with image like this

while ((event = walker.next())) {
  node = event.node;
  if (event.entering && node.type === 'Image' && !firstImage) {
    firstImage = {
      title: node.title,
      destination: node.destination,
    }
  }
}

and for heading and paragraph I’ll be doing similar processing. Thanks!

matmuchrapna · April 22, 2015, 1:27am

I failed with heading extraction. cannot wrap my head with .walker().

I have practical usecase with non-working code in a gist. Can you help me?

For example I have such md content:

var content = '\
# heading `code` \
\
_date_\
';

And I want to get plain text from h1 to paste into “html>head>title”, then I write helper functions:

function md2AST(content) {
  var commonmark = require('commonmark');
  var reader = new commonmark.Parser();
  return reader.parse(content);
}
 
// get first "h1" node from AST-tree
function getTitleNode(content) {
  var walker = md2AST(content).walker();
  var event, node;
  while (event = walker.next()) {
    node = event.node;
    if (event.entering && node.type === 'Header' && node.level === 1) {
      return node;
    }
  }
}
 
// get plain text from AST-node
function astNode2text(astNode) {
  var walker = astNode.walker();
  var acc = [];
  var event, node;
  while (event = walker.next()) {
    node = event.node;
    if (node.literal) {
      acc.push(node.literal);
    }
  }
 
  return acc.join(' ');
}

Finally, I can test my solution:

var result = astNode2text(getTitleNode(content));
 
// expectations: "heading code"
// reality: "heading  code   date
console.log(result);

But I failed. What I’m doing wrong?

jgm · April 22, 2015, 4:32am

One problem is that

var content = '\
# heading `code` \
\
_date_\
';

is equivalent to # heading `code` _date_ – there’s no newline before _date_, so it’s part of the heading.

But there’s actually a bug in the NodeWalker code, so that even if you fix this, there’s still a problem. (The problem is that the walker didn’t stop when it hit the root node, but kept going up to the document root.) With the fix, which I’ll push soon, and with the proper

var content = "# header `code`\n_date_";

it will work.

matmuchrapna · April 22, 2015, 4:24pm

Indeed, I tested on real markdown files and walker had that issue. Thanks, for publishing new version so fast!

matmuchrapna · April 22, 2015, 4:26pm

// get plain text from AST-node
function astNode2text(astNode) {
  var walker = astNode.walker();
  var acc = [];
  var event, node;
  while (event = walker.next()) {
    node = event.node;
    if (node.literal) {
      acc.push(node.literal);
    }
  }
 
  return acc.join(' ');
}

btw, @jgm, can you review this astNode2plainText function. Maybe I missed smth ans there is a better way to do achieve the goal?

jgm · April 22, 2015, 8:30pm

+++ matmuchrapna [Apr 22 15 16:36 ]:

btw, @jgm, can you review this astNode2plainText function. Maybe I missed smth ans there is a better way to do achieve the goal?

Your version will insert extra spaces where they aren’t wanted. (Note that spaces are already part of the literal content of Text nodes.)

Also, I’ve read that in most js engines it’s faster just to concatenate strings directly, rather than using this kind of string buffer approach.

So, this might be better:

// get plain text from AST-node
function astNode2text(astNode) {
  var walker = astNode.walker();
  var acc = "";
  var event, node;
  while (event = walker.next()) {
    node = event.node;
    if (node.literal) {
      acc += node.literal;
    }
  }

  return acc;
}

You should probably also check for LineBreak and SoftBreak nodes, which
don’t have a literal content. You’ll probably want to add a newline
or space to acc in these cases.