Agreed on a common syntax. There should be a sensible default rendering as well so that CommonMark documents are compatible across different websites and apps.
Explicitly listing the file types like this is probably the safest route. Otherwise the parser would have to know in advance (or check at the time of rendering) which files to include. If another file format is added in the future (say, a flac version), the document could either be updated manually (or perhaps programmatically if a flac version is added for every audio track in the larger file set).
I’m in agreement (as a CommonMark extension, not as part of the core spec).
If the extension is enabled, the ![]() syntax should render the content based on the specified file extension, e.g.  would render the HTML <video> tag. If the extension is not enabled, the syntax will attempt to render the HTML <img> tag regardless of the specified file extension.
For the extension to be viable we would need a white list of file extensions that would be used to render as the particular content types - image, audio, video, and perhaps other content types.