#microformats 2019-02-12

2019-02-12 UTC
[adam], strugee, ichoquo0Aigh9ie, [tantek], Garbee, bigbluehat, [xavierroy], KartikPrabhu, pniedzielski[m], nitot, tantek and [kevinmarks] joined the channel
#
@jgmac1106
↩️ I spent day hacking on Blogger adding microformats so it can play with #IndieWeb tools. In terms of syndication POSSE copies with IFTT best I can do for now, but still blogger with webmentions and a social reader is awesome… https://quickthoughts.jgregorymcverry.com/2019/02/12/alexstubenbort-i-spent-day-hacking-on
(twitter.com/_/status/1095159685009891328)
strugee, [jgmac1106], [eddie], [xavierroy], mickael, nitot, barpthewire, KartikPrabhu, [pfefferle], [Vincent], [adam], jgmac1106, [Rose], [kevinmarks] and [tantek] joined the channel
#
Loqi
cjwillcock: sknebel left you a message 1 day, 5 hours ago: congrats on your parser progress! It'd be great if you could put a testing page like https://php.microformats.io up, helps with manual testing and sharing results?
#
Loqi
cjwillcock: sknebel left you a message 16 hours, 53 minutes ago: FYI https://bugzilla.gnome.org/show_bug.cgi?id=769760
#
cjwillcock
thanks sknebel and zegnat!
#
cjwillcock
sknebel that's disappointing about libxml, I'll need to think about that some more; and I'll put a test page on my list
#
sknebel
generally, good HTML parsers have been somewhat of an issue, since HTML5 allows a bunch of stuff older parsers don't get, and e.g. HTML minimizer tools tend to exploit all that's allowed
#
cjwillcock
I was able to get around libxml not understanding the tags from 5 by adding the recover and noerror flags - but I wasn't expecting it have that open exploit, unfixed for +2 years :/
#
sknebel
I somehow remembered seeing the maintainer of something else rant about that a few weeks back
#
Loqi
[@tenderlove] FYI, if you use upstream libxml2 you're subject to multiple CVEs https://bugzilla.gnome.org/show_bug.cgi?id=769760
#
Zegnat
Is PHP using libxml2 for its parsing too? I know we had issues with DOMDocument parsing and moved to a userland PHP implementation
#
cjwillcock
zegnat: yep
#
Zegnat
Then there are definitely limitations, cjwillcock. In fact, we already know those limitations to be out there in the wild because that made us look for a userland one…
#
Zegnat
I think we wrote that one based on a live example someone had. Sadly we didn’t include an issue number there so would have to go looking
#
Zegnat
idly wonders how hard it would be to wrap something like https://github.com/servo/html5ever in a PHP ext
#
Loqi
[servo] html5ever: High-performance browser-grade HTML5 parser
#
cjwillcock
I was looking at wrapping up one of: http://xerces.apache.org/xerces-c/ or https://pugixml.org/
#
cjwillcock
I haven't finished going through my options for which one to use
#
cjwillcock
(maybe some other)
#
Zegnat
HTML really isn’t XML, so I would always be a little sceptical of those XML parsers.
#
Zegnat
I know someone was working on bringing Google’s Gumbo parser as a PHP ext, but both the ext and the parser itself have gone stale :(
#
cjwillcock
zegnat: thanks for the pointer to the test (failing)
#
Zegnat
No problem! We’ve been here before. If we can get someone (you? :P) to provide a nice modern HTML5 parser to PHP, we will all rejoice, haha
#
cjwillcock
lol
#
cjwillcock
well, step one, make me want one is done
#
cjwillcock
oh, that test doesn't fail because of html5 tags - that one is because of bad html
#
cjwillcock
I'm inclined to leave that as an exercise for userland code
#
Zegnat
What is bad HTML?
#
cjwillcock
unclosed <p>
#
Zegnat
No, that is 100% valid
#
cjwillcock
oh
#
cjwillcock
yes!
#
cjwillcock
you ARE right
#
Zegnat
Closing tags are optional for P elements. It is one of the things an XML-based browser will get wrong.
#
cjwillcock
thanks for that
#
cjwillcock
so libxml is out lol
#
Zegnat
Older XML based parsers try to get it right by finding the location they need to close the P (before block elements) but they do not know <article> is a block element because it is an HTML5 element
#
sknebel
and I think the one in HTML knew that, but didn'T know that <article> would force the close to happen
#
sknebel
*one in PHP
#
Zegnat
<div><p>Something<p>Something else</div> worked, IIRC. But as soon as HTML5 comes in it is game over
nitot joined the channel
#
cjwillcock
exactly right. Running the html through the tidy extension first resolves it. So I can either internally use the tidy extension - or strip out libxml and replace with a good html5 parser
#
cjwillcock
however, that use of tidy may not work in the case described in the CVE (I'll check it out)
#
Zegnat
Tidy may work too
[Vincent] joined the channel
#
[kevinmarks]
An advantage of working in node or go is that they have actual html5 parsers, not xml hacks?
#
cjwillcock
node's a little slower
#
cjwillcock
go wins
#
Zegnat
Does node really have an actual html5 parser available?
#
Zegnat
Or are you refering to a userland implementation aswell? I found https://github.com/inikulin/parse5
#
Loqi
[inikulin] parse5: HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.
#
sknebel
what does "have actual html5 parsers" mean? afaik Go is the only one where it is part of the official language project
#
Zegnat
That’s what I meant. Official language part, or otherwise available as some sort of official extension/plugin. Like how Node offers file system functionality ontop of the V8/ECMAScript that powers it
#
sknebel
but the ones we use in php-mf2, mf2py and I think microformats-node are html5 parsers
#
Zegnat
php-mf2 will use the official/native/default DOMDocument parser of the language, unless you provide a userland implementation (which we recommend, because the official isn’t HTML5 safe)
#
[kevinmarks]
Well, mf2py uses beautiful soup which can use html5lib.
#
sknebel
exactly
#
[kevinmarks]
But can also use the very bad default python parser if you're not careful.
#
Zegnat
I am guessing the Python default is also just xmllib? :P
#
Zegnat
That seems to be the case for most places
[kimberlyhirsh] joined the channel
#
Zegnat
Oh interesting
KartikPrabhu, [davidmead], nitot, barpthewire, TallTed, [kevinmarks], KevinMarks, [asuh], [tantek], [adam], gRegorLove_, jgmac1106, [schmarty], [eddie], eduardm and dougbeal|mb1 joined the channel
#
[tantek]
Kevinmarks, 15 years ago last night (!!!) https://twitter.com/t/status/433494367601717248
#
Loqi
[@t] Ten years ago to the hour, @KevinMarks and I introduced #microformats at an #ETech BoF session: http://tantek.com/presentations/2004etech/realworldsemanticspres.html (ttk.me t4UY2)
[adam] joined the channel
#
eduardm
jekyll postcss
#
eduardm
wrong console. my mistake
chrisaldrich and [kevinmarks] joined the channel
#
tantek
edited /invisible-data-considered-harmful (+143) "geourl archive link for map visualization of common lat long errors (enough to show up in data aggregations)"
(view diff)
#
tantek
edited /invisible-data-considered-harmful (+453) "another geourl errors citation, via kevinmarks"
(view diff)
#
tantek
edited /invisible-data-considered-harmful (+0) "/* invisible metadata failures */ -cr"
(view diff)
KartikPrabhu, [jgarber], jgmac1106, [schmarty] and tantek joined the channel
#
@jgmac1106
You know...I really like the minimalist features of Blogger and Classic Theme... always said I just want a blank HTML box with all the plumbing..in a way I am getting this vibe  My stylesheet, my ideas, my HTML  and now with microformats all my metadat… http://bit.ly/2SqCiB0
(twitter.com/_/status/1095461920650547206)
tantek joined the channel