#microformats 2018-03-03
2018-03-03 UTC
[eddie], [kevinmarks], tantek, globbot, barpthewire, [tantek] and twisted` joined the channel
# Zegnat !tell aaronpk that mf ruby bug you filed (https://github.com/indieweb/microformats-ruby/issues/83) isn’t actually a bug per spec, right? <br> is expected to disappear when taking the textContent of an HTML element. Sounds like https://github.com/microformats/microformats2-parsing/issues/15 needs to be resolved first?
# Zegnat goes off to read https://indieweb.org/note#Indieweb_whitespace_thinking as linked by tantek in #indieweb-dev
[kevinmarks] joined the channel
# [kevinmarks] It should turn into space of some kind
# [kevinmarks] Micro<br>formats should be "Micro\nformats" or "Micro formats" not "Microformats"
tantek joined the channel
# Zegnat innerText is an option, but would require CSS capabilities for the mf2 parser as non-CSS user agents just return textContent: https://html.spec.whatwg.org/#the-innertext-idl-attribute
[colinwalker] joined the channel
# Loqi aaronpk: Zegnat left you a message 3 hours, 26 minutes ago: that mf ruby bug you filed (https://github.com/indieweb/microformats-ruby/issues/83) isn’t actually a bug per spec, right? <br> is expected to disappear when taking the textContent of an HTML element. Sounds like https://github.com/microformats/microformats2-parsing/issues/15 needs to be resolved first?
[mifga] joined the channel
# Zegnat You may also run into questions of how many line breaks you want. E.g. if my raw HTML includes a linebreak after the BR element, the PHP parsers preserves this one *and* replaces the BR in the content.value property: http://php.microformats.io/?id=20180303151224242
[xavierroy] joined the channel
# KartikPrabhu yeah getting textContent is tricky
# KartikPrabhu but what is some elements are hidden with CSS?
# KartikPrabhu they don't show up on copy-paste
# KartikPrabhu or worse hidden by JS. we surely don't want mf2 parsers running random JS
# KartikPrabhu might be good to have some consensus and test-cases sorted out for this
# KartikPrabhu oh! any other reference would be great too
# aaronpk this is actually a pretty good explanation of what i'm thinking about https://medium.com/@patrickbrosset/when-does-white-space-matter-in-html-b90e8a7cdd33
# KartikPrabhu yeah I am not even sure most of that can be done with any HTML parser
# KartikPrabhu no, I mean I don't it can be done even over the HTML parser
# KartikPrabhu HTML parsers sometimes just get rid of spacing so "replacing" spaces and tabs would be impossible from the output of a HTML parser
# KartikPrabhu I'll have to look into it a bit more to be sure though
# KartikPrabhu Zegnat: maybe, not sure. But some example test-cases would be great
# KartikPrabhu if you are constructung the DOM then yes. But outside the browser (say a Python or PHP code ) I am not sure
# KartikPrabhu again would be great to have some test-case so we can just check
# KartikPrabhu I mean, we don't even have a solid algorithm for what is expected
# KartikPrabhu so not sure how accurate existing test cases are
# KartikPrabhu aaronpk: yes!
# KartikPrabhu Zegnat: lol! we know that means close to nothing :P
# Zegnat textContent is very clearly defined ... https://dom.spec.whatwg.org/#dom-node-textcontent
# KartikPrabhu does not understand that definition at all!
# KartikPrabhu aah
# Zegnat The textContent of the P element in `<p>Hallo<br>Bye</p>` is “HalloBye”. Because the P element contains 3 child nodes: text node “Hallo”, element “br”, text node “Bye”. The BR element is checked to see if it contains child nodes, it does not. Then all found text nodes are concatinated for the final value, thus “HalloBye”.
# KartikPrabhu aah^ there you go. Can't replace <br> using parser output then
# KartikPrabhu one can replace it before just like <img> is replaced by alt or src
# aaronpk here's one of the tests I believe is wrong https://github.com/microformats/tests/blob/master/tests/microformats-v2/h-entry/summarycontent.json
# Zegnat Note that the HTML spec only adds \n (line feed characters). And if you use innerText as a setter, it will specifically *skip* the \r in \r\n (https://html.spec.whatwg.org/#the-innertext-idl-attribute:dom-innertext-3)
KartikPrabhu joined the channel
# Zegnat so content.html should match https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments of the element with the e-content class
# KartikPrabhu lol
# KartikPrabhu so yeah we do need examples and consensus on what is expected
# KartikPrabhu aaronpk++ yas!
# KartikPrabhu yeah I'll take anything to start getting this decided
# KartikPrabhu if some parser implementors and users like Bridgy and Monocle or something agree with those then they can be implemented
# KartikPrabhu it does when parsing for properties
# KartikPrabhu yeah so if the element starts with a <br/> which is replaced by a space should it be stripped?
# KartikPrabhu or what Zegnat said
# KartikPrabhu ha!
# KartikPrabhu dang! nice aaronpk
# KartikPrabhu nice
# KartikPrabhu yes!
# KartikPrabhu is figuring out how the hell these tests work in mf2py code
# KartikPrabhu what's LF?
# KartikPrabhu aah
# KartikPrabhu oh i thought both were called CR
# KartikPrabhu checking for surrounding elements is going to get tough
# KartikPrabhu might slow down parsing too (not too sure)
# KartikPrabhu Zegnat: yup I do
# KartikPrabhu that's how I currently do the "replace <img> by alt and src" but that does not include checking surrounding stuff
# KartikPrabhu Zegnat: yeah it is built on a python lib to handle HTML but I think I should be ale to do most DOM things with it
# KartikPrabhu able*
# KartikPrabhu does anyone here know how to run the tests in the mf2py code base?
# KartikPrabhu sknebel: thanks!
# KartikPrabhu has no idea how any of that is working
# KartikPrabhu Zegnat: I would not collapse it
# aaronpk so far this has made the most sense to me: https://aaronparecki.com/2018/03/03/3/
nitot joined the channel
# KartikPrabhu sknebel: ok will try
# KartikPrabhu sknebel: ok something did happen :P
# KartikPrabhu lol! mf2py tests are failing on whitespace and LF :P
# KartikPrabhu sknebel++ thanks
barpthewire joined the channel
KartikPrabhu joined the channel
# KartikPrabhu ok none of FF, Chrome, and Opera change <br> to <br/> so it is the html5lib parser being overzealous in mf2py
# KartikPrabhu should that count as a bug though?
[miklb] and KartikPrabhu joined the channel
# KartikPrabhu ok none of FF, Chrome, and Opera change <br> to <br/> so it is the html5lib parser being overzealous in mf2py
# KartikPrabhu should that count as a bug though?
# KartikPrabhu all HTML parsers modify the HTML specially if it is malformed. but this is a bit too much I agree
# KartikPrabhu yeah. But if one is writing tests then maybe both should be included to prevent error/failure?
# KartikPrabhu ok will wait for additional input from tantek about this whole thig
# KartikPrabhu aaronpk: do you also want to think about the "\t" tab character expectations?
# KartikPrabhu even intermediate tabs ? like "this is some \t\t\t text"
# KartikPrabhu interesting.
# Zegnat Also, if we go for a more spec-y implementation, whitespace collapsing per CSS Text 3 says that “Every tab is converted to a space”. (https://www.w3.org/TR/css-text-3/#white-space-phase-1)
tantek joined the channel
# KartikPrabhu Zegnat: yeah html5lib does that, but it is also the closest to browser behaviour for malformed HTML
# KartikPrabhu so not sure what to do about that
# KartikPrabhu is sure the same things happen for <hr> things
# KartikPrabhu oh!
# Zegnat No idea where or how, but that’s what I am seeing in https://github.com/html5lib/html5lib-python/blob/master/html5lib/serializer.py
# KartikPrabhu hmm it is supposed to default to False but that is not what's going on
# KartikPrabhu Zegnat: thanks for the link will dig around more
# zegnat edited /microformats2-parsing (+1) "Serialisation algo is moved to the parsing page of the HTML spec" (view diff)
# gRegorLove catches up on whitespace conversation
# KartikPrabhu gRegorLove: yes! would like PHP input on that whole thing
# gRegorLove I copied the microformat-shiv code for the php-mf2 innerText method, giving better results than PHP's DOMNode::textContent()
# KartikPrabhu "better"
# gRegorLove It was a while ago, I'd have to find the github issue :)
# KartikPrabhu should be here https://node.microformats.io/ as well but that page does not load
# KartikPrabhu aaronpk: https://go.microformats.io/
# KartikPrabhu also
# Zegnat If you want a direct JSON output from node mf2, you can also use https://sturdy-backbone.glitch.me/mf2/?url=https://aaronpk.com/
# Zegnat raw github URLs work fine, of course, and give straight JSON answers: https://sturdy-backbone.glitch.me/mf2/?url=https://raw.githubusercontent.com/aaronpk/microformats-whitespace-tests/master/tests/1.html
# gRegorLove php-mf2 context on its innerText() method: https://github.com/indieweb/php-mf2/pull/82
# KartikPrabhu gRegorLove: great! It would be real nice to have a consistent way of doing this across mf2 parsers
# gRegorLove Agreed
# gRegorLove I don't know all the ins and outs well enough to have a strong opinion
# gRegorLove aaronpk's table looks pretty good at a glance, for the expected values.
# gRegorLove For 7 in the table, you're collapsing multiple spaces into one, aaronpk?
# aaronpk these rules made the most sense to me https://aaronparecki.com/2018/03/03/3/
# gRegorLove Zegnat, Ah, gotcha. I meant textContent and innerText specifically. But yeah, aaronpk's expected output LGTM.
# Zegnat aaronpk, I am trying to “port” the algo as described here into my function: https://www.w3.org/TR/css-text-3/#white-space-phase-1
# aaronpk https://pin13.net/mf2/whitespace.html updated with node and go parsers
# aaronpk found some other good html->plaintext tests https://github.com/mtibben/html2text/blob/master/test/BasicTest.php
[tantek] joined the channel
# aaronpk just updated https://pin13.net/mf2/whitespace.html with treating them as the same, since it gives a better picture of the current state overall
# gRegorLove I think <br> and <br/> should be treated as equal
# Zegnat HTML spec says to use https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments for the content of the html property on e- parsing
chrisaldrich joined the channel
# Zegnat Throw in any HTML: https://wiki.zegnat.net/media/textparsing.html
# Zegnat Code may or may not be headache inducing, but mostly just implements https://www.w3.org/TR/css-text-3/#white-space-rules with the addition of inserting \n for BR and Ps.
# gRegorLove Zegnat++
[kevinmarks] joined the channel
# [kevinmarks] I have a feeling that html5lib has an xhtml/html toggle, but it is terribly documented
# [kevinmarks] I base this on memories of Sam Ruby writing chunks of it.
# Zegnat The serialiser seems to have a “use_trailing_solidus” option (https://github.com/html5lib/html5lib-python/blob/master/html5lib/serializer.py)
# [kevinmarks] There's an optional_tags opt on the serializer
chrisaldrich and KartikPrabhu joined the channel
# KartikPrabhu Zegnat: mf2py uses Beautiful Soup to handle the HTML which internally uses html5lib. So I'll have really track this down
# KartikPrabhu no worries I have a fair idea of what is happening
# KartikPrabhu should be fixed by tomorrow I think
[kevinmarks] and [miklb] joined the channel
# KartikPrabhu so it seems BeautifulSoup takes it upon itself to close the tags https://github.com/waylan/beautifulsoup/blob/master/bs4/element.py#L1106
# [kevinmarks] Also, html5lib has multiple tree builder and walker options, some of which are more xml like.
# KartikPrabhu yeah but that is not causing this <br> to <br/>
# KartikPrabhu BS is doing it explictly
# [kevinmarks] Ah. So this is beautifulsoup being mid 2000s html fashionable.
# [kevinmarks] Ok, fork and fix and push it to them
# KartikPrabhu the dev code is not on github so will have to learn some new thingie
# KartikPrabhu will see if I can report it as bug or something
# [kevinmarks] Do we want to be the github target for beautiful soup issues?
# KartikPrabhu nope
# KartikPrabhu admin tax and all that
# [kevinmarks] This is interesting. I wonder if "Postelian drift" is a thing.
# [kevinmarks] Where the definition of being conservative changes over time.
# KartikPrabhu Zegnat: it is a nice way to interact with the HTML tree with many nice inbuilt functions
# KartikPrabhu otherwise we'd have to rewrite most of what BS does anyway
# [kevinmarks] Beautiful Soup has very nice abstractions on html. Though modern DOM does too.