#microformats 2018-03-03

2018-03-03 UTC
[eddie], [kevinmarks], tantek, globbot, barpthewire, [tantek] and twisted` joined the channel
# 11:14 
Zegnat !tell aaronpk that mf ruby bug you filed (https://github.com/indieweb/microformats-ruby/issues/83) isn’t actually a bug per spec, right? <br> is expected to disappear when taking the textContent of an HTML element. Sounds like https://github.com/microformats/microformats2-parsing/issues/15 needs to be resolved first?
# 11:14 
Loqi Ok, I'll tell them that when I see them next
# 11:14 
Loqi [aaronpk] #83 <br> tags are not interpreted as whitespace when converting HTML to plaintext
# 11:17 
Zegnat goes off to read https://indieweb.org/note#Indieweb_whitespace_thinking as linked by tantek in #indieweb-dev
[kevinmarks] joined the channel
# 12:07 
[kevinmarks] It should turn into space of some kind
# 12:13 
[kevinmarks] Micro<br>formats should be "Micro\nformats" or "Micro formats" not "Microformats"
tantek joined the channel
# 12:57 
Zegnat I don’t disagree, [kevinmarks]. But that would need a spec update, as I see it, and I haven’t really seen a lot of proposed algorithms/solutions
# 13:02 
Zegnat innerText is an option, but would require CSS capabilities for the mf2 parser as non-CSS user agents just return textContent: https://html.spec.whatwg.org/#the-innertext-idl-attribute
# 13:02 
Zegnat So we would probably need something in between textContent and innerText, and I know no existing algorithms for that.
[colinwalker] joined the channel
# 14:41 
aaronpk I forgot about #15
# 14:41 
Loqi aaronpk: Zegnat left you a message 3 hours, 26 minutes ago: that mf ruby bug you filed (https://github.com/indieweb/microformats-ruby/issues/83) isn’t actually a bug per spec, right? <br> is expected to disappear when taking the textContent of an HTML element. Sounds like https://github.com/microformats/microformats2-parsing/issues/15 needs to be resolved first?
# 14:41 
aaronpk i don't believe that there isn't already an existing dom method that includes the white space properly
[mifga] joined the channel
# 15:04 
Zegnat I haven’t been able to find one
# 15:06 
Zegnat DOM spec has no context of DOM elements inheritly meaning some sort of line break, so you are limited to the  HTML spec
# 15:07 
Zegnat And the HTML spec has innerText, which does require specific line breaks for br and p elements, but that algo also includes CSS specific things and is only for user agents that implement CSS
# 15:13 
Zegnat You may also run into questions of how many line breaks you want. E.g. if my raw HTML includes a linebreak after the BR element, the PHP parsers preserves this one *and* replaces the BR in the content.value property: http://php.microformats.io/?id=20180303151224242
[xavierroy] joined the channel
# 15:42 
KartikPrabhu yeah getting textContent is tricky
# 15:42 
aaronpk oh gosh
# 15:43 
aaronpk well I still maintain that the plaintext version should match what you get when you copypaste from the browser and paste into a plaintext document
# 15:44 
KartikPrabhu but what is some elements are hidden with CSS?
# 15:44 
KartikPrabhu they don't show up on copy-paste
# 15:44 
aaronpk yeah browser copy-paste takes into account css
# 15:44 
aaronpk which I guess isn't practical for microformats
# 15:44 
KartikPrabhu or worse hidden by JS. we surely don't want mf2 parsers running random JS
# 15:45 
KartikPrabhu related: https://github.com/microformats/microformats2-parsing/issues/20
# 15:45 
Loqi [kartikprabhu] #20 innertext in value-class-pattern needs clarification
# 15:45 
KartikPrabhu might be good to have some consensus and test-cases sorted out for this
# 15:48 
aaronpk i'm trying to remember back to the early HTML books I read because I feel like this was pretty well sorted out
# 15:48 
KartikPrabhu oh! any other reference would be great too
# 15:51 
aaronpk this is actually a pretty good explanation of what i'm thinking about https://medium.com/@patrickbrosset/when-does-white-space-matter-in-html-b90e8a7cdd33
# 15:51 
aaronpk * all spaces and tabs immediately before and after a line break are ignored
# 15:51 
aaronpk * all tab characters are handled as space characters
# 15:51 
aaronpk * line breaks are converted to spaces
# 15:51 
aaronpk * sequences of spaces at the beginning and end of a line are removed
# 15:53 
aaronpk from https://www.w3.org/TR/CSS21/text.html#white-space-model
# 15:53 
aaronpk (paraphrased)
# 15:53 
KartikPrabhu yeah I am not even sure most of that can be done with any HTML parser
# 15:54 
aaronpk yeah I think it's an additional step of processing beyond the HTML parser
# 15:54 
aaronpk in browsers it's the rendering step
# 15:55 
KartikPrabhu no, I mean I don't it can be done even over the HTML parser
# 15:55 
aaronpk why not?
# 15:55 
KartikPrabhu HTML parsers sometimes just get rid of spacing so "replacing" spaces and tabs would be impossible from the output of a HTML parser
# 15:56 
KartikPrabhu I'll have to look into it a bit more to be sure though
# 15:56 
Zegnat HTML parsers should not be doing that, that sounds like a bug, KartikPrabhu
# 15:56 
KartikPrabhu Zegnat: maybe, not sure. But some example test-cases would be great
# 15:56 
Zegnat Modern HTML parsers should parse text into DOM, and AFAIK this does not destroy whitespace. The whitespace just ends up in text nodes
# 15:57 
aaronpk I don't really have a good mental model of HTML parsers, but that does match my experience with the php mf2 parser since I keep seeing weird leading whitespace in text content
# 15:57 
KartikPrabhu if you are constructung the DOM then yes. But outside the browser (say a Python or PHP code ) I am not sure
# 15:58 
Zegnat In PHP DOMDocument is used to convert the HTML into a DOM tree, same there.
# 15:58 
KartikPrabhu again would be great to have some test-case so we can just check
# 15:58 
Zegnat There is no HTML parsing without DOM, at least not in HTML5
# 15:58 
Zegnat Should be in the test cases of your parser lib alread
# 15:58 
Zegnat *already
# 15:58 
KartikPrabhu I mean, we don't even have a solid algorithm for what is expected
# 15:59 
aaronpk I should just start making text files of test cases of how I expect this to work
# 15:59 
KartikPrabhu so not sure how accurate existing test cases are
# 15:59 
Zegnat Well, expected by spec is textContent is KartikPrabhu ;)
# 15:59 
KartikPrabhu aaronpk: yes!
# 15:59 
KartikPrabhu Zegnat: lol! we know that means close to nothing :P
# 15:59 
Zegnat Apart from the old VCP wiki pages, which are a lot more vague on what works
# 16:00 
Zegnat textContent is very clearly defined ... https://dom.spec.whatwg.org/#dom-node-textcontent
# 16:00 
Zegnat It is only innertext for VCP that is ill-defined
# 16:01 
KartikPrabhu does not understand that definition at all!
# 16:02 
Zegnat textContent of an HTML element (Element in the spec) is the concatenation of all the Text nodes it (or its nested elements) contains. Where text nodes are plain text strings between an opening and closing HTML tag.
# 16:03 
Zegnat Which seems to be what several parsers already return. PHP does not return this value, and goes against spec, because it tries to return a more logical value for the end-user.
# 16:03 
KartikPrabhu aah
# 16:07 
Zegnat The textContent of the P element in `<p>Hallo<br>Bye</p>` is “HalloBye”. Because the P element contains 3 child nodes: text node “Hallo”, element “br”, text node “Bye”. The BR element is checked to see if it contains child nodes, it does not. Then all found text nodes are concatinated for the final value, thus “HalloBye”.
# 16:07 
aaronpk so browsers don't render `<p>Hallo<br>Bye</p>` as HalloBye so when is textContent actually used?
# 16:08 
KartikPrabhu aah^ there you go. Can't replace <br> using parser output then
# 16:08 
Zegnat I am not sure if textContent is ever used by browsers aaronpk. They prefer to use innerText, I imagine, which does add the linebreak.
# 16:09 
KartikPrabhu one can replace it before just like <img> is replaced by alt or src
# 16:09 
Zegnat That is one option, KartikPrabhu
# 16:09 
aaronpk here's one of the tests I believe is wrong https://github.com/microformats/tests/blob/master/tests/microformats-v2/h-entry/summarycontent.json
# 16:11 
Zegnat What specifically are you saying is wrong there, aaronpk?
# 16:11 
Zegnat goes to write what he expects the output to be by hand
# 16:11 
aaronpk content.value and summary should be "Last week the microformats.org community celebrated its...", collapsing the newline and following several space characters into a single space
# 16:14 
Zegnat Not according to current spec, but, yes, I do agree
# 16:14 
aaronpk i'm not talking about the spec, i'm talking about what I expect as someone using the parser
# 16:14 
Zegnat Yeah, sorry, I for a second thought you were saying the test case to be wrong :)
# 16:15 
aaronpk well ultimately I am
# 16:15 
Zegnat I agree with you that I would expect normalised whitespace
# 16:15 
aaronpk but that's because the spec is wrong
# 16:17 
Zegnat Hmm, whatwg also added a note to the innerText method that it may also be used for the Selection API. So I guess browsers do use that algo for copy-pasting.
# 16:18 
Zegnat So basically: textContent is the DOM method for getting a string of all text nodes. innerText is the HTML element method for getting a normalised text only rendition of what the user agent renders
# 16:18 
Zegnat But full innerText can’t be supported without CSS support, so we might have to cherry pick some of its steps?
# 16:19 
aaronpk assuming white-space: normal seems reasonable
# 16:20 
aaronpk does mf2 say anything about \n vs \r\n right now? the parsers seem to handle that differently too
# 16:20 
Zegnat Yes, I think KartikPrabhu brought that up before too
# 16:20 
Zegnat The spec should be indifferent to it, that is, return whatever is in the source document
# 16:21 
aaronpk that makes sense
# 16:21 
Zegnat (Mostly because textContent is indifferent and returns verbatim what is in the text node)
# 16:24 
Zegnat Note that the HTML spec only adds \n (line feed characters). And if you use innerText as a setter, it will specifically *skip* the \r in \r\n (https://html.spec.whatwg.org/#the-innertext-idl-attribute:dom-innertext-3)
# 16:25 
aaronpk wow all three parsers have completely different results for <div class="e-content p-name"><p>Hello</p><p>World</p></div>
# 16:26 
aaronpk the ruby one is what I would expect
# 16:26 
aaronpk actually not quite, it shouldn't be adding a newline in the html version 😂
# 16:26 
aaronpk it seems sensible that the parser should preserve the originally authored html in content.html, right?
# 16:27 
Zegnat .. ruby is telling me “The change you wanted was rejected.” What sort of error is that?
# 16:27 
Zegnat yes, content.html should imo be exactly as authored
# 16:28 
Zegnat double checks the spec
# 16:29 
Zegnat spec says to use innerHTML, and then links to a non-existing fragment in the HTML spec. *sigh*
KartikPrabhu joined the channel
# 16:30 
Zegnat so content.html should match https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments of the element with the e-content class
# 16:31 
KartikPrabhu lol
# 16:31 
KartikPrabhu so yeah we do need examples and consensus on what is expected
# 16:31 
aaronpk i'm making a repo with really simple examples
# 16:32 
KartikPrabhu aaronpk++ yas!
# 16:32 
Loqi aaronpk has 12 karma in this channel (1575 overall)
# 16:32 
aaronpk apologies in advance if these seem contrived, but i'm trying to make them really easy to read
# 16:32 
KartikPrabhu yeah I'll take anything to start getting this decided
# 16:32 
Zegnat Just quickly looking through the content.html algo, aaronpk, it does look like any and all whitespace should be kept as originally authored. No normalisation of whitespace or \r\n.
# 16:33 
KartikPrabhu if some parser implementors and users like Bridgy and Monocle or something agree with those then they can be implemented
# 16:33 
Zegnat The only thing mf2 then does is strip any whitespace at the start and end of content.html ... which is OK, maybe? Not sure I see the point in doing that.
# 16:33 
aaronpk oh does it say to do that?
# 16:33 
Zegnat “ html: the innerHTML of the element by using the HTML spec: Serializing HTML Fragments algorithm, with leading/trailing whitespace removed. ”
# 16:34 
KartikPrabhu it does when parsing for properties
# 16:34 
Zegnat http://microformats.org/wiki/microformats2-parsing#parsing_an_e-_property
# 16:34 
aaronpk I just wrote that test, and wasn't sure what I was expecting
# 16:34 
Loqi [Tantek Çelik] microformats2 parsing specification
# 16:34 
aaronpk certainly for non-html values it makes sense to strip
# 16:34 
aaronpk but I was on the fence about it for html values
# 16:35 
Zegnat I am on the fence too. E.g. if I use e-content on a PRE element, I would not like having whitespace stripped from the start.
# 16:35 
KartikPrabhu yeah so if the element starts with a <br/> which is replaced by a space should it be stripped?
# 16:35 
KartikPrabhu or what Zegnat said
# 16:35 
aaronpk well <pre> elements are a different problem I think
# 16:35 
aaronpk because the knowledge of that element doesn't make it through to the parsed result
# 16:36 
aaronpk the leading <br> is interesting tho
# 16:36 
Zegnat No, but if you say the content.html contains stuff as authored, I am not sure I would expect the leading whitespace to be gone, aaronpk. But honestly any reason I can come up with for keeping the whitespace is pretty contrived.
# 16:36 
Zegnat I just don’t see a reason why we should be stripping it either.
# 16:37 
aaronpk since it's written into the spec there must be a reason
# 16:37 
KartikPrabhu ha!
# 16:37 
aaronpk as opposed to a lot of this other stuff which is only implicitly in the spec
# 16:37 
Zegnat Hehe
# 16:37 
aaronpk someone took the time to write that sentence though, so i'm curious what the use case was
# 16:38 
Zegnat I wonder that for a lot. Especially some of those faux-CSS selectors for implied propertyies ;)
# 16:39 
sknebel Zegnat: if you care about the pre-ness of content, you'd have to include the pre inside the e-content though
# 16:41 
Zegnat Yeah, my example was contrived, sknebel. Just the only case where the leading whitespace is actually important
# 16:41 
aaronpk https://github.com/aaronpk/microformats-whitespace-tests/tree/master/tests
# 16:45 
KartikPrabhu dang! nice aaronpk
# 16:45 
sknebel (also, when quoting from the test suite, obligatory warning that it isn't always correct even to current specified behavior)
# 16:46 
Zegnat aaronpk, I see you decided to keep leading and trailing whitespace for html values? :)
# 16:46 
aaronpk yes that was my initial inclination before you said the spec says otherwise
# 16:46 
Zegnat It would be my preference too, if we are updating the textContent of the spec anyway
# 16:48 
Zegnat These look like a good start to me, aaronpk!
# 16:50 
aaronpk next up i'm writing a script to compare the results of all the parsers
# 16:50 
KartikPrabhu nice
# 16:51 
Zegnat brainstorms a bit on how to codify this HTML-input-to-plain-text-value-output as a generalised DOM based algo
# 16:51 
KartikPrabhu yes!
# 16:51 
KartikPrabhu is figuring out how the hell these tests work in mf2py code
# 16:53 
Zegnat There must be a better way than “replace the BR element with a textNode containing a single LF, unless the next node in the tree is a textNode starting with an LF”...
# 16:54 
KartikPrabhu what's LF?
# 16:54 
Zegnat \n
# 16:54 
KartikPrabhu aah
# 16:54 
Zegnat CR (cariage return) is \r, LF (line feed) is \n
# 16:54 
KartikPrabhu oh i thought both were called CR
# 16:55 
Zegnat Everyday is an opportunity to learn something new :D
# 16:55 
KartikPrabhu checking for surrounding elements is going to get tough
# 16:55 
Zegnat Do you have access to the DOM API from your HTML parser, or not, KartikPrabhu?
# 16:55 
KartikPrabhu might slow down parsing too (not too sure)
# 16:55 
KartikPrabhu Zegnat: yup I do
# 16:56 
KartikPrabhu that's how I currently do the "replace <img> by alt and src" but that does not include checking surrounding stuff
# 16:57 
Zegnat Yeah, I would rather have it go as a look-behind. It is easier to, while walking the node tree, keep track of “did I just replace a BR element with \n?”, than it is to check what is coming next.
# 16:57 
KartikPrabhu Zegnat: see eg.g. https://github.com/kartikprabhu/mf2py/blob/master/mf2py/dom_helpers.py#L26
# 17:02 
Zegnat I do not recognise a lot of the DOM api in that function, haha
# 17:02 
Loqi Zegnat: lol
# 17:03 
KartikPrabhu Zegnat: yeah it is built on a python lib to handle HTML but I think I should be ale to do most DOM things with it
# 17:03 
KartikPrabhu able*
# 17:07 
KartikPrabhu does anyone here know how to run the tests in the mf2py code base?
# 17:08 
Zegnat sknebel (ping) maybe? He at least has a lot more Python experience than me.
# 17:08 
sknebel one sec
# 17:08 
KartikPrabhu sknebel: thanks!
# 17:08 
KartikPrabhu has no idea how any of that is working
# 17:08 
Zegnat aaronpk, how would you expect multiple breaks in the raw HTML (\n\n\n) to be represented in the plaintext version? Collapsed into 1?
# 17:09 
KartikPrabhu Zegnat: I would not collapse it
# 17:10 
aaronpk no I don't think so
# 17:10 
aaronpk oh wait
# 17:10 
sknebel KartikPrabhu: "python setup.py test" ?
# 17:10 
aaronpk so far this has made the most sense to me: https://aaronparecki.com/2018/03/03/3/
# 17:10 
Loqi [Aaron Parecki] When does white space matter in HTML? – Patrick Brosset
# 17:10 
aaronpk so: "line breaks are converted to spaces" then "sequences of spaces at the beginning and end of a line are removed"
# 17:10 
sknebel (untested, since I don't seem to have a checkout of it on this machine, but the setup.oy seems to have the necessary bits
# 17:11 
Zegnat But those rules leave no \n in place, aaronpk
nitot joined the channel
# 17:11 
KartikPrabhu sknebel: ok will try
# 17:11 
aaronpk oh yeah, my #5 is wrong! html newlines are not significant in plaintext so they should be removed
# 17:11 
KartikPrabhu sknebel: ok something did happen :P
# 17:12 
aaronpk pushed a fix
# 17:12 
Zegnat Ah, thanks :P
# 17:12 
Zegnat Yes, I was looking at test 5, haha
# 17:13 
KartikPrabhu lol! mf2py tests are failing on whitespace and LF :P
# 17:14 
KartikPrabhu sknebel++ thanks
# 17:14 
Loqi sknebel has 6 karma in this channel (87 overall)
# 17:15 
Zegnat So the proposal basically turns into us applying the CSS white space processing rules, followed by inserting \n’s in specific places depending on the HTML element?
barpthewire joined the channel
# 17:39 
aaronpk alright done
# 17:39 
aaronpk https://rawgit.com/aaronpk/microformats-whitespace-tests/master/results/results.html
# 17:42 
aaronpk oh forgot to add the html in the table
# 17:43 
Zegnat Ugh, locked my entire firefox because of an errant while loop. Wasn’t Firefox supposed to isolate tabs from eachother these days?
# 17:43 
aaronpk there https://pin13.net/mf2/whitespace.html
# 17:50 
aaronpk well that was fun
# 17:50 
aaronpk it's a little bit sad how little the parsers agree on
# 17:51 
Zegnat They seem to agree on content.html, mostly. Just Python defaulting to close void elements
# 17:53 
aaronpk except for </p><p>
# 17:55 
Zegnat Ah, right, that’s an odd one. I wonder where that is coming from, in the html noless
# 17:56 
Zegnat tries to figure out if <br> or <br/> is expected
# 17:56 
aaronpk looks forward to tantek catching up on this discussion
KartikPrabhu joined the channel
# 18:41 
KartikPrabhu ok none of FF, Chrome, and Opera change <br> to <br/> so it is the html5lib parser being overzealous in mf2py
# 18:41 
KartikPrabhu should that count as a bug though?
[miklb] and KartikPrabhu joined the channel
# 18:41 
KartikPrabhu ok none of FF, Chrome, and Opera change <br> to <br/> so it is the html5lib parser being overzealous in mf2py
# 18:41 
KartikPrabhu should that count as a bug though?
# 18:42 
aaronpk I guess it's mostly harmless but still seems like it shouldn't be modifying the html
# 18:42 
KartikPrabhu all HTML parsers modify the HTML specially if it is malformed. but this is a bit too much I agree
# 18:43 
aaronpk I don't have strong feelings about it though because there is no visible difference between the two
# 18:43 
KartikPrabhu yeah. But if one is writing tests then maybe both should be included to prevent error/failure?
# 18:44 
aaronpk I can accept both in my test chart
# 18:45 
KartikPrabhu ok will wait for additional input from tantek about this whole thig
# 18:47 
KartikPrabhu aaronpk: do you also want to think about the "\t" tab character expectations?
# 18:49 
Zegnat Tabs are collapsed in all the other whitespace collapsing, per the rules aaronpk linked to
# 18:49 
KartikPrabhu even intermediate tabs ? like "this is some \t\t\t text"
# 18:50 
Zegnat Yes
# 18:51 
KartikPrabhu interesting.
# 18:52 
Zegnat Also, if we go for a more spec-y implementation, whitespace collapsing per CSS Text 3 says that “Every tab is converted to a space”. (https://www.w3.org/TR/css-text-3/#white-space-phase-1)
tantek joined the channel
# 18:53 
Zegnat Also, I just checked, void elements should *not* be self closing per the HTML5 spec
# 18:53 
Zegnat So technically, Python is doing it wrong.
# 18:53 
KartikPrabhu Zegnat: yeah html5lib does that, but it is also the closest to browser behaviour for malformed HTML
# 18:53 
KartikPrabhu so not sure what to do about that
# 18:54 
KartikPrabhu is sure the same things happen for <hr> things
# 18:55 
Zegnat It should be a setting in html5lib, KartikPrabhu
# 18:55 
KartikPrabhu oh!
# 18:55 
Zegnat Option `use_trailing_solidus`
# 18:55 
Zegnat No idea where or how, but that’s what I am seeing in https://github.com/html5lib/html5lib-python/blob/master/html5lib/serializer.py
# 18:56 
Zegnat (And I assume that’s what is being used?)
# 18:57 
KartikPrabhu hmm it is supposed to default to False but that is not what's going on
# 18:57 
KartikPrabhu Zegnat: thanks for the link will dig around more
# 18:58 
zegnat edited /microformats2-parsing (+1) "Serialisation algo is moved to the parsing page of the HTML spec" (view diff)
# 18:59 
Zegnat Fixed the HTML string algo link ^^^ That’s the spec that is saying not to use the closing slash, if you are interested in the technicalities, KartikPrabhu
# 19:00 
gRegorLove catches up on whitespace conversation
# 19:04 
KartikPrabhu gRegorLove: yes! would like PHP input on that whole thing
# 19:06 
gRegorLove I copied the microformat-shiv code for the php-mf2 innerText method, giving better results than PHP's DOMNode::textContent()
# 19:07 
KartikPrabhu "better"
# 19:07 
KartikPrabhu ?
# 19:07 
gRegorLove It was a while ago, I'd have to find the github issue :)
# 19:07 
aaronpk oh yeah I forgot to add the node results
# 19:07 
aaronpk is that online somewhere?
# 19:08 
aaronpk see: https://pin13.net/mf2/whitespace.html
# 19:08 
gRegorLove aaronpk: https://glennjones.net/tools/microformats/
# 19:09 
KartikPrabhu should be here https://node.microformats.io/ as well but that page does not load
# 19:09 
KartikPrabhu aaronpk: https://go.microformats.io/
# 19:09 
KartikPrabhu also
# 19:10 
Zegnat If you want a direct JSON output from node mf2, you can also use https://sturdy-backbone.glitch.me/mf2/?url=https://aaronpk.com/
# 19:11 
aaronpk I need direct json output given html
# 19:11 
Zegnat But I don’t think that one takes plain text input, so your tests need to be on public URLs
# 19:12 
Zegnat raw github URLs work fine, of course, and give straight JSON answers: https://sturdy-backbone.glitch.me/mf2/?url=https://raw.githubusercontent.com/aaronpk/microformats-whitespace-tests/master/tests/1.html
# 19:12 
Loqi Hello World
# 19:12 
aaronpk ah good idea
# 19:12 
gRegorLove php-mf2 context on its innerText() method: https://github.com/indieweb/php-mf2/pull/82
# 19:12 
Loqi [gRegorLove] #82 Implemented @glennjones "innerText" parsing for better parsed whitespace
# 19:13 
KartikPrabhu gRegorLove: great! It would be real nice to have a consistent way of doing this across mf2 parsers
# 19:13 
gRegorLove Agreed
# 19:17 
gRegorLove I don't know all the ins and outs well enough to have a strong opinion
# 19:18 
Zegnat The ins and outs are that mf2 really doesn’t define anything itself, just that it uses DOM textContent (all text nodes, no other data). So you only need to have an opinion on what sort of string you personally would expect :)
# 19:19 
aaronpk the point of this exercise for me is to look at it from the plain input and output regardless of what the spec currently says and see if that makes sense
# 19:19 
gRegorLove aaronpk's table looks pretty good at a glance, for the expected values.
# 19:19 
Zegnat mumbles something about how the CSS text spec isn’t written against a DOM tree
# 19:20 
gRegorLove For 7 in the table, you're collapsing multiple spaces into one, aaronpk?
# 19:21 
aaronpk these rules made the most sense to me https://aaronparecki.com/2018/03/03/3/
# 19:21 
Loqi [Aaron Parecki] When does white space matter in HTML? – Patrick Brosset
# 19:23 
gRegorLove Zegnat, Ah, gotcha. I meant textContent and innerText specifically. But yeah, aaronpk's expected output LGTM.
# 19:25 
Zegnat aaronpk, I am trying to “port” the algo as described here into my function: https://www.w3.org/TR/css-text-3/#white-space-phase-1
# 19:25 
Zegnat That seems to match the one you bookmarked, except it is a web spec instead of a Medium article
# 19:25 
Zegnat (assumption that white-space is set to normal, as you stated previously)
# 19:36 
aaronpk https://pin13.net/mf2/whitespace.html updated with node and go parsers
# 19:38 
sknebel (node parser on sturdy-backbone) isn't quite stock, but the changes shouldn't impact this comparison
# 19:38 
aaronpk ah
# 19:39 
aaronpk found some other good html->plaintext tests https://github.com/mtibben/html2text/blob/master/test/BasicTest.php
# 19:39 
sknebel (I had to fix something to make sturdy backbone work, but the PR never got merged)
# 19:40 
aaronpk this library is giving me doubts about whether the mf2 parser's plaintext version of e-content will ever be useful
# 19:41 
Zegnat Hmm, yeah
# 19:41 
aaronpk specifically https://github.com/mtibben/html2text/blob/master/test/ListTest.php
# 19:42 
aaronpk a "good" plaintext representation of <ul><li> should actually convert that to a plaintext bullet list using *
# 19:42 
aaronpk but that has nothing to do with html rules
[tantek] joined the channel
# 19:42 
aaronpk (i'm imagining the use case of rendering plaintext versions of posts in a reader that doesn't render html)
# 19:43 
aaronpk that conversion is definitely not something i'd expect the mf2 parser to do, but I might expect XRay to do it
# 19:43 
sknebel not a fan of specifications that say "do whatever library X does", but I can see why people do that ;)
# 19:44 
aaronpk ?
# 19:44 
Zegnat We need some sort of HTML-to-plain-text in mf2 any way, for p- parsing. But definitely a valid question wether it makes sense to then provide this plain text for e- as well or not
# 19:44 
aaronpk true, but most of the time the p- rules are used for simple one-line strings
# 19:46 
Zegnat Definitely true for me
# 19:47 
aaronpk did we land anywhere on whether I should treat <br> as equal to <br/> for the purposes of comparing whether the test is successful?
# 19:49 
aaronpk just updated https://pin13.net/mf2/whitespace.html with treating them as the same, since it gives a better picture of the current state overall
# 19:54 
gRegorLove I think <br> and <br/> should be treated as equal
# 19:54 
sknebel yes
# 19:56 
tantek reads some scrollback
# 19:57 
aaronpk wb tantek
# 19:57 
tantek uh yes, HTML5 treats them the same. why would you treat them differently?!?
# 19:57 
aaronpk depends on whether you take the definition of the mf2 parser to mean the parser should not modify the authored html or whether transformations like <br> -> <br/> are okay for the parser to do
# 19:59 
Zegnat According to spec, <br/> is wrong
# 19:59 
Zegnat I think I said that somewhere
# 20:00 
sknebel Zegnat: nope, it is allowed
# 20:00 
Zegnat No, according to mf spec we need to do HTML serialisation per HTML spec, which does not add the /
# 20:00 
tantek Zegnat, which spec? Browsers treat them the same :P
# 20:00 
sknebel but allows it as valid html
# 20:01 
Zegnat HTML spec says to use https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments for the content of the html property on e- parsing
# 20:01 
Zegnat Which specifically does not add the /
# 20:01 
tantek oh for *generating* HTML
# 20:01 
tantek yeah that's fine
# 20:01 
aaronpk not even generating, just consuming existing html
# 20:01 
aaronpk consuming and re-generating I guess
# 20:02 
tantek when consuming you treat them the same
# 20:02 
Zegnat Yeah, we are talking about outputting form the mf2 parser
# 20:02 
Zegnat Which, according to the spec, should not output <br/> but always <br>
# 20:02 
Zegnat s/form/from/
# 20:02 
tantek yeah, better to be consistent there
# 20:03 
tantek however if you're consuming HTML from the mf2 JSON, you must treat them the same
# 20:03 
tantek you don't get to "sorta" parse HTML
# 20:03 
aaronpk haha okay so what should my test do? reject the result if the mf2 parser adds a / ?
# 20:03 
Loqi aaronpk: lol
# 20:03 
Zegnat For your purpose however, aaronpk, I would treat them as the same. And we should file a bug on Python to not output slashes on void element
# 20:04 
tantek I for one generate <br/> in my code because I'm using various XML functions on the back end to process my own content
# 20:04 
aaronpk for my purposes the difference isn't significant so i'm tempted to accept both results
# 20:04 
tantek if "accept" means parse/consume then yes
# 20:05 
aaronpk "accept" means "allow the parser to modify the html and still treat as a valid result" in this case
# 20:05 
tantek "modify the html"?
# 20:05 
aaronpk some parsers change the <br> in <div class="e-content"><br></div> to <br/>
# 20:06 
Zegnat Oh, Go and Node do it too? so multiple bug reports :(
# 20:06 
tantek that's both wrong, and something that consumers of mf2 JSON must be able to handle
# 20:07 
tantek this seems not very relevant to anything user-visible
# 20:07 
aaronpk right, which is why I'm okay accepting both for this test matrix
chrisaldrich joined the channel
# 20:17 
Zegnat I think I can now walk the DOM tree of a node and output the exact expected plain text from all your examples aaronpk
# 20:18 
aaronpk whoa
# 20:18 
aaronpk Zegnat++
# 20:18 
Loqi zegnat has 10 karma in this channel (178 overall)
# 20:19 
Zegnat Hmm. I am failing 3 or 7 currently, darn, I thought I covered it all
# 20:20 
Zegnat In my implementation of the algo either they both have `\n ` or both have `\n`, but your test has 2 different outputs
# 20:20 
aaronpk where does the space come from in #3?
# 20:21 
Zegnat The \n behind the BR, as all \n’s are replaced with spaces
# 20:21 
aaronpk oh yeah huh
# 20:21 
aaronpk hm
# 20:22 
aaronpk aha, last step: "sequences of spaces at the beginning and end of a line are removed"
# 20:22 
aaronpk the "\n " should be turned into "\n"
# 20:23 
aaronpk er, the "<br> " should be turned into "<br>" because that line has a trailing space
# 20:25 
Zegnat Alright, so any spaces before or after a \n should be removed... retesting
# 20:25 
Zegnat That means 7 is currently wrong in your test?
# 20:26 
aaronpk reviews
# 20:26 
aaronpk ah yep. the first step "all spaces and tabs immediately before and after a line break are ignored" should end up turning #7 into #3
# 20:26 
Zegnat Alright
# 20:28 
aaronpk updated
# 20:30 
Zegnat Throw in any HTML: https://wiki.zegnat.net/media/textparsing.html
# 20:31 
Zegnat Does a single DOM tree walk, not back and forth traversing of elements
# 20:33 
Zegnat Code may or may not be headache inducing, but mostly just implements https://www.w3.org/TR/css-text-3/#white-space-rules with the addition of inserting \n for BR and Ps.
# 20:35 
Zegnat Also only interacts with nodes using methods from the DOM spec, so should be portable to other programming languages
# 20:52 
gRegorLove Zegnat++
# 20:52 
Loqi zegnat has 11 karma in this channel (179 overall)
# 21:25 
Zegnat Now if only PHP had full DOM support, hehe
[kevinmarks] joined the channel
# 21:32 
[kevinmarks] I have a feeling that html5lib has an xhtml/html toggle, but it is terribly documented
# 21:33 
[kevinmarks] I base this on memories of Sam Ruby writing chunks of it.
# 21:34 
Zegnat The serialiser seems to have a “use_trailing_solidus” option (https://github.com/html5lib/html5lib-python/blob/master/html5lib/serializer.py)
# 21:34 
Zegnat So hopefully KartikPrabhu can find where to set that :)
# 21:41 
[kevinmarks] There's an optional_tags opt on the serializer
# 21:44 
Zegnat I am only seeing omit_optional_tags, but that doesn’t touch self-closing tags at all. Is more about having things like empty HEAD elements added to the page.
# 21:44 
Zegnat Am I looking in the wrong file?
chrisaldrich and KartikPrabhu joined the channel
# 22:40 
KartikPrabhu Zegnat: mf2py uses Beautiful Soup to handle the HTML which internally uses html5lib. So I'll have really track this down
# 22:40 
Zegnat Oh, oof
# 22:41 
KartikPrabhu no worries I have a fair idea of what is happening
# 22:41 
KartikPrabhu should be fixed by tomorrow I think
# 22:47 
Zegnat Woot!
# 22:47 
Zegnat KartikPrabhu++
# 22:47 
Loqi kartikprabhu has 10 karma in this channel (173 overall)
# 22:47 
Loqi does a happy dance!
[kevinmarks] and [miklb] joined the channel
# 23:49 
KartikPrabhu so it seems BeautifulSoup takes it upon itself to close the tags https://github.com/waylan/beautifulsoup/blob/master/bs4/element.py#L1106
# 23:52 
[kevinmarks] Also, html5lib has multiple tree builder and walker options, some of which are more xml like.
# 23:53 
KartikPrabhu yeah but that is not causing this <br> to <br/>
# 23:53 
KartikPrabhu BS is doing it explictly
# 23:54 
[kevinmarks] Ah. So this is beautifulsoup being mid 2000s html fashionable.
# 23:54 
[kevinmarks] Ok, fork and fix and push it to them
# 23:55 
KartikPrabhu the dev code is not on github so will have to learn some new thingie
# 23:55 
KartikPrabhu will see if I can report it as bug or something
# 23:56 
[kevinmarks] Do we want to be the github target for beautiful soup issues?
# 23:56 
KartikPrabhu nope
# 23:57 
KartikPrabhu admin tax and all that
# 23:57 
[kevinmarks] This is interesting. I wonder if "Postelian drift" is a thing.
# 23:58 
Zegnat Could you drop the bs dependency? What is it used for?
# 23:58 
[kevinmarks] Where the definition of being conservative changes over time.
# 23:59 
KartikPrabhu Zegnat: it is a nice way to interact with the HTML tree with many nice inbuilt functions
# 23:59 
KartikPrabhu otherwise we'd have to rewrite most of what BS does anyway
# 23:59 
[kevinmarks] Beautiful Soup has very nice abstractions on html. Though modern DOM does too.