#microformats 2018-03-03

2018-03-03 UTC
[eddie], [kevinmarks], tantek, globbot, barpthewire, [tantek] and twisted` joined the channel
#
Zegnat
!tell aaronpk that mf ruby bug you filed (https://github.com/indieweb/microformats-ruby/issues/83) isn’t actually a bug per spec, right? <br> is expected to disappear when taking the textContent of an HTML element. Sounds like https://github.com/microformats/microformats2-parsing/issues/15 needs to be resolved first?
#
Loqi
Ok, I'll tell them that when I see them next
#
Loqi
[aaronpk] #83 <br> tags are not interpreted as whitespace when converting HTML to plaintext
#
Zegnat
goes off to read https://indieweb.org/note#Indieweb_whitespace_thinking as linked by tantek in #indieweb-dev
[kevinmarks] joined the channel
#
[kevinmarks]
It should turn into space of some kind
#
[kevinmarks]
Micro<br>formats should be "Micro\nformats" or "Micro formats" not "Microformats"
tantek joined the channel
#
Zegnat
I don’t disagree, [kevinmarks]. But that would need a spec update, as I see it, and I haven’t really seen a lot of proposed algorithms/solutions
#
Zegnat
innerText is an option, but would require CSS capabilities for the mf2 parser as non-CSS user agents just return textContent: https://html.spec.whatwg.org/#the-innertext-idl-attribute
#
Zegnat
So we would probably need something in between textContent and innerText, and I know no existing algorithms for that.
[colinwalker] joined the channel
#
aaronpk
I forgot about #15
#
Loqi
aaronpk: Zegnat left you a message 3 hours, 26 minutes ago: that mf ruby bug you filed (https://github.com/indieweb/microformats-ruby/issues/83) isn’t actually a bug per spec, right? <br> is expected to disappear when taking the textContent of an HTML element. Sounds like https://github.com/microformats/microformats2-parsing/issues/15 needs to be resolved first?
#
aaronpk
i don't believe that there isn't already an existing dom method that includes the white space properly
[mifga] joined the channel
#
Zegnat
I haven’t been able to find one
#
Zegnat
DOM spec has no context of DOM elements inheritly meaning some sort of line break, so you are limited to the HTML spec
#
Zegnat
And the HTML spec has innerText, which does require specific line breaks for br and p elements, but that algo also includes CSS specific things and is only for user agents that implement CSS
#
Zegnat
You may also run into questions of how many line breaks you want. E.g. if my raw HTML includes a linebreak after the BR element, the PHP parsers preserves this one *and* replaces the BR in the content.value property: http://php.microformats.io/?id=20180303151224242
[xavierroy] joined the channel
#
KartikPrabhu
yeah getting textContent is tricky
#
aaronpk
oh gosh
#
aaronpk
well I still maintain that the plaintext version should match what you get when you copypaste from the browser and paste into a plaintext document
#
KartikPrabhu
but what is some elements are hidden with CSS?
#
KartikPrabhu
they don't show up on copy-paste
#
aaronpk
yeah browser copy-paste takes into account css
#
aaronpk
which I guess isn't practical for microformats
#
KartikPrabhu
or worse hidden by JS. we surely don't want mf2 parsers running random JS
#
Loqi
[kartikprabhu] #20 innertext in value-class-pattern needs clarification
#
KartikPrabhu
might be good to have some consensus and test-cases sorted out for this
#
aaronpk
i'm trying to remember back to the early HTML books I read because I feel like this was pretty well sorted out
#
KartikPrabhu
oh! any other reference would be great too
#
aaronpk
this is actually a pretty good explanation of what i'm thinking about https://medium.com/@patrickbrosset/when-does-white-space-matter-in-html-b90e8a7cdd33
#
aaronpk
* all spaces and tabs immediately before and after a line break are ignored
#
aaronpk
* all tab characters are handled as space characters
#
aaronpk
* line breaks are converted to spaces
#
aaronpk
* sequences of spaces at the beginning and end of a line are removed
#
aaronpk
(paraphrased)
#
KartikPrabhu
yeah I am not even sure most of that can be done with any HTML parser
#
aaronpk
yeah I think it's an additional step of processing beyond the HTML parser
#
aaronpk
in browsers it's the rendering step
#
KartikPrabhu
no, I mean I don't it can be done even over the HTML parser
#
aaronpk
why not?
#
KartikPrabhu
HTML parsers sometimes just get rid of spacing so "replacing" spaces and tabs would be impossible from the output of a HTML parser
#
KartikPrabhu
I'll have to look into it a bit more to be sure though
#
Zegnat
HTML parsers should not be doing that, that sounds like a bug, KartikPrabhu
#
KartikPrabhu
Zegnat: maybe, not sure. But some example test-cases would be great
#
Zegnat
Modern HTML parsers should parse text into DOM, and AFAIK this does not destroy whitespace. The whitespace just ends up in text nodes
#
aaronpk
I don't really have a good mental model of HTML parsers, but that does match my experience with the php mf2 parser since I keep seeing weird leading whitespace in text content
#
KartikPrabhu
if you are constructung the DOM then yes. But outside the browser (say a Python or PHP code ) I am not sure
#
Zegnat
In PHP DOMDocument is used to convert the HTML into a DOM tree, same there.
#
KartikPrabhu
again would be great to have some test-case so we can just check
#
Zegnat
There is no HTML parsing without DOM, at least not in HTML5
#
Zegnat
Should be in the test cases of your parser lib alread
#
Zegnat
*already
#
KartikPrabhu
I mean, we don't even have a solid algorithm for what is expected
#
aaronpk
I should just start making text files of test cases of how I expect this to work
#
KartikPrabhu
so not sure how accurate existing test cases are
#
Zegnat
Well, expected by spec is textContent is KartikPrabhu ;)
#
KartikPrabhu
aaronpk: yes!
#
KartikPrabhu
Zegnat: lol! we know that means close to nothing :P
#
Zegnat
Apart from the old VCP wiki pages, which are a lot more vague on what works
#
Zegnat
textContent is very clearly defined ... https://dom.spec.whatwg.org/#dom-node-textcontent
#
Zegnat
It is only innertext for VCP that is ill-defined
#
KartikPrabhu
does not understand that definition at all!
#
Zegnat
textContent of an HTML element (Element in the spec) is the concatenation of all the Text nodes it (or its nested elements) contains. Where text nodes are plain text strings between an opening and closing HTML tag.
#
Zegnat
Which seems to be what several parsers already return. PHP does not return this value, and goes against spec, because it tries to return a more logical value for the end-user.
#
Zegnat
The textContent of the P element in `<p>Hallo<br>Bye</p>` is “HalloBye”. Because the P element contains 3 child nodes: text node “Hallo”, element “br”, text node “Bye”. The BR element is checked to see if it contains child nodes, it does not. Then all found text nodes are concatinated for the final value, thus “HalloBye”.
#
aaronpk
so browsers don't render `<p>Hallo<br>Bye</p>` as HalloBye so when is textContent actually used?
#
KartikPrabhu
aah^ there you go. Can't replace <br> using parser output then
#
Zegnat
I am not sure if textContent is ever used by browsers aaronpk. They prefer to use innerText, I imagine, which does add the linebreak.
#
KartikPrabhu
one can replace it before just like <img> is replaced by alt or src
#
Zegnat
That is one option, KartikPrabhu
#
Zegnat
What specifically are you saying is wrong there, aaronpk?
#
Zegnat
goes to write what he expects the output to be by hand
#
aaronpk
content.value and summary should be "Last week the microformats.org community celebrated its...", collapsing the newline and following several space characters into a single space
#
Zegnat
Not according to current spec, but, yes, I do agree
#
aaronpk
i'm not talking about the spec, i'm talking about what I expect as someone using the parser
#
Zegnat
Yeah, sorry, I for a second thought you were saying the test case to be wrong :)
#
aaronpk
well ultimately I am
#
Zegnat
I agree with you that I would expect normalised whitespace
#
aaronpk
but that's because the spec is wrong
#
Zegnat
Hmm, whatwg also added a note to the innerText method that it may also be used for the Selection API. So I guess browsers do use that algo for copy-pasting.
#
Zegnat
So basically: textContent is the DOM method for getting a string of all text nodes. innerText is the HTML element method for getting a normalised text only rendition of what the user agent renders
#
Zegnat
But full innerText can’t be supported without CSS support, so we might have to cherry pick some of its steps?
#
aaronpk
assuming white-space: normal seems reasonable
#
aaronpk
does mf2 say anything about \n vs \r\n right now? the parsers seem to handle that differently too
#
Zegnat
Yes, I think KartikPrabhu brought that up before too
#
Zegnat
The spec should be indifferent to it, that is, return whatever is in the source document
#
aaronpk
that makes sense
#
Zegnat
(Mostly because textContent is indifferent and returns verbatim what is in the text node)
#
Zegnat
Note that the HTML spec only adds \n (line feed characters). And if you use innerText as a setter, it will specifically *skip* the \r in \r\n (https://html.spec.whatwg.org/#the-innertext-idl-attribute:dom-innertext-3)
#
aaronpk
wow all three parsers have completely different results for <div class="e-content p-name"><p>Hello</p><p>World</p></div>
#
aaronpk
the ruby one is what I would expect
#
aaronpk
actually not quite, it shouldn't be adding a newline in the html version 😂
#
aaronpk
it seems sensible that the parser should preserve the originally authored html in content.html, right?
#
Zegnat
.. ruby is telling me “The change you wanted was rejected.” What sort of error is that?
#
Zegnat
yes, content.html should imo be exactly as authored
#
Zegnat
double checks the spec
#
Zegnat
spec says to use innerHTML, and then links to a non-existing fragment in the HTML spec. *sigh*
KartikPrabhu joined the channel
#
Zegnat
so content.html should match https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments of the element with the e-content class
#
KartikPrabhu
so yeah we do need examples and consensus on what is expected
#
aaronpk
i'm making a repo with really simple examples
#
KartikPrabhu
aaronpk++ yas!
#
Loqi
aaronpk has 12 karma in this channel (1575 overall)
#
aaronpk
apologies in advance if these seem contrived, but i'm trying to make them really easy to read
#
KartikPrabhu
yeah I'll take anything to start getting this decided
#
Zegnat
Just quickly looking through the content.html algo, aaronpk, it does look like any and all whitespace should be kept as originally authored. No normalisation of whitespace or \r\n.
#
KartikPrabhu
if some parser implementors and users like Bridgy and Monocle or something agree with those then they can be implemented
#
Zegnat
The only thing mf2 then does is strip any whitespace at the start and end of content.html ... which is OK, maybe? Not sure I see the point in doing that.
#
aaronpk
oh does it say to do that?
#
Zegnat
“ html: the innerHTML of the element by using the HTML spec: Serializing HTML Fragments algorithm, with leading/trailing whitespace removed. ”
#
KartikPrabhu
it does when parsing for properties
#
aaronpk
I just wrote that test, and wasn't sure what I was expecting
#
Loqi
[Tantek Çelik] microformats2 parsing specification
#
aaronpk
certainly for non-html values it makes sense to strip
#
aaronpk
but I was on the fence about it for html values
#
Zegnat
I am on the fence too. E.g. if I use e-content on a PRE element, I would not like having whitespace stripped from the start.
#
KartikPrabhu
yeah so if the element starts with a <br/> which is replaced by a space should it be stripped?
#
KartikPrabhu
or what Zegnat said
#
aaronpk
well <pre> elements are a different problem I think
#
aaronpk
because the knowledge of that element doesn't make it through to the parsed result
#
aaronpk
the leading <br> is interesting tho
#
Zegnat
No, but if you say the content.html contains stuff as authored, I am not sure I would expect the leading whitespace to be gone, aaronpk. But honestly any reason I can come up with for keeping the whitespace is pretty contrived.
#
Zegnat
I just don’t see a reason why we should be stripping it either.
#
aaronpk
since it's written into the spec there must be a reason
#
aaronpk
as opposed to a lot of this other stuff which is only implicitly in the spec
#
aaronpk
someone took the time to write that sentence though, so i'm curious what the use case was
#
Zegnat
I wonder that for a lot. Especially some of those faux-CSS selectors for implied propertyies ;)
#
sknebel
Zegnat: if you care about the pre-ness of content, you'd have to include the pre inside the e-content though
#
Zegnat
Yeah, my example was contrived, sknebel. Just the only case where the leading whitespace is actually important
#
KartikPrabhu
dang! nice aaronpk
#
sknebel
(also, when quoting from the test suite, obligatory warning that it isn't always correct even to current specified behavior)
#
Zegnat
aaronpk, I see you decided to keep leading and trailing whitespace for html values? :)
#
aaronpk
yes that was my initial inclination before you said the spec says otherwise
#
Zegnat
It would be my preference too, if we are updating the textContent of the spec anyway
#
Zegnat
These look like a good start to me, aaronpk!
#
aaronpk
next up i'm writing a script to compare the results of all the parsers
#
Zegnat
brainstorms a bit on how to codify this HTML-input-to-plain-text-value-output as a generalised DOM based algo
#
KartikPrabhu
is figuring out how the hell these tests work in mf2py code
#
Zegnat
There must be a better way than “replace the BR element with a textNode containing a single LF, unless the next node in the tree is a textNode starting with an LF”...
#
KartikPrabhu
what's LF?
#
Zegnat
CR (cariage return) is \r, LF (line feed) is \n
#
KartikPrabhu
oh i thought both were called CR
#
Zegnat
Everyday is an opportunity to learn something new :D
#
KartikPrabhu
checking for surrounding elements is going to get tough
#
Zegnat
Do you have access to the DOM API from your HTML parser, or not, KartikPrabhu?
#
KartikPrabhu
might slow down parsing too (not too sure)
#
KartikPrabhu
Zegnat: yup I do
#
KartikPrabhu
that's how I currently do the "replace <img> by alt and src" but that does not include checking surrounding stuff
#
Zegnat
Yeah, I would rather have it go as a look-behind. It is easier to, while walking the node tree, keep track of “did I just replace a BR element with \n?”, than it is to check what is coming next.
#
Zegnat
I do not recognise a lot of the DOM api in that function, haha
#
Loqi
Zegnat: lol
#
KartikPrabhu
Zegnat: yeah it is built on a python lib to handle HTML but I think I should be ale to do most DOM things with it
#
KartikPrabhu
does anyone here know how to run the tests in the mf2py code base?
#
Zegnat
sknebel (ping) maybe? He at least has a lot more Python experience than me.
#
sknebel
one sec
#
KartikPrabhu
sknebel: thanks!
#
KartikPrabhu
has no idea how any of that is working
#
Zegnat
aaronpk, how would you expect multiple breaks in the raw HTML (\n\n\n) to be represented in the plaintext version? Collapsed into 1?
#
KartikPrabhu
Zegnat: I would not collapse it
#
aaronpk
no I don't think so
#
aaronpk
oh wait
#
sknebel
KartikPrabhu: "python setup.py test" ?
#
aaronpk
so far this has made the most sense to me: https://aaronparecki.com/2018/03/03/3/
#
Loqi
[Aaron Parecki] When does white space matter in HTML? – Patrick Brosset
#
aaronpk
so: "line breaks are converted to spaces" then "sequences of spaces at the beginning and end of a line are removed"
#
sknebel
(untested, since I don't seem to have a checkout of it on this machine, but the setup.oy seems to have the necessary bits
#
Zegnat
But those rules leave no \n in place, aaronpk
nitot joined the channel
#
KartikPrabhu
sknebel: ok will try
#
aaronpk
oh yeah, my #5 is wrong! html newlines are not significant in plaintext so they should be removed
#
KartikPrabhu
sknebel: ok something did happen :P
#
aaronpk
pushed a fix
#
Zegnat
Ah, thanks :P
#
Zegnat
Yes, I was looking at test 5, haha
#
KartikPrabhu
lol! mf2py tests are failing on whitespace and LF :P
#
KartikPrabhu
sknebel++ thanks
#
Loqi
sknebel has 6 karma in this channel (87 overall)
#
Zegnat
So the proposal basically turns into us applying the CSS white space processing rules, followed by inserting \n’s in specific places depending on the HTML element?
barpthewire joined the channel
#
aaronpk
alright done
#
aaronpk
oh forgot to add the html in the table
#
Zegnat
Ugh, locked my entire firefox because of an errant while loop. Wasn’t Firefox supposed to isolate tabs from eachother these days?
#
aaronpk
well that was fun
#
aaronpk
it's a little bit sad how little the parsers agree on
#
Zegnat
They seem to agree on content.html, mostly. Just Python defaulting to close void elements
#
aaronpk
except for </p><p>
#
Zegnat
Ah, right, that’s an odd one. I wonder where that is coming from, in the html noless
#
Zegnat
tries to figure out if <br> or <br/> is expected
#
aaronpk
looks forward to tantek catching up on this discussion
KartikPrabhu joined the channel
#
KartikPrabhu
ok none of FF, Chrome, and Opera change <br> to <br/> so it is the html5lib parser being overzealous in mf2py
#
KartikPrabhu
should that count as a bug though?
[miklb] and KartikPrabhu joined the channel
#
KartikPrabhu
ok none of FF, Chrome, and Opera change <br> to <br/> so it is the html5lib parser being overzealous in mf2py
#
KartikPrabhu
should that count as a bug though?
#
aaronpk
I guess it's mostly harmless but still seems like it shouldn't be modifying the html
#
KartikPrabhu
all HTML parsers modify the HTML specially if it is malformed. but this is a bit too much I agree
#
aaronpk
I don't have strong feelings about it though because there is no visible difference between the two
#
KartikPrabhu
yeah. But if one is writing tests then maybe both should be included to prevent error/failure?
#
aaronpk
I can accept both in my test chart
#
KartikPrabhu
ok will wait for additional input from tantek about this whole thig
#
KartikPrabhu
aaronpk: do you also want to think about the "\t" tab character expectations?
#
Zegnat
Tabs are collapsed in all the other whitespace collapsing, per the rules aaronpk linked to
#
KartikPrabhu
even intermediate tabs ? like "this is some \t\t\t text"
#
KartikPrabhu
interesting.
#
Zegnat
Also, if we go for a more spec-y implementation, whitespace collapsing per CSS Text 3 says that “Every tab is converted to a space”. (https://www.w3.org/TR/css-text-3/#white-space-phase-1)
tantek joined the channel
#
Zegnat
Also, I just checked, void elements should *not* be self closing per the HTML5 spec
#
Zegnat
So technically, Python is doing it wrong.
#
KartikPrabhu
Zegnat: yeah html5lib does that, but it is also the closest to browser behaviour for malformed HTML
#
KartikPrabhu
so not sure what to do about that
#
KartikPrabhu
is sure the same things happen for <hr> things
#
Zegnat
It should be a setting in html5lib, KartikPrabhu
#
Zegnat
Option `use_trailing_solidus`
#
Zegnat
(And I assume that’s what is being used?)
#
KartikPrabhu
hmm it is supposed to default to False but that is not what's going on
#
KartikPrabhu
Zegnat: thanks for the link will dig around more
#
zegnat
edited /microformats2-parsing (+1) "Serialisation algo is moved to the parsing page of the HTML spec"
(view diff)
#
Zegnat
Fixed the HTML string algo link ^^^ That’s the spec that is saying not to use the closing slash, if you are interested in the technicalities, KartikPrabhu
#
gRegorLove
catches up on whitespace conversation
#
KartikPrabhu
gRegorLove: yes! would like PHP input on that whole thing
#
gRegorLove
I copied the microformat-shiv code for the php-mf2 innerText method, giving better results than PHP's DOMNode::textContent()
#
gRegorLove
It was a while ago, I'd have to find the github issue :)
#
aaronpk
oh yeah I forgot to add the node results
#
aaronpk
is that online somewhere?
#
KartikPrabhu
should be here https://node.microformats.io/ as well but that page does not load
#
Zegnat
If you want a direct JSON output from node mf2, you can also use https://sturdy-backbone.glitch.me/mf2/?url=https://aaronpk.com/
#
aaronpk
I need direct json output given html
#
Zegnat
But I don’t think that one takes plain text input, so your tests need to be on public URLs
#
Loqi
Hello World
#
aaronpk
ah good idea
#
gRegorLove
php-mf2 context on its innerText() method: https://github.com/indieweb/php-mf2/pull/82
#
Loqi
[gRegorLove] #82 Implemented @glennjones "innerText" parsing for better parsed whitespace
#
KartikPrabhu
gRegorLove: great! It would be real nice to have a consistent way of doing this across mf2 parsers
#
gRegorLove
I don't know all the ins and outs well enough to have a strong opinion
#
Zegnat
The ins and outs are that mf2 really doesn’t define anything itself, just that it uses DOM textContent (all text nodes, no other data). So you only need to have an opinion on what sort of string you personally would expect :)
#
aaronpk
the point of this exercise for me is to look at it from the plain input and output regardless of what the spec currently says and see if that makes sense
#
gRegorLove
aaronpk's table looks pretty good at a glance, for the expected values.
#
Zegnat
mumbles something about how the CSS text spec isn’t written against a DOM tree
#
gRegorLove
For 7 in the table, you're collapsing multiple spaces into one, aaronpk?
#
aaronpk
these rules made the most sense to me https://aaronparecki.com/2018/03/03/3/
#
Loqi
[Aaron Parecki] When does white space matter in HTML? – Patrick Brosset
#
gRegorLove
Zegnat, Ah, gotcha. I meant textContent and innerText specifically. But yeah, aaronpk's expected output LGTM.
#
Zegnat
aaronpk, I am trying to “port” the algo as described here into my function: https://www.w3.org/TR/css-text-3/#white-space-phase-1
#
Zegnat
That seems to match the one you bookmarked, except it is a web spec instead of a Medium article
#
Zegnat
(assumption that white-space is set to normal, as you stated previously)
#
aaronpk
https://pin13.net/mf2/whitespace.html updated with node and go parsers
#
sknebel
(node parser on sturdy-backbone) isn't quite stock, but the changes shouldn't impact this comparison
#
sknebel
(I had to fix something to make sturdy backbone work, but the PR never got merged)
#
aaronpk
this library is giving me doubts about whether the mf2 parser's plaintext version of e-content will ever be useful
#
Zegnat
Hmm, yeah
#
aaronpk
a "good" plaintext representation of <ul><li> should actually convert that to a plaintext bullet list using *
#
aaronpk
but that has nothing to do with html rules
[tantek] joined the channel
#
aaronpk
(i'm imagining the use case of rendering plaintext versions of posts in a reader that doesn't render html)
#
aaronpk
that conversion is definitely not something i'd expect the mf2 parser to do, but I might expect XRay to do it
#
sknebel
not a fan of specifications that say "do whatever library X does", but I can see why people do that ;)
#
Zegnat
We need some sort of HTML-to-plain-text in mf2 any way, for p- parsing. But definitely a valid question wether it makes sense to then provide this plain text for e- as well or not
#
aaronpk
true, but most of the time the p- rules are used for simple one-line strings
#
Zegnat
Definitely true for me
#
aaronpk
did we land anywhere on whether I should treat <br> as equal to <br/> for the purposes of comparing whether the test is successful?
#
aaronpk
just updated https://pin13.net/mf2/whitespace.html with treating them as the same, since it gives a better picture of the current state overall
#
gRegorLove
I think <br> and <br/> should be treated as equal
#
tantek
reads some scrollback
#
aaronpk
wb tantek
#
tantek
uh yes, HTML5 treats them the same. why would you treat them differently?!?
#
aaronpk
depends on whether you take the definition of the mf2 parser to mean the parser should not modify the authored html or whether transformations like <br> -> <br/> are okay for the parser to do
#
Zegnat
According to spec, <br/> is wrong
#
Zegnat
I think I said that somewhere
#
sknebel
Zegnat: nope, it is allowed
#
Zegnat
No, according to mf spec we need to do HTML serialisation per HTML spec, which does not add the /
#
tantek
Zegnat, which spec? Browsers treat them the same :P
#
sknebel
but allows it as valid html
#
Zegnat
HTML spec says to use https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments for the content of the html property on e- parsing
#
Zegnat
Which specifically does not add the /
#
tantek
oh for *generating* HTML
#
tantek
yeah that's fine
#
aaronpk
not even generating, just consuming existing html
#
aaronpk
consuming and re-generating I guess
#
tantek
when consuming you treat them the same
#
Zegnat
Yeah, we are talking about outputting form the mf2 parser
#
Zegnat
Which, according to the spec, should not output <br/> but always <br>
#
Zegnat
s/form/from/
#
tantek
yeah, better to be consistent there
#
tantek
however if you're consuming HTML from the mf2 JSON, you must treat them the same
#
tantek
you don't get to "sorta" parse HTML
#
aaronpk
haha okay so what should my test do? reject the result if the mf2 parser adds a / ?
#
Loqi
aaronpk: lol
#
Zegnat
For your purpose however, aaronpk, I would treat them as the same. And we should file a bug on Python to not output slashes on void element
#
tantek
I for one generate <br/> in my code because I'm using various XML functions on the back end to process my own content
#
aaronpk
for my purposes the difference isn't significant so i'm tempted to accept both results
#
tantek
if "accept" means parse/consume then yes
#
aaronpk
"accept" means "allow the parser to modify the html and still treat as a valid result" in this case
#
tantek
"modify the html"?
#
aaronpk
some parsers change the <br> in <div class="e-content"><br></div> to <br/>
#
Zegnat
Oh, Go and Node do it too? so multiple bug reports :(
#
tantek
that's both wrong, and something that consumers of mf2 JSON must be able to handle
#
tantek
this seems not very relevant to anything user-visible
#
aaronpk
right, which is why I'm okay accepting both for this test matrix
chrisaldrich joined the channel
#
Zegnat
I think I can now walk the DOM tree of a node and output the exact expected plain text from all your examples aaronpk
#
aaronpk
Zegnat++
#
Loqi
zegnat has 10 karma in this channel (178 overall)
#
Zegnat
Hmm. I am failing 3 or 7 currently, darn, I thought I covered it all
#
Zegnat
In my implementation of the algo either they both have `\n ` or both have `\n`, but your test has 2 different outputs
#
aaronpk
where does the space come from in #3?
#
Zegnat
The \n behind the BR, as all \n’s are replaced with spaces
#
aaronpk
oh yeah huh
#
aaronpk
aha, last step: "sequences of spaces at the beginning and end of a line are removed"
#
aaronpk
the "\n " should be turned into "\n"
#
aaronpk
er, the "<br> " should be turned into "<br>" because that line has a trailing space
#
Zegnat
Alright, so any spaces before or after a \n should be removed... retesting
#
Zegnat
That means 7 is currently wrong in your test?
#
aaronpk
reviews
#
aaronpk
ah yep. the first step "all spaces and tabs immediately before and after a line break are ignored" should end up turning #7 into #3
#
Zegnat
Alright
#
aaronpk
updated
#
Zegnat
Does a single DOM tree walk, not back and forth traversing of elements
#
Zegnat
Code may or may not be headache inducing, but mostly just implements https://www.w3.org/TR/css-text-3/#white-space-rules with the addition of inserting \n for BR and Ps.
#
Zegnat
Also only interacts with nodes using methods from the DOM spec, so should be portable to other programming languages
#
gRegorLove
Zegnat++
#
Loqi
zegnat has 11 karma in this channel (179 overall)
#
Zegnat
Now if only PHP had full DOM support, hehe
[kevinmarks] joined the channel
#
[kevinmarks]
I have a feeling that html5lib has an xhtml/html toggle, but it is terribly documented
#
[kevinmarks]
I base this on memories of Sam Ruby writing chunks of it.
#
Zegnat
The serialiser seems to have a “use_trailing_solidus” option (https://github.com/html5lib/html5lib-python/blob/master/html5lib/serializer.py)
#
Zegnat
So hopefully KartikPrabhu can find where to set that :)
#
[kevinmarks]
There's an optional_tags opt on the serializer
#
Zegnat
I am only seeing omit_optional_tags, but that doesn’t touch self-closing tags at all. Is more about having things like empty HEAD elements added to the page.
#
Zegnat
Am I looking in the wrong file?
chrisaldrich and KartikPrabhu joined the channel
#
KartikPrabhu
Zegnat: mf2py uses Beautiful Soup to handle the HTML which internally uses html5lib. So I'll have really track this down
#
Zegnat
Oh, oof
#
KartikPrabhu
no worries I have a fair idea of what is happening
#
KartikPrabhu
should be fixed by tomorrow I think
#
Zegnat
Woot!
#
Zegnat
KartikPrabhu++
#
Loqi
kartikprabhu has 10 karma in this channel (173 overall)
#
Loqi
does a happy dance!
[kevinmarks] and [miklb] joined the channel
#
KartikPrabhu
so it seems BeautifulSoup takes it upon itself to close the tags https://github.com/waylan/beautifulsoup/blob/master/bs4/element.py#L1106
#
[kevinmarks]
Also, html5lib has multiple tree builder and walker options, some of which are more xml like.
#
KartikPrabhu
yeah but that is not causing this <br> to <br/>
#
KartikPrabhu
BS is doing it explictly
#
[kevinmarks]
Ah. So this is beautifulsoup being mid 2000s html fashionable.
#
[kevinmarks]
Ok, fork and fix and push it to them
#
KartikPrabhu
the dev code is not on github so will have to learn some new thingie
#
KartikPrabhu
will see if I can report it as bug or something
#
[kevinmarks]
Do we want to be the github target for beautiful soup issues?
#
KartikPrabhu
admin tax and all that
#
[kevinmarks]
This is interesting. I wonder if "Postelian drift" is a thing.
#
Zegnat
Could you drop the bs dependency? What is it used for?
#
[kevinmarks]
Where the definition of being conservative changes over time.
#
KartikPrabhu
Zegnat: it is a nice way to interact with the HTML tree with many nice inbuilt functions
#
KartikPrabhu
otherwise we'd have to rewrite most of what BS does anyway
#
[kevinmarks]
Beautiful Soup has very nice abstractions on html. Though modern DOM does too.