#microformats 2018-04-02

2018-04-02 UTC
j12t, KevinMarks_, KartikPrabhu and tantek joined the channel
#
Zegnat
Hmm. I should make PHPUnit pull aaronpk's whitespace tests repo. So when tests for things like PRE get added they are immediately testable.
#
Zegnat
<pre> is annoying because then the walker needs to keep track of when it exits the element. Currently it doesn't need to know that. Going to become stateful I guess.
#
KartikPrabhu
i'm having a lot of trouble with <pre>
tantek joined the channel
#
Zegnat
Today is my travelling-home day. Will have a good long think in the car.
#
KartikPrabhu
drive safe
#
Zegnat
Planning to pull https://github.com/aaronpk/microformats-whitespace-tests for automatic testing. So if anyone has time to write some tests with <pre> there, that would help a lot!
#
Loqi
[aaronpk] microformats-whitespace-tests
#
Zegnat
I don't drive. Being driven. So all the more time to think about algorithms
tantek, KevinMarks, j12t, barpthewire, KevinMarks_, davy__, sebsel and webchat254 joined the channel
#
aaronpk
I can't figure out how I ran the ruby parser locally
#
aaronpk
spoke too soon, apparently I forgot I made a little test server and left it not where I expected it
#
aaronpk
i'm gonna have some new whitespace tests for you Zegnat
#
Zegnat
Nice aaronpk!
Garbee joined the channel
#
Zegnat
I am still mulling over how to do <pre>. It feels like the only way to do it is literally implement the entire WHATWG innerText, and I really do not want that.
#
aaronpk
for the name value
#
aaronpk
KartikPrabhu: is there a python parser more up to date than https://python.microformats.io/ ?
#
aaronpk
same api?
#
Zegnat
Oh, that I don’t know.
#
aaronpk
looks like it expects "content" isntead of "docc"
#
KartikPrabhu
aaronpk: my parser only does HTML imput
#
aaronpk
that's fine
#
aaronpk
I just need it to return json
#
aaronpk
or the json inside of <pre><code>
#
Zegnat
I think the real problem with <pre> will be when people do things like <pre>one<br>\n<p>two</p>\n\nthree</pre>, or other nested elements.
#
KartikPrabhu
hmm it will inside the page.
#
Zegnat
The algorithm can probably be pretty simple if I didn’t need to think about whether nested elements were also adding linebreaks to the plain text output.
#
aaronpk
oh yeah I should add some tests for that too
#
aaronpk
KartikPrabhu: argh you have a csrf check on the form!
#
KartikPrabhu
aah you want to automate it :P
#
aaronpk
yes and I don't really want to make a get request before each post
#
KartikPrabhu
now doesn't even remember how he set this up
#
aaronpk
oh gosh, I just realized that the browser adds \n\n for <p> tags
#
aaronpk
which I guess makes sense
#
Zegnat
Yes. They also do linebreak collapsing between P tags.
#
aaronpk
I had written the expected result with just one \n
#
aaronpk
should I update it?
#
Zegnat
It depends what the goal of the mf2 parser is to be.
#
aaronpk
I assumed the goal was to return an accurate plaintext representation of the content
#
Zegnat
Browsers do a much more visual representation than what we currently do. I care more about getting a somewhat normalised text for use in p-names and what not, rather than visually matching.
#
Zegnat
Browsers also have special ways for lists and tables. That’s all in the WHATWG innterText algo. But that felt like a stretch to get people to implement.
#
aaronpk
i'm thinking of a situation like rendering text content in an ios app, where you have to give it plain text, if the app can't rely on the mf2 parser's plaintext representation then it will have to do that html->text conversion itself
#
aaronpk
for most properties it doesn't matter, since most of the time the p- values are expected to have only one line of text anyway (author name, post title, etc)
#
Zegnat
https://html.spec.whatwg.org/#dom-innertext - is not fun... I am not even sure I understand all of the steps, because I am not well acquanted with CSS specs.
#
aaronpk
but for the h-entry content it could definitely have multiple lines with significant whitespace, and I could even see a table in there
#
Zegnat
Sounds like you would want the mf2 parsers to support innerText then.
#
aaronpk
I think so, because if they don't, then the h-entry content.value is most likely not useful
#
aaronpk
okay in the mean time here are some new ones https://pin13.net/mf2/whitespace.html
#
Zegnat
Hmm, 11 strips significant whitespace from the end of the <pre>?
#
aaronpk
oh yeah it shouldn't
#
aaronpk
oh right, it should for p-name
#
aaronpk
but not for the content.value
#
Zegnat
Why would you treat those differently?
#
aaronpk
nevermind, it should for e-content too: "removing all leading/trailing whitespace"
#
Zegnat
Yes, but if you decide that whitespace inside PRE is significant, it really shouldn’t be stripping it, right?
#
Zegnat
I should strip the \n and 2 spaces after the </pre>, but not the \n and 4 spaces after “three”, surely?
#
aaronpk
for the html version yeah
#
Loqi
yea!
#
aaronpk
keep the "\n " inside the </pre>
#
Zegnat
I might have a go at actual WHATWG innerText later tonight. Just going to need to keep very careful track of all the CSS assumptions I make.
#
KartikPrabhu
yeah <pre> is a pain! I tried a lot yesterday to make some sane function to handle <pre> but all failed :(
#
Zegnat
All of that text handling is handled by a single step in the WHATWG spec, which is mind-boggling: https://html.spec.whatwg.org/#the-innertext-idl-attribute:text
#
Zegnat
Basically, if you do not have a renderer and a concept of a “CSS text box” you need to start making assumptions about text right from the start.
j12t and KartikPrabhu joined the channel
#
aaronpk
https://pin13.net/mf2/whitespace.html is updated with the new python results!
#
KartikPrabhu
aaronpk: might want to keep the experimental python separate
#
KartikPrabhu
the new one gives you version number in the debug so should be 1.1.1
#
aaronpk
eh, this is for tracking progress on the new handling so it's fine
#
aaronpk
it's using the 0.4.4-alpha php one too
#
KartikPrabhu
wow! how is Ruby getting the pre correct!?
j12t joined the channel
#
KartikPrabhu
the expected 11 looks wrong since every whitespace in <pre> should be preserved
#
aaronpk
that's what Zegnat was saying, but the last step of p- and e- parsing is to trim whitespace
#
KartikPrabhu
yes but in 11 the e-content and p-name are not on the <pre>
#
Zegnat
The algo replaces all the steps anyway, so it would simply have to include not-trimming <pre>.
#
KartikPrabhu
so Ruby seems to be doing correct on 11 too
#
KartikPrabhu
including final trim
[kevinmarks] joined the channel
#
aaronpk
ok i'll add the whitespace in the <pre> to the expected result
#
aaronpk
refresh results
#
KartikPrabhu
aaronpk: I actually was thinking of the \n before the one
#
KartikPrabhu
because it really is "<pre>\n one [...]"
#
aaronpk
that's in there
#
KartikPrabhu
aah I see yes. Ruby is putting two \ns
#
KartikPrabhu
hmmm mf2py removes that \n from HTML too
#
aaronpk
heh that's the one php gets right
#
KartikPrabhu
yeah. I am using a secret hidden method for that from BeautifulSoup so might not be reliable
j12t, [chrisaldrich], tantek, [cleverdevil] and [miklb] joined the channel
#
KartikPrabhu
<sigh> BeautifulSoup converts <pre>\n\ntext\n</pre> to <pre>\ntext\n</pre>
#
KartikPrabhu
in fact always removes leading \n
#
KartikPrabhu
*one* leading \n
#
KartikPrabhu
looks like mf2py is never going to pass the <pre> whitespace tests
#
sknebel
KartikPrabhu: that is correct behavior per html5 spec
#
KartikPrabhu
oh! it is
#
KartikPrabhu
sknebel: so #11 here is incorrect https://pin13.net/mf2/whitespace.html ?
#
KartikPrabhu
for the content.html property inside the<pre>
KevinMarks and [snarfed] joined the channel
#
KartikPrabhu
sknebel: if you can confirm then python will win over php again ;)
kaushalmodi joined the channel
#
sknebel
KartikPrabhu: "In the HTML syntax, a leading newline character immediately following the pre element start tag is stripped." (https://html.spec.whatwg.org/multipage/grouping-content.html#the-pre-element)
#
KartikPrabhu
sknebel: nice!
#
KartikPrabhu
aaronpk: whitespace test 11 the expected content.html is incorrect due to https://chat.indieweb.org/microformats/2018-04-02#t1522692871794700
#
Loqi
[sknebel] KartikPrabhu: "In the HTML syntax, a leading newline character immediately following the pre element start tag is stripped." (https://html.spec.whatwg.org/multipage/grouping-content.html#the-pre-element)
#
KartikPrabhu
sknebel++ for finding citation
#
Loqi
sknebel has 8 karma in this channel (93 overall)
#
KartikPrabhu
<phew> that is another thing I don't have to fix
[kaushal_modi] joined the channel
#
gRegorLove
Glad you guys are tackling the whitespace stuff and not me :)
#
KartikPrabhu
.... and so say all of us
#
Zegnat
I see this as a chance to learn even more crazy html edge-cases!
KevinMarks joined the channel
#
Zegnat
I actually thought of an interesting <template> element edge-case. But decided not to start issues on that one.
#
KartikPrabhu
<template> is always dropped no?
#
Zegnat
It is in the PHP parser, but that isn’t technically correct, I belief.
#
Zegnat
“microformats2 parsers are expected to follow HTML parsing rules” - this means that the html property of an e-* should include the full <template> element, as per HTML serialisation rules
#
Loqi
[Tantek Çelik] microformats2 parsing specification
#
KartikPrabhu
which is why mf2py and phpmf2 drop <template> all to gether
#
Zegnat
Yes, and I am saying dropping it completely is breaking from the HTML spec. Also, the <template> element itself most definitely is part of the DOM so <template class="p-name"></template> should result in an empty string name property.
#
Zegnat
If the official policy is following the HTML spec, dropping entire <template> elements is wrong.
#
Zegnat
A lot of mf2 parsing is deceptively easy in browsers, where all the crazy HTML with rendering info is already handled. If you are purely working with a DOM tree (as we often are in our respective programming languages), the parsing rules become a lot more convoluted.
#
Zegnat
As we see now with trying to get a browser-like innerText :P
#
Zegnat
It just isn’t worth opening issues about it, because anyone putting mf2 classes on their template elements is probably one of two things: 1) making a mistake, 2) me trying to proof a point.
#
KartikPrabhu
yup. leave that for now
chrisaldrich, symon1, tantek, [chrisaldrich] and [kevinmarks] joined the channel
#
[kevinmarks]
Examples from the wild should always lead this work, and. Asking test examples should follow.
#
[kevinmarks]
Making, not asking.
#
KartikPrabhu
[kevinmarks]: for whitespace there are parser compatibility issues https://pin13.net/mf2/whitespace.html Don't really care of <template> as much
#
[kevinmarks]
Yes, I think what we have done with those is solid, I'm wary of template if we have never seen it in posts, whereas code in posts is something we do.
#
KartikPrabhu
yes I agree
#
Zegnat
We have seen it, and the result to seeing template elements has been that parsers throw the element out of the parsed DOM wholesale, breaking spec as written. If the parser changes (that were done to reflect in-the-wild examples) had triggered a spec update, that would have been fine. As it stands it specifically diverged parsers from the spec.
#
Zegnat
But I agree that it isn’t actually an important issue.
#
[kevinmarks]
The "using css to make whitespace significant" aspect is even harder, but again I am not sure if we have examples from the wild
#
KartikPrabhu
[kevinmarks]: I use CSS to make whitespaces in my notes
#
Loqi
[Kartik Prabhu] B: We need a better social network. A: Do you like ads? B: No! A: Can I sell your data? B: No! A: Can I have your data anyway? B: No! A: Do you want to host it yourself? B: No! A: Do you want to pay for this better social network? B: No! A: OK, bye!
#
KartikPrabhu
Loqi gets that fine though
#
Zegnat
Most of the whitespace use-cases came from the wild, I think. E.g. aaronpk was having troubles rendering certain posts in his reader.
#
KartikPrabhu
yeah whitespace we all agree on
#
Zegnat
<template> is not important, I just would like for the spec to match parsers or the other way around, and not have a divergence.
webchat254_, [eddie], vivus and webchat254__ joined the channel
#
KartikPrabhu
Zegnat: the whatwg innerText has 2 "\n" for <p>; is that what we want?
#
Zegnat
Not sure. I am probably going to test with both.
#
Zegnat
As far as the code for innerText is concerned, it is just a single fixed integer somewhere. Easily swapped out. It is the rest of the code I am worried about.
#
Zegnat
The original lynx (?) example aaronpk screenshotted in the parsing issue featured 2 "\n", IIRC
#
KartikPrabhu
I am writing an implementation in python now
#
KartikPrabhu
but the tests have 1. I can swap it too
#
KartikPrabhu
not sure if <pre> should have leading \n though
#
Zegnat
I’ll get to that when I get to actually parsing <pre> elements at all.
#
Zegnat
Currently reading up on the CSS Display spec to see if “used value” vs “computed value” is going to make a difference.
#
KartikPrabhu
I am ignoring that :P
#
KartikPrabhu
steps 2-4 of the PLain text of element are now not needed right?
#
Zegnat
Aah, implementing my algo. Cool.
#
KartikPrabhu
no I am implementing the innerText from whatwg
#
KartikPrabhu
also are you still doing step 2 for element to string > text node ?
#
KartikPrabhu
replace \t\n\r by spaces?
#
Zegnat
Outside of white-space: pre; it should be doing that, I believe
#
Zegnat
Under normal whitespace rules, all ASCII whitespace is collapsed to a single space character.
#
Zegnat
That’s where state is going to come in if you need to keep track of being inside or outside a <pre> element
#
KartikPrabhu
maybe not. Let me write this up and test it :P
#
Zegnat
Oooh, looking forward to it!
KevinMarks and [eddie] joined the channel
#
KartikPrabhu
ok failing some tests at the moment need to debug
#
KartikPrabhu
should be preparing slides for a presentation, but here we are...
#
Zegnat
well, I am off for bed. Looking forward to whatever you cook up! Wish I was able to move the algo to the mf wiki for iterating, but I can’t create new pages there.
tantek, j12t, KevinMarks, webchat254_, chrisaldrich and KevinMarks_ joined the channel; webchat254_ left the channel