#microformats 2018-04-02

2018-04-02 UTC
j12t, KevinMarks_, KartikPrabhu and tantek joined the channel
# 05:49 
Zegnat Hmm. I should make PHPUnit pull aaronpk's whitespace tests repo. So when tests for things like PRE get added they are immediately testable.
# 05:50 
Zegnat <pre> is annoying because then the walker needs to keep track of when it exits the element. Currently it doesn't need to know that. Going to become stateful I guess.
# 05:51 
KartikPrabhu yup
# 05:51 
KartikPrabhu i'm having a lot of trouble with <pre>
tantek joined the channel
# 06:02 
Zegnat Today is my travelling-home day. Will have a good long think in the car.
# 06:02 
KartikPrabhu drive safe
# 06:03 
Zegnat Planning to pull https://github.com/aaronpk/microformats-whitespace-tests for automatic testing. So if anyone has time to write some tests with <pre> there, that would help a lot!
# 06:03 
Loqi [aaronpk] microformats-whitespace-tests
# 06:03 
Zegnat I don't drive. Being driven. So all the more time to think about algorithms
tantek, KevinMarks, j12t, barpthewire, KevinMarks_, davy__, sebsel and webchat254 joined the channel
# 12:58 
aaronpk I can't figure out how I ran the ruby parser locally
# 12:58 
aaronpk spoke too soon, apparently I forgot I made a little test server and left it not where I expected it
# 12:59 
aaronpk i'm gonna have some new whitespace tests for you Zegnat
# 12:59 
Zegnat Nice aaronpk!
Garbee joined the channel
# 13:00 
Zegnat I am still mulling over how to do <pre>. It feels like the only way to do it is literally implement the entire WHATWG innerText, and I really do not want that.
# 13:00 
aaronpk heh
# 13:02 
aaronpk wow, all except the PHP parser get this one right https://github.com/aaronpk/microformats-whitespace-tests/blob/master/tests/10.html
# 13:02 
aaronpk for the name value
# 13:02 
aaronpk KartikPrabhu: is there a python parser more up to date than https://python.microformats.io/ ?
# 13:03 
Zegnat aaronpk, try https://kartikprabhu.com/connection/mfparser
# 13:03 
aaronpk same api?
# 13:03 
Zegnat Oh, that I don’t know.
# 13:03 
aaronpk "api"
# 13:04 
aaronpk looks like it expects "content" isntead of "docc"
# 13:04 
aaronpk "doc"
# 13:04 
KartikPrabhu aaronpk: my parser only does HTML imput
# 13:04 
aaronpk that's fine
# 13:04 
aaronpk I just need it to return json
# 13:04 
aaronpk or the json inside of <pre><code>
# 13:04 
Zegnat I think the real problem with <pre> will be when people do things like <pre>one<br>\n<p>two</p>\n\nthree</pre>, or other nested elements.
# 13:04 
KartikPrabhu hmm it will inside the page.
# 13:05 
Zegnat The algorithm can probably be pretty simple if I didn’t need to think about whether nested elements were also adding linebreaks to the plain text output.
# 13:06 
aaronpk oh yeah I should add some tests for that too
# 13:07 
aaronpk KartikPrabhu: argh you have a csrf check on the form!
# 13:07 
KartikPrabhu yes
# 13:08 
KartikPrabhu aah you want to automate it :P
# 13:08 
aaronpk yes and I don't really want to make a get request before each post
# 13:11 
KartikPrabhu now doesn't even remember how he set this up
# 13:16 
aaronpk oh gosh, I just realized that the browser adds \n\n for <p> tags
# 13:16 
aaronpk which I guess makes sense
# 13:16 
Zegnat Yes. They also do linebreak collapsing between P tags.
# 13:17 
aaronpk so these are separated by \n\n in the plaintext https://github.com/aaronpk/microformats-whitespace-tests/blob/master/tests/9.html
# 13:17 
aaronpk I had written the expected result with just one \n
# 13:17 
aaronpk should I update it?
# 13:17 
Zegnat It depends what the goal of the mf2 parser is to be.
# 13:18 
aaronpk I assumed the goal was to return an accurate plaintext representation of the content
# 13:18 
Zegnat Browsers do a much more visual representation than what we currently do. I care more about getting a somewhat normalised text for use in p-names and what not, rather than visually matching.
# 13:18 
Zegnat Browsers also have special ways for lists and tables. That’s all in the WHATWG innterText algo. But that felt like a stretch to get people to implement.
# 13:19 
aaronpk hm
# 13:20 
aaronpk i'm thinking of a situation like rendering text content in an ios app, where you have to give it plain text, if the app can't rely on the mf2 parser's plaintext representation then it will have to do that html->text conversion itself
# 13:21 
aaronpk for most properties it doesn't matter, since most of the time the p- values are expected to have only one line of text anyway (author name, post title, etc)
# 13:21 
Zegnat https://html.spec.whatwg.org/#dom-innertext - is not fun... I am not even sure I understand all of the steps, because I am not well acquanted with CSS specs.
# 13:21 
aaronpk but for the h-entry content it could definitely have multiple lines with significant whitespace, and I could even see a table in there
# 13:22 
Zegnat Sounds like you would want the mf2 parsers to support innerText then.
# 13:22 
aaronpk I think so, because if they don't, then the h-entry content.value is most likely not useful
# 13:24 
aaronpk okay in the mean time here are some new ones https://pin13.net/mf2/whitespace.html
# 13:25 
Zegnat Hmm, 11 strips significant whitespace from the end of the <pre>?
# 13:26 
aaronpk oh yeah it shouldn't
# 13:26 
aaronpk oh right, it should for p-name
# 13:26 
aaronpk but not for the content.value
# 13:27 
Zegnat Why would you treat those differently?
# 13:27 
aaronpk nevermind, it should for e-content too: "removing all leading/trailing whitespace"
# 13:27 
Zegnat Yes, but if you decide that whitespace inside PRE is significant, it really shouldn’t be stripping it, right?
# 13:28 
Zegnat I should strip the \n and 2 spaces after the </pre>, but not the \n and 4 spaces after “three”, surely?
# 13:28 
aaronpk for the html version yeah
# 13:28 
Loqi yea!
# 13:29 
aaronpk keep the "\n    " inside the </pre>
# 13:32 
Zegnat I might have a go at actual WHATWG innerText later tonight. Just going to need to keep very careful track of all the CSS assumptions I make.
# 13:33 
KartikPrabhu yeah <pre> is a pain! I tried a lot yesterday to make some sane function to handle <pre> but all failed :(
# 13:37 
Zegnat All of that text handling is handled by a single step in the WHATWG spec, which is mind-boggling: https://html.spec.whatwg.org/#the-innertext-idl-attribute:text
# 13:37 
Zegnat Basically, if you do not have a renderer and a concept of a “CSS text box” you need to start making assumptions about text right from the start.
j12t and KartikPrabhu joined the channel
# 13:54 
aaronpk https://pin13.net/mf2/whitespace.html is updated with the new python results!
# 13:55 
KartikPrabhu aaronpk: might want to keep the experimental python separate
# 13:56 
KartikPrabhu the new one gives you version number in the debug so should be 1.1.1
# 13:56 
aaronpk eh, this is for tracking progress on the new handling so it's fine
# 13:56 
KartikPrabhu ok
# 13:56 
aaronpk it's using the 0.4.4-alpha php one too
# 13:56 
KartikPrabhu wow! how is Ruby getting the pre correct!?
j12t joined the channel
# 13:57 
KartikPrabhu the expected 11 looks wrong since every whitespace in <pre> should be preserved
# 13:58 
aaronpk that's what Zegnat was saying, but the last step of p- and e- parsing is to trim whitespace
# 13:58 
KartikPrabhu yes but in 11 the e-content and p-name are not on the <pre>
# 13:59 
Zegnat The algo replaces all the steps anyway, so it would simply have to include not-trimming <pre>.
# 13:59 
KartikPrabhu so Ruby seems to be doing correct on 11 too
# 13:59 
KartikPrabhu including final trim
[kevinmarks] joined the channel
# 14:00 
aaronpk ok i'll add the whitespace in the <pre> to the expected result
# 14:01 
aaronpk refresh results
# 14:02 
KartikPrabhu aaronpk: I actually was thinking of the \n before the one
# 14:02 
KartikPrabhu "one"
# 14:02 
KartikPrabhu because it really is "<pre>\n      one [...]"
# 14:03 
aaronpk that's in there
# 14:03 
KartikPrabhu aah I see yes. Ruby is putting two \ns
# 14:04 
KartikPrabhu hmmm mf2py removes that \n from HTML too
# 14:04 
aaronpk heh that's the one php gets right
# 14:05 
KartikPrabhu yeah. I am using a secret hidden method for that from BeautifulSoup so might not be reliable
j12t, [chrisaldrich], tantek, [cleverdevil] and [miklb] joined the channel
# 17:41 
KartikPrabhu <sigh> BeautifulSoup converts <pre>\n\ntext\n</pre> to <pre>\ntext\n</pre>
# 17:43 
KartikPrabhu in fact always removes leading \n
# 17:44 
KartikPrabhu *one* leading \n
# 17:44 
KartikPrabhu looks like mf2py is never going to pass the <pre> whitespace tests
# 17:45 
sknebel KartikPrabhu: that is correct behavior per html5 spec
# 17:45 
KartikPrabhu oh! it is
# 17:45 
KartikPrabhu ?
# 17:46 
KartikPrabhu sknebel: so #11 here is incorrect https://pin13.net/mf2/whitespace.html ?
# 17:47 
KartikPrabhu for the content.html property inside the<pre>
KevinMarks and [snarfed] joined the channel
# 17:53 
KartikPrabhu sknebel: if you can confirm then python will win over php again ;)
kaushalmodi joined the channel
# 18:14 
sknebel KartikPrabhu: "In the HTML syntax, a leading newline character immediately following the pre element start tag is stripped." (https://html.spec.whatwg.org/multipage/grouping-content.html#the-pre-element)
# 18:14 
KartikPrabhu sknebel: nice!
# 18:15 
KartikPrabhu aaronpk: whitespace test 11 the expected content.html is incorrect due to https://chat.indieweb.org/microformats/2018-04-02#t1522692871794700
# 18:15 
Loqi [sknebel] KartikPrabhu: "In the HTML syntax, a leading newline character immediately following the pre element start tag is stripped." (https://html.spec.whatwg.org/multipage/grouping-content.html#the-pre-element)
# 18:15 
KartikPrabhu sknebel++ for finding citation
# 18:15 
Loqi sknebel has 8 karma in this channel (93 overall)
# 18:15 
KartikPrabhu <phew> that is another thing I don't have to fix
[kaushal_modi] joined the channel
# 18:21 
gRegorLove Glad you guys are tackling the whitespace stuff and not me :)
# 18:21 
KartikPrabhu .... and so say all of us
# 18:22 
Zegnat I see this as a chance to learn even more crazy html edge-cases!
KevinMarks joined the channel
# 18:26 
Zegnat I actually thought of an interesting <template> element edge-case. But decided not to start issues on that one.
# 18:27 
KartikPrabhu <template> is always dropped no?
# 18:28 
Zegnat It is in the PHP parser, but that isn’t technically correct, I belief.
# 18:30 
Zegnat “microformats2 parsers are expected to follow HTML parsing rules” - this means that the html property of an e-* should include the full <template> element, as per HTML serialisation rules
# 18:30 
KartikPrabhu Zegnat: http://microformats.org/wiki/microformats2-parsing#note_HTML_parsing_rules
# 18:31 
Loqi [Tantek Çelik] microformats2 parsing specification
# 18:31 
KartikPrabhu which is why mf2py and phpmf2 drop <template> all to gether
# 18:32 
Zegnat Yes, and I am saying dropping it completely is breaking from the HTML spec. Also, the <template> element itself most definitely is part of the DOM so <template class="p-name"></template> should result in an empty string name property.
# 18:32 
Zegnat If the official policy is following the HTML spec, dropping entire <template> elements is wrong.
# 18:33 
Zegnat A lot of mf2 parsing is deceptively easy in browsers, where all the crazy HTML with rendering info is already handled. If you are purely working with a DOM tree (as we often are in our respective programming languages), the parsing rules become a lot more convoluted.
# 18:34 
Zegnat As we see now with trying to get a browser-like innerText :P
# 18:35 
KartikPrabhu right
# 18:37 
Zegnat It just isn’t worth opening issues about it, because anyone putting mf2 classes on their template elements is probably one of two things: 1) making a mistake, 2) me trying to proof a point.
# 18:38 
KartikPrabhu yup. leave that for now
chrisaldrich, symon1, tantek, [chrisaldrich] and [kevinmarks] joined the channel
# 19:07 
[kevinmarks] Examples from the wild should always lead this work, and. Asking test examples should follow.
# 19:07 
[kevinmarks] Making, not asking.
# 19:08 
KartikPrabhu [kevinmarks]: for whitespace there are parser compatibility issues https://pin13.net/mf2/whitespace.html Don't really care of <template> as much
# 19:09 
[kevinmarks] Yes, I think what we have done with those is solid, I'm wary of template if we have never seen it in posts, whereas code in posts is something we do.
# 19:10 
KartikPrabhu yes I agree
# 19:11 
Zegnat We have seen it, and the result to seeing template elements has been that parsers throw the element out of the parsed DOM wholesale, breaking spec as written. If the parser changes (that were done to reflect in-the-wild examples) had triggered a spec update, that would have been fine. As it stands it specifically diverged parsers from the spec.
# 19:11 
Zegnat But I agree that it isn’t actually an important issue.
# 19:12 
[kevinmarks] The "using css to make whitespace significant" aspect is even harder, but again I am not sure if we have examples from the wild
# 19:12 
KartikPrabhu [kevinmarks]: I use CSS to make whitespaces in my notes
# 19:13 
KartikPrabhu see e-content in https://kartikprabhu.com/notes/better-social-networks
# 19:13 
Loqi [Kartik Prabhu] B: We need a better social network.
A: Do you like ads?
B: No!
A: Can I sell your data?
B: No!
A: Can I have your data anyway?
B: No!
A: Do you want to host it yourself?
B: No!
A: Do you want to pay for this better social network?
B: No!
A: OK, bye!
# 19:13 
KartikPrabhu Loqi gets that fine though
# 19:15 
Zegnat Most of the whitespace use-cases came from the wild, I think. E.g. aaronpk was having troubles rendering certain posts in his reader.
# 19:17 
KartikPrabhu yeah whitespace we all agree on
# 19:18 
Zegnat <template> is not important, I just would like for the spec to match parsers or the other way around, and not have a divergence.
webchat254_, [eddie], vivus and webchat254__ joined the channel
# 20:38 
KartikPrabhu Zegnat: the whatwg innerText has 2 "\n" for <p>; is that what we want?
# 20:39 
Zegnat Not sure. I am probably going to test with both.
# 20:39 
Zegnat As far as the code for innerText is concerned, it is just a single fixed integer somewhere. Easily swapped out. It is the rest of the code I am worried about.
# 20:39 
Zegnat The original lynx (?) example aaronpk screenshotted in the parsing issue featured 2 "\n", IIRC
# 20:39 
KartikPrabhu I am writing an implementation in python now
# 20:40 
KartikPrabhu but the tests have 1. I can swap it too
# 20:40 
KartikPrabhu not sure if <pre> should have leading \n though
# 20:40 
Zegnat I’ll get to that when I get to actually parsing <pre> elements at all.
# 20:41 
Zegnat Currently reading up on the CSS Display spec to see if “used value” vs “computed value” is going to make a difference.
# 20:41 
KartikPrabhu I am ignoring that :P
# 20:41 
KartikPrabhu btw refering now to https://wiki.zegnat.net/media/textparsing.html
# 20:41 
KartikPrabhu steps 2-4 of the PLain text of element are now not needed right?
# 20:41 
Zegnat Aah, implementing my algo. Cool.
# 20:42 
KartikPrabhu no I am implementing the innerText from whatwg
# 20:42 
KartikPrabhu also are you still doing step 2 for element to string > text node ?
# 20:42 
KartikPrabhu replace \t\n\r by spaces?
# 20:43 
Zegnat Outside of white-space: pre; it should be doing that, I believe
# 20:43 
KartikPrabhu ok
# 20:43 
Zegnat Under normal whitespace rules, all ASCII whitespace is collapsed to a single space character.
# 20:44 
Zegnat That’s where state is going to come in if you need to keep track of being inside or outside a <pre> element
# 20:49 
KartikPrabhu maybe not. Let me write this up and test it :P
# 20:55 
Zegnat Oooh, looking forward to it!
KevinMarks and [eddie] joined the channel
# 21:04 
KartikPrabhu ok failing some tests at the moment need to debug
# 21:05 
KartikPrabhu should be preparing slides for a presentation, but here we are...
# 21:06 
Zegnat well, I am off for bed. Looking forward to whatever you cook up! Wish I was able to move the algo to the mf wiki for iterating, but I can’t create new pages there.
tantek, j12t, KevinMarks, webchat254_, chrisaldrich and KevinMarks_ joined the channel; webchat254_ left the channel