2018-01-12 UTC
tantek and vivus joined the channel
# 01:08 Loqi Ok, I'll tell them that when I see them next
tantek joined the channel
# 01:46 tantek gRegorLove: well, they were real examples at some point, so if you want to unlink them, verify they don't have the markup first
[chrisaldrich], tantek, KartikPrabhu, sebsel, [xavierroy], nitot, [sebsel], Loqi, [mlopatka], vivus, Phae, webchat54 and [eddie] joined the channel
# 15:13 aaronpk uhoh, the PHP parser seems to be including a "photo" property when it shouldn't be
# 15:17 aaronpk I didn't expect that img tag to be used as an implied photo
nitot joined the channel
# 15:19 aaronpk is that this rule? "else if .h-x>img[src]:only-of-type:not[.h-*] then use that img src for photo"
nitot and [colinwalker] joined the channel
# 15:53 Zegnat Let me recheck implied rules … those are a little messy
# 15:54 Zegnat else if .h-x>:only-child:not[.h-*]>img[src]:only-of-type:not[.h-*], then use that img’s src for photo
# 15:55 Zegnat If there is a single IMG inside a wrapper element inside the root mf, it is an implied photo.
# 15:55 aaronpk that should catch things like <a href=""><img></a> right?
# 15:56 Zegnat <span class="h-card"><a href=""><img> Name</a></span>
# 15:56 aaronpk the extra <div class="e-content"> I threw in there should not be matched under that rule I thought
# 15:56 Zegnat As soon as you add anything next to the e-content DIV (e.g. a p-name H1) the image is no longer implied.
# 15:57 aaronpk I don't see anything in that rule that mentions other mf2 properties like my example
# 15:57 Zegnat For implied stuff, whether the elements have other properties is completely ignored.
# 15:58 aaronpk so <span class="h-card"><a href=""><img></a></span> should imply the name, photo and url properties, that makes sense.
# 15:58 Zegnat The rule is: [root-h-with-only-one-child] > [that-one-child-that-is-not-allowed-to-be-an-h] -> [the-only-img-inside-the-root-h-here-is-implied]
# 15:58 aaronpk so was my example for my xray test too contrived?
# 15:59 Zegnat Possibly. I think the argument is that the real usecase will not have only 1 element as child of the root h-entry.
# 15:59 aaronpk k let me check some of the actual cases this was erroring on
# 15:59 Zegnat And therefore the implied rule will not apply.
# 16:00 Zegnat Also, there might be a case to be made to not imply any properties because other properties are explicitly marked. I think there was a implied-name discussion about that.
# 16:01 Loqi [tantek] From the end of the wiki discussion, one straw proposal was:
"any explicit p-* property on an element stops implied p-name"
(this sounds a bit ambiguous and could be reworded, but I think the general intent / principle is workable)
# 16:04 aaronpk that's one that XRay is parsing wrong that i'm trying to fix
# 16:05 Zegnat Ah, right. I was thinking we were still talking about implied images, and cleverdevil’s is explicit ;)
# 16:07 aaronpk but yes, the parser does not incorrectly imply that photo property
# 16:07 Zegnat Hmm, that post is interesting. That image definitely should not be removed from the e-content, otherwise there is no content left within the a element that’s in the content.
[kevinmarks] joined the channel
# 16:07 Zegnat Or well. “definitely” as in my gut feeling, haha
# 16:08 aaronpk so consumers will render the photo from the `photo` property
# 16:08 Zegnat Actually, I would argue the markup there is just plain wrong and the u-photo class should be on the a element. Then consumers get the full photo URL.
# 16:09 Zegnat The img with class="u-photo" links to a thumbnail for me.
# 16:10 aaronpk with my new XRay parsing, the HTML of that post turns into an empty <a> tag, heh
# 16:10 Zegnat The problem I am seeing here with deduping is that the content will be an <a> element linking to the actual full photo but nothing in it so probably not shown in a feed reader
# 16:10 Zegnat yes, exactly. That’s why I said my gut feeling was that it should not be removed from e-content ;)
# 16:11 aaronpk microblog readers which are aware of photo posts are the use case
# 16:11 aaronpk as soon as the reader is aware of the photo property, there is something to show
# 16:11 aaronpk that post doesn't in fact have any content other than the image
# 16:11 aaronpk so it's correct to remove the image from the content property
# 16:12 Zegnat Actually the content is DIV>A>IMG, since it uses e-*, saying that the HTML is important here.
# 16:14 aaronpk if there were not a u-photo class on the img tag then I would agree
# 16:14 Zegnat Even then the content is still DIV>A right? The A might have important rel and href values.
# 16:15 Zegnat That’s why I think deduping *in this specific case* is hard. I don’t think leaving the A element empty is more correct than the parser giving back 2 images (one in content and one in photo property)
# 16:16 aaronpk I do think it's more correct because it looks super broken to show two images, and doesn't look super broken to show one image that happens to not link to the full-size image
# 16:21 Zegnat Yeah, I honestly don’t know how to solve that from a mf2 spec perspective though :(
# 16:22 aaronpk this bit of the discussion probably should have been in #indieweb-dev
# 16:22 Zegnat Yes, but I do somewhat feel that is consuming becomes this much of a struggle, that should be filed as a spec issue
# 16:23 aaronpk possibly yes, especially if it means consumers of the mf2 data can't use the plaintext value that the parser returns
# 16:24 Zegnat Also after reading the discussion in indieweb-dev, I now feel I need to compare PHP’s DOMDocument textContent to the DOM spec.
# 16:25 Zegnat I wonder if PHP’s textContent is broken, or if we collectively just need different output than the DOM textContent property gives us. The later case would require the spec to be updated to define what textContent is.
# 16:25 aaronpk in XML, <br /> has no text content, but in HTML <br /> has text content of "\n"
# 16:26 aaronpk the question is when PHP says DOM, what do they mean
# 16:26 aaronpk and when microformats says DOM, what do they mean
# 16:26 aaronpk tantek is going to have a field day catching up with these logs
# 16:27 aaronpk i'm going to try to capture as much of this as possible in my github thread
# 16:28 aaronpk actually let me move some of this to that microformats implied parsing thread
# 16:28 Zegnat Where is it defined that <br> has a text content of "\n"?
# 16:30 aaronpk "The br element represents a line break." doesn't that mean \n?
# 16:33 Zegnat Or well, it doesn’t mean that it has any special textContent, is what I should say
# 16:34 Zegnat It represents a line break for HTML rendering engines (those ignore \n), but it does not add a line break to the actual DOM in any way that I can find.
nitot joined the channel
# 16:38 Zegnat Go and Python parsers both rely on DOM textContent. PHP adds magic \n and spaces. Ruby doesn’t parse my test at all (on ruby.microformats.io) and the node one is still down.
# 16:38 Zegnat Test: <div class="h-entry"><p>Wow<br><span>This</span></p><p>Is Interesting</p></div>
# 16:38 Zegnat Expected name value on entry: WowThisIs Interesting
# 16:39 aaronpk no, I would definitely expect newlines in the plaintext
# 16:39 aaronpk and if you copypaste the text from the browser it will have newlines
# 16:39 Zegnat I’ll post this. I actually have no preference either way. I just always assumed mf2’s textContent was DOM’s textContent.
# 16:40 Zegnat That’s actually a very good argument, aaronpk. Matching browser copy-paste.
# 16:40 Zegnat Or even just matching a text browser like lynx/w3m.
# 16:40 Zegnat The PHP parser doesn’t add a line break for the P elements though ;)
# 16:41 aaronpk looks like a good addition to that fancy plaintext function
# 16:41 Zegnat I guess the definition will be something like “add a linebreak for every block element”
gRegorLove joined the channel
tantek joined the channel
[xavierroy] joined the channel
tantek joined the channel
# 17:00 aaronpk Yeah that's what happens when I start working on XRay at 6am
# 17:01 aaronpk the img alt text gets included in the plaintext name property
# 17:02 tantek he shouldn't be putting alt text that duplicates visible text
# 17:02 Loqi [Zegnat] #15 Define the value of textContent.
# 17:03 tantek adactio's example is bad for screen readers which would sound like they are repeating themselves
# 17:03 aaronpk okay I will try to find a different example with alt text that is screwing me up
# 17:04 Zegnat Hmm, that issue title doesn’t make much sense by itself. textContent isn’t a value. Oh well, better titles are welcome.
# 17:05 aaronpk I think his compact list is actually pretty nice as is
# 17:06 tantek it doesn't really show much information, and abstract linked names are less compelling than icons of people
# 17:06 Zegnat tantek, if we are using DOM spec’s textContent (as I assumed, and as I write in the issue) that is fine but should be called out.
# 17:06 Zegnat And if that is decided, a bug should be filed on the PHP parser which isn’t doing so at present. Again because of reasons captured in the issue.
# 17:07 aaronpk adactio's lack of facepiles is really not the problem here
# 17:09 tantek Zegnat: last time I checked, I thought I referenced the HTML spec in particular, for parsing textContent
# 17:09 aaronpk oh this is good. tantek's post causes this problem with alt text :)
# 17:10 aaronpk oh hey Zegnat tantek does the thing you wanted which is to have the `u-photo` class on the <a> instead of the <img> thumbnail
# 17:11 aaronpk oh yeah this might be #indieweb-dev because i'm talking about consuming the microformat data
# 17:12 Zegnat HTML spec builds on DOM spec, but yes. So the PHP parser is wrong per-spec, but the PHP parser does what at least 2 users (the one who opened the issue, and aaronpk) want. (And probably what more people want, since other people like glenn and gRegorLove went and implemented it.)
# 17:13 Zegnat s/want/expect/ ... maybe more accurate. Don’t want to state what people “want”, but they did have a proclaimed expectancy.
# 17:21 Zegnat HTML defines parsing into a node tree, and DOM defines how to handle said node tree and its default attributes. Is how I saw it. Lets just say they are both needed, haha.
# 17:23 Zegnat Either way, DOM spec defines the textContent getter, and we have users on record saying DOM textContent is not a useful plain-text version of HTML. (Because it isn’t meant to be.) This user feedback has triggered the PHP parser to change its behaviour away from the mf2 spec.
# 17:23 tantek ok that sounds non-trivial and needing some work to resolve
# 17:24 tantek have parsers converged on their own "textContent"?
# 17:25 Zegnat No. PHP uses their own, which is based on (maybe the same as?) the one included in the JS microformats-shiv.
# 17:25 Zegnat Test case and parser results all in the GitHub issue
# 17:29 Zegnat I do not currently run node locally and node.microformats.io has been down for a while so I couldn’t include its output :(
# 17:29 Zegnat Anyone know if we can switch that to a different instance?
# 17:31 aaronpk I try not to run node.js stuff if I can hep it so I won't volunteer for that
# 17:33 sknebel some of the others are on Heroku afaik, so packing it for that with a webinterface might be an option?
# 17:33 sknebel (sturdy-backbone has a slightly modified one exposed somewhere)
# 17:47 sknebel so you hve to host your testfile somewhere, but at least can look at it
# 17:48 Loqi [aaronpk] #139 trim whitespace from HTML value as well
# 17:51 Zegnat Thankfully, finally a parser change I have no comments on, aaronpk :P
# 17:51 Zegnat Thanks sknebel. Can Kaja be made to use that?
KartikPrabhu and sebsel joined the channel
# 18:09 Loqi [aaronpk] #16 consider not including img alt text as part of surrounding text properties
# 18:14 Zegnat Side note: putting alt in the plain-text value properties is an explicit extension by the mf2 spec to DOM’s textContent ;)
# 18:17 aaronpk i'm trying to capture all this so I can resume my XRay work later
tantek joined the channel
[miklb], [colinwalker], KartikPrabhu and reidab joined the channel
# 18:45 gRegorLove Oh, that's the preview, haha. I thought Loqi was commenting on the conversation.
# 18:50 Loqi [aaronpk] #139 trim whitespace from HTML value as well
# 18:51 aaronpk hm might be a glitch, looks like it failed while trying to fetch packages to install
tantek joined the channel
[cleverdevil] joined the channel
# 19:08 Zegnat Thanks for adding the node parser output gRegorLove. I guess glennjones’ innerText implementation sits behind a flag while in the PHP parser it is the default?
# 19:09 Zegnat That might be worth noting too, if that is the case.
# 19:10 sknebel has a checkbox on the webinterface, doesn't seem to change the output here
# 19:11 gRegorLove I didn't think so, but it's been quite a while since I looked at Node's innerText
# 19:14 Zegnat Hmm, I would have thought my test case should have triggered it. But I didn’t spend too much time looking at the node parser.
# 19:14 gRegorLove Oops, wrong method name. innerText() in php-mf2, links to mf-shiv github
# 19:26 gRegorLove aaronpk, Want me to close those question issues or confirm with a comment?
# 19:27 aaronpk you can close them if you think they're resolved. as long as they're closed by someone other than me we'll be able to tell.
[kevinmarks] joined the channel
# 20:06 Loqi [aaronpk] #2 image alt text is lost during parsing
# 20:12 Zegnat (Shows the difference between plain text value parsing from e- and value parsing from p-.)
# 20:14 aaronpk and I remmeber that removing the contents of the <script> was correct for the plaintext version of e-content
# 20:16 Zegnat I double checked spec when writing this one. I’ll open a quick issue. And then I am retiring from mf2 for the night. Probably.
# 20:16 Loqi [Tantek Çelik] microformats2 parsing specification
# 20:16 Zegnat Look at the zegnat-special again, gRegorLove ;)
# 20:17 Zegnat p- says you replace img and drop style/script. e- only says you replace img.
[keithjgrant] joined the channel
# 20:19 Zegnat Why? That’s what you get when you do e-name. Every parser supports that.
# 20:19 Zegnat I could have used p-contentone and e-contenttwo to separate the properties, but this gave a more fun parser output.
# 20:20 Zegnat Besides, all parsers I just tested handled that without problem. It is just that most drop the SCRIPT element on e-, which is likely what the spec intended.
# 20:20 Zegnat Only Python seems broken, it doesn’t drop SCRIPT at all, not in p- either.
# 20:24 gRegorLove I get the spec bug, yeah. I don't think I'd seen or contrived the scenario before where a property gets a string value and an object (where the object isn't nested with 'children')
# 20:25 sknebel that was just to show the difference in handling I guess
# 20:27 Zegnat Theoretically the test case is just <div class="h-x">Hello <script>beautiful </script>person</div>, which should return an implied name property "Hello beautiful person", since SCRIPT should not be removed on implied name either.
# 20:27 Zegnat But mine is just more fun, and clearly shows the specific handling difference between e- and p- when they have exactly the same content.
# 20:28 Zegnat (So much the same that they are on the same element.)
# 20:29 Loqi [Zegnat] #17 Define removal of SCRIPT and STYLE elements everywhere textContent is requested.
# 20:40 Zegnat Ah, good to know. I have a post-it on my desk saying to reread that issues page and finish it once and for all, but I have exams and had to postpone :(
[mrkrndvs] joined the channel
[kevinmarks] and tantek joined the channel
# 22:18 tantek thanks for being thorough about the parsing differences between p- and e-
iwaim___, [colinwalker], strugee, [kevinmarks] and vivus joined the channel