#microformats 2018-01-12

2018-01-12 UTC
#
noblecashforcars
edited /hlisting (+74) "/* past examples */"
(view diff)
tantek and vivus joined the channel
#
gRegorLove
!tell tantek spammer on /hlisting
#
Loqi
Ok, I'll tell them that when I see them next
#
gRegorLove
maybe we should unlink these past examples?
tantek joined the channel
#
tantek
edited /Special:Log/block () "blocked [[User:Noblecashforcars]] with an expiry time of infinite (account creation disabled): Spamming links to external sites"
(view diff)
#
tantek
gRegorLove: well, they were real examples at some point, so if you want to unlink them, verify they don't have the markup first
#
Loqi
tantek: gRegorLove left you a message 38 minutes ago: spammer on /hlisting
[chrisaldrich], tantek, KartikPrabhu, sebsel, [xavierroy], nitot, [sebsel], Loqi, [mlopatka], vivus, Phae, webchat54 and [eddie] joined the channel
#
aaronpk
uhoh, the PHP parser seems to be including a "photo" property when it shouldn't be
#
aaronpk
python does it too
#
aaronpk
and ruby
#
aaronpk
am I doing this wrong?
#
aaronpk
I didn't expect that img tag to be used as an implied photo
nitot joined the channel
#
aaronpk
is that this rule? "else if .h-x>img[src]:only-of-type:not[.h-*] then use that img src for photo"
nitot and [colinwalker] joined the channel
#
Zegnat
Let me recheck implied rules … those are a little messy
#
Zegnat
It is this rule:
#
Zegnat
else if .h-x>:only-child:not[.h-*]>img[src]:only-of-type:not[.h-*], then use that img’s src for photo
#
Zegnat
If there is a single IMG inside a wrapper element inside the root mf, it is an implied photo.
#
aaronpk
that should catch things like <a href=""><img></a> right?
#
Zegnat
<span class="h-card"><a href=""><img> Name</a></span>
#
aaronpk
the extra <div class="e-content"> I threw in there should not be matched under that rule I thought
#
Zegnat
As soon as you add anything next to the e-content DIV (e.g. a p-name H1) the image is no longer implied.
#
aaronpk
oh hm your span->a->img example is interesting
#
aaronpk
I don't see anything in that rule that mentions other mf2 properties like my example
#
Zegnat
For implied stuff, whether the elements have other properties is completely ignored.
#
aaronpk
so <span class="h-card"><a href=""><img></a></span> should imply the name, photo and url properties, that makes sense.
#
Zegnat
The rule is: [root-h-with-only-one-child] > [that-one-child-that-is-not-allowed-to-be-an-h] -> [the-only-img-inside-the-root-h-here-is-implied]
#
aaronpk
so was my example for my xray test too contrived?
#
Zegnat
Possibly. I think the argument is that the real usecase will not have only 1 element as child of the root h-entry.
#
aaronpk
k let me check some of the actual cases this was erroring on
#
Zegnat
And therefore the implied rule will not apply.
#
Zegnat
If you have real usecases with this, please!
#
Zegnat
Also, there might be a case to be made to not imply any properties because other properties are explicitly marked. I think there was a implied-name discussion about that.
#
aaronpk
that's how I thought it worked
#
Zegnat
https://github.com/microformats/microformats2-parsing/issues/6#issuecomment-286834311 - maybe should be expanded to include all implied parsing including photo ?
#
Loqi
[tantek] From the end of the wiki discussion, one straw proposal was: "any explicit p-* property on an element stops implied p-name" (this sounds a bit ambiguous and could be reworded, but I think the general intent / principle is workable)
#
Zegnat
What’s wrong with that cleverdevil post?
#
aaronpk
that's one that XRay is parsing wrong that i'm trying to fix
#
aaronpk
the img.u-photo is inside the e-content
#
aaronpk
so the image appears twice
#
Zegnat
Ah, right. I was thinking we were still talking about implied images, and cleverdevil’s is explicit ;)
#
aaronpk
they are very much related
#
aaronpk
but yes, the parser does not incorrectly imply that photo property
#
Zegnat
Hmm, that post is interesting. That image definitely should not be removed from the e-content, otherwise there is no content left within the a element that’s in the content.
[kevinmarks] joined the channel
#
Zegnat
Or well. “definitely” as in my gut feeling, haha
#
aaronpk
that's correct tho, because the post is a photo
#
Loqi
ahaha
#
aaronpk
so consumers will render the photo from the `photo` property
#
Zegnat
Actually, I would argue the markup there is just plain wrong and the u-photo class should be on the a element. Then consumers get the full photo URL.
#
aaronpk
full photo url?
#
aaronpk
oh instead of the resized one?
#
Zegnat
The img with class="u-photo" links to a thumbnail for me.
#
aaronpk
that's debatable but sure
#
aaronpk
with my new XRay parsing, the HTML of that post turns into an empty <a> tag, heh
#
Zegnat
The problem I am seeing here with deduping is that the content will be an <a> element linking to the actual full photo but nothing in it so probably not shown in a feed reader
#
Zegnat
yes, exactly. That’s why I said my gut feeling was that it should not be removed from e-content ;)
#
aaronpk
"feed readers" aren't my use case here
#
aaronpk
microblog readers which are aware of photo posts are the use case
#
Zegnat
s/feed reader/any reader/
#
aaronpk
my point stands
#
aaronpk
as soon as the reader is aware of the photo property, there is something to show
#
aaronpk
that post doesn't in fact have any content other than the image
#
aaronpk
so it's correct to remove the image from the content property
#
Zegnat
It has a link and an image.
#
aaronpk
and leave the content property blank
#
Zegnat
Actually the content is DIV>A>IMG, since it uses e-*, saying that the HTML is important here.
#
aaronpk
if there were not a u-photo class on the img tag then I would agree
#
Zegnat
Even then the content is still DIV>A right? The A might have important rel and href values.
#
Zegnat
That’s why I think deduping *in this specific case* is hard. I don’t think leaving the A element empty is more correct than the parser giving back 2 images (one in content and one in photo property)
#
aaronpk
I do think it's more correct because it looks super broken to show two images, and doesn't look super broken to show one image that happens to not link to the full-size image
#
Zegnat
Yeah, I honestly don’t know how to solve that from a mf2 spec perspective though :(
#
aaronpk
well this is not an mf2 thing specifically
#
aaronpk
not an mf2 parsing thing
#
aaronpk
this is consuming mf2 data
#
aaronpk
this bit of the discussion probably should have been in #indieweb-dev
#
Zegnat
Yes, but I do somewhat feel that is consuming becomes this much of a struggle, that should be filed as a spec issue
#
Zegnat
s/is/if/
#
aaronpk
possibly yes, especially if it means consumers of the mf2 data can't use the plaintext value that the parser returns
#
Zegnat
Also after reading the discussion in indieweb-dev, I now feel I need to compare PHP’s DOMDocument textContent to the DOM spec.
#
Zegnat
I wonder if PHP’s textContent is broken, or if we collectively just need different output than the DOM textContent property gives us. The later case would require the spec to be updated to define what textContent is.
#
aaronpk
XML DOM or HTML DOM?
#
aaronpk
in XML, <br /> has no text content, but in HTML <br /> has text content of "\n"
#
Zegnat
When I say DOM I mean DOM: https://dom.spec.whatwg.org/
#
aaronpk
the question is when PHP says DOM, what do they mean
#
aaronpk
and when microformats says DOM, what do they mean
#
Zegnat
I actually thought there was only 1 DOM.
#
aaronpk
tantek is going to have a field day catching up with these logs
#
aaronpk
i'm going to try to capture as much of this as possible in my github thread
#
aaronpk
actually let me move some of this to that microformats implied parsing thread
#
Zegnat
Where is it defined that <br> has a text content of "\n"?
#
aaronpk
"The br element represents a line break." doesn't that mean \n?
#
Zegnat
Or well, it doesn’t mean that it has any special textContent, is what I should say
#
Zegnat
It represents a line break for HTML rendering engines (those ignore \n), but it does not add a line break to the actual DOM in any way that I can find.
nitot joined the channel
#
Zegnat
Go and Python parsers both rely on DOM textContent. PHP adds magic \n and spaces. Ruby doesn’t parse my test at all (on ruby.microformats.io) and the node one is still down.
#
Zegnat
Test: <div class="h-entry"><p>Wow<br><span>This</span></p><p>Is Interesting</p></div>
#
Zegnat
Expected name value on entry: WowThisIs Interesting
#
aaronpk
no, I would definitely expect newlines in the plaintext
#
aaronpk
given that's how a browser will render it
#
aaronpk
and if you copypaste the text from the browser it will have newlines
#
Zegnat
I’ll post this. I actually have no preference either way. I just always assumed mf2’s textContent was DOM’s textContent.
#
Loqi
Wow This Is Interesting
#
Zegnat
That’s actually a very good argument, aaronpk. Matching browser copy-paste.
#
aaronpk
that's what I think of when I think plaintext
#
Zegnat
Or even just matching a text browser like lynx/w3m.
#
Zegnat
The PHP parser doesn’t add a line break for the P elements though ;)
#
aaronpk
looks like a good addition to that fancy plaintext function
#
Zegnat
I guess the definition will be something like “add a linebreak for every block element”
#
Zegnat
But yes, writing spec issue now
#
aaronpk
thanks
gRegorLove joined the channel
#
aaronpk
okay next issue: alt text in images
#
aaronpk
this is messy
tantek joined the channel
#
aaronpk
good morning tantek :D
[xavierroy] joined the channel
#
[kevinmarks]
Browser copy paste was variable too. Nor sure if that resolved
tantek joined the channel
#
tantek
wow logs
#
aaronpk
Yeah that's what happens when I start working on XRay at 6am
#
aaronpk
adactio's posts are throwing me off
#
Loqi
[Jeremy Keith] Preparing to podcast. Preparing to podcast. https://adactio.com/images/uploaded/13301/large.jpg
#
aaronpk
the img alt text gets included in the plaintext name property
#
tantek
he shouldn't be putting alt text that duplicates visible text
#
tantek
that's an authoring error
#
Loqi
[Zegnat] #15 Define the value of textContent.
#
tantek
y tho
#
aaronpk
why what?
#
tantek
either
#
tantek
adactio's example is bad for screen readers which would sound like they are repeating themselves
#
aaronpk
okay I will try to find a different example with alt text that is screwing me up
#
Zegnat
Hmm, that issue title doesn’t make much sense by itself. textContent isn’t a value. Oh well, better titles are welcome.
#
tantek
pretty sure DOM spec defines textContent
#
tantek
adactio needs facepiles for his likes
#
aaronpk
"needs" is a strong word
#
aaronpk
I think his compact list is actually pretty nice as is
#
tantek
it doesn't really show much information, and abstract linked names are less compelling than icons of people
#
Zegnat
tantek, if we are using DOM spec’s textContent (as I assumed, and as I write in the issue) that is fine but should be called out.
#
Zegnat
And if that is decided, a bug should be filed on the PHP parser which isn’t doing so at present. Again because of reasons captured in the issue.
#
aaronpk
adactio's lack of facepiles is really not the problem here
#
tantek
Zegnat: last time I checked, I thought I referenced the HTML spec in particular, for parsing textContent
#
tantek
the DOM spec builds on that AFAIK
#
aaronpk
oh this is good. tantek's post causes this problem with alt text :)
#
aaronpk
oh hey Zegnat tantek does the thing you wanted which is to have the `u-photo` class on the <a> instead of the <img> thumbnail
#
tantek
but that's deliberate
#
tantek
wait now we're #indieweb-dev talk
#
aaronpk
oh yeah this might be #indieweb-dev because i'm talking about consuming the microformat data
#
Zegnat
HTML spec builds on DOM spec, but yes. So the PHP parser is wrong per-spec, but the PHP parser does what at least 2 users (the one who opened the issue, and aaronpk) want. (And probably what more people want, since other people like glenn and gRegorLove went and implemented it.)
#
Zegnat
s/want/expect/ ... maybe more accurate. Don’t want to state what people “want”, but they did have a proclaimed expectancy.
#
tantek
Zegnat: sorta. HTML still defines parsing
#
tantek
DOM uses the output of that parsing
#
Zegnat
HTML defines parsing into a node tree, and DOM defines how to handle said node tree and its default attributes. Is how I saw it. Lets just say they are both needed, haha.
#
Loqi
Zegnat: lol
#
Zegnat
Either way, DOM spec defines the textContent getter, and we have users on record saying DOM textContent is not a useful plain-text version of HTML. (Because it isn’t meant to be.) This user feedback has triggered the PHP parser to change its behaviour away from the mf2 spec.
#
tantek
ok that sounds non-trivial and needing some work to resolve
#
tantek
have parsers converged on their own "textContent"?
#
Zegnat
No. PHP uses their own, which is based on (maybe the same as?) the one included in the JS microformats-shiv.
#
Zegnat
Python and Go seem to use DOM’s textContent.
#
Zegnat
Test case and parser results all in the GitHub issue
#
Zegnat
changes issue title to be less vague
#
aaronpk
the PHP one was ported from node
#
Zegnat
I do not currently run node locally and node.microformats.io has been down for a while so I couldn’t include its output :(
#
Zegnat
Anyone know if we can switch that to a different instance?
#
aaronpk
I try not to run node.js stuff if I can hep it so I won't volunteer for that
#
sknebel
some of the others are on Heroku afaik, so packing it for that with a webinterface might be an option?
#
sknebel
or host one on Glitch
#
sknebel
(sturdy-backbone has a slightly modified one exposed somewhere)
#
sknebel
(found it: https://sturdy-backbone.glitch.me/mf2/?url= - wouldn't add that officially, but a testing option for now)
#
sknebel
so you hve to host your testfile somewhere, but at least can look at it
#
aaronpk
gRegorLove: merged your dt- PR!
#
aaronpk
got my own small one for you to review if you can: https://github.com/indieweb/php-mf2/pull/139
#
Loqi
[aaronpk] #139 trim whitespace from HTML value as well
#
Zegnat
Thankfully, finally a parser change I have no comments on, aaronpk :P
#
Zegnat
Thanks sknebel. Can Kaja be made to use that?
KartikPrabhu and sebsel joined the channel
#
gRegorLove
Thanks! Will take a look.
#
Loqi
[aaronpk] #16 consider not including img alt text as part of surrounding text properties
#
Zegnat
Side note: putting alt in the plain-text value properties is an explicit extension by the mf2 spec to DOM’s textContent ;)
#
aaronpk
wow this opened up quite a rabbit hole
#
aaronpk
i'm trying to capture all this so I can resume my XRay work later
tantek joined the channel
#
Zegnat
Capturing is good, aaronpk :D
[miklb], [colinwalker], KartikPrabhu and reidab joined the channel
#
gRegorLove
reads the logs
#
Loqi
[Loqi] Wow This Is Interesting
#
gRegorLove
Oh, that's the preview, haha. I thought Loqi was commenting on the conversation.
#
Loqi
hehe
#
gRegorLove
aaronpk: https://github.com/indieweb/php-mf2/pull/139 looks good. CI errored on PHP5.5
#
Loqi
[aaronpk] #139 trim whitespace from HTML value as well
#
aaronpk
hm might be a glitch, looks like it failed while trying to fetch packages to install
tantek joined the channel
#
aaronpk
clicks restart
#
aaronpk
oh gRegorLove, your PR I think solved these issues, would appreciate if you could confirm https://github.com/indieweb/php-mf2/issues?q=is%3Aissue+is%3Aopen+label%3Aquestion
#
Loqi
[gRegorLove] ### Node via https://glennjones.net/tools/microformats/ ``` { "type": ["h-entry"], "properties": { "name": ["WowThisIs Interesting"] } } ```
[cleverdevil] joined the channel
#
Zegnat
Thanks for adding the node parser output gRegorLove. I guess glennjones’ innerText implementation sits behind a flag while in the PHP parser it is the default?
#
Zegnat
That might be worth noting too, if that is the case.
#
sknebel
has a checkbox on the webinterface, doesn't seem to change the output here
#
gRegorLove
I didn't think so, but it's been quite a while since I looked at Node's innerText
#
Zegnat
Hmm, I would have thought my test case should have triggered it. But I didn’t spend too much time looking at the node parser.
#
gRegorLove
Oops, wrong method name. innerText() in php-mf2, links to mf-shiv github
#
gRegorLove
aaronpk, Want me to close those question issues or confirm with a comment?
#
aaronpk
you can close them if you think they're resolved. as long as they're closed by someone other than me we'll be able to tell.
[kevinmarks] joined the channel
#
gRegorLove
Related to the img alt discussion, did this come up? https://github.com/microformats/microformats2-parsing/issues/2
#
Loqi
[aaronpk] #2 image alt text is lost during parsing
#
gRegorLove
Only skimmed the img alt conversation so far
#
aaronpk
no, it's kind of orthogonal
#
Zegnat
Surprise, I think no parser gets the new zegnat-special right: https://gist.github.com/Zegnat/ae738ea4322909b0ce952cb1346deb9c
#
Zegnat
(Shows the difference between plain text value parsing from e- and value parsing from p-.)
#
aaronpk
I thought <script> tags were removed completely
#
Zegnat
Not in e- value parsing per spec.
#
Zegnat
Probably an omission in spec, but yeah…
#
aaronpk
we've had this discussion before
#
Zegnat
Could be.
#
aaronpk
and I remmeber that removing the contents of the <script> was correct for the plaintext version of e-content
#
aaronpk
otherwise you end up with junk
#
sknebel
makes sense, is not in the spec
#
gRegorLove
<aaronpk> we've had this discussion before
#
gRegorLove
<Zegnat> Probably an omission in spec, but yeah…
#
Zegnat
I double checked spec when writing this one. I’ll open a quick issue. And then I am retiring from mf2 for the night. Probably.
#
Loqi
[Tantek Çelik] microformats2 parsing specification
#
sknebel
not on e-
#
gRegorLove
It is in a p?
#
Zegnat
Look at the zegnat-special again, gRegorLove ;)
#
sknebel
and an e-
#
Zegnat
p- says you replace img and drop style/script. e- only says you replace img.
[keithjgrant] joined the channel
#
gRegorLove
Hm, the nested object in name is really weird there too
#
Zegnat
Why? That’s what you get when you do e-name. Every parser supports that.
#
gRegorLove
It's weird in combination with the p-name
#
Zegnat
I could have used p-contentone and e-contenttwo to separate the properties, but this gave a more fun parser output.
#
Zegnat
Besides, all parsers I just tested handled that without problem. It is just that most drop the SCRIPT element on e-, which is likely what the spec intended.
#
Zegnat
Only Python seems broken, it doesn’t drop SCRIPT at all, not in p- either.
#
gRegorLove
I get the spec bug, yeah. I don't think I'd seen or contrived the scenario before where a property gets a string value and an object (where the object isn't nested with 'children')
#
sknebel
that was just to show the difference in handling I guess
#
sknebel
not something you'd do
#
Zegnat
Theoretically the test case is just <div class="h-x">Hello <script>beautiful </script>person</div>, which should return an implied name property "Hello beautiful person", since SCRIPT should not be removed on implied name either.
#
Zegnat
But mine is just more fun, and clearly shows the specific handling difference between e- and p- when they have exactly the same content.
#
Zegnat
(So much the same that they are on the same element.)
#
Loqi
[Zegnat] #17 Define removal of SCRIPT and STYLE elements everywhere textContent is requested.
#
Loqi
microformats2-parsing-issues
#
Zegnat
Ah, good to know. I have a post-it on my desk saying to reread that issues page and finish it once and for all, but I have exams and had to postpone :(
#
Loqi
it'll be okay
[mrkrndvs] joined the channel
#
Zegnat
He, thanks Loqi
#
Zegnat
gives Loqi a cookie
#
Loqi
peers at the cookie
[kevinmarks] and tantek joined the channel
#
tantek
thanks for being thorough about the parsing differences between p- and e-
iwaim___, [colinwalker], strugee, [kevinmarks] and vivus joined the channel
#
[kevinmarks]
hm. file a bug on the python parser?