#microformats 2023-06-25

2023-06-25 UTC
joeyg, [aciccarello], sebbu, btrem and JKing joined the channel
#
JKing
I am writing a test case for handling of <area>, <img>, and <data> in Value Class Pattern processing (seemingly not covered by any extant test cases) and noticed that three out of five implementations I tried (PHP, JS, Python against Go, Ruby) don't behave as I'd expect. Could all three be wrong, or might the parsing documentation need updating?
#
sebbu
what is wrong exactly, not behaving as you'd expect ?
#
sebbu
i remember DOM being implemented differently in various parsers, aka having "empty" spacing nodes in some implementations, ignoring thoses spacing in some others implementation, and in others they're merged with text nodes
#
sebbu
(i once had to parse a <table>, de-merging cells, and there was a difference if the html was minified or pretty-printed)
#
JKing
It seems from the output like the three don't implement the special cases of those elements at all.
#
JKing
Given:
#
JKing
<div class="h-test"><div class="p-name">
#
JKing
<img class="value" alt="A">
#
JKing
<area class="value" alt="B">
#
JKing
<data class="value" value="C">c</data>
#
JKing
<data class="value">D</data>
#
JKing
</div></div>
#
JKing
The value of 'name' should be "ABCD" per the VCP docs, but is "cD" in PHP, JS, and Python.
#
sebbu
that's because you didn't ask for VCP
#
sebbu
you only ask for text value once all html nodes are removed, but not the text content inside them
#
sebbu
beside, it'll be \n\t\n\t\n\tc\n\tD\n
#
JKing
How can that be? The very first step at http://microformats.org/wiki/microformats2-parsing#parsing_a_p-_property is to look for a VCP value.
#
sebbu
i'm not saying that algorithm is wrong
#
sebbu
i'm saying you didn't used that algorithm or method
#
JKing
I don't understand what you mean.
#
sebbu
are you using a microformat library to get what you want, or just DOM or xml manipulation ?
#
JKing
I'm debugging a processor I just wrote by following the documentation and available tests. Given that there was no test I could find for this part, I wrote the above. My implementation agrees with Ruby and Go, which use "ABCD" as the value for the p-name.
#
JKing
(and by PHP, JS, Python, Ruby, and Go, I mean the implementations referenced from microfrmats.io)
#
sebbu
beside, it give you the value of the h- node, not the p- node ;)
[pfefferle] joined the channel
#
JKing
sebbu: I think you may be misreading the test case and/or misunderstanding what I'm saying. There is indeed a p-name property. All five of the (what I assume are) mature implementations are doing VCP processing, hence no whitespace in the result. They only differ in whether they take information from alt= and =value attributes as stated at http://microformats.org/wiki/value-class-pattern#Basic_Parsing
[schmarty], btrem and [tw2113_Slack_] joined the channel
#
Zegnat
JKing: that looks to be a bug at least in the PHP parser. The problem is it is using the inner text for all elements with a value class and does not have the check for alt/value attributes as mentioned in the VCP spec. Feel free to open an issue on https://github.com/microformats/php-mf2
#
Loqi
[preview] [microformats] php-mf2: php-mf2 is a pure, generic microformats-2 parser for PHP. It makes HTML as easy to consume as JSON.
#
JKing
Zegnat: Okay, thanks for the confirmation.
#
Zegnat
The fact that multiple parsers are doing this wrong might also be because there is no good test for it in https://github.com/microformats/tests . In the perfect world, if you are implementing a parser yourself, you should be able to just go against the input and outputs from that repo ... so if you have created tests yourself that are not covered there, feel free to PR!
#
Loqi
[preview] [microformats] tests: Microformats test suite
#
Zegnat
The PHP parser runs that test suite as part of its tests as well
#
JKing
Yes, I noticed that.
#
JKing
I'll clean up my test cases and post a patch, then. I have a few.
#
Zegnat
I am a little surprised the JS parser also got it wrong. As it felt a lot more up to date than the old PHP one. Mmm...
#
JKing
That surprised me as well, which is why I wondered if it was intentional.
#
Zegnat
I am actually not 100% sure what is going on with the JS parser. I expected it to at least pick up on the image, as alt values do seem to be part of its walker for textContent
#
Zegnat
Probably easy to debug, but on my way to bed. Nice catch! As I said, getting this in the test suite so it can trickle down to all implementations (including PHP and JS) would be awesome!
#
JKing
Happy to do it. Even with the omissions and vagaries in the documentation and holes in the test suite, it was pretty easy to write a processor from scratch. Two weeks to write and one week to test in my spare time, so not too bad.
#
JKing
The hardest part of testing was figuring out that multiple implementations were using a different text trimming algorithm (one of your devising, I understand) than what the parsing spec actually mandates. I did eventually find the documentation on the wiki, but you first have to know to look for it.
[tantek] and eitilt joined the channel
#
[KevinMarks]
Clarifying that would be a community benefit we'd all learn from