#microformats 2018-08-07

2018-08-07 UTC
[kiai], CGML2, [jgmac1106], [matpacker], mnemonic and [jgarber] joined the channel
#
gRegorLove
PR to fix the implied photo issue today: https://github.com/microformats/php-mf2/pull/191
#
Loqi
[gRegorLove] #191 Fix implied photo parsing
#
gRegorLove
sknebel++ for catching that
#
Loqi
sknebel has 18 karma in this channel over the last year (75 in all channels)
[jgarber] joined the channel
#
[jgarber]
gRegorLove: Are there any changes related to that that should be made to the microformats test suite? Or does the recent (15 days ago) change to the implied photo test cover this?
#
gRegorLove
That was a php-mf2 bug not a spec change, but yeah I am working on adding some tests for implied properties -- specifically negative tests, to cover some instances when they shouldn't be implied.
#
[jgarber]
Ah, perfect! Thanks for the clarification.
#
[jgarber]
I’m familiarizing myself with the test suite. Glad to see it’s getting some updates.
#
[jgarber]
I do have a couple of newbie questions about https://github.com/microformats/tests:
#
[jgarber]
1. Is it useful as an npm package? It’s hard to tell what it really does, how it’s used, etc. `npm install` kinda works, but it’s not obvious it does anything meaningful.
#
[jgarber]
2. There’s also a `composer.json` file. That’s a PHP thing, yeah? Is that useful…?
#
[jgarber]
4. I’m thinking whether or not this repo would be helpful pulled into other projects as a Git submodule. Not really a question here, but curious how others are incorporating it into their projects.
#
[jgarber]
3. The `change-log.html` files are historically useful but… maybe better suited as folder-specific `changelog.md` so as not to be conflated with the actual test HTML files…?
#
Loqi
[microformats] tests: Microformats test suite
[kiai] joined the channel
#
[jgarber]
☝ I don’t have a lot of historical familiarity with the repo other than that Glenn Jones worked on it for a long time. Beyond that, those questions are asked without much context.
#
gRegorLove
I'm new to it as well. I've wanted to get php-mf2 using it for a while but kept working on other issues :)
#
gRegorLove
I don't know node very well, so not sure about 1.
#
gRegorLove
2, yeah composer.json is PHP package control, so you can include it in other libs. php-mf2 uses that to load in the tests
#
gRegorLove
Not sure / no strong opinion on 3
#
gRegorLove
Submodule could work. With composer it's nice that we can specify microformats for development only: https://github.com/microformats/php-mf2/blob/master/composer.json#L19
#
gRegorLove
So if you just want to put php-mf2 in a project and don't need all the tests, you don't have to get them.
#
[jgarber]
What I’m trying to sort out in late night brain space is something like: “Should this test suite repo include language-specific things (e.g. `composer.json`, `package.json`, etc.) or should it simply be the test files…” ?
#
[jgarber]
Right, that approach makes sense. No need to distribute the test files with the package.
#
[jgarber]
(or whatever PHP calls its include-able things. 🙃 )
#
[jgarber]
Same principle applies to Ruby gems.
#
gRegorLove
Ohh. Yeah, composer.json and package.json make pulling them into other projects easier. Composer lets you specify a git repo, but we can just use "mf2/tests" instead.
#
gRegorLove
So I think they're fine there
#
[jgarber]
Packaging the test suite as a gem is _possible_, but not as cut and dry as adding a file like `package.json` or `composer.json`.
#
gRegorLove
knows very little Ruby :)
#
[jgarber]
Oh, quite alright. I’m thinking aloud. Currently the microformats Ruby gem copy/pastes the test suite files into a vendor folder. I’m thinking about strategies for avoiding that.
[xavierroy], Sove, ketralnis, MylesBraithwaite, [Serena] and [kevinmarks] joined the channel
#
Loqi
Zegnat rebase and rerun https://github.com/microformats/php-mf2/pull/163 . see how we’re doing
#
Loqi
Countdown set by Zegnat on 2018-08-06 at 9:29pm UTC
samouy15, Davnit18, Tux10, disi14, rolig and [pfefferle] joined the channel
#
Zegnat
gRegorLove, should I hold of on updating my tests PR until you have worked ont the tests itself a little more?
[jgmac1106], adactio, Guest29805, [pfefferle], aaronpk and [kevinmarks] joined the channel
#
[kevinmarks]
I'm thinking that we should refactor the tests repo, as there are a lot of odd synthetic tests there, and not many spot tests for the parsing algorithm.
#
[kevinmarks]
The basic idea of parallel html and json files is good and should be usable by multiple parsers; the actual content is a bit annoying.
#
[kevinmarks]
The original idea of sending the same html through multiple parsers and comparing the json is a good one, but that bit isn't really maintained.
#
[kevinmarks]
Would it make sense to use some kind of existing CI tool for this?
[jgarber] joined the channel
#
[jgarber]
Agreed 100%. “Here’s a possible input, here’s the expected output” is super, super handy so long as the test suite is reliable, stable, and as complete as possible.
#
[kevinmarks]
That each parser is spawning parallel spec tests instead is a marker of failure.
#
[jgarber]
[kevinmarks] I saw your CI issue on the tests repo. I wasn’t quite sure what you had in mind, though. CI at first blush seems more appropriate for the parsers that rely on the test suite. Not on the test suite itself.
#
[jgarber]
Unless the intent is to enforce some kind of style compliance on the test files themselves.
#
[kevinmarks]
What a CI is good at is running code in multiple configurations
#
[kevinmarks]
So showing the responses of all known parsers is good.
#
[kevinmarks]
Doing it with the current tests may not be helpful, as they have a lot of cruft there, but starting with a simple subset and then merging them over could be good.
#
[kevinmarks]
Each time I sit down to look at them with mf2py I get bogged down in whitespace and awkward json comparisons.
#
[jgarber]
Would that CI effort be duplicative assuming that each of the parsers implement a CI pipeline in their own repositories?
#
sknebel
yes, the structure of the current tests isn't great
#
sknebel
a while back there was consensus that it's fine to stop using the current organisation structure by microformat (which only makes sense for mf1 IMHO) and start structuring them by features
#
[kevinmarks]
It would be, but that would be helpful in spotting if the test or the parsers are wrong.
#
[jgarber]
sknebel: “structuring them by features” <= What would constitute a “feature” ?
#
sknebel
e.g. exercsing the different paths for p-properties. another set for the different paths for e-, ...
#
[jgarber]
Ah, gotcha.
#
sknebel
another thing is that the JSON comparison means that you always test everything in a file, so small tests are important to not break tons of tests on known differences
#
[jgarber]
So a test suite organized by “Here are all the likely permutations of `p-*` properties and the expected parser results.”
#
sknebel
e.g. html from e- properties gets serialized differently, and we are not going to fix that
#
sknebel
so complex e-properties shouldn't be in tests not about them
#
[jgarber]
More unit-style tests than what exists now.
#
sknebel
think so, yes. a set of "real-life" examples would be cool too maybe, but clearly separated
#
[jgarber]
Agreed. A combination of comprehensive unit-style tests and a handful of high-level “complete” examples would be fantastic.
#
Zegnat
Maybe we can pull a variety of real-world examples from indiemap? Maybe store them in a folder structure like https://indieweb.org/IndieArchive#Storage ?
#
sknebel
right now I can think of 3 areas where parsers might differ somewhat intentionally: a) e-* serialisation (although one could test if the resulting HTML parses the same) b) url resolving (depending on what steps specific libs take/not take) c) dt-parsing (there's a long-standing proposal to imply the dat part of a datatime always, not just in VCP, and some parsers already implement that)
thk127 joined the channel
#
[jgarber]
What’s the status of URL normalization / absolute-izing in parsers?
#
[kevinmarks]
So compile examples of these from the parsers existing unit tests, and put the current ones in a historic directory?
#
[jgarber]
e.g. HTML includes a relative URL, what do parsers do with that when constructing the resulting data structure?
#
Zegnat
They should resolve the URL, [jgarber]. And I think most will do just that.
#
sknebel
sure, but I wouldn't be surprised if there's differences in how much they do
#
[kevinmarks]
The current structure makes some sense for mf1, where you do need to know all the properties possible, but very little sense for mf2, agreed.
#
[jgarber]
Zegnat: Roger that. :thumbsup::skin-tone-2:
#
sknebel
e.g. mf2py will happily keep /../ segments in some urls
#
Zegnat
There might be slight encoding differences or other oddities, I think, depending on what URL lib is used for the parsing and resolving.
#
sknebel
and not sure if all turn urls into percent-encoding
#
[jgarber]
I _think_ I saw some encoding tests in the existing microformats/tests repo last night…
#
[jgarber]
This may be a tangent or a dead end, but… Would http://json-schema.org be at all helpful with the test suite?
#
sknebel
(in mf2py it *might* even depend on the used html parser, not sure)
#
[jgarber]
Full disclosure: I know very little about JSON Schema.
#
Zegnat
[jgarber], http://microformats.org/wiki/microformats2-json actually links to two independent JSON Schema implementations for validation
#
Zegnat
But for the most part, and to keep the tests simple, it is probably easier to just test against the full expected JSON output
#
[jgarber]
Zegnat: I should’ve known that’d be in the wiki somewhere. 😄
#
Zegnat
That page is relatively new, so no shame in not knowing it :P Probably not very much linked to yet either
#
[kevinmarks]
Json schema is overkill perhaps. There are other json issues - unicode vs UTF-8 for example
#
Loqi
[kevinmarks] #65 encoding unicode values - best practice?
#
[kevinmarks]
(I dislike \u in utf8)
beaky10 joined the channel
#
Zegnat
[kevinmarks], if we decide to go with RFC 7493 JSON (https://github.com/microformats/microformats2-parsing/issues/23) we would be saying all mf2 JSON is always utf-8 encoded. Else it breaks spec.
#
Loqi
[Zegnat] #23 Should the spec define what JSON spec we adhere to?
#
Zegnat
Then it would be pretty easy to say that we do not require \u escaping
#
Zegnat
Currently we make no claim to the JSON encoding at all
simon_-_28 joined the channel
#
[kevinmarks]
Yes, and I uni we should require not using \u escaping too (it can be hard to test as language json parsers may convert both to native unicode strings)
#
[kevinmarks]
We should also maybe have some html in other encodings to turn into utf8 mf2.
#
Loqi
yea!
jalcine, barpthewire, tantek, K0HAX4, bradenslen, tantek__ and snarfed joined the channel
#
snarfed
FYI all mf2py users! candidate for the next release (1.1.2) is available, including whitespace bug fixes and performance improvements. please try it out and let us know if you hit any problems! you can install with: pip install -e git+https://github.com/microformats/mf2py.git#egg=mf2py
#
snarfed
sknebel++ for his hard work on this release!
#
Loqi
sknebel has 19 karma in this channel over the last year (77 in all channels)
[Serena], beaver24, ForexTrader, [jgmac1106], tantek, TallTed, Demp10, [chrisaldrich], adactio_, Monkeh13, cooled, jackjamieson, [johnjohnston], KartikPrabhu, [grantcodes] and [cleverdevil] joined the channel
#
@metbril
I was wondering if your #indieweb #microformats2 #wordpress mf2_s theme repo is still under active development? It has not been updated for a year. https://github.com/dshanske/mf2_s
(twitter.com/_/status/1026873802519662593)
snarfed and kloeri24 joined the channel
#
snarfed
hey sknebel, thanks again for the mf2py optimization PR, https://github.com/microformats/mf2py/pull/123 . mind rebasing it on current master now that the revert is merged?
#
Loqi
[sknebel] #123 Performance improvements: reduce and optimize regular expressions
#
sknebel
doesn't merge cleanly? sure, can do
#
snarfed
the urljoin ValueError PR looks great too. mind adding a test? looks like you have a test case in https://github.com/microformats/mf2py/issues/79
#
Loqi
[snarfed] #79 ValueError: Invalid IPv6 URL
[jgmac1106] joined the channel
#
Zegnat
Just opened a PR for more implied property fixes: https://github.com/microformats/php-mf2/pull/192 - feels like we also need more creative tests for all of these ...
#
Loqi
[Zegnat] #192 Fix implied URL parsing
#
sknebel
snarfed: rebase pushed
#
snarfed
thanks! merging now
#
sknebel
then I'll rebase the url thing, add a test case and push that
#
sknebel
and I think then I have yet another performance thing, assuming I get the data I want out of indiemap in time
#
snarfed
ooh ok
loppy2 joined the channel
#
snarfed
hope indiemap is making sense...?
#
sknebel
it does, but I fight every time with the dialect of SQL and stuff
#
sknebel
is the "links" view really limited in which mf2-classes are in it?
#
gRegorLove
Thanks for the xpath cleanup Zegnat! I was trying to simplify that but didn't think to try traversal. Appears to work in local testing.
#
Zegnat
Have a look at my PR for URL, I use it there, gRegorLove. Seems to pass everything just fine.
#
gRegorLove
There's a couple other locations using that sibling-counting method we can probably simplify.
#
gRegorLove
Cool, will do
#
Zegnat
I did notice that you check for the existence of h-* classes after the XPath, and I made it part of the XPath. Not sure what is best.
snarfed joined the channel
#
snarfed
sknebel: the links table isn't limited by mf2 class, no
#
sknebel
ok, cause mf2_class string Possible values: u-in-reply-to u-repost-of u-like-of u-favorite-of u-invitee u-quotation-of u-bookmark-of NULL made it sound like only some are included
#
snarfed
oh the *values* in mf2_class, yes, just those
#
snarfed
you can use the pages table's links field to see all classes in all links, unfiltered
#
sknebel
right now I can't even get your query from github to work (grumble)
#
sknebel
> SELECT list expression references column to_url which is neither grouped nor aggregated
#
sknebel
was curious if the ratio of relative/absolute link is different when only looking at properties
have joined the channel
#
snarfed
sknebel: huh, odd. are you on the new UI? ie https://console.cloud.google.com/bigquery
#
snarfed
check in query options that you're using standard SQL, not legacy
#
sknebel
yes, and I am on standard
#
snarfed
and you have the last line `GROUP BY absolute`?
#
sknebel
bad error message, but that was it
#
sknebel
so I change that to use page.lnks directly instead, and filter the classes for mf2 classes...
#
snarfed
page.links is a repeated field, so it's a bit trickier, but roughly yes
#
snarfed
start with eg `select count(*) from indiemap.pages p, p.links l`, and then work with l
#
snarfed
(the comma does an implicit UNNEST, which is the key)
#
snarfed
bbi a bit
#
snarfed
(sknebel also https://github.com/microformats/mf2py/pull/124 needs a resolve. thanks in advance!)
#
Loqi
[sknebel] #124 Catch ValueError from urljoin, pass through broken value instead
#
gRegorLove
Zegnat: Ok, moved those into the xpath. I'm thinking we should add a check for `false` returned from xpath->query() as well
#
Zegnat
That would probably be wise, gRegorLove
#
Zegnat
I was trying ->evaluate() to immediately get the href out instead of a nodelist, but couldn't get that to work.
#
sknebel
snarfed++ for indiemap
#
Loqi
snarfed has 2 karma in this channel over the last year (114 in all channels)
#
sknebel
(apparently in mf2-properties or at least things with classes that look like it, absolute urls are 22690412:24135988)
#
sknebel
(that's missing implied properties)
#
gRegorLove
Oh, I should use resolveUrl in there as well
israfel11 and tantek joined the channel
#
gRegorLove
Ok, final changes pushed
#
aaronpk
btw for any ops here, if you deop yourself you'll stop seeingt he spam
snarfed joined the channel
#
KartikPrabhu
sknebel: snarfed: maybe before you guys release new mf2py it might be good to do some clean up e.g. https://lgtm.com/projects/g/tommorris/mf2py/alerts/?mode=list
Cisien27 joined the channel
#
sknebel
snarfed: rebased and new PR on top of it submitted
VoidWhisperer joined the channel
#
gRegorLove
aaronpk, mf2/tests is showing "abandoned" on packagist https://packagist.org/packages/mf2/tests
#
aaronpk
why would it do that
#
aaronpk
I clicked the "un-abandon" button
#
aaronpk
no idea how that happened
#
gRegorLove
haha :shrug:
#
snarfed
sknebel++ thanks! merged!
#
Loqi
sknebel has 20 karma in this channel over the last year (79 in all channels)
#
Zegnat
gRegorLove, should I go through the implied name code too, and see if I can refactor it to match what we’re doing for photo and url?
#
gRegorLove
Sounds good
#
gRegorLove
Any objection to parseImpliedPhoto() always returning an absolute URL (or false, if none)?
#
Zegnat
Not from me.
#
Zegnat
What was the reason to put it in its own method anyway? I just inlined the implied url parsing
iamtakingiteasy joined the channel
#
gRegorLove
I don't remember exactly. Mostly aesthetics, probably, since parseH is so long
#
sknebel
now I'm curious to see what happens to bridgy performance with the new mf2py
#
KartikPrabhu
sknebel: can we collect a large repo of in the wild examples to do this stress test before live deplyment to bridgy?
[cleverdevil] joined the channel
#
sknebel
I asked about that too recently, we probably could collect a sample set of pages from bridgy activity or indiemap or ?, since I've measured performance against a single large example for now
#
snarfed
yeah indiemap has those examples if we really want them, 5.8M pages worth :P https://indiemap.org/docs.html#crawl
#
KartikPrabhu
we probably could pick about 100/1000 at random of differnet page sizes
#
snarfed
sknebel has already done good research and benchmarking for these optimizations though, which is more than we've done in general in mf2py since i complained last, so i don't think we need to block on a release
#
snarfed
KartikPrabhu: are you hoping to find bugs? or measure speed more broadly?
#
KartikPrabhu
snarfed: both
#
KartikPrabhu
it would be useful for future testing
#
snarfed
sure. go for it!
#
KartikPrabhu
I have no idea how to set such things
#
snarfed
gotta love learning :P
#
snarfed
after a day or so on bridgy, it will have parsed 10-50k pages, so on the bugs front, if those all work ok, i'm ok with it
#
snarfed
more measurements never hurt, but again i don't think we need to block the release on them, since sknebel's measurements so far are clearly positive
#
sknebel
I see a few things: a) performance - we can repeat measurements against different versions, and a representative sample means profiling gives more representative results for various cases (e.g. I have a commit here that I think should help in some cases, but in my test example it doesn't do much since the example markup is "too nice")
#
sknebel
b) bugs: outside crash-bugs and bad violations of output rules that seem relatively unlikely, running a specific version over random in-the-wild data tells us little without looking at each example individually and figuring out what the output actually should be
#
sknebel
it could maybe be useful to track changes between versions and compare to other parsers
#
sknebel
e.g. take 1000 pages, parse with mf2py and php-mf2, see where they disagree
#
sknebel
smilarly, if a change that's not supposed to change output changes output for one of those pages, that's something to investigate and find things that aren't properly covered by tests
#
@FusedMind
↩️ @jenchan_atl @apply_imagine How does structured markup (RDFa, JSON-LD, and microformats) affect SERP's?
(twitter.com/_/status/1026909466292310016)
#
Zegnat
gRegorLove: https://github.com/microformats/php-mf2/pull/193 :) That should clear up that part of the parser I suppose
#
Loqi
[Zegnat] #193 Refactor implied name to match photo and url
#
gRegorLove
No more exceptions, woo!
#
gRegorLove
I'll take a closer look later. I'm sure it's good though.
#
gRegorLove
Zegnat++
#
Loqi
zegnat has 16 karma in this channel over the last year (139 in all channels)
#
Zegnat
I think the other two are definitely up for merge. Would appreciate if you give this a once over first. But this should make all of it a lot closer to the spec-as-written :)
yawkat1, barpthewire and Contessa joined the channel
#
sknebel
php-mf2 having two functions called resolveUrl in the same file is not confusing at all... at least they have a different number of parameters
bradenslen joined the channel
#
Zegnat
One function, one class method?
#
sknebel
(btw, those could be an interesting starting point for performance work, in mf2py the equivalent functions were)
#
Zegnat
The outer resolveUrl function is part of a collection of functions at the bottom of the file that are our URL parsing implementation. There is no standard PHP lib that could be depended upon.
#
Zegnat
The fact it has the same name as the parser-internal method that triggers it might be a bit confusing I admit. But I am not sure how much performance there is to win there without sitting down with the specs/RFCs and refactoring url normalisation in general :(
snarfed, [eddie], Brace18, [jgmac1106], [kevinmarks], mist25 and [pfefferle] joined the channel
#
sknebel
you'd have to measure if url_parse is slow or not, please don't cargo-cult this, but you don't need much thinking for something like that :D
#
Zegnat
That’s actually an if we could add, ha
nOgAnOo joined the channel
#
Zegnat
I should look into some PHP perf tools. No idea how to see what parts may be slow.
#
Loqi
I agree
jackjamieson, KartikPrabhu, barpthewire1, snarfed, snarfed1, SailorHaumea24, Guest29805, Shanmugamp723, tantek__ and LooCfur joined the channel
#
snarfed
no bridgy crashes after ~9h on the latest mf2py. good sign, even if it's a low bar :P
KindOne7, [shaners], [kevinmarks], KartikPrabhu, [jgmac1106] and [matpacker] joined the channel