Went a little wild with the wiki leaderboard scripting and am now grabbing _all_ wiki users and attempting to parse their homepages for microformats2 data:
~2750 users, `n` of which are spam-ish/not URLs/don’t resolve properly. Turns out I’m overloading my own service (https://micromicro.cc) by blasting it with requests to parse that many URLs. 🙃
Nice, that'd explain the intermittent 500s I'd been seeing ☺ good fix, aaronpk!
[snarfed] Did you have any reservations open-sourcing the IndieMap data?
I'm thinking about doing this for IndieWeb Search.
capjamesg no big reservations, no
I didn't think about it as open sourcing as much as just republishing, and that they explicitly retain their original copyright etc. https://github.com/snarfed/indie-map/#indie-map
I have gotten one removal request so far, which I complied with. https://github.com/snarfed/indie-map/issues/2
[csarven] #2 Request to omit all statements on csarven.ca
I love that title. 🙂 Skimming through, nice to see they’re using JSON Feed extensions for some extra metadata.
A great site and blog in general!
Having a discussion elsewhere around whether using microformats as CSS hooks is a good/bad idea. I vaguely remember this coming up here a few times, but can't find any resources. Does anyone know any Wiki pages or blog posts that sum up arguments for/against?
i believe the consensus is it's a bad idea
that is what I felt, but I'm struggling to explain _why_ that is the case 😅
the reason i stopped doing it is so that i can change the structure of the microformats classes without affecting the presentation
i guess that hasn't mattered much lately but was very helpful as i was developing the site especially when adding new microformats markup to it
that makes sense
Same here
[Murray] , aaronpk, pretty sure it's an FAQ (why not to use microformats class names for CSS rules)
hm not seeing anything specific on https://microformats.org/wiki/faq
hmm, it looks like either it's not or there's possibly some out of date thinking on this 😬
^ I just updated this FAQ. aaronpk, [Murray] can you review ?
Somewhat related, Adrian Roselli recommends using CSS selectors to try and enforce proper use of aria https://adrianroselli.com/2021/06/using-css-to-enforce-accessibility.html
Fascinating. Will have to read that to see if the reasoning is any different from when we originally thought it was good to use microformats classes as styling hooks as well
Ah that's very different reasoning and use-cases! The stuff about using ARIA as hooks for the state of an element make complete sense, though I feel there should've been at least a mention of the respective CSS pseudo-classes, their similarities, and when to use one or the other or both
ah, interesting point. if i understand this advice correctly, wherever a CSS pseudo-class exists typically you wouldn't be using ARIA anyway, since the preference is to use HTML-native elements and semantics first, with ARIA as a fallback.
Yes that is how it's supposed to work. As with anything there's always the danger of overuse (remember when people used to put u-photo on every img tag?)
[tantek]4: I think I need to fix some code that still overscopes on that
A CSV with a few million links from IndieWeb Search is available now: https://github.com/capjamesg/indieweb-search-links/blob/main/data/2022-09-12.csv.gz
nice, i see 1.4 million lines.. i'm going to feed unique domains back into indieweb.rocks
I crawled your people.txt list plus a few others. I wouldn't recommend crawling the "target" links.
angelo I can send you raw HTML if it's helpful too.
you have 1.4 million pages of raw HTML?
I have 250k.
Each page can have multiple outgoing links which is why there are 1.4 million lines of links.
if there's a reasonable way to transfer them i could put them to use.. what size are we talking about?
if you could limit it to the domain's homepage only, that'd be ideal
4.3gb according to Elasticsearch.
(if there's any stats people are interested in, let me know so I can add it to the crawler - crawls run weekly now)
what are the h_card and is_homepage records?
is_homepage tracks if a page is a root.
h-card is the h-card found on the page.
I use that to show profile pictures in search results.
is the h-card representative?
Let me check.
I'm sorry but it's not.
It looks for a h-card on the page and if it can't find one it does authorship discovery.
that's cool. if you could give me a list of the domains with h-cards on them that'd be exactly what i'm looking for. how difficult would that be to pull out?
Not too difficult. I can share it tomorrow.
