#dev 2022-09-16

2022-09-16 UTC
AramZS joined the channel
#
angelo
i'm most interested in the <lastmod> tag in sitemaps; seems the most efficient way to communicate site-wide changes to a bot
#
[tantek]4
and then you can block bots which disobey it
#
[schmarty]1
(many folks complain about google ignoring `lastmod` https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap?hl=en
geoffo and AramZS_ joined the channel
geoffo joined the channel
#
aaronpk
hmm new twitter char counting failed in the same way
#
corlaez
429 are, I think, impossible to implement for static websites. What if we take a more agressive stand and just block google altogether until they change their crawling decisions (maybe never, but I feel everyone can implement this and could create more buzz)
#
corlaez
Well, for some static websites, that do not control the actual server hosting them
#
corlaez
(the cheapest way to go for static sites)
geoffo, jacky, Seirdy, angelo and gRegor joined the channel
#
gRegor
Do crawlers mostly ignore ETag and Last-Modified? Otherwise seems like those would be preferable over a sitemap with <lastmod>
geoffo, gRegorLove_ and mro joined the channel
#
@ton_zylstra
↩️ dat was ook het punt: maak WP compliant met microformats2, en de classes v post kinds. Zodat themes en blocks er standaard mee uit de voeten kunnen. Als je dan webmention aan zet is ineens 40% van het web een open sociaal platform, buiten de silo's.
(twitter.com/_/status/1570685363835863045)
jjuran, gRegorLove_, mro_ and mro joined the channel
#
capjamesg
angelo I have to start processing 429s in IndieWeb Search.
#
capjamesg
angelo IndieWeb Search crawls your home page and spiders out from there if you don't have a sitemap.
#
capjamesg
But... URLs in a sitemap do get priority in IndieWeb Search.
#
capjamesg
That's only because they are discovered first and URLs are crawled in order of discovery.
#
capjamesg
Unless your site has over 30,000 pages, this is inconsequential.
mro and tetov-irc joined the channel
#
[tonz]
[snarfed] wrt ‘it may add up to significant bandwidth for big sites’ , an example given was a single page wordpress site w 4 files, getting single digit visitors per week, content not changing, still getting 600K hits from bots and crawlers per month. In part because of all those unnecessary URLs (and crawlers being wasteful themselves) A big site saw Google crawler hit their site every 2 seconds to index the whole thing. So it’
#
[tonz]
probably significant for every site (until crawlers learn / are taught to tone it down)
jonnybarnes, angelo_, jacky, jordemort and jacky__ joined the channel
#
jordemort
(moving from #indiweb) i'm working on a client-side search engine for my static site using sql.js and had the thought: what if there was something like <link rel="sqlsite" href="/path/to/sqllite.db" /> that served up a sqlite database with all of a site's posts indexed in some sort of agreed-upon schema? (prolly based on h-entry)
#
jordemort
then it'd be easy to build some "standard" client-side search javascript, or CLI tools to search sites that implemented it, or metasearch engines where you could pick sets of sites to search together
#
[schmarty]1
reminds me of this hack to use http as a virtual filesystem for sql.js: https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/
#
jordemort
yeah, i'm building on that
#
jordemort
so clients wouldn't even have to download the whole index, just range-request what they need
#
[schmarty]1
this would be relevant for capjamesg who has been building his own indieweb search crawler/index/etc https://indieweb-search.jamesg.blog/
#
jordemort
regular sql.js works fine over http too, it just has to fetch the whole database; the range requests are the real magic in httpvfs
#
[schmarty]1
federated and cross-site search, where each site hosts their own search and some tool makes multiple requests and aggregates results, has come up a few times but it has a lot of variables and i don't know that anyone has made a real go of it.
#
[schmarty]1
jordemort: personally i'd be willing to try it out once you have something working for your site that is replicable!
[Will_Monroe] and jacky joined the channel
#
jordemort
unrelated, except that i'm doing my indexing by parsing my mf2 metadata: did i miss it or is there no standard way to mark up tags/categories in h-entry?
#
Loqi
misses it or is there no standard way to mark up tags too
#
jordemort
i've started using `p-tag` which all the parsers seem to pick up
#
[schmarty]1
parsers are vocabulary-agnostic, so they'll pick up `p-anything`
#
[schmarty]1
what are tags?
#
Loqi
tags or tagging refers to categorizing or labeling content, your own or others (tag-reply), with words, phrases, names, or other information, optionally linked to specific people, events, locations, such as the practice of tagging posts being about certain people (person-tag), like tagging people or other items where (area-tag) they're depicted in a photo https://indieweb.org/tags
#
[schmarty]1
there's a how to markup section there. `p-category` is most common, i think
#
jordemort
ah p-category
jacky, AramZS and geoffo joined the channel
#
aaronpk
oh jeez, sorry in advance, telegraph was down for the last couple days so it's catching up now
#
aaronpk
apparently it didn't start back up when i rebooted the server
AramZS joined the channel
#
jacky
TIL about https://boardgamegeek.com/, might be a /POSSE location for some?
#
[schmarty]1
heh. any idea how many stuck mentions there were?
#
[schmarty]1
jacky: POSSE for board game reviews?
#
jacky
yeah! and possibly even collections and gameplay
jacky joined the channel
#
jacky
they allow for a lot of interesting info
#
jacky
might document the site if I eventually do some manual POSSEing
#
jacky
it even goes as far as having acquistion info and MSRP versus purchase price
jacky joined the channel
#
angelo_
re: <lastmod> vs ETag/Last-Modified; one request to a sufficiently marked up sitemap will allow a bot to pinpoint the four documents out of thousands that need to be re-crawled. conditionally requesting still requires every document to be hit.
angelo joined the channel
#
[snarfed]
jordemort adoption is usually the biggest challenge with any idea like this. there's a big established ecosystem around the existing adopted standards (HTML etc) and big established search engines. they may not be perfect, but they're fully adopted
#
[snarfed]
new ideas like this will struggle to get more than a few sites to adopt them at the beginning, so the resulting search engines' indexes will be unusably incomplete, so people won't use them much, so other publishers won't be incentivized to adopt
#
[snarfed]
chicken and egg
#
[snarfed]
one way to handle that is to supplement results with existing search engines until adoption hits critical mass
#
[snarfed]
(and this is all still just considering centralized search. federated search, ie send the query to all/many nodes and compile the results, I don't even know how to begin thinking about, so much of it seems so intractable. I'm honestly curious how the fediverse does it, if at all)
jeremycherfas joined the channel
#
jordemort
i don't know if i necessarily care about massive adoption / uptake 😉
#
@wikipediachain
↩️ XSL > Web Ontology Language > Semantic HTML > RDF/XML > DOAP > Rule-based system > Simple Knowledge Organization System > Agora (web browser) > IndieAuth > XPointer > XSL > WebXR > Oculus Rift S > Hack (programming language) > List of mergers and acquisitions by Meta Platforms
(twitter.com/_/status/1570825778585010178)
#
jordemort
but i'll prototype it on my blog, maybe turn a couple books into other example sites for a demo of "federated" search
#
angelo
so i just began consuming opensearch files; see https://indieweb.rocks/adactio.com left column below his card; if you try a search you'll see that you wind up on his site and that the results are marked up
#
jacky
[snarfed]: just asked about masto search and got this link! https://docs.joinmastodon.org/admin/optional/elasticsearch/
#
jacky
tl;dr: it only searches stuff you've interacted with
#
[snarfed]
yeaahhhh
jacky, geoffo, jeremycherfas and gRegor joined the channel
angelo joined the channel
#
angelo
whoa i did not know that about mastodon.. time to expand representative h-card parsing beyond just domains
CrowderSoup joined the channel
#
capjamesg
I'm not sure IndieWeb Search could handle mastodon.
#
capjamesg
[snarfed] What did you crawl for IndieMap?
#
capjamesg
Just HTML or more?
#
[snarfed]
tried hard to limit to HTML
jacky, ben_thatmustbeme, AramZS and wagle joined the channel
#
[KevinMarks]
Angelo see also tumblr
#
capjamesg
[KevinMarks] Did Tumblr get back to you?
#
[KevinMarks]
No, I owe them a ping still, though Matt saying that they want to solve comments sounds promising
#
capjamesg
I know right?
#
capjamesg
I still have that interview open.
barnaby joined the channel
#
angelo
KevinMarks definitely looking forward to expanding my search radius once i work out all of the kinks with this first corpus
jacky joined the channel
#
angelo
the thing that excites me about mastodon is it's consistency; i feel like i'll get near 100% mf support across https://www.fediversesearch.com/api/v1/search/?keyword=&software=mastodon
#
angelo
whereas my immediate experience with tumblr is that many/most people stray from the default theme
geoffo, barnaby, [Jamie_Tanna] and jacky joined the channel
#
angelo
that said, they do have sitemaps with <lastmod>: https://indieweb-test.tumblr.com/sitemap1.xml ; while i expect tumblr users don't do much updating of old posts i feel confident finding the few that do would be doable only by watching the sitemaps
#
angelo
ETag on the sitemap.xml would be *chef's kiss*
AramZS, gRegor and Ruxton joined the channel
#
[KevinMarks]
Also I need to write tests for the different post and composite cases as I think I may have some of them wrong. Then we can evangelize other theme authors once we show value
jacky, tetov-irc, nertzy, gRegor and geoffo joined the channel