#dev 2024-06-05

2024-06-05 UTC
to2ds, geoffo, [snarfed], btrem, sivoais, gRegorLove_, Tiffany, gRegor, [Murray] and GuestZero joined the channel
#
capjamesg[d]
[snarfed] Does granary have a function to identify the type of an arbitrary feed?
#
capjamesg[d]
(i.e. example.com/a.xml is Atom, etc.)
#
[snarfed]
capjamesg beyond the Content-Type header?
#
capjamesg[d]
I wondered if you had a single function for it.
#
[snarfed]
again though, do you just want Content-Type? or do you want sniffing for when that header is missing? or something else?
#
capjamesg[d]
The content type header is fine.
#
[snarfed]
granary doesn't do arbitrary sniffing, but Content-Type should give you the answer 99% of the time
#
capjamesg[d]
I have a project where I want to serialize feeds in different formats into JSON feed.
#
capjamesg[d]
I wondered if there was a granary.to_jsonfeed(text) function that would take my text and return it as JSON feed, without my having to specify feed type.
#
[snarfed]
no, it doesn't sniff out the type of arbitrary text
#
capjamesg[d]
Do we have a list of standard content types somewhere on the wiki?
#
capjamesg[d]
This is all so exciting by the way [snarfed]!
#
capjamesg[d]
I have written feed polling code before and it's no fun.
#
[snarfed]
hah true. what's so exciting?
#
capjamesg[d]
That I don't have to write the conversion code 😄
#
[snarfed]
hah good!
#
[snarfed]
you could look at IANA's mime type registry, but that's probably more than you want
#
capjamesg[d]
What is the one for mf2 JSON and mf2 text?
#
Loqi
It looks like we don't have a page for "one for mf2 JSON and mf2 text" yet. Would you like to create it? (Or just say "one for mf2 JSON and mf2 text is ____", a sentence describing the term)
#
capjamesg[d]
I assume mf2 text is application/html.
#
capjamesg[d]
Since you're retrieving a web page.
#
capjamesg[d]
But what about a JSON file that is formatted as mf2 json? I feel like I have heard that is possible before?
#
[snarfed]
application/mf2+json ?
#
capjamesg[d]
That's it!
#
capjamesg[d]
And for that I'd use `microformats2.json_to_activities`, it looks like.
#
capjamesg[d]
Blog post incoming 😄
#
[snarfed]
https://feedparser.readthedocs.io/ (venerable, legendary) is the other Python lib worth looking at
#
[KevinMarks]
I've made bookmarklets for ShareOpenly:
#
[KevinMarks]
or in a new window:
#
capjamesg[d]
I wanted a universal feed -> JSON feed convertor.
#
[tantek]
^ [manton]
#
capjamesg[d]
JSON Feed is so easy to work with.
#
capjamesg[d]
It's my preferred choice for parsing feed data.
barnabywalters joined the channel
#
[KevinMarks]
hm, FeedParser is kind of that, but it's not JSONfeed, it's it's own format
#
[KevinMarks]
also we'd need to add mf2 and json feed support to it
#
[snarfed]
capjamesg++ nice!
#
Loqi
capjamesg has 39 karma in this channel over the last year (197 in all channels)
#
[KevinMarks]
If you're polling, you will want to implement this stuff https://pythonhosted.org/feedparser/http-etag.html
#
[snarfed]
minor point, you might consider passing through source JSON feeds unchanged, since granary round tripping through AS1 may be lossy
GuestZero and jacky joined the channel
#
jacky
https://transistor.fm/changelog/podroll/ looks like a linklog for podcasts? doesn't huffduffer do something like this?
[benatwork] joined the channel
#
[benatwork]
[KevinMarks] NICE!!!
#
[KevinMarks]
the window size stuff should likely be tweaked, I just stole the huffduffer one.
#
[KevinMarks]
Is there a param I can pass to say which URL to share to? I was thinking of a variant that spawns 3 bsky, mastodon and x ones as a POSSE shortcut
#
capjamesg[d]
[tantek] The complexity of running a search engine and the worry that you may take someone's site down or break something were two big reasons I didn't have any motivation to keep running IndieWeb search.
#
[KevinMarks]
It is a lot to take on, agreed
#
capjamesg[d]
Another problem was getting the rankings right. Having a link graph helped _a lot_ with this, but there were still many searches where you would have one post then a dozen archive pages.
#
capjamesg[d]
Given feeds are already an established pattern, it feels like the best place to start.
#
[KevinMarks]
yeah, a lot of the technorati effort was distinguishing post links from other kinds, and also at that time post permalinks could be id's on an archive page too.
#
capjamesg[d]
Another advantage to the feed approach is that, so long as people put contents / summaries in their feeds, the number of web requests you have to make is substantially reduced.
#
[KevinMarks]
that was another technorati fun bit of coding, comparing the contents of feeds and post permalinks to work out which were summaries and which weren't
#
[snarfed]
heuristics, whee
#
[tantek]
capjamesg[d] the flipside is that feed files tend to be less reliable (less visible) and have less of the content, and useless for historical search
#
[tantek]
aside from indexing Mastodon, I'd avoid depending on feed files
#
[snarfed]
(um also avoid indexing Mastodon 😆)
#
[tantek]
capjamesg[d], strongly agree with the worry that you may take someone's site down or break something. that's worth summarizing in a brainstorm itself about writing considerate crawlers, perhaps on /search
#
[tantek]
[snarfed] of course I meant only with 100% opt-in
#
[tantek]
which tbh is how to scale a high-quality blog/social search engine. start with only your own blog. then allow people to use IndieAuth to sign-in to opt-in to having their own blog indexed and go from there. will scale nice and slowly so you can keep up with the necessary engineering
#
[tantek]
if someone opts-in with their site/blog, and they have a confirmed rel=me to a separate Mastodon account, then prompt them with a [ ] yes please also index my Mastodon profile @foo@bar
#
[tantek]
so it's double-opt-in for someone's Masto/fedi account
#
[tantek]
and only their direct posts IMO. I would not index reposts
#
[snarfed]
or probably just use the `indexable` field (I think) that Mastodon already exposes for that opt in
#
[tantek]
nah, I'd gate it this way just for deliberately slower adoption to catch problems sooner
#
capjamesg[d]
I will definitely not be exploring Mastodon search haha.
#
[tantek]
capjamesg[d], as someone who has a separate Mastodon profile/posts, I'm surprised as I expect you would want it at least for yourself!
#
capjamesg[d]
I use Mastodon more for distributing my blog posts and having nice conversations. If I wanted an archive, I'd put it on my blog 😄
#
capjamesg[d]
#blogfirst 😛
#
[tantek]
capjamesg[d] people used to say that about Twitter too
#
capjamesg[d]
To the point about taking down someone's site, one of my biggest worries was the URL canonicalisation logic missed an edge case and caused an exploding, technically valid, URL sequence.
#
[tantek]
and then years later realized actually I do want to search my Twitter
#
[tantek]
capjamesg[d] a limiter on number of URLs from a particular domain per day would be a good thing to build into any crawler
#
[tantek]
that way you know you can only be causing max N hits per day to their site
#
capjamesg[d]
I can't remember exactly what I did, but I have mentally noted things like exponential back-off, strong testing for canonicalisation, respecting 429s / higher incidence rates of 500s, respecting Retry-After, crawling multiple sites at once rather than crawling each one sequentially (and thus moving all your crawl capacity to one site at once).
#
capjamesg[d]
[tantek] That's another good point, for sure.
#
[snarfed]
and I assume robots.txt?
#
capjamesg[d]
I also implemented a window that checked crawl speed, with the intent that if a site started to respond slower over time you could schedule part of a crawl for later.
#
capjamesg[d]
Yes, robots.txt.
#
[snarfed]
sounds like you're being extremely responsible
#
capjamesg[d]
I didn't have exponential-back off and robust canonicalisation in the original search engine.
#
[snarfed]
beyond that, if you happen to hit a URL that crashes a site, to a large degree that's their problem, not yours
#
[snarfed]
I wouldn't worry about that too much
#
capjamesg[d]
[tantek] looking back, every site had a crawl budget.
#
capjamesg[d]
It was 30k URLs at peak.
#
capjamesg[d]
But that's because I wanted to discover as many URLs as possible.
#
[tantek]
[snarfed] robots.txt is very coarse, and another invisible thing that many users may have no idea about, so I'd deliberately look for it and extra prompt confirm to the user that they wanted to override it if necessary
#
[tantek]
another opt-in I'd add: [ ] index pages/posts even if they contain a meta noindex tag, again because users may not know how to turn some random setting off in their overly complex CMS etc.
#
[tantek]
lots of bad metadata out there, and poor tools to surface/update them, so we have to treat them as suspect (though respect on the side of safety by default)
barnaby joined the channel
barnaby joined the channel
#
sebbu
capjamesg[d], there's also https://www.sitemaps.org/ that should help with indexing
[Paul_Robert_Ll] and barnaby joined the channel
#
[tantek]
[KevinMarks] did we find sitemaps ever help with any indexing at Technorati? My memory says no
#
[KevinMarks]
No, we were feeds and homepage by default, maybe individual permalinks
#
[tantek]
we definitely crawled individual permalinks and I think we had some capacity to crawl archives for folks that signed up to use our personal blog search embed
GuestZero, Yummers and [lcs] joined the channel
#
[KevinMarks]
I think we used follow your node for that rather than sitemaps - the patterns in blogging tools had next/prev and overview links.
GuestZero joined the channel
#
AramZS
Surely less blogging tools have that now and sitemaps have much higher pick up because of Google's emphasis on them?
jacky, zicklepop0, GuestZero, afrodidact and gRegorLove_ joined the channel
#
capjamesg[d]
[KevinMarks] Let me know if there are any more data patterns I should learn 😄
#
[KevinMarks]
You know the basic mapreduce pattern, right?
#
capjamesg[d]
I came across it when doing IndieWeb search, but only at a cursory level.
#
[KevinMarks]
simply put, you write a thing that processes files looking for structure, dump out the structure and create a simple database from the structure. The magic is that you touch each input file once, and there's some orchestration stuff to let you read them in parallel
#
[KevinMarks]
but the principle you can do with a single threaded thing that reads data files and outputs a simple database of what you want form them. Then you can add more input files, tweak what you're looking for, but still have a repeatable process
#
[KevinMarks]
as opposed to the classic database pattern fo designing the database structure really carefully and policing hard what can be inserted and deleted from it.
#
[KevinMarks]
So what Simon calls Baked Data is a variation of this
#
[KevinMarks]
This is like the opposite pattern
[Jo] joined the channel
#
[KevinMarks]
it's a bit like Tantek's bims but dispersed differently
#
[KevinMarks]
honeycomb is great at profiling live systems rather than a specific piece of code
#
[snarfed]
honeycomb++
#
Loqi
honeycomb has 1 karma over the last year
barnaby and jacky joined the channel
#
Loqi
[snarfed] has 55 karma in this channel over the last year (100 in all channels)
#
[tantek]
BridgyFed << 2024-06-05 TechCrunch: [https://techcrunch.com/2024/06/05/bluesky-and-mastodon-users-can-now-talk-to-each-other-with-bridgy-fed/ Bluesky and Mastodon users can now talk to each other with Bridgy Fed]
#
Loqi
ok, I added "2024-06-05 TechCrunch: [https://techcrunch.com/2024/06/05/bluesky-and-mastodon-users-can-now-talk-to-each-other-with-bridgy-fed/ Bluesky and Mastodon users can now talk to each other with Bridgy Fed]" to the "See Also" section of /Bridgy_Fed https://indieweb.org/wiki/index.php?diff=95713&oldid=95374
#
[snarfed]
yes! she and I talk occasionally. this seemed like good exposure overall
#
[snarfed]
second article she's written about BF, she says she's paying more attention to the space now
#
[KevinMarks]
Sarah's always been thorough
barnabywalters joined the channel