#dev 2024-06-05
2024-06-05 UTC
to2ds, geoffo, [snarfed], btrem, sivoais, gRegorLove_, Tiffany, gRegor, [Murray] and GuestZero joined the channel
#
capjamesg[d] [snarfed] Does granary have a function to identify the type of an arbitrary feed?

#
capjamesg[d] (i.e. example.com/a.xml is Atom, etc.)

#
capjamesg[d] I wondered if you had a single function for it.

#
capjamesg[d] The content type header is fine.

#
capjamesg[d] ++

#
capjamesg[d] I have a project where I want to serialize feeds in different formats into JSON feed.

#
capjamesg[d] I wondered if there was a granary.to_jsonfeed(text) function that would take my text and return it as JSON feed, without my having to specify feed type.

#
capjamesg[d] Do we have a list of standard content types somewhere on the wiki?

#
[snarfed] there's https://oauth-dropins.readthedocs.io/en/latest/source/oauth_dropins.webutil.html#oauth_dropins.webutil.util.sniff_json_or_form_encoded , but only for JSON vs form-encoded
#
capjamesg[d] This is all so exciting by the way [snarfed]!

#
capjamesg[d] I have written feed polling code before and it's no fun.

#
capjamesg[d] That I don't have to write the conversion code 😄

#
capjamesg[d] What is the one for mf2 JSON and mf2 text?

#
Loqi It looks like we don't have a page for "one for mf2 JSON and mf2 text" yet. Would you like to create it? (Or just say "one for mf2 JSON and mf2 text is ____", a sentence describing the term)

#
capjamesg[d] I assume mf2 text is application/html.

#
capjamesg[d] Since you're retrieving a web page.

#
capjamesg[d] But what about a JSON file that is formatted as mf2 json? I feel like I have heard that is possible before?

#
capjamesg[d] That's it!

#
capjamesg[d] And for that I'd use `microformats2.json_to_activities`, it looks like.

#
capjamesg[d] Blog post incoming 😄

#
[snarfed] https://feedparser.readthedocs.io/ (venerable, legendary) is the other Python lib worth looking at
#
[KevinMarks] I've made bookmarklets for ShareOpenly:

#
[KevinMarks] ```javascript:var%20w=window.open('https://shareopenly.org/share/?url='+encodeURIComponent(location.href)+'&text\='+encodeURIComponent(document.getSelection().toString()||document.title),%27shareopenly%27,%27scrollbars=1,status=0,resizable=1,location=0,toolbar=0,width=360,height=480%27);```

#
[KevinMarks] or in a new window:

#
capjamesg[d] [snarfed] This is what I was working on: https://gist.github.com/capjamesg/d8ab9f3e00e8d2709217e481f94ee8fb

#
capjamesg[d] I wanted a universal feed -> JSON feed convertor.

#
capjamesg[d] JSON Feed is so easy to work with.

#
capjamesg[d] It's my preferred choice for parsing feed data.

barnabywalters joined the channel
#
[KevinMarks] hm, FeedParser is kind of that, but it's not JSONfeed, it's it's own format

#
[KevinMarks] also we'd need to add mf2 and json feed support to it

#
[KevinMarks] If you're polling, you will want to implement this stuff https://pythonhosted.org/feedparser/http-etag.html

#
capjamesg[d] ++

#
capjamesg[d] Yeah.

GuestZero and jacky joined the channel
#
jacky https://transistor.fm/changelog/podroll/ looks like a linklog for podcasts? doesn't huffduffer do something like this?

[benatwork] joined the channel
#
[benatwork] [KevinMarks] NICE!!!
#
[KevinMarks] the window size stuff should likely be tweaked, I just stole the huffduffer one.

#
[KevinMarks] Is there a param I can pass to say which URL to share to? I was thinking of a variant that spawns 3 bsky, mastodon and x ones as a POSSE shortcut

#
capjamesg[d] [tantek] The complexity of running a search engine and the worry that you may take someone's site down or break something were two big reasons I didn't have any motivation to keep running IndieWeb search.

#
[KevinMarks] It is a lot to take on, agreed

#
capjamesg[d] Another problem was getting the rankings right. Having a link graph helped _a lot_ with this, but there were still many searches where you would have one post then a dozen archive pages.

#
capjamesg[d] Given feeds are already an established pattern, it feels like the best place to start.

#
[KevinMarks] yeah, a lot of the technorati effort was distinguishing post links from other kinds, and also at that time post permalinks could be id's on an archive page too.

#
capjamesg[d] Another advantage to the feed approach is that, so long as people put contents / summaries in their feeds, the number of web requests you have to make is substantially reduced.

#
[KevinMarks] that was another technorati fun bit of coding, comparing the contents of feeds and post permalinks to work out which were summaries and which weren't

#
capjamesg[d] I will definitely not be exploring Mastodon search haha.

#
capjamesg[d] I use Mastodon more for distributing my blog posts and having nice conversations. If I wanted an archive, I'd put it on my blog 😄

#
capjamesg[d] #blogfirst 😛

#
capjamesg[d] To the point about taking down someone's site, one of my biggest worries was the URL canonicalisation logic missed an edge case and caused an exploding, technically valid, URL sequence.

#
capjamesg[d] I can't remember exactly what I did, but I have mentally noted things like exponential back-off, strong testing for canonicalisation, respecting 429s / higher incidence rates of 500s, respecting Retry-After, crawling multiple sites at once rather than crawling each one sequentially (and thus moving all your crawl capacity to one site at once).

#
capjamesg[d] [tantek] That's another good point, for sure.

#
capjamesg[d] I also implemented a window that checked crawl speed, with the intent that if a site started to respond slower over time you could schedule part of a crawl for later.

#
capjamesg[d] Yes, robots.txt.

#
capjamesg[d] I didn't have exponential-back off and robust canonicalisation in the original search engine.

#
capjamesg[d] [tantek] looking back, every site had a crawl budget.

#
capjamesg[d] It was 30k URLs at peak.

#
capjamesg[d] But that's because I wanted to discover as many URLs as possible.

barnaby joined the channel
barnaby joined the channel
#
sebbu capjamesg[d], there's also https://www.sitemaps.org/ that should help with indexing
[Paul_Robert_Ll] and barnaby joined the channel
#
[KevinMarks] No, we were feeds and homepage by default, maybe individual permalinks

GuestZero, Yummers and [lcs] joined the channel
#
[KevinMarks] I think we used follow your node for that rather than sitemaps - the patterns in blogging tools had next/prev and overview links.

GuestZero joined the channel
#
AramZS Surely less blogging tools have that now and sitemaps have much higher pick up because of Google's emphasis on them?
jacky, zicklepop0, GuestZero, afrodidact and gRegorLove_ joined the channel
#
capjamesg[d] [KevinMarks] Let me know if there are any more data patterns I should learn 😄

#
[KevinMarks] You know the basic mapreduce pattern, right?

#
capjamesg[d] I came across it when doing IndieWeb search, but only at a cursory level.

#
[KevinMarks] simply put, you write a thing that processes files looking for structure, dump out the structure and create a simple database from the structure. The magic is that you touch each input file once, and there's some orchestration stuff to let you read them in parallel

#
[KevinMarks] but the principle you can do with a single threaded thing that reads data files and outputs a simple database of what you want form them. Then you can add more input files, tweak what you're looking for, but still have a repeatable process

#
[KevinMarks] as opposed to the classic database pattern fo designing the database structure really carefully and policing hard what can be inserted and deleted from it.

#
[KevinMarks] So what Simon calls Baked Data is a variation of this

#
[KevinMarks] Another data pattern post: https://www.honeycomb.io/blog/why-observability-requires-distributed-column-store

#
[KevinMarks] This is like the opposite pattern

[Jo] joined the channel
#
[KevinMarks] it's a bit like Tantek's bims but dispersed differently

#
capjamesg[d] Genius.

#
[KevinMarks] honeycomb is great at profiling live systems rather than a specific piece of code

barnaby and jacky joined the channel
#
[tantek] [snarfed]++ whoa BridgyFed in TechCrunch(!!!) https://techcrunch.com/2024/06/05/bluesky-and-mastodon-users-can-now-talk-to-each-other-with-bridgy-fed/

#
[tantek] BridgyFed << 2024-06-05 TechCrunch: [https://techcrunch.com/2024/06/05/bluesky-and-mastodon-users-can-now-talk-to-each-other-with-bridgy-fed/ Bluesky and Mastodon users can now talk to each other with Bridgy Fed]

#
Loqi ok, I added "2024-06-05 TechCrunch: [https://techcrunch.com/2024/06/05/bluesky-and-mastodon-users-can-now-talk-to-each-other-with-bridgy-fed/ Bluesky and Mastodon users can now talk to each other with Bridgy Fed]" to the "See Also" section of /Bridgy_Fed https://indieweb.org/wiki/index.php?diff=95713&oldid=95374

#
[KevinMarks] Sarah's always been thorough

barnabywalters joined the channel