#dev 2024-06-05
2024-06-05 UTC
to2ds, geoffo, [snarfed], btrem, sivoais, gRegorLove_, Tiffany, gRegor, [Murray] and GuestZero joined the channel
# capjamesg[d] [snarfed] Does granary have a function to identify the type of an arbitrary feed?
# capjamesg[d] (i.e. example.com/a.xml is Atom, etc.)
# capjamesg[d] I wondered if you had a single function for it.
# capjamesg[d] The content type header is fine.
# capjamesg[d] ++
# capjamesg[d] I have a project where I want to serialize feeds in different formats into JSON feed.
# capjamesg[d] I wondered if there was a granary.to_jsonfeed(text) function that would take my text and return it as JSON feed, without my having to specify feed type.
# capjamesg[d] Do we have a list of standard content types somewhere on the wiki?
# [snarfed] there's https://oauth-dropins.readthedocs.io/en/latest/source/oauth_dropins.webutil.html#oauth_dropins.webutil.util.sniff_json_or_form_encoded , but only for JSON vs form-encoded
# capjamesg[d] This is all so exciting by the way [snarfed]!
# capjamesg[d] I have written feed polling code before and it's no fun.
# capjamesg[d] That I don't have to write the conversion code 😄
# capjamesg[d] What is the one for mf2 JSON and mf2 text?
# Loqi It looks like we don't have a page for "one for mf2 JSON and mf2 text" yet. Would you like to create it? (Or just say "one for mf2 JSON and mf2 text is ____", a sentence describing the term)
# capjamesg[d] I assume mf2 text is application/html.
# capjamesg[d] Since you're retrieving a web page.
# capjamesg[d] But what about a JSON file that is formatted as mf2 json? I feel like I have heard that is possible before?
# capjamesg[d] That's it!
# capjamesg[d] And for that I'd use `microformats2.json_to_activities`, it looks like.
# capjamesg[d] Blog post incoming 😄
# [snarfed] https://feedparser.readthedocs.io/ (venerable, legendary) is the other Python lib worth looking at
# [KevinMarks] I've made bookmarklets for ShareOpenly:
# [KevinMarks] ```javascript:var%20w=window.open('https://shareopenly.org/share/?url='+encodeURIComponent(location.href)+'&text\='+encodeURIComponent(document.getSelection().toString()||document.title),%27shareopenly%27,%27scrollbars=1,status=0,resizable=1,location=0,toolbar=0,width=360,height=480%27);```
# [KevinMarks] or in a new window:
# capjamesg[d] [snarfed] This is what I was working on: https://gist.github.com/capjamesg/d8ab9f3e00e8d2709217e481f94ee8fb
# capjamesg[d] I wanted a universal feed -> JSON feed convertor.
# capjamesg[d] JSON Feed is so easy to work with.
# capjamesg[d] It's my preferred choice for parsing feed data.
barnabywalters joined the channel
# [KevinMarks] hm, FeedParser is kind of that, but it's not JSONfeed, it's it's own format
# [KevinMarks] also we'd need to add mf2 and json feed support to it
# [KevinMarks] If you're polling, you will want to implement this stuff https://pythonhosted.org/feedparser/http-etag.html
# capjamesg[d] ++
# capjamesg[d] Yeah.
GuestZero and jacky joined the channel
# jacky https://transistor.fm/changelog/podroll/ looks like a linklog for podcasts? doesn't huffduffer do something like this?
[benatwork] joined the channel
# [benatwork] [KevinMarks] NICE!!!
# [KevinMarks] the window size stuff should likely be tweaked, I just stole the huffduffer one.
# [KevinMarks] Is there a param I can pass to say which URL to share to? I was thinking of a variant that spawns 3 bsky, mastodon and x ones as a POSSE shortcut
# capjamesg[d] [tantek] The complexity of running a search engine and the worry that you may take someone's site down or break something were two big reasons I didn't have any motivation to keep running IndieWeb search.
# [KevinMarks] It is a lot to take on, agreed
# capjamesg[d] Another problem was getting the rankings right. Having a link graph helped _a lot_ with this, but there were still many searches where you would have one post then a dozen archive pages.
# capjamesg[d] Given feeds are already an established pattern, it feels like the best place to start.
# [KevinMarks] yeah, a lot of the technorati effort was distinguishing post links from other kinds, and also at that time post permalinks could be id's on an archive page too.
# capjamesg[d] Another advantage to the feed approach is that, so long as people put contents / summaries in their feeds, the number of web requests you have to make is substantially reduced.
# [KevinMarks] that was another technorati fun bit of coding, comparing the contents of feeds and post permalinks to work out which were summaries and which weren't
# capjamesg[d] I will definitely not be exploring Mastodon search haha.
# capjamesg[d] I use Mastodon more for distributing my blog posts and having nice conversations. If I wanted an archive, I'd put it on my blog 😄
# capjamesg[d] #blogfirst 😛
# capjamesg[d] To the point about taking down someone's site, one of my biggest worries was the URL canonicalisation logic missed an edge case and caused an exploding, technically valid, URL sequence.
# capjamesg[d] I can't remember exactly what I did, but I have mentally noted things like exponential back-off, strong testing for canonicalisation, respecting 429s / higher incidence rates of 500s, respecting Retry-After, crawling multiple sites at once rather than crawling each one sequentially (and thus moving all your crawl capacity to one site at once).
# capjamesg[d] [tantek] That's another good point, for sure.
# capjamesg[d] I also implemented a window that checked crawl speed, with the intent that if a site started to respond slower over time you could schedule part of a crawl for later.
# capjamesg[d] Yes, robots.txt.
# capjamesg[d] I didn't have exponential-back off and robust canonicalisation in the original search engine.
# capjamesg[d] [tantek] looking back, every site had a crawl budget.
# capjamesg[d] It was 30k URLs at peak.
# capjamesg[d] But that's because I wanted to discover as many URLs as possible.
barnaby joined the channel
barnaby joined the channel
# sebbu capjamesg[d], there's also https://www.sitemaps.org/ that should help with indexing
[Paul_Robert_Ll] and barnaby joined the channel
# [KevinMarks] No, we were feeds and homepage by default, maybe individual permalinks
GuestZero, Yummers and [lcs] joined the channel
# [KevinMarks] I think we used follow your node for that rather than sitemaps - the patterns in blogging tools had next/prev and overview links.
GuestZero joined the channel
# AramZS Surely less blogging tools have that now and sitemaps have much higher pick up because of Google's emphasis on them?
jacky, zicklepop0, GuestZero, afrodidact and gRegorLove_ joined the channel
# capjamesg[d] [KevinMarks] Let me know if there are any more data patterns I should learn 😄
# [KevinMarks] You know the basic mapreduce pattern, right?
# capjamesg[d] I came across it when doing IndieWeb search, but only at a cursory level.
# [KevinMarks] simply put, you write a thing that processes files looking for structure, dump out the structure and create a simple database from the structure. The magic is that you touch each input file once, and there's some orchestration stuff to let you read them in parallel
# [KevinMarks] but the principle you can do with a single threaded thing that reads data files and outputs a simple database of what you want form them. Then you can add more input files, tweak what you're looking for, but still have a repeatable process
# [KevinMarks] as opposed to the classic database pattern fo designing the database structure really carefully and policing hard what can be inserted and deleted from it.
# [KevinMarks] So what Simon calls Baked Data is a variation of this
# [KevinMarks] Another data pattern post: https://www.honeycomb.io/blog/why-observability-requires-distributed-column-store
# [KevinMarks] This is like the opposite pattern
[Jo] joined the channel
# [KevinMarks] it's a bit like Tantek's bims but dispersed differently
# capjamesg[d] Genius.
# [KevinMarks] honeycomb is great at profiling live systems rather than a specific piece of code
barnaby and jacky joined the channel
# [tantek] [snarfed]++ whoa BridgyFed in TechCrunch(!!!) https://techcrunch.com/2024/06/05/bluesky-and-mastodon-users-can-now-talk-to-each-other-with-bridgy-fed/
# [tantek] BridgyFed << 2024-06-05 TechCrunch: [https://techcrunch.com/2024/06/05/bluesky-and-mastodon-users-can-now-talk-to-each-other-with-bridgy-fed/ Bluesky and Mastodon users can now talk to each other with Bridgy Fed]
# Loqi ok, I added "2024-06-05 TechCrunch: [https://techcrunch.com/2024/06/05/bluesky-and-mastodon-users-can-now-talk-to-each-other-with-bridgy-fed/ Bluesky and Mastodon users can now talk to each other with Bridgy Fed]" to the "See Also" section of /Bridgy_Fed https://indieweb.org/wiki/index.php?diff=95713&oldid=95374
# [KevinMarks] Sarah's always been thorough
barnabywalters joined the channel