#dev 2023-08-21
2023-08-21 UTC
Nuve, eitilt, btrem, oodani, IWSlackGateway, Hlo, tei_, dustinm` and [fluffy] joined the channel
# [fluffy] Does anyone have any thoughts on this windmill-tilt that this person keeps on suggesting to me? https://github.com/PlaidWeb/Publ/issues/544
[snarfed] joined the channel
# Soni sorry for going on a *slightly* unhinged rant about how web crawlers are evil and a waste of bandwidth, energy, CPU time, open connections, [etc] >.<
# Soni well, publ does its own indexing mostly while avoiding crawlers. it has a lot more data about the files and their contents than a crawler has access to. basically it's cheaper for publ to figure out what changed than for an external system to do so.
# Soni wouldn't that lose the efficiency benefits?
# [fluffy] Also privacy is definitely a concern. Just exporting raw search index stuff from, say, http://beesbuzz.biz to somewhere else means that someone with access to the raw search data would also have access to private content.
# Soni does it not use inotify(?)?
# Soni oh
# Soni oh :(
# Soni yes, that is what we'd expect...
# Soni we don't really see the point of SEO for something like this tbh
# Soni but a webcrawler isn't very indieweb
# Soni like, in theory an index export could be hosted on static pages and consumed by other static pages
# Soni but you can't exactly run a crawler with a static pages
# Soni RFC 3229
# [snarfed] https://indieweb.org/FAQ#Do_I_need_to_self-host_my_website (thanks aaronpk)
# Soni hmm
# [fluffy] if I were to build an indieweb search engine I’d probably start out by crawling sites, and on any page with h-entry, use an mf2 parser to split out the data and feed that into some sort of search index. which may or may not be built on whoosh, for that matter, since for all its problems, whoosh is pretty nice.
# Soni a lot of our friends have given up on doing anything that might require interacting with a search engine, so like, search engines are dead...
# Soni honestly we switched to firefox keywords recently
# Soni or uh what are they called
# Soni yes, they are called search keywords
# Soni because a lot of websites have local search: wikipedia, rustdoc, docs.rs, a bunch of other stuff
# Soni we've definitely searched programming blogs before
# Soni but not often enough for us to add a whole keyword for it
# Soni but that doesn't mean we don't want those results
# Soni keyword search is a workaround for the death of global search engines
# Soni but global and local should not be the only options
# Soni (we should have some sort of nonlocal search)
# Soni okay
# Soni sorry
# Soni [snarfed]: sadly web search (or anything search) is way outside our realm of experience(? we forget the right phrase)... (grep doesn't count. we mean it kinda does but... not really tbh. it's not an appropriate substitute for real search.)
[aciccarello] joined the channel
# [aciccarello] fluffy, congrats on the release. I am curious what the W3C OGP validator you were using is.
# [aciccarello] Ah I see
# [fluffy] https://github.com/PlaidWeb/Publ/commit/c9d98fde5ac37c415632d8edac45cd18d7c989e6 is the opengraph change. Literally just changing `<meta ... />` to `<meta ...>` 🙂
# Soni XHTML is fun :v (except how basically nothing supports XHTML output...)
# Soni there is such thing as an XHTML polyglot
# Soni but yeah
[jacky], gerben and [schmarty] joined the channel
# Soni our best lead for search is a dead link btw
# Soni did we mention this before? https://oddmuse.org/wiki/Indexed_Search
# Soni but we guess it has the same "rebuilds everything" problem as publ
# capjamesg What's happening re: search?
# capjamesg > if I were to build an indieweb search engine I’d probably start out by crawling sites, and on any page with h-entry, use an mf2 parser to split out the data and feed that into some sort of search index. which may or may not be built on whoosh, for that matter, since for all its problems, whoosh is pretty nice.
# IWDiscord <capjamesg#0>
# capjamesg +1 [fluffy]
# capjamesg I used IndieWeb Search as a playground for learning about the web; an enjoyable project indeed, but I made things unnecessarily complicated.
# capjamesg I wrote my own crawler 😅
# capjamesg The index was around 500k pages at peak.
# capjamesg I had _so much fun_ building that project!
# capjamesg [snarfed]++ for all the assistance when I worked on IndieWeb Search.
# capjamesg [snarfed]++
# [snarfed] capjamesg https://github.com/PlaidWeb/Publ/issues/544 , and decentralized search in general, Soni might be interested in learning search stuff
# capjamesg I see! My implementation wasn't decentralized, but that idea is interesting to me!
# capjamesg I thought about a system where people could submit their own records.
# capjamesg Then IndieWeb Search (herein IWS) could validate N% of records as a trust exercise.
# capjamesg I didn't build this, however.
# vonexplaino I like the reverse idea. Being able to self curate search endpoints of trusted resources. Then my search checks those endpoints
# capjamesg Interesting!
# capjamesg You could have a system where parties index their own sites then submit indexes to a standard endpoint to each other.
# capjamesg Then you could use any one party's site to search the whole network.
# capjamesg This requires trusting the party on whose site you are performing the search.
# capjamesg (just as one trusts Google)
# vonexplaino Yup. Indie web search had that .well-known point for publicising endpoints
# capjamesg You could have nodes index N% of the network, so there is some overlap, pull from each other, do some trust exercise (compare samples of each site between different nodes' indexes), then show results whose integrity has been preserved.
# capjamesg The downside there is compute and the not-so-scalable approach baked into that system.
# capjamesg Just brainstorming! I haven't thought this much about this particular problem.
# vonexplaino Indeed. I was going personal approach to self curate as scaling sounded problematic
# vonexplaino I trust x,y,x so I include their search endpoints in my aggregator
# capjamesg You could definitely have a trusted list.
# aaronpk This is the most inscrutable thing I have ever read https://twittercommunity.com/t/x-api-v2-migration/203391/1
gRegor joined the channel
# Soni see, we want to rely on trust as a safeguard against enshittification... tho we must also acknowledge how trust-based systems inherently tend to surface racism (because racism is a pervasive problem)
# Soni and publishing indexes for other ppl feels icky
# Soni indexes should be origin-locked tbh
# Soni (as in, when retrieving the index for foo.example, only the entries that refer to foo.example are included)
# Soni (when running search queries you use all the indexes you know of tho)
bterry joined the channel