#dev 2023-08-21

2023-08-21 UTC
Nuve, eitilt, btrem, oodani, IWSlackGateway, Hlo, tei_, dustinm` and [fluffy] joined the channel
#
[fluffy]
Does anyone have any thoughts on this windmill-tilt that this person keeps on suggesting to me? https://github.com/PlaidWeb/Publ/issues/544
[snarfed] joined the channel
#
[snarfed]
Oof, sounds like a boondoggle. See if they'll build it themselves as a PR and run it and see how the experience is first
#
[snarfed]
Oh that's Soni, they've been active here occasionally on and off, often advocating for this kind of decentralized search
#
[snarfed]
It's interesting, but very niche, lots of technical obstacles, not many proven implementations that are actually usable (if any)
#
Soni
sorry for going on a *slightly* unhinged rant about how web crawlers are evil and a waste of bandwidth, energy, CPU time, open connections, [etc] >.<
#
[snarfed]
Np! Fun to talk about interesting ideas!
#
[fluffy]
Ah, hi, didn’t realize you were here
#
[fluffy]
I’m just like. Not clear on why this would belong in Publ.
#
[snarfed]
Definitely seems like the right next step is to build a prototype yourself Soni, get experience running it, see how it goes
#
[fluffy]
I do think that the idea of an indieweb/social search thing where people can submit their processed content to the index directly has some merit. But Publ’s not really in a position to consume that.
#
Soni
well, publ does its own indexing mostly while avoiding crawlers. it has a lot more data about the files and their contents than a crawler has access to. basically it's cheaper for publ to figure out what changed than for an external system to do so.
#
[fluffy]
Yeah but it’s very specific to the site that it’s running on, and there’s a lot of assumptions made in the actual search process.
#
[fluffy]
Having a more generic search thing where it’s like, “here’s a library/tool/whatever to process your raw data into search data for this other search engine” would be much more feasible.
#
Soni
wouldn't that lose the efficiency benefits?
#
[fluffy]
Also privacy is definitely a concern. Just exporting raw search index stuff from, say, http://beesbuzz.biz to somewhere else means that someone with access to the raw search data would also have access to private content.
#
[fluffy]
I think you’re vastly overestimating the efficiency of how Publ’s search indexing works 🙂
#
Soni
does it not use inotify(?)?
#
Loqi
[preview] fluffy
#
[fluffy]
It uses watchdog (which uses inotify et al) to determine which files have changed, yeah, but the actual parsing of content into the whoosh tables is… not great.
#
Soni
oh
#
[fluffy]
It’s actually the slowest part of Publ’s site indexing right now.
#
Soni
oh :(
#
[fluffy]
Also it can’t do things like figure out links between items or whatever, it just builds the index on the raw Markdown.
#
Soni
yes, that is what we'd expect...
#
Soni
we don't really see the point of SEO for something like this tbh
#
[fluffy]
A webcrawler can do a lot more stuff and probably be faster overall.
#
Soni
but a webcrawler isn't very indieweb
#
[fluffy]
Why not? and how is it any less indieweb than, say, a feed reader that parses mf2?
#
Soni
like, in theory an index export could be hosted on static pages and consumed by other static pages
#
[fluffy]
(and one which uses archive links to consume an entire site)
#
Soni
but you can't exactly run a crawler with a static pages
#
[fluffy]
sure, but something has to be able to process the data at some point
#
[fluffy]
would it be better to overwhelm a search engine with a constant flood of new data every time something changes, or to have that search engine reach out for new data as it has the capacity to?
#
[fluffy]
Also you might be surprised at how little resources it takes to actually run a webcrawler.
#
Soni
RFC 3229
#
[snarfed]
Yeah Soni this a common indieweb FAQ. Indieweb is self hosting friendly, but it's definitely not _about_ self hosting, nor does it require it or discourage more centralized solutions
#
[snarfed]
that extends to technical architecture in general, including search
#
Soni
hmm
#
[snarfed]
We'd definitely be very interested in decentralized search here! But it's not necessarily "more" indieweb. Just interesting to try!
#
[fluffy]
if I were to build an indieweb search engine I’d probably start out by crawling sites, and on any page with h-entry, use an mf2 parser to split out the data and feed that into some sort of search index. which may or may not be built on whoosh, for that matter, since for all its problems, whoosh is pretty nice.
#
[snarfed]
Site local search though, definitely indieweb
#
[fluffy]
although it’s only good for full-text search and not things like, well, link relevance is really important
#
Soni
a lot of our friends have given up on doing anything that might require interacting with a search engine, so like, search engines are dead...
#
[fluffy]
(and also where things get difficult and expensive)
#
[fluffy]
I switched to duckduckgo ages ago. It seems to be less-bad than Google these days.
#
Soni
honestly we switched to firefox keywords recently
#
Soni
or uh what are they called
#
Soni
yes, they are called search keywords
#
[fluffy]
that’s just UI on top of other search engines, though, right?
#
[fluffy]
and duckduckgo provides that too, like !wiki to search wikipedia and so on
#
Soni
because a lot of websites have local search: wikipedia, rustdoc, docs.rs, a bunch of other stuff
#
[fluffy]
sure, but I don’t see my site ever appearing in a firefox keywords thing
#
Soni
we've definitely searched programming blogs before
#
Soni
but not often enough for us to add a whole keyword for it
#
[fluffy]
exactly
#
Soni
but that doesn't mean we don't want those results
#
[fluffy]
It sounds to me like you’re trying to have something both ways though?
#
Soni
keyword search is a workaround for the death of global search engines
#
[fluffy]
Like I totally agree that public search right now is completely broken and there needs to be a better way to do something. But I don’t think Publ is in the position to do that.
#
Soni
but global and local should not be the only options
#
Soni
(we should have some sort of nonlocal search)
#
[fluffy]
This seems like a good project to write up with high-level goals and then do some design work to see how those goals would be met.
#
Soni
okay
#
[snarfed]
fluffy++ agreed, and prototype
#
Loqi
fluffy has 5 karma in this channel over the last year (17 in all channels)
#
[snarfed]
Soni design it and build it! We'd be excited to see!
#
Soni
sorry
#
[fluffy]
It was just their HTML validator I was fixing.
#
Soni
[snarfed]: sadly web search (or anything search) is way outside our realm of experience(? we forget the right phrase)... (grep doesn't count. we mean it kinda does but... not really tbh. it's not an appropriate substitute for real search.)
#
[fluffy]
It didn’t like that I was using selfclosed tags in an HTML5 context.
[aciccarello] joined the channel
#
[aciccarello]
fluffy, congrats on the release. I am curious what the W3C OGP validator you were using is.
#
[fluffy]
(one of those little turds that a lot of people decided to inherit from XHTML for some reason)
#
[snarfed]
Soni you could learn! Eg talk to capjamesg, he's learned and then built a ton of stuff here that he didn't already know, including deep web search (see link above)
#
[fluffy]
Unfortunately his search engine doesn’t seem to be working right now, either.
#
[fluffy]
https://github.com/PlaidWeb/Publ/commit/c9d98fde5ac37c415632d8edac45cd18d7c989e6 is the opengraph change. Literally just changing `<meta ... />` to `<meta ...>` 🙂
#
Soni
XHTML is fun :v (except how basically nothing supports XHTML output...)
#
[fluffy]
The whole “make HTML tags selfclose like XHTML” thing is like. a thing.
#
[fluffy]
I don’t have the spoons to get into it 🙂
#
[fluffy]
basically one of those telephone-game things where a couple of optimistic misinterpretations led to some really awful things, and it’s the HTML equivalent of how streaming providers add pillarboxes to 4:3 content instead of adjusting it in the player.
#
Soni
there is such thing as an XHTML polyglot
#
Soni
but yeah
#
[fluffy]
anyway nobody gets XHTML right either, so
[jacky], gerben and [schmarty] joined the channel
#
Soni
our best lead for search is a dead link btw
#
Soni
did we mention this before? https://oddmuse.org/wiki/Indexed_Search
#
Soni
but we guess it has the same "rebuilds everything" problem as publ
#
c​apjamesg
What's happening re: search?
#
c​apjamesg
> if I were to build an indieweb search engine I’d probably start out by crawling sites, and on any page with h-entry, use an mf2 parser to split out the data and feed that into some sort of search index. which may or may not be built on whoosh, for that matter, since for all its problems, whoosh is pretty nice.
#
IWDiscord
<c​apjamesg#0>
#
c​apjamesg
+1 [fluffy]
#
c​apjamesg
I used IndieWeb Search as a playground for learning about the web; an enjoyable project indeed, but I made things unnecessarily complicated.
#
c​apjamesg
I wrote my own crawler 😅
#
c​apjamesg
The index was around 500k pages at peak.
#
c​apjamesg
I had _so much fun_ building that project!
#
c​apjamesg
[snarfed]++ for all the assistance when I worked on IndieWeb Search.
#
c​apjamesg
[snarfed]++
#
Loqi
[snarfed] has 102 karma in this channel over the last year (152 in all channels)
#
[snarfed]
capjamesg https://github.com/PlaidWeb/Publ/issues/544 , and decentralized search in general, Soni might be interested in learning search stuff
#
Loqi
[preview] [SoniEx2] #544 external search index
#
c​apjamesg
I see! My implementation wasn't decentralized, but that idea is interesting to me!
#
c​apjamesg
I thought about a system where people could submit their own records.
#
c​apjamesg
Then IndieWeb Search (herein IWS) could validate N% of records as a trust exercise.
#
c​apjamesg
I didn't build this, however.
#
v​onexplaino
I like the reverse idea. Being able to self curate search endpoints of trusted resources. Then my search checks those endpoints
#
c​apjamesg
Interesting!
#
c​apjamesg
You could have a system where parties index their own sites then submit indexes to a standard endpoint to each other.
#
c​apjamesg
Then you could use any one party's site to search the whole network.
#
c​apjamesg
This requires trusting the party on whose site you are performing the search.
#
c​apjamesg
(just as one trusts Google)
#
v​onexplaino
Yup. Indie web search had that .well-known point for publicising endpoints
#
c​apjamesg
You could have nodes index N% of the network, so there is some overlap, pull from each other, do some trust exercise (compare samples of each site between different nodes' indexes), then show results whose integrity has been preserved.
#
c​apjamesg
The downside there is compute and the not-so-scalable approach baked into that system.
#
c​apjamesg
Just brainstorming! I haven't thought this much about this particular problem.
#
v​onexplaino
Indeed. I was going personal approach to self curate as scaling sounded problematic
#
v​onexplaino
I trust x,y,x so I include their search endpoints in my aggregator
#
c​apjamesg
You could definitely have a trusted list.
#
aaronpk
This is the most inscrutable thing I have ever read https://twittercommunity.com/t/x-api-v2-migration/203391/1
gRegor joined the channel
#
[snarfed]
do you have to care?
#
Soni
see, we want to rely on trust as a safeguard against enshittification... tho we must also acknowledge how trust-based systems inherently tend to surface racism (because racism is a pervasive problem)
#
Soni
and publishing indexes for other ppl feels icky
#
Soni
indexes should be origin-locked tbh
#
Soni
(as in, when retrieving the index for foo.example, only the entries that refer to foo.example are included)
#
Soni
(when running search queries you use all the indexes you know of tho)
bterry joined the channel