#dev 2021-08-30

2021-08-30 UTC
Seirdy, nolith, voxpelli, capjamesg[d], bneil, wrmilling, vilhalmer, strugee, GWG and [tantek] joined the channel
#
capjamesg[d]
I thought I'd get a bit more serious about noting down domain names to crawl. If you want to be included in my search crawl of some IndieWeb sites while I experiment with building a multi-site search engine, you can add your name here: https://gist.github.com/capjamesg/5323cf383f9669c6fa7247c738490287
#
capjamesg[d]
[edit] I thought I'd get a bit more serious about noting down domain names to crawl. If you want to be included in my search crawl of some IndieWeb sites while I experiment with building a multi-site search engine, you can add your name here: https://gist.github.com/capjamesg/5323cf383f9669c6fa7247c738490287
#
capjamesg[d]
It might take a while to crawl your site though πŸ™‚
hendursa1 joined the channel
tetov-irc joined the channel
#
jamietanna
thanks [snarfed] :)
sebbu and nertzy__ joined the channel
#
capjamesg[d]
Roughly how many pages are on the wiki?
#
Murray[d]
So just over 12,500 it seems
#
Murray[d]
(I assume content pages are counted as well within generic pages, otherwise add 4,500 to that :D)
#
capjamesg[d]
So I will crawl 1/12th of the wiki haha.
#
capjamesg[d]
I might need to make an exception in my crawler.
Ramon[d], chenghiz_, pufferfish, hendursaga, Zegnat[d] and Ruxton joined the channel
#
[fluffy]
I have many thousands of pages on my site as well so I’m expecting your crawler to just scratch the surface. My site map is ordered newest first at least.
#
[fluffy]
Well, it starts with category pages, then newest first on entry pages. I should probably remove the category pages from it though.
nertzy__ joined the channel
#
capjamesg[d]
The crawl went through this morning [fluffy].
#
capjamesg[d]
I have indexed 25753 pages so far.
#
capjamesg[d]
[edit] The project will live here: https://indieweb-search.jamesg.blog/
#
capjamesg[d]
I will change the name, etc. when I get the chance.
#
[tantek]
crawling and searching has always seemed like a natural fit for distributed computing, so I have to admit I'm always perplexed when the default is to build one thing that crawls everything (instead of coming up with a protocol to do distributed (perhaps self-?) crawling and maybe even results retrieval, leaving aggregation/ranking up to specific UIs
#
capjamesg[d]
I thought about this very thing.
#
capjamesg[d]
Because if people wanted to crawl their site and send the data to me to be included in the main search index that would be great.
#
capjamesg[d]
I need to look into how this would work. Still really just an idea.
#
[tantek]
like ideally everyone's server indexes all their own content, then supports some sort of "distributed index aggregation" protocol with others that do the same to form pools of search results
#
capjamesg[d]
The advantage to the centralised approach is consistency in terms of how results are indexed.
#
[tantek]
this feels like the thing that would already be solved by some PhD person in distributed computing
#
capjamesg[d]
Which is helpful at scale because not everyone may be honest about the integrity of the content they index.
#
capjamesg[d]
So safeguards are needed.
#
[tantek]
you can solve consistency by providing a canonical algorithm from text content -> indexing summary
#
[tantek]
like there are canonical algorithms for hashing content -> hashes
#
capjamesg[d]
It would just need a lot more thought than I have put in πŸ™‚
#
capjamesg[d]
(for distributed working)
#
capjamesg[d]
And relying on people to crawl their site isn't a problem.
#
[tantek]
point being, *someone* must have thought about this already because it seems like a big "plant a flag" "grab the prize" kind of CS problem
#
capjamesg[d]
Because it would be everyone's choice to be included, and when, etc.
#
capjamesg[d]
"According to the FAQ about Nutch, an open-source search engine website, the savings in bandwidth by distributed web crawling are not significant, since "A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages...". "
#
capjamesg[d]
Just looking at the wikipedia page for this πŸ˜„
#
[snarfed]
also search crawling and indexing are massively antagonistic arms races. the major search engines are very unlikely to trust crawl data from anyone else
#
[tantek]
yes I realize "trust" is a challenge, however seemingly dealt with with some amount of distributed verification also
#
[snarfed]
that's...a big hand wave
#
capjamesg[d]
I'm not big into decentrailzed networking πŸ˜„
#
[tantek]
the big search engines have already lost that "crawl data from anyone else"
#
capjamesg[d]
So I would need to read in on what existing trust architectures look like.
#
[tantek]
Google, etc. results are already so polluted to be nearly useless
#
[tantek]
except for specialty multi-word phrases
#
[tantek]
anything generic is lost (c.f. recipes)
#
[snarfed]
maybe so, but that's not really an argument to take yet another massive trust loss
#
[tantek]
point being, the myth that "centralization can solve this" in terms of arms-race has been pretty much disproven IMO
#
capjamesg[d]
I haven't open sourced my search engine because the code is a mess and I also broke the repository a bit today (I will need help on that haha).
#
capjamesg[d]
But I think transparency into the "algorithm" is crucial.
#
[snarfed]
having spent time with the search org at Google, crawl is absolutely considered core competence that they would never give up ownership of to arbitrary third parties
#
[tantek]
so you might as well have smaller sets of folks who share some amount of trust aggregating search results amongst themselves
#
[snarfed]
the costs (bandwidth, eng, etc) are not big concerns
#
[tantek]
don't care about Google changing their approach
#
capjamesg[d]
Oh yeah snarfed. Google are big enough to do it in house.
#
[tantek]
convince Google of anything is a non-goal
#
capjamesg[d]
But that doesn't mean independent sites like ours couldn't do it differently πŸ™‚
#
[snarfed]
s/Google/major search engines/
#
[snarfed]
capjamesg maybe, but, search is already hard enough, adding in another even harder problem doesn't seem like a winning solution
#
capjamesg[d]
Yeah haha.
#
[snarfed]
i got into this argument a fair amount in my indiemap talk
#
capjamesg[d]
I liked your keep it simple approach snarfed.
#
[tantek]
point here is that major search engines have crap results, especially for indieweb content, so there's incentive among indieweb sites to collaborate to produce higher quality more thorough search results amongst ourselves
#
[snarfed]
decentralization is good in many ways, but not everything benefits from decentralization
#
capjamesg[d]
There's an acronym for that that's at least popular in the UK: KISS (keep it simple, stupid).
#
[snarfed]
esp since it's generally _harder_ than centralization
#
[tantek]
snarfed, see start of thread, search indexing / aggregation seems like a natural fit for distributed computing
#
[snarfed]
seems is the key word
#
capjamesg[d]
In practice this would be hard.
#
[snarfed]
(also decentralized, not distributed. it's almost all distributed as in on more than one server)
#
capjamesg[d]
Yeah. Decentralized is a better word for it.
#
capjamesg[d]
I'd love to see this happen but there would be a lot to work out.
#
[tantek]
capjames[d], no, in *theory* it is hard because AFAIK there's no algorithms/protocols developed to do it yet
#
[tantek]
we're not even at the "in practice" point yet πŸ™‚
#
capjamesg[d]
I don't have any experience developing protocols πŸ˜„
#
[snarfed]
the nice thing is, our community is small enough that we can do both independently. 1) experiment with new IR/search techniques with centralized crawlers and search engines, eg cwieske's, indiemap, capjamesg's
#
[tantek]
you need a canonical text -> index algorithm, and you need a protocol for sharing / aggregating indexes
#
[snarfed]
...and 2) decentralized crawl and indexing. (/me rungs away)
#
sknebel
Zegnat made that thing once which hooked into peoples site searches
#
capjamesg[d]
As a first step I can open source what I have. As long as people understand I am not a professional developer and I'm self dogfooding search.
#
[snarfed]
my two cents is, the ROI for decentralized search is massively bad and a very poor use of our time, but I'm all for people experimenting!
#
sknebel
(afaik through OpenSearch? whatever wordpress offers by default)
#
GWG
capjamesg[d]: I am not one either if it helps
#
[tantek]
depending on how antagonistic a set of actors you want to aggregate among, you also need a distributed spot verification protocol along with a protocol for downgrading (dropping) bad actors
#
[snarfed]
first I'd like to hear a crisp definition of the problem that we're trying to solve with decentralized search, and that we can't more easily already solve with centraliezd
#
[snarfed]
not sure I've heard that yet, other than, "existing search engines are bad, can we do better"
#
[tantek]
if you want to try experimenting, I'd start with two sites, each maintaining their own index, and then also presenting a "shared search" UI that combines their indices and see how that works. that should be enough to shake out the first two steps (1. text->index algo, 2. protocol for sharing/aggregating indexes)
#
[snarfed]
which is crisp, but orthogonal to decentralized
#
rattroupe[d]
Do you remember when google used to give great results all the time?
#
[tantek]
snarfed, I shared already. existing search engines are bad in two very obvious ways
#
capjamesg[d]
If anyone wants to do that, they can peek into my codebase when I open source it.
#
[tantek]
1 existing search engines have TONS of noise in their results, enough to make their results often next to useless
#
capjamesg[d]
(By the way, SQLite virtual tables (read: full text search tables) are great but not for modifications. You cannot run an ALTER TABLE on SQLite virtual tables!)
#
[tantek]
2 existing search engines lose TONS of signal (read fail to actually index) from indieweb sites
#
[tantek]
^ problem description
#
capjamesg[d]
[tantek] I know πŸ˜„
#
capjamesg[d]
I looked at this when I started developing mine.
#
capjamesg[d]
Which is why I built my own for my blog anyway.
#
capjamesg[d]
Because Google was not surfacing content in a way I liked and I had no control over index frequency.
#
[snarfed]
sure! I heard both of those. I just don't see how decentralizing is an obvious way to address quality
#
[tantek]
for example, Google sucks at indexing my own site, and has gotten noticeably worse (dropped posts from their index) over time
#
capjamesg[d]
I think decentralized aggregation is interesitng but you'd need to figure out how to own representation.
#
[snarfed]
more like the opposite, it seems like a huge technical windmill to tilt at that _doesn't_ obviously improve quality, and other easier approaches would be better
#
[tantek]
it's why I switched to using DDG for my site search UI
#
[tantek]
and yet Google is happy to surface recipe spam sites πŸ™„
#
[snarfed]
as an example, I'd love to redo indiemap with current data, but there's no way in hell I'd want to solicit, collect, clean up, normalize, and debug a bunch of other people's partial crawls
#
[snarfed]
whether manually or via code
#
[snarfed]
that would be way more work, more painful, and way less consistent in quality, completeness, etc, than redoing the crawl entirely myself
#
[snarfed]
and that crawl task is basically the same if you want to make a search engine out of it
#
[tantek]
sure. hence why I suggested anyone interested in this start with two sites, not 1000s
#
[tantek]
nah, that's a big handwave "basically the same"
#
[snarfed]
but the direction is the same. still not seeing how decentralized crawl improves quality
#
[tantek]
depends on your quality bar
#
[snarfed]
not a handwave, I have some experience here. crawling for the search use case is not meaningfully different
#
[tantek]
if the bar is beat google, the bar is pretty low πŸ˜›
#
[snarfed]
ask capjamesg, he's done both ways very recently
Seirdy joined the channel
#
rattroupe[d]
Here’s an argument for how search quality could be improved through decentralization:
#
[snarfed]
sure! regardless of bar though, decentralizing crawl doesn't improve quality
#
[snarfed]
(all else equal)
#
rattroupe[d]
Google used to be good, but now it’s results are filled with spam. Bad actors try to game SEO to get to the top of google results
#
capjamesg[d]
Crawling is just the means by which data is acquired.
#
capjamesg[d]
And it's pretty standard in terms of techniques at a high level.
#
rattroupe[d]
In fact the #1 reason google is so bad is because of bad actors
#
[tantek]
snarfed, point being, "still not seeing" is not due to lack of implementations trying, but rather lack of standards for (1) and (2) I noted above
#
rattroupe[d]
Spammers know if they can beat google, they win
#
capjamesg[d]
(i.e. following links, obeying robots.txt directives, reading structured data, then at a low-level things get harder)
#
[tantek]
yeah Google has lost that battle
#
rattroupe[d]
Decentralization means bad actors no longer have a single point of attack
#
capjamesg[d]
I love being able to customize my own search results.
#
capjamesg[d]
So I'd say that if you want to build your own search engine just for yourself because you don't like Google / DDG's results, great. Do it!
#
rattroupe[d]
Someone trying to get to the top of search results can’t do it if there are a thousand different search engines all crawling with slightly different algorithms
#
capjamesg[d]
I built some direct answers and use NLP to surface info that means the most to my visitors.
#
capjamesg[d]
E.g. I worked on answering questions about what equipment I use for coffee making.
#
capjamesg[d]
But everyone has a different need.
#
rattroupe[d]
It’s a basic robustness through numbers
#
capjamesg[d]
(I realise I'm chatting about personal search, not for multiple sites :P)
#
[snarfed]
definitely! experimenting is great. I like to advocate for us to experiment in ways that are more likely to be beneficial, which I don't believe decentralized crawl is (and decentralized indexing/ranking would be even worse), but everyone gets to choose their own experiments!
#
[tantek]
capjamesg[d]++ yes absolutely for personal site search
#
Loqi
capjamesg[d] has 1 karma in this channel over the last year (2 in all channels)
#
capjamesg[d]
The truth is that I made my site search more complex than it needed to be because I wanted to explore search. You can get by with much less code / backend.
#
[snarfed]
totally! and fwiw personal site searches are small centralized search engines, but perfect places to experiment with quality signals!
#
capjamesg[d]
And make something really cool too!
#
[snarfed]
capjamesg++
#
Loqi
capjamesg has 5 karma in this channel over the last year (8 in all channels)
#
capjamesg[d]
Example of custom SERP I built ^
#
capjamesg[d]
(it works for other queries but this is very much experimental)
#
rattroupe[d]
A decentralized search network would thus greatly reduce the effect of bad actors (spammers/SEO gamers) have on search results, which is the best way to improve results
#
capjamesg[d]
Would those benefits be substantially better than just having an open source system?
#
rattroupe[d]
Point is, you don’t want there to be one single algorithm that spammers could attack
#
capjamesg[d]
Where you could read the exact ranking factors going into a search result.
#
[tantek]
capjamesg[d], it's not either or
#
[tantek]
you could use OSS to implement distributed search protocols
#
[tantek]
rattroupe[d] depends on the algorithm, and where/how you decide to defend against spammers
#
[snarfed]
rattroupe I like the "plurality reduces attack vectors" idea! you'd actually end up with _worse_ quality, not better though. bad actors do affect results, but it's far from the single biggest quality problem
#
[tantek]
e.g. keep algorithms simple when possible. like the text-> index algorithm would *NOT* be the place to try to defend against spam
#
[snarfed]
definitely still interesting though
#
[snarfed]
(also there are already thousands of quality algorithms in big search engines, not just one. same with facebook's feed algorithm, etc. it's just easy for us to assume there's "the one" from the outside)
#
capjamesg[d]
Quick sidebar: if anyone knows how to accept all merge requests in VSCode, I need your help. I accidentally commited a big file to GitHub 🀦
#
rattroupe[d]
Google has spent millions, maybe billions of dollars, and thousands of man hours, trying to come up with the one best spam proof ranking algorithm, and they’ve failed. So I don’t think the answer lies in finding a single better algorithm
#
capjamesg[d]
Here are my personal search engine ranking factors: title, description, url, category, published, keywords, page_content, h1, h2, h3, h4, h5, h6.
#
capjamesg[d]
pagerank is also in the works but it requires separate calculations and I'm not ready to implement.
#
capjamesg[d]
No. I don't think there's a spam-proof algorithm.
#
[tantek]
exactly
#
[tantek]
you handle it at a different layer, hence thoughts like distributed verification
#
[tantek]
and demoting / dropping of bad actors
#
[tantek]
I mean heck, that can even be done on a per-site-UI basis
#
[tantek]
if spammers want to get together and provide their own UI/results, let them, no one will bother to use it
#
[tantek]
if a bunch of indieweb peers get together and provide their own results, then there's high quality amongst them
#
[tantek]
that's how you "solve" spam, you avoid including them in the first place
#
capjamesg[d]
And vouching would be interesting too.
#
[tantek]
exactly
#
capjamesg[d]
(where domains with good authority can lend credence to a new / up and coming domain)
#
[tantek]
first you solve spam at the webmention level, then you use that solution and apply to search indexing πŸ™‚
#
capjamesg[d]
Haha πŸ™‚
#
capjamesg[d]
I haven't read the vouch spec yet.
#
capjamesg[d]
Need to do that.
#
capjamesg[d]
Whereas right now I am manually crawling sites for the IndieWeb search engine. Any good list of all IndieWeb domain names? There is one on the wiki right?
#
[snarfed]
agreed with most of this. spam maaaayyybe seems like a more reasonable target for decentralization
#
[snarfed]
(like we've already prototyped with vouch etc)
#
[tantek]
search results are just displaying webmentions for words instead of links
#
capjamesg[d]
Yeah. And you would also have control over the extent to which "SEO Optimization" is a thing.
#
[tantek]
πŸ˜›
#
[snarfed]
old but also includes methodology for updating
#
[tantek]
do folks doing SEO Optimization also use ATM Machines? πŸ˜‰
#
capjamesg[d]
I knew that list existed somewhere.
#
[tantek]
what is list of all IndieWeb domain names
#
Loqi
It looks like we don't have a page for "list of all IndieWeb domain names" yet. Would you like to create it? (Or just say "list of all IndieWeb domain names is ____", a sentence describing the term)
#
[snarfed]
2020 is only partial though, includes wiki but not wm.io or bridgy (which was the big one iirc)
#
capjamesg[d]
I'll let you all know when I have open sourced my search engine πŸ™‚
#
capjamesg[d]
Just crawling the wiki. Will have 5,000 pages in the first index.
#
[tantek]
list of all IndieWeb domain names is /chat-names#Nicknames
#
[tantek]
that'll do for now πŸ˜„
#
capjamesg[d]
Anyone know how to resolve all merges in VSCode? πŸ™‚
#
capjamesg[d]
But for now it's television / sleep time!
Seirdy, KartikPrabhu, [chrisaldrich], Jordy[d], nertzy__, [manton] and [KevinMarks] joined the channel
#
[KevinMarks]
Google isn't one algorithm, it's a lot of separate algorithms with a unified scoring scheme so full search can pick between them. That would be harder to decentralize as the between algorithm parity part is hard.
#
[KevinMarks]
Mind you, Google is now trying to delegate that to a machine learning thing rather than a lot of internal argument and testing
#
[tantek]
no need to copy Google's implementation details like that. they bend over backwards with complexity like that due to fighting spam. start simple, not by copying the complexity of established players.
#
[tantek]
on another topic
#
[tantek]
[fluffy]++ for that long series of thoughtful notes regarding social media, federation, timelines, RSS etc. would definitely (re-)read it as a blog post! (including any/all critical perspectives of IndieWeb technologies, comments as default public etc.)
#
Loqi
[fluffy] has 7 karma in this channel over the last year (30 in all channels)
[schmarty] joined the channel
#
[fluffy]
Yeah I’ve been meaning to write something like that as not an extemporaneous rant.
nertzy__ joined the channel
#
[KevinMarks]
That was a very good start to a discussion indeed
#
[KevinMarks]
Hm. Maybe the "build your own search engine from distributed oneboxes" could work. If you're not trying to use them all, then the scoring arms race is less of an issue. If you can define a focused search algorithm that's good at one thing, then others can decide to add it to their search results. Achieving the speed gate that Google applied may be trickier
tetov-irc joined the channel
#
[tantek]
other way around KevinMarks, let's design for realtime distributed search from the start. who needs Google's delayed crawl & index dependency
#
[tantek]
instead of a crawl/polling, a local websub / pubsub architecture could incrementally update local indexes. a lot less processing/IO
#
[schmarty]
Personal indexes and agents yeessss
#
[tantek]
updating local indexes based on changes to content is not that different from resending webmentions based on changes to content
#
[schmarty]
I don't have a better link at the moment but I find myself thinking about what an IndieWeb remembrance agent would look like https://alumni.media.mit.edu/~rhodes/Papers/remembrance.html
#
[schmarty]
Indexing all the things I follow and bookmark, checkins, likes, watches, listens, notes I make for myself, etc
#
aaronpk
i kind of use my website for this already, by bookmarking and liking things. my search engine searches across all the content
Seirdy joined the channel
#
[tantek]
I kinda want that and some sort of "2nd degree index" that also indexes the contents of webmentions received, posts replied to, things bookmarked, etc.
#
aaronpk
oh yeah, the contents of the webmentions i receive aren't part of my index... that'd be interesting, also interesting how i would display those in the search results
#
[tantek]
similarly the (full) contents of things in your reply-contexts
[tw2113_Slack_] joined the channel