#dev 2021-08-30
2021-08-30 UTC
Seirdy, nolith, voxpelli, capjamesg[d], bneil, wrmilling, vilhalmer, strugee, GWG and [tantek] joined the channel
# capjamesg[d] I thought I'd get a bit more serious about noting down domain names to crawl. If you want to be included in my search crawl of some IndieWeb sites while I experiment with building a multi-site search engine, you can add your name here: https://gist.github.com/capjamesg/5323cf383f9669c6fa7247c738490287
# capjamesg[d] [edit] I thought I'd get a bit more serious about noting down domain names to crawl. If you want to be included in my search crawl of some IndieWeb sites while I experiment with building a multi-site search engine, you can add your name here: https://gist.github.com/capjamesg/5323cf383f9669c6fa7247c738490287
# capjamesg[d] It might take a while to crawl your site though π
hendursa1 joined the channel
tetov-irc joined the channel
# jamietanna thanks [snarfed] :)
sebbu and nertzy__ joined the channel
# capjamesg[d] Roughly how many pages are on the wiki?
# Murray[d] capjamesg: https://indieweb.org/Special:Statistics
# capjamesg[d] So I will crawl 1/12th of the wiki haha.
# capjamesg[d] I might need to make an exception in my crawler.
Ramon[d], chenghiz_, pufferfish, hendursaga, Zegnat[d] and Ruxton joined the channel
nertzy__ joined the channel
# capjamesg[d] The crawl went through this morning [fluffy].
# capjamesg[d] I have indexed 25753 pages so far.
# capjamesg[d] The project will live here: https://indieweb-search.jamesg.blog/
# capjamesg[d] [edit] The project will live here: https://indieweb-search.jamesg.blog/
# capjamesg[d] I will change the name, etc. when I get the chance.
# [tantek] crawling and searching has always seemed like a natural fit for distributed computing, so I have to admit I'm always perplexed when the default is to build one thing that crawls everything (instead of coming up with a protocol to do distributed (perhaps self-?) crawling and maybe even results retrieval, leaving aggregation/ranking up to specific UIs
# capjamesg[d] Wel...
# capjamesg[d] *Well...
# capjamesg[d] I thought about this very thing.
# capjamesg[d] Because if people wanted to crawl their site and send the data to me to be included in the main search index that would be great.
# capjamesg[d] I need to look into how this would work. Still really just an idea.
# capjamesg[d] The advantage to the centralised approach is consistency in terms of how results are indexed.
# capjamesg[d] Which is helpful at scale because not everyone may be honest about the integrity of the content they index.
# capjamesg[d] So safeguards are needed.
# capjamesg[d] Yep.
# capjamesg[d] It would just need a lot more thought than I have put in π
# capjamesg[d] (for distributed working)
# capjamesg[d] And relying on people to crawl their site isn't a problem.
# capjamesg[d] Because it would be everyone's choice to be included, and when, etc.
# capjamesg[d] "According to the FAQ about Nutch, an open-source search engine website, the savings in bandwidth by distributed web crawling are not significant, since "A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages...". "
# capjamesg[d] Just looking at the wikipedia page for this π
# capjamesg[d] I'm not big into decentrailzed networking π
# capjamesg[d] So I would need to read in on what existing trust architectures look like.
# capjamesg[d] Yeah.
# capjamesg[d] Now...
# capjamesg[d] I haven't open sourced my search engine because the code is a mess and I also broke the repository a bit today (I will need help on that haha).
# capjamesg[d] But I think transparency into the "algorithm" is crucial.
# capjamesg[d] Oh yeah snarfed. Google are big enough to do it in house.
# capjamesg[d] But that doesn't mean independent sites like ours couldn't do it differently π
# capjamesg[d] Yeah haha.
# capjamesg[d] I liked your keep it simple approach snarfed.
# capjamesg[d] There's an acronym for that that's at least popular in the UK: KISS (keep it simple, stupid).
# capjamesg[d] In practice this would be hard.
# capjamesg[d] Yeah. Decentralized is a better word for it.
# capjamesg[d] I'd love to see this happen but there would be a lot to work out.
# capjamesg[d] +1
# capjamesg[d] I don't have any experience developing protocols π
# capjamesg[d] As a first step I can open source what I have. As long as people understand I am not a professional developer and I'm self dogfooding search.
# capjamesg[d] Yeah.
# [tantek] if you want to try experimenting, I'd start with two sites, each maintaining their own index, and then also presenting a "shared search" UI that combines their indices and see how that works. that should be enough to shake out the first two steps (1. text->index algo, 2. protocol for sharing/aggregating indexes)
# rattroupe[d] Do you remember when google used to give great results all the time?
# capjamesg[d] If anyone wants to do that, they can peek into my codebase when I open source it.
# capjamesg[d] (By the way, SQLite virtual tables (read: full text search tables) are great but not for modifications. You cannot run an ALTER TABLE on SQLite virtual tables!)
# capjamesg[d] [tantek] I know π
# capjamesg[d] I looked at this when I started developing mine.
# capjamesg[d] Which is why I built my own for my blog anyway.
# capjamesg[d] Because Google was not surfacing content in a way I liked and I had no control over index frequency.
# capjamesg[d] I think decentralized aggregation is interesitng but you'd need to figure out how to own representation.
Seirdy joined the channel
# rattroupe[d] Hereβs an argument for how search quality could be improved through decentralization:
# rattroupe[d] Google used to be good, but now itβs results are filled with spam. Bad actors try to game SEO to get to the top of google results
# capjamesg[d] Crawling is just the means by which data is acquired.
# capjamesg[d] And it's pretty standard in terms of techniques at a high level.
# rattroupe[d] In fact the #1 reason google is so bad is because of bad actors
# rattroupe[d] Spammers know if they can beat google, they win
# capjamesg[d] (i.e. following links, obeying robots.txt directives, reading structured data, then at a low-level things get harder)
# rattroupe[d] Decentralization means bad actors no longer have a single point of attack
# capjamesg[d] I love being able to customize my own search results.
# capjamesg[d] So I'd say that if you want to build your own search engine just for yourself because you don't like Google / DDG's results, great. Do it!
# rattroupe[d] Someone trying to get to the top of search results canβt do it if there are a thousand different search engines all crawling with slightly different algorithms
# capjamesg[d] I built some direct answers and use NLP to surface info that means the most to my visitors.
# capjamesg[d] E.g. I worked on answering questions about what equipment I use for coffee making.
# capjamesg[d] But everyone has a different need.
# rattroupe[d] Itβs a basic robustness through numbers
# capjamesg[d] (I realise I'm chatting about personal search, not for multiple sites :P)
# capjamesg[d] The truth is that I made my site search more complex than it needed to be because I wanted to explore search. You can get by with much less code / backend.
# capjamesg[d] And make something really cool too!
# capjamesg[d] Example of custom SERP I built ^
# capjamesg[d] (it works for other queries but this is very much experimental)
# rattroupe[d] A decentralized search network would thus greatly reduce the effect of bad actors (spammers/SEO gamers) have on search results, which is the best way to improve results
# capjamesg[d] Would those benefits be substantially better than just having an open source system?
# rattroupe[d] Point is, you donβt want there to be one single algorithm that spammers could attack
# capjamesg[d] Where you could read the exact ranking factors going into a search result.
# capjamesg[d] Indeed.
# capjamesg[d] Quick sidebar: if anyone knows how to accept all merge requests in VSCode, I need your help. I accidentally commited a big file to GitHub π€¦
# rattroupe[d] Google has spent millions, maybe billions of dollars, and thousands of man hours, trying to come up with the one best spam proof ranking algorithm, and theyβve failed. So I donβt think the answer lies in finding a single better algorithm
# capjamesg[d] Here are my personal search engine ranking factors: title, description, url, category, published, keywords, page_content, h1, h2, h3, h4, h5, h6.
# capjamesg[d] pagerank is also in the works but it requires separate calculations and I'm not ready to implement.
# capjamesg[d] No. I don't think there's a spam-proof algorithm.
# capjamesg[d] Yep.
# capjamesg[d] Yeah.
# capjamesg[d] And vouching would be interesting too.
# capjamesg[d] (where domains with good authority can lend credence to a new / up and coming domain)
# capjamesg[d] Haha π
# capjamesg[d] I haven't read the vouch spec yet.
# capjamesg[d] Need to do that.
# capjamesg[d] Whereas right now I am manually crawling sites for the IndieWeb search engine. Any good list of all IndieWeb domain names? There is one on the wiki right?
# capjamesg[d] Yeah. And you would also have control over the extent to which "SEO Optimization" is a thing.
# capjamesg[d] π
# [snarfed] capjamesg: https://indiemap.org/docs.html#sites
# capjamesg[d] I knew that list existed somewhere.
# [snarfed] oh there's also https://github.com/snarfed/indie-map/tree/main/crawl/2020
# Loqi It looks like we don't have a page for "list of all IndieWeb domain names" yet. Would you like to create it? (Or just say "list of all IndieWeb domain names is ____", a sentence describing the term)
# capjamesg[d] I'll let you all know when I have open sourced my search engine π
# capjamesg[d] Just crawling the wiki. Will have 5,000 pages in the first index.
# [tantek] list of all IndieWeb domain names is /chat-names#Nicknames
# capjamesg[d] Anyone know how to resolve all merges in VSCode? π
# capjamesg[d] But for now it's television / sleep time!
# capjamesg[d] π
Seirdy, KartikPrabhu, [chrisaldrich], Jordy[d], nertzy__, [manton] and [KevinMarks] joined the channel
# [KevinMarks] Google isn't one algorithm, it's a lot of separate algorithms with a unified scoring scheme so full search can pick between them. That would be harder to decentralize as the between algorithm parity part is hard.
# [KevinMarks] Mind you, Google is now trying to delegate that to a machine learning thing rather than a lot of internal argument and testing
[schmarty] joined the channel
nertzy__ joined the channel
# [KevinMarks] That was a very good start to a discussion indeed
# [KevinMarks] Hm. Maybe the "build your own search engine from distributed oneboxes" could work. If you're not trying to use them all, then the scoring arms race is less of an issue. If you can define a focused search algorithm that's good at one thing, then others can decide to add it to their search results. Achieving the speed gate that Google applied may be trickier
tetov-irc joined the channel
# [schmarty] Personal indexes and agents yeessss
# [schmarty] I don't have a better link at the moment but I find myself thinking about what an IndieWeb remembrance agent would look like https://alumni.media.mit.edu/~rhodes/Papers/remembrance.html
# [schmarty] Indexing all the things I follow and bookmark, checkins, likes, watches, listens, notes I make for myself, etc
Seirdy joined the channel
[tw2113_Slack_] joined the channel