#dev 2021-09-23
2021-09-23 UTC
oodani, edburns, kevinc[d], kaeldra[d], edburns[d], neocow and hendursa1 joined the channel
# capjamesg[d] Murray Google lets you request a crawl from Search Console.
# capjamesg[d] I haven't added that feature yet because I didn't know if people would want it.
# capjamesg[d] I would probably have an open API, rate limited, and only for people whose sites are already in the index.
# capjamesg[d] I couldn't rely solely on feed polling though because: (i) not everyone uses feeds; (ii) feeds will not let me discover other pages.
# Murray[d] That makes sense. I was thinking it would be nice to have a search engine that encouraged/utilised Indieweb tech directly; I don't really touch Micropub/sub so not sure of their real capabilities, but it occurred that social readers must use those tools to refresh feeds, and building that in as a way to trigger a recrawl (and thereby potentially boost ranking) could provide an additional benefit to their implementation π
# capjamesg[d] It's also funny you say that...
# capjamesg[d] I have thought about weighing links for various microformats "social" interactions like "likes" and "bookmarks"
# capjamesg[d] The search engine isn't really promotional of any standards.
# capjamesg[d] I'd like https:// support to be a tiebreaker.
# capjamesg[d] But that's about as far as I have delved into this world π
# capjamesg[d] With that said, I do want to make additional features available to the amazing folks who implement IndieWeb tech.
# capjamesg[d] "who is jamesg.blog" should show my h-card in the engine.
# capjamesg[d] But...
# capjamesg[d] I may as well make the most of IndieWeb tech since lots of sites in the index support it.
# capjamesg[d] A feed subscription would be cool.
# capjamesg[d] Maybe spotting rel=alternate feeds.
# capjamesg[d] I need to work on the who is a bit more π
# capjamesg[d] It's a tad broken.
# capjamesg[d] To be honest I could add microformats2 support as a ranking factor but as another "tiebreaker".
# capjamesg[d] If a page supports mf2, they could get an additional point in rankings.
# capjamesg[d] So all else being equal, the site with mf2 would win.
# capjamesg[d] But the factor would be weighed so little that there would be no chance of mf2 content displacing more relevant content.
# capjamesg[d] Relevance first, then weights.
# Murray[d] RE: mf2, I guess with any kind of rank weighting you want to avoid gameable systems that encourage negative behaviour. In that case, might it be tricky to distinguish between those with valid mf2 and people who have just spammed a bunch of random mf2 classes onto a page? (Though, for a smaller engine like this, perhaps those issues of scale are best left for now π )
# capjamesg[d] That's one reason why I have held off π
# capjamesg[d] Most links in the engine are curated right now so I know they are good.
# capjamesg[d] I have a feeling I'll just keep the engine for personal sites though.
# capjamesg[d] I love the "phonebook" idea. How can IndieWeb Search assist?
# Murray[d] As I say, this is way out of my league, so it's just sat in my "ideas" folder/as a brain worm π But effectively, the way I've considered it, is an API that you can fire a URL to and it would return details about the person, derived from their h-card. I haven't really got much beyond that π
# capjamesg[d] Have you seen XRay from aaronpk?
# capjamesg[d] No worries voxpelli. If you have any tips yourself, let me know π
# capjamesg[d] Love it! I'll note down how to expand "who is" as an idea on the GitHub repo for the project and think more about it.
# capjamesg[d] I have over 200,000 pages of raw HTML easily accessible in the index so if there's a use for it I'm more than down to explore!
# [KevinMarks] Websub can be a recrawl signal
# [KevinMarks] We used to have ping servers for this https://codex.wordpress.org/Update_Services
# capjamesg[d] [KevinMarks] I'm going to reuse my microsub code with a few modifications for parsing feeds.
# capjamesg[d] I found that Google has a ping API you can use to tell them your site has a new sitemap.
# capjamesg[d] It's just one endpoint though.
# capjamesg[d] google.com/ping
# capjamesg[d] [KevinMarks
# capjamesg[d] How does a ping server prevent abuse?
# capjamesg[d] Authentication would be one part. Maybe rate limiting too?
# capjamesg[d] And a "maximum recrawl" budget per day?
# capjamesg[d] I think I just answered my own question π€¦ββοΈ
tetov-irc joined the channel
# [KevinMarks] It doesn't really - weblogs.com was over 90% spam in the end. You need another metric to decide. At technorati we had our own ping server and used others and applied a range of metrics to decide which ones we recrawled rapidly and which we deferred. I think at one point we used 'from weblogs.com ' as a spam signal in itself.
# petermolnar > I think I just answered my own question - https://en.wikipedia.org/wiki/Rubber_duck_debugging
wagle, hendursaga, akevinhuang and akevinhuang2 joined the channel
# capjamesg[d] voxpelli To recrawl, I am going to: (i) check all active RSS / Atom / mf2 h-feeds and crawl new articles; (ii) look at pages that were recently added to a sitemap using the lastmod attribute (if none is present, pages will be crawled anyway); (iii) check if the homepage was crawled in the last three weeks and if it was not, order a full recrawl.
# capjamesg[d] This will all be controlled by a cron job
# capjamesg[d] Point (iii) needs a bit of work because the sitemap change might impact last crawl date.
# capjamesg[d] I can just use the "last crawl" attribute I have in a file instead.
rommudoh[m] joined the channel
nekr0z, reed, hala-bala[m], Abhas[m], benatkin, SamWilson[m], LaBcasse[m], astralbijection[, vikanezrimaya, Lohn, diegov, npd[m], mackeveli_, ChrisHarris[m] and unrelentingtech joined the channel
hendursaga, Ramon[d], Seirdy, Loqi, [benatwork], hans1963[d], kimberlyhirsh[d], j12t, KartikPrabhu, akevinhuang, akevinhuang2 and tetov-irc joined the channel