#dev 2021-09-23

2021-09-23 UTC
oodani, edburns, kevinc[d], kaeldra[d], edburns[d], neocow and hendursa1 joined the channel
#
capjamesg[d]
Murray Google lets you request a crawl from Search Console.
#
capjamesg[d]
I haven't added that feature yet because I didn't know if people would want it.
#
capjamesg[d]
I would probably have an open API, rate limited, and only for people whose sites are already in the index.
#
capjamesg[d]
I couldn't rely solely on feed polling though because: (i) not everyone uses feeds; (ii) feeds will not let me discover other pages.
#
Murray[d]
That makes sense. I was thinking it would be nice to have a search engine that encouraged/utilised Indieweb tech directly; I don't really touch Micropub/sub so not sure of their real capabilities, but it occurred that social readers must use those tools to refresh feeds, and building that in as a way to trigger a recrawl (and thereby potentially boost ranking) could provide an additional benefit to their implementation πŸ™‚
#
Murray[d]
Effectively giving a search weighting to Indieweb sites
#
capjamesg[d]
It's also funny you say that...
#
Murray[d]
It does not surprise me that you have had theses thoughts already πŸ˜„
#
capjamesg[d]
I have thought about weighing links for various microformats "social" interactions like "likes" and "bookmarks"
#
capjamesg[d]
The search engine isn't really promotional of any standards.
#
capjamesg[d]
I'd like https:// support to be a tiebreaker.
#
capjamesg[d]
But that's about as far as I have delved into this world πŸ˜„
#
capjamesg[d]
With that said, I do want to make additional features available to the amazing folks who implement IndieWeb tech.
#
capjamesg[d]
"who is jamesg.blog" should show my h-card in the engine.
#
Murray[d]
Sure, and I understand you're building a search engine first and foremost, not an index of Indieweb sites, so that all makes sense
#
capjamesg[d]
I may as well make the most of IndieWeb tech since lots of sites in the index support it.
#
capjamesg[d]
A feed subscription would be cool.
#
Murray[d]
That WhoIs search is pretty cool
#
capjamesg[d]
Maybe spotting rel=alternate feeds.
#
capjamesg[d]
I need to work on the who is a bit more πŸ˜„
#
capjamesg[d]
It's a tad broken.
#
Murray[d]
Just trying it out now, seems to work really well for me πŸ˜„ πŸ‘
#
capjamesg[d]
To be honest I could add microformats2 support as a ranking factor but as another "tiebreaker".
#
capjamesg[d]
If a page supports mf2, they could get an additional point in rankings.
#
capjamesg[d]
So all else being equal, the site with mf2 would win.
#
capjamesg[d]
But the factor would be weighed so little that there would be no chance of mf2 content displacing more relevant content.
#
capjamesg[d]
Relevance first, then weights.
#
Murray[d]
I've long wondered about an Indieweb "phonebook" type thing, some way to make making rich attributions on bookmarks, social interactions, person tagging etc. easier and this feels like it could almost fit that, which is cool πŸ™‚
#
Murray[d]
RE: mf2, I guess with any kind of rank weighting you want to avoid gameable systems that encourage negative behaviour. In that case, might it be tricky to distinguish between those with valid mf2 and people who have just spammed a bunch of random mf2 classes onto a page? (Though, for a smaller engine like this, perhaps those issues of scale are best left for now πŸ˜„ )
#
capjamesg[d]
That's one reason why I have held off πŸ˜„
#
capjamesg[d]
Most links in the engine are curated right now so I know they are good.
#
capjamesg[d]
I have a feeling I'll just keep the engine for personal sites though.
#
capjamesg[d]
I love the "phonebook" idea. How can IndieWeb Search assist?
#
voxpelli
capjamesg[d]: I don’t seem to have the links to the recrawl papers anywhere close by, it’s probably in some pile of papers
#
Murray[d]
As I say, this is way out of my league, so it's just sat in my "ideas" folder/as a brain worm πŸ˜„ But effectively, the way I've considered it, is an API that you can fire a URL to and it would return details about the person, derived from their h-card. I haven't really got much beyond that πŸ™‚
#
capjamesg[d]
Have you seen XRay from aaronpk?
#
capjamesg[d]
No worries voxpelli. If you have any tips yourself, let me know πŸ™‚
#
Murray[d]
I'm aware, but I'd personally prefer something directly catering towards a specific goal
#
Murray[d]
A second consideration would be a way to "seed" a personal contacts book, so it would be useful to effectively be able to look up individuals via the API en masse
#
Murray[d]
Basically, I'm slowly building a personal contact book in my CMS and it feels like it would make sense to centralise that information somehow within the community πŸ€·β€β™‚οΈ
#
capjamesg[d]
Love it! I'll note down how to expand "who is" as an idea on the GitHub repo for the project and think more about it.
#
capjamesg[d]
I have over 200,000 pages of raw HTML easily accessible in the index so if there's a use for it I'm more than down to explore!
#
Murray[d]
As I say, it's a nice feature, will be cool to see how/if it evolves πŸ‘ capjamesg++
#
Loqi
capjamesg has 9 karma in this channel over the last year (18 in all channels)
#
[KevinMarks]
Websub can be a recrawl signal
#
[KevinMarks]
We used to have ping servers for this https://codex.wordpress.org/Update_Services
#
capjamesg[d]
[KevinMarks] I'm going to reuse my microsub code with a few modifications for parsing feeds.
#
capjamesg[d]
I found that Google has a ping API you can use to tell them your site has a new sitemap.
#
capjamesg[d]
It's just one endpoint though.
#
capjamesg[d]
google.com/ping
#
capjamesg[d]
[KevinMarks
#
capjamesg[d]
How does a ping server prevent abuse?
#
capjamesg[d]
Authentication would be one part. Maybe rate limiting too?
#
capjamesg[d]
And a "maximum recrawl" budget per day?
#
capjamesg[d]
I think I just answered my own question πŸ€¦β€β™‚οΈ
tetov-irc joined the channel
#
[KevinMarks]
It doesn't really - weblogs.com was over 90% spam in the end. You need another metric to decide. At technorati we had our own ping server and used others and applied a range of metrics to decide which ones we recrawled rapidly and which we deferred. I think at one point we used 'from weblogs.com ' as a spam signal in itself.
#
petermolnar
> I think I just answered my own question - https://en.wikipedia.org/wiki/Rubber_duck_debugging
wagle, hendursaga, akevinhuang and akevinhuang2 joined the channel
#
GWG
aaronpk: Any chance you might have a few minutes to comment on the IndieAuth spec PRs? I wanted to do an update with plenty of time before 10/16?
#
capjamesg[d]
voxpelli To recrawl, I am going to: (i) check all active RSS / Atom / mf2 h-feeds and crawl new articles; (ii) look at pages that were recently added to a sitemap using the lastmod attribute (if none is present, pages will be crawled anyway); (iii) check if the homepage was crawled in the last three weeks and if it was not, order a full recrawl.
#
capjamesg[d]
This will all be controlled by a cron job
#
capjamesg[d]
Point (iii) needs a bit of work because the sitemap change might impact last crawl date.
#
capjamesg[d]
I can just use the "last crawl" attribute I have in a file instead.
rommudoh[m] joined the channel
#
aaronpk
GWG: yes i will try to review them today
nekr0z, reed, hala-bala[m], Abhas[m], benatkin, SamWilson[m], LaBcasse[m], astralbijection[, vikanezrimaya, Lohn, diegov, npd[m], mackeveli_, ChrisHarris[m] and unrelentingtech joined the channel
#
GWG
aaronpk+0.5
hendursaga, Ramon[d], Seirdy, Loqi, [benatwork], hans1963[d], kimberlyhirsh[d], j12t, KartikPrabhu, akevinhuang, akevinhuang2 and tetov-irc joined the channel