#indieweb 2021-10-06
2021-10-06 UTC
angelo, gRegor, gRegorLove_, MatrixTravelerb4, benchi[m], Nuve, grantcodes[d], timdream, feoh, hans1963[d], hendursa1 and _inky joined the channel
# Loqi [indienews] New post: https://publ.plaidweb.site/blog/63-Publ-v0-7-3
# capjamesg[d] [KevinMarks]
# capjamesg[d] Was this what you were thinking about re: context?
# capjamesg[d] aaronpk your site happens to have a rich h-card on each page which makes it perfect for testing 🙂
# capjamesg[d] Em. I didn't use either of those words?
# capjamesg[d] Ah, I see.
# capjamesg[d] I am on Discord which is why I didn't see that.
# capjamesg[d] In any case, we can move to #dev.
macaw, tetov-irc, n8chz1, n8chz and _inky joined the channel
# [KevinMarks] that is looking nicer, [capjamesg]
# [KevinMarks] For some background thinking on crawling and scraping, this is well worth reading https://twitter.com/Abebab/status/1445723482231173120?s=20
# @Abebab 3 weeks ago LAION-400M dataset (now a billion+), first Image-Alt-text pair dataset of this scale was released. @vinayprabhu, @MannyKayy & I dug into it https://arxiv.org/abs/2110.01963 Long tread 1/ Warning: paper contains NSFW content that may be disturbing, distressing &/or offensive https://pbs.twimg.com/media/FBA9JQvUYAoUeHe.png (twitter.com/_/status/1445723482231173120)
# capjamesg[d] [KevinMarks] Thanks for sharing!
# capjamesg[d] I'm happy with the social layout. It will take a few weeks to roll out because I will have to index all h-cards in a separate field (for performance reasons).
# capjamesg[d] (but open to feedback 🙂 )
# capjamesg[d] In terms of that paper, I'll dig deeper.
# capjamesg[d] The example of captioning images is outstanding.
# capjamesg[d] I don't understand AI / ML nearly enough to do anything with IndieWeb Search. You can get very far without it on a project where only certain domains are indexed.
# capjamesg[d] (outstanding is maybe not the right word, rather stark and concerning)
# capjamesg[d] I haven't given much thought to content filtering because it is a big grey area in terms of ethics and the approach one uses (naive approaches are not up to the task in many cases). For now I rely on the integrity of the content this community publishes 🙂
# [KevinMarks] You're relying on the content coming from a smaller known group, which is better then the kind of deep grab discussed, but you are going to need a way to easily blocklist contributors from the results (and maybe individual pages)
# capjamesg[d] I have a blocklist.txt file to which I can manually add domains. This gives protection against reindexing a website that has been removed. Individual pages are harder to trace but I can imagine a feasible, non-performance-drag way to implement a system to prevent blocklisted URLs from being reindexed.
# capjamesg[d] The concern I have is site deaths.
# capjamesg[d] If a site in the index goes offline and the domain is bought by someone else, that causes an issue. I have not encountered this yet but then again the project has only been going on for a few months.
# capjamesg[d] I'm not sure what an automated blocklist system would look like. That's a tough problem to solve.
# [KevinMarks] you may also have the problem of a site being fine, but it including webmentions (or PESO'd bridgy tweet replies) that are offensive or explicit - I'm sure mine include some of those
# capjamesg[d] I'll take this to #dev.
# [snarfed] [tantek] I'm trying to understand https://indieweb.org/domain-deaths . it's sites that currently still exist and serve, but seem abandoned and likely to not get renewed and then get hijacked?
KartikPrabhu joined the channel
# Loqi zombie is in the context of the web a website that had died (site-deaths), perhaps due to domain registration neglect, and has been brought back by some other looking sorta like it did before, but oddly broken, often with spam pages/links added, and eats a lot of CPU likely due to abusive scripts https://indieweb.org/zombie
# [snarfed] eg https://indieweb.org/lost_sites could maybe be collapsed into site deaths
# petermolnar I'd classify zombie as abandoned with spammy comments, lost domain as redirects/has new owner/has completely different content
# petermolnar site death is 410
# petermolnar actually, that's another category
# petermolnar wiped clean, but it's still with the old owner
joec_ joined the channel
# capjamesg[d] Oh, 410. I forgot about those.
# [tantek] lol I created both /lost_sites and /domain-deaths like 7 months apart 🤦
_inky joined the channel
# Loqi ok, I added "Use-case: don't bother indexing/crawling links to these domains" to the "See Also" section of /lost_sites https://indieweb.org/wiki/index.php?diff=77252&oldid=77251
hendursaga joined the channel
# capjamesg[d] Thank you!
# capjamesg[d] I will add these in my blacklist file.
# capjamesg[d] No worries at all!
# capjamesg[d] All added to my list.
# capjamesg[d] I appreciate all of the thought that has gone into producing these lists.
# capjamesg[d] I'll add anything if I find any issues.
# capjamesg[d] (where "issues" is another site that could be flagged as zombie or lost)
# capjamesg[d] [tantek] Your suggestions yesterday were helpful. I have added logic to add icons to the rel=me featured snippet.
# capjamesg[d] I have also implemented a simple "discover" keyword that lets you find only homepages that mention a keyword.
# capjamesg[d] I think you mentioned something about only looking in h-cards too. I could use that behavior instead.
# capjamesg[d] (Whole site searches feel out of scope for this since one can do a site search anyway)
# capjamesg[d] Yep. Let's move to dev.
shoesNsocks, [fluffy], [jeremycherfas], Moosadee, _inky, joj, marksuth, hendursaga, macaw, tetov-irc, [schmarty], gRegor and maxwelljoslyn[d] joined the channel
# maxwelljoslyn[d] GWG - I must relinquish hosting of HWC to you to you tonight - feeling pretty slammed by work.
# maxwelljoslyn[d] [edit] GWG - I must relinquish hosting of HWC to you to you tonight - feeling pretty slammed by work, can't attend tonight..
Seirdy joined the channel