#indieweb 2021-10-06
2021-10-06 UTC
angelo, gRegor, gRegorLove_, MatrixTravelerb4, benchi[m], Nuve, grantcodes[d], timdream, feoh, hans1963[d], hendursa1 and _inky joined the channel
Loqi [indienews] New post: https://publ.plaidweb.site/blog/63-Publ-v0-7-3

capjamesg[d] [KevinMarks]

capjamesg[d] Was this what you were thinking about re: context?

capjamesg[d] aaronpk your site happens to have a rich h-card on each page which makes it perfect for testing 🙂

capjamesg[d] Em. I didn't use either of those words?

capjamesg[d] Ah, I see.

capjamesg[d] I am on Discord which is why I didn't see that.

capjamesg[d] In any case, we can move to #dev.

macaw, tetov-irc, n8chz1, n8chz and _inky joined the channel
[KevinMarks] that is looking nicer, [capjamesg]

[KevinMarks] For some background thinking on crawling and scraping, this is well worth reading https://twitter.com/Abebab/status/1445723482231173120?s=20

@Abebab 3 weeks ago LAION-400M dataset (now a billion+), first Image-Alt-text pair dataset of this scale was released. @vinayprabhu, @MannyKayy & I dug into it https://arxiv.org/abs/2110.01963 Long tread 1/ Warning: paper contains NSFW content that may be disturbing, distressing &/or offensive https://pbs.twimg.com/media/FBA9JQvUYAoUeHe.png (twitter.com/_/status/1445723482231173120)
capjamesg[d] [KevinMarks] Thanks for sharing!

capjamesg[d] I'm happy with the social layout. It will take a few weeks to roll out because I will have to index all h-cards in a separate field (for performance reasons).

capjamesg[d] (but open to feedback 🙂 )

capjamesg[d] In terms of that paper, I'll dig deeper.

capjamesg[d] The example of captioning images is outstanding.

capjamesg[d] I don't understand AI / ML nearly enough to do anything with IndieWeb Search. You can get very far without it on a project where only certain domains are indexed.

capjamesg[d] (outstanding is maybe not the right word, rather stark and concerning)

capjamesg[d] I haven't given much thought to content filtering because it is a big grey area in terms of ethics and the approach one uses (naive approaches are not up to the task in many cases). For now I rely on the integrity of the content this community publishes 🙂

[KevinMarks] You're relying on the content coming from a smaller known group, which is better then the kind of deep grab discussed, but you are going to need a way to easily blocklist contributors from the results (and maybe individual pages)

capjamesg[d] I have a blocklist.txt file to which I can manually add domains. This gives protection against reindexing a website that has been removed. Individual pages are harder to trace but I can imagine a feasible, non-performance-drag way to implement a system to prevent blocklisted URLs from being reindexed.

capjamesg[d] The concern I have is site deaths.

capjamesg[d] If a site in the index goes offline and the domain is bought by someone else, that causes an issue. I have not encountered this yet but then again the project has only been going on for a few months.

capjamesg[d] I'm not sure what an automated blocklist system would look like. That's a tough problem to solve.

[KevinMarks] you may also have the problem of a site being fine, but it including webmentions (or PESO'd bridgy tweet replies) that are offensive or explicit - I'm sure mine include some of those

capjamesg[d] I'll take this to #dev.

[snarfed] [tantek] I'm trying to understand https://indieweb.org/domain-deaths . it's sites that currently still exist and serve, but seem abandoned and likely to not get renewed and then get hijacked?
KartikPrabhu joined the channel
Loqi zombie is in the context of the web a website that had died (site-deaths), perhaps due to domain registration neglect, and has been brought back by some other looking sorta like it did before, but oddly broken, often with spam pages/links added, and eats a lot of CPU likely due to abusive scripts https://indieweb.org/zombie

[snarfed] eg https://indieweb.org/lost_sites could maybe be collapsed into site deaths
petermolnar I'd classify zombie as abandoned with spammy comments, lost domain as redirects/has new owner/has completely different content

petermolnar site death is 410

petermolnar actually, that's another category

petermolnar wiped clean, but it's still with the old owner

joec_ joined the channel
capjamesg[d] Oh, 410. I forgot about those.

[tantek] lol I created both /lost_sites and /domain-deaths like 7 months apart 🤦

_inky joined the channel
Loqi ok, I added "Use-case: don't bother indexing/crawling links to these domains" to the "See Also" section of /lost_sites https://indieweb.org/wiki/index.php?diff=77252&oldid=77251

hendursaga joined the channel
capjamesg[d] Thank you!

capjamesg[d] I will add these in my blacklist file.

capjamesg[d] No worries at all!

capjamesg[d] All added to my list.

capjamesg[d] I appreciate all of the thought that has gone into producing these lists.

capjamesg[d] I'll add anything if I find any issues.

capjamesg[d] (where "issues" is another site that could be flagged as zombie or lost)

capjamesg[d] [tantek] Your suggestions yesterday were helpful. I have added logic to add icons to the rel=me featured snippet.

capjamesg[d] I have also implemented a simple "discover" keyword that lets you find only homepages that mention a keyword.

capjamesg[d] I think you mentioned something about only looking in h-cards too. I could use that behavior instead.

capjamesg[d] (Whole site searches feel out of scope for this since one can do a site search anyway)

capjamesg[d] Yep. Let's move to dev.

shoesNsocks, [fluffy], [jeremycherfas], Moosadee, _inky, joj, marksuth, hendursaga, macaw, tetov-irc, [schmarty], gRegor and maxwelljoslyn[d] joined the channel
maxwelljoslyn[d] GWG - I must relinquish hosting of HWC to you to you tonight - feeling pretty slammed by work.
maxwelljoslyn[d] [edit] GWG - I must relinquish hosting of HWC to you to you tonight - feeling pretty slammed by work, can't attend tonight..
Seirdy joined the channel