#indieweb 2021-10-06

2021-10-06 UTC
angelo, gRegor, gRegorLove_, MatrixTravelerb4, benchi[m], Nuve, grantcodes[d], timdream, feoh, hans1963[d], hendursa1 and _inky joined the channel
#
capjamesg[d]
[KevinMarks]
#
capjamesg[d]
Was this what you were thinking about re: context?
#
capjamesg[d]
aaronpk your site happens to have a rich h-card on each page which makes it perfect for testing 🙂
#
Loqi
hey capjamesg[d], would you mind moving this conversation (CDN, Microformats) to #indieweb-dev? thanks!
#
capjamesg[d]
Em. I didn't use either of those words?
#
sknebel
well you said the first, but in an url, that also shouldn't count :D
#
capjamesg[d]
Ah, I see.
#
capjamesg[d]
I am on Discord which is why I didn't see that.
#
capjamesg[d]
In any case, we can move to #dev.
macaw, tetov-irc, n8chz1, n8chz and _inky joined the channel
#
aaronpk
Oops haha I need to make Loqi ignore URLs for that
#
[KevinMarks]
that is looking nicer, [capjamesg]
#
[KevinMarks]
For some background thinking on crawling and scraping, this is well worth reading https://twitter.com/Abebab/status/1445723482231173120?s=20
#
@Abebab
3 weeks ago LAION-400M dataset (now a billion+), first Image-Alt-text pair dataset of this scale was released. @vinayprabhu, @MannyKayy & I dug into it https://arxiv.org/abs/2110.01963 Long tread 1/ Warning: paper contains NSFW content that may be disturbing, distressing &/or offensive https://pbs.twimg.com/media/FBA9JQvUYAoUeHe.png
(twitter.com/_/status/1445723482231173120)
#
capjamesg[d]
[KevinMarks] Thanks for sharing!
#
capjamesg[d]
I'm happy with the social layout. It will take a few weeks to roll out because I will have to index all h-cards in a separate field (for performance reasons).
#
capjamesg[d]
(but open to feedback 🙂 )
#
capjamesg[d]
In terms of that paper, I'll dig deeper.
#
capjamesg[d]
The example of captioning images is outstanding.
#
capjamesg[d]
I don't understand AI / ML nearly enough to do anything with IndieWeb Search. You can get very far without it on a project where only certain domains are indexed.
#
capjamesg[d]
(outstanding is maybe not the right word, rather stark and concerning)
#
capjamesg[d]
I haven't given much thought to content filtering because it is a big grey area in terms of ethics and the approach one uses (naive approaches are not up to the task in many cases). For now I rely on the integrity of the content this community publishes 🙂
#
[KevinMarks]
You're relying on the content coming from a smaller known group, which is better then the kind of deep grab discussed, but you are going to need a way to easily blocklist contributors from the results (and maybe individual pages)
#
capjamesg[d]
I have a blocklist.txt file to which I can manually add domains. This gives protection against reindexing a website that has been removed. Individual pages are harder to trace but I can imagine a feasible, non-performance-drag way to implement a system to prevent blocklisted URLs from being reindexed.
#
capjamesg[d]
The concern I have is site deaths.
#
capjamesg[d]
If a site in the index goes offline and the domain is bought by someone else, that causes an issue. I have not encountered this yet but then again the project has only been going on for a few months.
#
capjamesg[d]
I'm not sure what an automated blocklist system would look like. That's a tough problem to solve.
#
[KevinMarks]
you may also have the problem of a site being fine, but it including webmentions (or PESO'd bridgy tweet replies) that are offensive or explicit - I'm sure mine include some of those
#
capjamesg[d]
I'll take this to #dev.
#
[snarfed]
[tantek] I'm trying to understand https://indieweb.org/domain-deaths . it's sites that currently still exist and serve, but seem abandoned and likely to not get renewed and then get hijacked?
#
[snarfed]
as opposed to site deaths where that's already happened?
KartikPrabhu joined the channel
#
[tantek]
domain deaths is about control over the domain, site-deaths is about loss of content
#
[snarfed]
ok. so if a site switched domains...would the old one go on domain-deaths?
#
[tantek]
if they lost control of it yes. if they kept the old domain and put redirects in place, then its not dead
#
[snarfed]
thanks!
#
[tantek]
I mean I suppose we could call them "lost domains" instead
#
[snarfed]
phrase seems fine, I'll just update the description
#
aaronpk
actually "lost domains" sounds better IMO
#
[snarfed]
do we have an example of one in the wild though? ie the domain was lost, but the site survived, presumably serving somewhere else?
#
aaronpk
portablecontacts is one. unfortunately it's not a good idea to actually link to these tho :)
#
[tantek]
what is a zombie
#
Loqi
zombie is in the context of the web a website that had died (site-deaths), perhaps due to domain registration neglect, and has been brought back by some other looking sorta like it did before, but oddly broken, often with spam pages/links added, and eats a lot of CPU likely due to abusive scripts https://indieweb.org/zombie
#
[tantek]
snarfed, a few examples there ^
#
[snarfed]
oh man two new ones, zombie and lost site
#
[snarfed]
I kinda wonder if we really have enough meaningful distinctions for four separate things here
#
[snarfed]
aaronpk so poco has a new current active site ()ie not just a backup or internet archive snapshot), but we don't want to link to it here
#
aaronpk
do they??
#
[snarfed]
afaik no. which seems then like it's a site death, not a domain death
#
aaronpk
i think the distinction is that there is now other content on that domain
#
[snarfed]
hmm that's not really what tantek said ^
#
[snarfed]
that sounds more like "zombie"
#
aaronpk
ok yea i don't understand the difference between "zombie" and "domain-deaths" here
#
[snarfed]
anyway, this is low priority, just curious. I still wonder if we have know of any domain death examples in the wild, ie where the site lost its domain and moved to a new one
#
[snarfed]
domain death I guess doesn't imply that there's new content (or owner) on the old domain
#
[snarfed]
zombie does
#
aaronpk
the domain my blog used to be on was re-registered by someone and now redirects to something random
#
[snarfed]
right, so then is that a zombie or lost domain 🤷
#
[snarfed]
we may not need four separate ideas/terms here
#
[snarfed]
eg https://indieweb.org/lost_sites could maybe be collapsed into site deaths
#
petermolnar
I'd classify zombie as abandoned with spammy comments, lost domain as redirects/has new owner/has completely different content
#
aaronpk
yeah these all seem close enough
#
petermolnar
site death is 410
#
petermolnar
actually, that's another category
#
petermolnar
wiped clean, but it's still with the old owner
joec_ joined the channel
#
capjamesg[d]
Oh, 410. I forgot about those.
#
[tantek]
zombie is a subset of dead, is that not obvious?
#
[tantek]
lol I created both /lost_sites and /domain-deaths like 7 months apart 🤦
#
Murray[d]
(feels like a fitting topic for Halloween season 👻 )
#
[tantek]
lol ok I'll merge / redirect
#
Murray[d]
on topic: wouldn't "zombie" be a subset of "domain death"? Could those pages be collapsed?
#
[tantek]
it is a subset, though zombie is distinct (dangerous) enough to be worth its own page to warn people (and have a place to reference)
#
[tantek]
domain deaths are more like, pour one out
#
Murray[d]
fair 🙂
#
[tantek]
good q tho
#
[snarfed]
we might also want at least one concrete example in the wild for each of these terms that we decide to keep
#
[tantek]
yes we have examples for each
#
[snarfed]
^ see backscroll, I'm not sure we do for lost/dead domain specifically
#
[snarfed]
(seems like the portablecontacts.net site only exists in backups or IA, which doesn't seem to match the domain death case where the site is still actively serving, just on a different domain)
#
aaronpk
oh i see, zombie is specifically when it is acting sort of like the original rather than replaced by something entirely different?
#
aaronpk
portablecontacts is still serving similar content on that same domain
#
aaronpk
(once you click past the cert error)
#
[snarfed]
heh ok, so then it's a zombie?
#
aaronpk
it's a "lost site" that has been turned into a "zombie" yes :D
_inky joined the channel
#
[tantek]
whereas kylewm's site is "merely" lost
#
[tantek]
ok, separated examples accordingly. see #indieweb-meta
#
[snarfed]
kylewm's isn't a site death?
#
[snarfed]
it's not serving on any domain now, right?
#
[snarfed]
oh I see. lost sites are unintentional, site deaths are intentional?
#
[tantek]
it's a severity thing
#
[tantek]
all you have to do to earn your way on to site deaths is destroy/lost content/permalinks
#
[snarfed]
ah, so the main difference in severity is maintaining control over the domain?
#
[tantek]
losing your domain earns you a spot on lost sites
#
[tantek]
lost sites << Use-case: don't bother indexing/crawling links to these domains
#
Loqi
ok, I added "Use-case: don't bother indexing/crawling links to these domains" to the "See Also" section of /lost_sites https://indieweb.org/wiki/index.php?diff=77252&oldid=77251
#
[tantek]
that was for capjamesg[d] in particular, so he could have a blocklist for his IndieWeb Search engine
#
[snarfed]
ok! thx
#
[tantek]
could also be useful for an IndieMap recrawl 🙂
hendursaga joined the channel
#
capjamesg[d]
Thank you!
#
capjamesg[d]
I will add these in my blacklist file.
#
[tantek]
capjamesg[d], you'll have to do the merge of lists of domains from the two pages, hope that's ok!
#
capjamesg[d]
No worries at all!
#
[tantek]
zombies seemed particularly dangerous and worth isolating / warning about on their own page
#
capjamesg[d]
All added to my list.
#
capjamesg[d]
I appreciate all of the thought that has gone into producing these lists.
#
capjamesg[d]
I'll add anything if I find any issues.
#
capjamesg[d]
(where "issues" is another site that could be flagged as zombie or lost)
#
capjamesg[d]
[tantek] Your suggestions yesterday were helpful. I have added logic to add icons to the rel=me featured snippet.
#
capjamesg[d]
I have also implemented a simple "discover" keyword that lets you find only homepages that mention a keyword.
#
capjamesg[d]
I think you mentioned something about only looking in h-cards too. I could use that behavior instead.
#
capjamesg[d]
(Whole site searches feel out of scope for this since one can do a site search anyway)
#
[tantek]
right, two steps:
#
[tantek]
1 look to see if a site has a top level h-card that is representative
#
Loqi
friendly reminder capjamesg[d], [tantek], we try to keep dev talk (Microformats) out of this channel, can you move to #indieweb-dev?
#
[tantek]
haha oops
#
capjamesg[d]
Yep. Let's move to dev.
shoesNsocks, [fluffy], [jeremycherfas], Moosadee, _inky, joj, marksuth, hendursaga, macaw, tetov-irc, [schmarty], gRegor and maxwelljoslyn[d] joined the channel
#
maxwelljoslyn[d]
GWG - I must relinquish hosting of HWC to you to you tonight - feeling pretty slammed by work.
#
maxwelljoslyn[d]
[edit] GWG - I must relinquish hosting of HWC to you to you tonight - feeling pretty slammed by work, can't attend tonight..
#
GWG
I'll miss you, but will hold down the fort
Seirdy joined the channel