#indieweb 2021-09-22

2021-09-22 UTC
#
[tantek]
The goodreads story is a sad one, where a site initially quite helpful / useful for both readers and book authors, perhaps even a "community" of sorts, sometime after being sold (to Amazon) was subsequently neglected (in terms of community management / harassment) that it's now become a *burden* for authors
#
[tantek]
passionately make a thing that connects people & improves their lives, sell it, watch that thing become a net harm on people (worse than if you'd never made it in the first place)
[schmarty] and nertzy joined the channel
#
hendursaga
[tantek]: I really disliked them discontinuing their API.. it was a valuable resource..
#
[tantek]
ugh yeah. with that and the existing points in /Goodreads#See_Also, it's probably worth collecting those into a "Criticism" section to summarize / make it explicit
aranjedeath, Ruxton and KartikPrabhu1 joined the channel
#
Loqi
[indienews] New post: "Building my own webmention receiver" https://jamesg.blog/2021/09/22/webmention-receiver
hendursa1, _inky, timdream, benji and Bitweasil joined the channel
#
[KevinMarks]
[capjamesg] did you test with webmention.rocks ?
#
capjamesg[d]
[KevinMarks] yeah.
#
capjamesg[d]
That blog post is actually a bit out of date but I’d rather publish than not.
#
capjamesg[d]
I have written one blog post in the last few weeks.
#
Loqi
[capjamesg] webmention-receiver: A webmention receiver written in Python Flask with sqlite3.
#
Loqi
[capjamesg] webmention-receiver: A webmention receiver written in Python Flask with sqlite3.
tetov-irc joined the channel
#
@parismartineau
just thinking about this field on my vet's new patient intake form https://pbs.twimg.com/media/E_zz3gcUUAAOtAh.png
(twitter.com/_/status/1440294126474252289)
#
capjamesg[d]
Well... IndieWeb Search needs a spam filter.
#
capjamesg[d]
Any tips on suggested spam filtering techniques that do not require a very good understanding of statistics?
#
[KevinMarks]
that is the hard problem of search
#
petermolnar
there's a wordpress plugin named antispam-bee; I believe that is a local-only plugin. Maybe looking into how it works could help.
Moosadee and sennomo joined the channel
#
[tantek]
capjamesg[d], rather than filtering, are you able to track down where the spam is coming from?
#
capjamesg[d]
[tantek] I am.
#
capjamesg[d]
I am not going to take a filtering approach.
#
capjamesg[d]
In this case the issue was one domain that linked out to many, many sites.
#
[tantek]
next question: how did that domain get into the list of sits to crawl?
#
voxpelli
what is indieweb search?
#
Loqi
IndieWeb Search is a search engine that lets you find web pages and websites created by IndieWeb community members https://indieweb.org/IndieWeb_Search
hendursaga joined the channel
#
capjamesg[d]
I made that page πŸ™‚
#
voxpelli
capjamesg[d]++
#
Loqi
capjamesg[d] has 3 karma in this channel over the last year (17 in all channels)
#
capjamesg[d]
Feel free to leave notes if you have tried out the search engine. Today was index rebuilding day ):
#
capjamesg[d]
* πŸ™‚
#
capjamesg[d]
[tantek] Good question. I need to investigate. It was either in my manual search at the start of my putting together the list or it was potentially a redirect from a dead domain that was pulled from the wiki or the domains list.
#
voxpelli
what is relspider?
#
Loqi
relspider is a web crawler that indexes the identity/social graph of profiles https://indieweb.org/relspider
#
voxpelli
capjamesg[d]: it reminded me of that old experiment of mine ;)
#
capjamesg[d]
Fun fact: I now have a CSV of almost 3M links. But I suspect at least tens of thousands are from this spam site and invalid.
#
capjamesg[d]
I'm just about to eat my dinner voxpelli πŸ™‚
#
capjamesg[d]
Will check out later!
#
voxpelli
The main blocker I had in that implementation: I never got a good recrawl mechanism implemented
#
capjamesg[d]
I am stuck on that issue too.
#
voxpelli
I did some research on the subject, found lots of university papers and ultimately never got past that phase
#
capjamesg[d]
I have some almost working recrawler logic in my crawler.
#
capjamesg[d]
But I need to decide: how much to crawl, when to crawl, and how often I should rebuild the full index.
#
capjamesg[d]
Any other considerations?
#
[tantek]
oooh: "redirect from a dead domain" this is a likely thing
#
[tantek]
or even without a redirect
#
[tantek]
what is a zombie site?
#
Loqi
It looks like we don't have a page for "zombie site" yet. Would you like to create it? (Or just say "zombie site is ____", a sentence describing the term)
#
[tantek]
what is a zombie
#
Loqi
zombie is in the context of the web a website that had died (site-deaths), perhaps due to domain registration neglect, and has been brought back by some other looking sorta like it did before, but oddly broken, often with spam pages/links added, and eats a lot of CPU likely due to abusive scripts https://indieweb.org/zombie
#
[tantek]
zombie site is /zombie
#
[tantek]
capjamesg[d], does your crawler track when it does redirects?
#
[tantek]
if a spam site did come from the wiki, it's worth investigating how it got in there, and if it was an old site that died / got redirected / became a zombie, worth documenting it explicitly as such on the wiki, and changing links to it to use Internet Archive versions instead
#
[tantek]
I think we've done this with a few
#
[tantek]
what is Portable Contacts
#
Loqi
Portable Contacts (often abbreviated as PoCo) was a proposed 2008 specification for exchange of vCard-compatible contact info using a one-off JSON format, before microformats2, its parsed canonical JSON format, and h-card was specified https://indieweb.org/Portable_Contacts
#
[tantek]
^ that page has a bunch of examples of how to capture / document when relevant sites/domains die / get taken over
#
voxpelli
would be nice to have a mechanism to validate whether the owner of the domain is still the same
#
voxpelli
I guess something like an XFN friend list should have timestamps so one can know which owner of the domain one refers to
#
[tantek]
like their h-card on the homepage? but that could be a zombie too
#
Loqi
hey voxpelli, [tantek], we try to keep dev talk (XFN, Microformats) out of this channel, can you move to #indieweb-dev?
#
[tantek]
oops πŸ™‚
#
voxpelli
Yeah, I guess only actual super safe variant is to do what Keybase does
#
voxpelli
But even that is probably vulnerable to β€œreplay” attacks (copy and paste the important identity info from the previous site to be able to impersonate that person and abuse that person’s credibility in its social circles to eg spam)
#
aaronpk
yeah you'd need an active challenge to detect if the site changed owner
#
aaronpk
rather than an artifact that is published statically that could be copied
#
voxpelli
One could probably make it so that it only needs to be updated when a domain is renewed
#
aaronpk
There's precedent in this with SSH showing a warning if the server fingerprint changed from the last time you logged in
#
voxpelli
Very true
#
[tantek]
there may be something you can do by checking things before/after domain expiration
#
[tantek]
things like: domain registrar info, IP address (and if it changed, is it with the same webhost?), other WHOIS info
[fluffy] joined the channel
#
[tantek]
some of these naturally change, however, maybe there could be a threshold. if a certain number of these things all change at the same time, it's evidence that a new entity controls the domain
#
voxpelli
Especially important for those who have that domain in their XFN lists to be made aware that it may no longer represent the same
#
[KevinMarks]
I've had it where old blogs I was subscribed to came back to life and my feedreader was full of spam
#
voxpelli
I think a friend of mine crawled some top blogs, looking at when they would expire, too see if people were going to snatch them if they did expire
#
capjamesg[d]
This is an interesting discussion.
#
capjamesg[d]
[tantek] There are a few interesting resources I have come across.
#
capjamesg[d]
I seen a service that lets you do an API call and find out if information like IP address / WHOIS has changed recently.
#
capjamesg[d]
I also found that Google has an open API that you can use to determine if a URL might be malicious.
#
capjamesg[d]
Which is really kind of them.
#
capjamesg[d]
I thought about incorporating it into IndieWeb Search but the crawler does not follow links to other sites right now.
#
capjamesg[d]
voxpelli My crawler is public on GitHub capjamesg[d]/indieweb-search.
#
capjamesg[d]
voxpelli Have you seen snarfed's IndieMap?
#
capjamesg[d]
He done something similar with social graphs to what you did with relspider.
#
capjamesg[d]
I'm open to any extensions on the IndieWeb Search crawler that does not slow things down too much so if there's data that might be interesting to the community I can pull it.
#
[tantek]
mostly I think some of this will be useful for after the fact repair
#
[tantek]
rather than polling all the information for all the domains regularly
#
capjamesg[d]
[tantek] My logs are a bit off. Logs are marked with a timestamp in their file name so it's hard to find exactly when a site was crawled unless you use a bash command. The Python logging module doesn't let me specify different logging files per thread I don't think.
#
capjamesg[d]
I think the requests library adds some logging. I don't explicitly log redirects. One second.
#
[tantek]
I'm wondering if we need to keep a briefer "domain deaths" page similar to site deaths but for domains that are known to have been abandoned by the original owner, and thus either actually unresponsive, or (vulnerable to being) taken over by bad actors
#
capjamesg[d]
If there's a way I can contribute I'd be happy to. But I'm not crawling all domains on the wiki just yet.
#
[tantek]
that way anyone using links from the wiki could also use that as a blocklist of sorts
#
[tantek]
just as we can use /chat-names as a startlist of sorts
#
capjamesg[d]
[tantek] My logs track redirects by status code.
#
[tantek]
capjamesg[d] even without crawling all domains, if you had a blocklist, you could ignore links to those domains
#
capjamesg[d]
I like that idea [tantek].
#
capjamesg[d]
I have a blocklist logic already. But only three domains.
#
capjamesg[d]
Somehow LinkedIn got into my dataset. It might have been an early mix up on my part.
#
[tantek]
capjamesg[d] see #indieweb-meta πŸ˜‚ πŸ€¦β€β™‚οΈ
#
capjamesg[d]
voxpelli Curious to hear if you have any feedback on the search engine if you tried it out πŸ™‚
#
capjamesg[d]
[tantek] Your comment on logging reminded me of how much I track. This is so useful!
#
voxpelli
capjamesg[d]: absolutely, I’m quite swamped at the moment so it may take a while though
#
capjamesg[d]
Back to your original point, I'm not going to go too deep into the spam world right now since it's easier to track and trace versus implement a machine learning algorithm or something like that.
#
capjamesg[d]
No worries voxpelli πŸ™‚
dotslashroot joined the channel
#
capjamesg[d]
To be honest I have ticked off a lot of my to-dos for this project. I am just using it myself now.
#
capjamesg[d]
I need to fix duplicate results.
#
capjamesg[d]
Apparently the wiki /webmentions and /webmention are indexed as separate documents because I follow redirects. I don't know why this is not picked up as a duplicate though.
#
capjamesg[d]
(I am using md5 hashing for duplicate identification. I know I need to use something different.)
#
dotslashroot
Hello
#
capjamesg[d]
Hello dotslashroot!
shoesNsocks, MatrixTravelerbo, ShinyCyril and zblesk[m] joined the channel
#
capjamesg[d]
aaronpk How does the wiki handle redirects?
#
capjamesg[d]
I am seeing a 200 code from curl.
#
[KevinMarks]
There will be a 301 happening before that. You can track them if you look in more detail at requests
#
capjamesg[d]
I'm probably being silly but I can't see one.
#
capjamesg[d]
request_obj.history is not showing a redirect chain.
#
aaronpk
->meta
pmn joined the channel
#
pmn
can <link> or rss's <item> be relative?
#
pmn
*of rss
#
Loqi
hey pmn: that's a lot of dev jargon! RSS... can you move to #indieweb-dev?
#
pmn
ok
hendursa1, hendursaga, _inky, maxwelljoslyn[d] and [snarfed] joined the channel; pmn left the channel
#
[snarfed]
I definitely detected and special cased a bunch of paths on specific sites for IndieMap, somewhat similar to tantek's idea. lots but still probably less work than any kind of smart automated filtering. https://indiemap.org/docs.html#exceptions
tetov-irc joined the channel
#
[snarfed]
(sorry, probably for #indieweb-dev)
angelo, ShinyCyril, Seirdy and [cleverdevil] joined the channel