#indieweb 2021-09-22
2021-09-22 UTC
# [tantek] The goodreads story is a sad one, where a site initially quite helpful / useful for both readers and book authors, perhaps even a "community" of sorts, sometime after being sold (to Amazon) was subsequently neglected (in terms of community management / harassment) that it's now become a *burden* for authors
[schmarty] and nertzy joined the channel
# hendursaga [tantek]: I really disliked them discontinuing their API.. it was a valuable resource..
# [tantek] ugh yeah. with that and the existing points in /Goodreads#See_Also, it's probably worth collecting those into a "Criticism" section to summarize / make it explicit
aranjedeath, Ruxton and KartikPrabhu1 joined the channel
# Loqi [indienews] New post: "Building my own webmention receiver" https://jamesg.blog/2021/09/22/webmention-receiver
hendursa1, _inky, timdream, benji and Bitweasil joined the channel
# [KevinMarks] [capjamesg] did you test with webmention.rocks ?
# capjamesg[d] [KevinMarks] yeah.
# capjamesg[d] That blog post is actually a bit out of date but Iβd rather publish than not.
# capjamesg[d] I have written one blog post in the last few weeks.
tetov-irc joined the channel
# @parismartineau just thinking about this field on my vet's new patient intake form https://pbs.twimg.com/media/E_zz3gcUUAAOtAh.png (twitter.com/_/status/1440294126474252289)
# capjamesg[d] Well... IndieWeb Search needs a spam filter.
# capjamesg[d] Any tips on suggested spam filtering techniques that do not require a very good understanding of statistics?
# [KevinMarks] that is the hard problem of search
# petermolnar there's a wordpress plugin named antispam-bee; I believe that is a local-only plugin. Maybe looking into how it works could help.
Moosadee and sennomo joined the channel
# capjamesg[d] [tantek] I am.
# capjamesg[d] I am not going to take a filtering approach.
# capjamesg[d] In this case the issue was one domain that linked out to many, many sites.
# Loqi IndieWeb Search is a search engine that lets you find web pages and websites created by IndieWeb community members https://indieweb.org/IndieWeb_Search
hendursaga joined the channel
# capjamesg[d] I made that page π
# capjamesg[d] Feel free to leave notes if you have tried out the search engine. Today was index rebuilding day ):
# capjamesg[d] * π
# capjamesg[d] [tantek] Good question. I need to investigate. It was either in my manual search at the start of my putting together the list or it was potentially a redirect from a dead domain that was pulled from the wiki or the domains list.
# Loqi relspider is a web crawler that indexes the identity/social graph of profiles https://indieweb.org/relspider
# capjamesg[d] Fun fact: I now have a CSV of almost 3M links. But I suspect at least tens of thousands are from this spam site and invalid.
# capjamesg[d] I'm just about to eat my dinner voxpelli π
# capjamesg[d] Will check out later!
# capjamesg[d] I am stuck on that issue too.
# capjamesg[d] I have some almost working recrawler logic in my crawler.
# capjamesg[d] But I need to decide: how much to crawl, when to crawl, and how often I should rebuild the full index.
# capjamesg[d] Any other considerations?
# Loqi It looks like we don't have a page for "zombie site" yet. Would you like to create it? (Or just say "zombie site is ____", a sentence describing the term)
# Loqi zombie is in the context of the web a website that had died (site-deaths), perhaps due to domain registration neglect, and has been brought back by some other looking sorta like it did before, but oddly broken, often with spam pages/links added, and eats a lot of CPU likely due to abusive scripts https://indieweb.org/zombie
# Loqi Portable Contacts (often abbreviated as PoCo) was a proposed 2008 specification for exchange of vCard-compatible contact info using a one-off JSON format, before microformats2, its parsed canonical JSON format, and h-card was specified https://indieweb.org/Portable_Contacts
[fluffy] joined the channel
# [KevinMarks] I've had it where old blogs I was subscribed to came back to life and my feedreader was full of spam
# capjamesg[d] This is an interesting discussion.
# capjamesg[d] [tantek] There are a few interesting resources I have come across.
# capjamesg[d] I seen a service that lets you do an API call and find out if information like IP address / WHOIS has changed recently.
# capjamesg[d] I also found that Google has an open API that you can use to determine if a URL might be malicious.
# capjamesg[d] Which is really kind of them.
# capjamesg[d] I thought about incorporating it into IndieWeb Search but the crawler does not follow links to other sites right now.
# capjamesg[d] voxpelli My crawler is public on GitHub capjamesg[d]/indieweb-search.
# capjamesg[d] voxpelli Have you seen snarfed's IndieMap?
# capjamesg[d] He done something similar with social graphs to what you did with relspider.
# capjamesg[d] I'm open to any extensions on the IndieWeb Search crawler that does not slow things down too much so if there's data that might be interesting to the community I can pull it.
# capjamesg[d] [tantek] My logs are a bit off. Logs are marked with a timestamp in their file name so it's hard to find exactly when a site was crawled unless you use a bash command. The Python logging module doesn't let me specify different logging files per thread I don't think.
# capjamesg[d] I think the requests library adds some logging. I don't explicitly log redirects. One second.
# capjamesg[d] If there's a way I can contribute I'd be happy to. But I'm not crawling all domains on the wiki just yet.
# [tantek] just as we can use /chat-names as a startlist of sorts
# capjamesg[d] [tantek] My logs track redirects by status code.
# capjamesg[d] I like that idea [tantek].
# capjamesg[d] I have a blocklist logic already. But only three domains.
# capjamesg[d] Somehow LinkedIn got into my dataset. It might have been an early mix up on my part.
# capjamesg[d] voxpelli Curious to hear if you have any feedback on the search engine if you tried it out π
# capjamesg[d] [tantek] Your comment on logging reminded me of how much I track. This is so useful!
# capjamesg[d] Back to your original point, I'm not going to go too deep into the spam world right now since it's easier to track and trace versus implement a machine learning algorithm or something like that.
# capjamesg[d] No worries voxpelli π
dotslashroot joined the channel
# capjamesg[d] To be honest I have ticked off a lot of my to-dos for this project. I am just using it myself now.
# capjamesg[d] I need to fix duplicate results.
# capjamesg[d] Apparently the wiki /webmentions and /webmention are indexed as separate documents because I follow redirects. I don't know why this is not picked up as a duplicate though.
# capjamesg[d] (I am using md5 hashing for duplicate identification. I know I need to use something different.)
# dotslashroot Hello
# capjamesg[d] Hello dotslashroot!
shoesNsocks, MatrixTravelerbo, ShinyCyril and zblesk[m] joined the channel
# capjamesg[d] aaronpk How does the wiki handle redirects?
# capjamesg[d] i.e. from /webmentions to /Webmention
# capjamesg[d] I am seeing a 200 code from curl.
# [KevinMarks] There will be a 301 happening before that. You can track them if you look in more detail at requests
# capjamesg[d] I'm probably being silly but I can't see one.
# capjamesg[d] request_obj.history is not showing a redirect chain.
# capjamesg[d] +1
pmn joined the channel
# pmn can <link> or rss's <item> be relative?
# pmn *of rss
# pmn ok
hendursa1, hendursaga, _inky, maxwelljoslyn[d] and [snarfed] joined the channel; pmn left the channel
# [snarfed] I definitely detected and special cased a bunch of paths on specific sites for IndieMap, somewhat similar to tantek's idea. lots but still probably less work than any kind of smart automated filtering. https://indiemap.org/docs.html#exceptions
tetov-irc joined the channel
angelo, ShinyCyril, Seirdy and [cleverdevil] joined the channel