#dev 2021-10-06
2021-10-06 UTC
angelo, gRegor, gRegorLove_, superkuh and Nuve joined the channel
# @n8wachT IndieAuth | An #identity layer on top of #OAuth 2.0 https://indieauth.spec.indieweb.org/ (twitter.com/_/status/1445606376269516808)
grantcodes[d], Nuve and feoh joined the channel
# @RobbiNespu Hello #indieweb #webmention world! I implemented checkin on my ssg website. Check my testing post https://robbinespu.gitlab.io/indieweb/2021-10-06-checkin-pasar-dungun/ which implement map (using #mapbox #openstreetmap), sadly foursquare not available on http://brid.gy to test if my code parse correctly. (twitter.com/_/status/1445646931619823619)
# @RobbiNespu Hello #indieweb #webmention world! I implemented checkin on my ssg website. Check my testing post https://robbinespu.gitlab.io/indieweb/2021-10-06-checkin-pasar-dungun/ which implement map (using #mapbox #openstreetmap), sadly foursquare not available on http://brid.gy to test if my code parse correctly. (twitter.com/_/status/1445646931619823619)
kogepan, hans1963[d] and hendursa1 joined the channel
# capjamesg[d] Moving discussion on contexts in search to here.
tetov-irc and akevinhuang joined the channel
# petermolnar capjamesg[d]: re main channel and site deaths: maybe check a text diff/distance when you update a content? If it's too different, there might be a problem.
# capjamesg[d] [KevinMarks] I think Google actually has a meta tag to mark adult content. I could easily add that to help prevent some issues. Not a solution, but an idea.
# capjamesg[d] petermolnar good idea. Do you know how to do such a diff or distance check, algorithmically? I had md5 hashing before learning about hash collisions with md5. And it wasn't working quite right either.
# capjamesg[d] [KevinMarks] I hadn't considered that vector of creating spam either.
# capjamesg[d] The commonality between webmention feeds is likely to be markup. Those feeds could be excluded from indexing altogether since they are visitor-generated and most likely do not themselves address the intent of the visitor.
# [KevinMarks] hashes are good for checking if it's exactly the same - sha256 is a reasonable choice though in practice an accidental rather than malicious collision in md5 is unlikely. "edit distance" is probably a good search term - there are higher order ways to decide like word vector spaces, but that may be overkill.
# [KevinMarks] If you have well marked-up webmention feeds in h-cite etc then you can attribute them to the source site, though again you may want to crawl them to check.
# [KevinMarks] you're likely going to need a blocklist, an allowlist and some kinds of pending list where you use signals to hold for review
# capjamesg[d] +1
# capjamesg[d] p-comment would indicate a comments section in a well marked-up comments section. Technically, i could delete those from the raw HTML before the page is indexed.
# capjamesg[d] The trouble is that not all pages would fall under that category.
# capjamesg[d] A page with h-cards that point to more than 2 authors could also be indicative of a comments section?
# capjamesg[d] What "signals" would you recommend?
# capjamesg[d] I wonder though if that could open up some kind of manipulation.
# capjamesg[d] If the site owner doesn't add rel=nofollow to Webmention links, those links would have weight in ranking.
# capjamesg[d] [KevinMarks] is it worth having some kind of keyword blocklist that would surface any obvious cases of content that might be problematic? or is this too naive?
# [KevinMarks] That can work, but again it can easily go wrong - if you manually curate it, sure, but it can be easy to accidentally overblock. (a famous one was posts mentioning 'socialism' being blocked because of Cialis being added to the spam words blocklist) - this is where a pending queue can be useful
KartikPrabhu joined the channel
# capjamesg[d] [tantek] +1
# capjamesg[d] I don't want to introduce a keyword filter. It's too risky and likely to be biased in one of many ways.
# capjamesg[d] "adult content" <meta> tags are added by site owners so they can be considered more reliable.
# Loqi content warning is a feature of a post create UI where an author can hide by default some or all of the primary content of a post due to some concern about the nature of the content https://indieweb.org/content_warning
# capjamesg[d] A few potential metrics for spam prevention / flagging: too many external links, too many referring domains, very long URL, very long title.
# petermolnar "also beware that "adult content" has been historically used to discriminate against / censor LGBTQI content and publishers in general" - may I ask for the source of this, please? Most "adult content" I encountered with was... adult content, and had nothing to do with gender identity.
# petermolnar or sexual identity
# petermolnar I'll rephrase the question: I'd like to understand who used "adult content" to censor.
# petermolnar *censor content that is not actually classified as 18+, but based on other criteria as well
# petermolnar but... sex ed can't be adult content, given the people you'd want to educate are not adults, no?
# capjamesg[d] Would you say this applies to the meta tag content rating?
# capjamesg[d] [tantek] aaronpk
# capjamesg[d] I know this is a super nuanced topic.
# Murray[d] petermolnar: YouTube is a pretty major example of a service that heavily censored/censors LGBTQIA+ content (and sex ed content more broadly) under the umbrella of "adult content". The censoring can be literal deletion of accounts/videos, but normally more subtle instances like age gating, suppressing videos, etc.
# [tantek] petermolnar it's a pretty well established pattern, examples above, and both online and offline. more details/history here: https://en.wikipedia.org/wiki/LGBT_movements_in_the_United_States#Opposition
# [tantek] Tumblr << Criticism: 2013-07-24 [https://www.baltimoresun.com/features/bal-tumblr-bans-gay-lesbian-searches-20130724-story.html Tumblr bans in-app searches of 'gay' and 'lesbian,' but not gay slurs] <blockquote>… plenty of content posted under those tags contributes to a productive discussion of LGBT issues</blockquote>
# Loqi ok, I added "Criticism: 2013-07-24 [https://www.baltimoresun.com/features/bal-tumblr-bans-gay-lesbian-searches-20130724-story.html Tumblr bans in-app searches of 'gay' and 'lesbian,' but not gay slurs] <blockquote>… plenty of content posted under those tags contributes to a productive discussion of LGBT issues</blockquote>" to the "See Also" section of /Tumblr https://indieweb.org/wiki/index.php?diff=77245&oldid=68805
# jeremycherfas As @cleverdevil is busy buying a house and moving, can anyone help me understand what this part of his Overcast scraping code might do: https://gist.github.com/cleverdevil/a8215850420493c1ee06364161e281c0#file-overcast-recently-played-py-L43-L44
# jeremycherfas Or rather, is `SESSION_PATH` a special variable in Python? I can't find anything that suggests it is.
# jeremycherfas Maybe he has a style of ALLCAPS for variables defined in a config file.
# capjamesg[d] Uppercase variables are typically used to determine constants in Python, which is a value that does not change.
# capjamesg[d] I use them in my config files for things like URLs that I know will not change.
# capjamesg[d] In that code, I don't know if conf is a module that cleverdevil has created, but I assume it is.
# capjamesg[d] And in a file like conf.py he would store his config variables.
# capjamesg[d] SESSION_PATH in this context looks to be the path where you want to save all of your files.
# capjamesg[d] At a high level, the code is: authenticating, getting an OPML file to parse, reading the file, and then extracting some information.
# capjamesg[d] Let me know if you have any questions.
# capjamesg[d] There are some good comments on cleverdevil's script too 🙂
hendursaga joined the channel
# jeremycherfas So it is a convention to ALLCAP variables that won't change, as you describe? I can handle that!
# capjamesg[d] Yeah 🙂
# capjamesg[d] It's a convention, not a rule though.
# capjamesg[d] That's just something to bear in mind. I haven't seen many cases of confusing constants.
# capjamesg[d] Thanks for the content filtering discussion everyone! I made one change: sites marked as "adult" will not be indexed, according to a meta tag. Google uses this in their SafeSearch to the extent I am aware. Code for reference: https://github.com/capjamesg/indieweb-search/blob/main/crawler/url_handling.py#L252
# capjamesg[d] So the entire h-card?
# capjamesg[d] That is actually very easy to do now because I am going to show results like the screenshot I shared in #indieweb earlier, with profile photos, etc.
# capjamesg[d] [tantek] I will use that logic instead. Good idea! I think that is likely to surface results with a greater intent than a word that appears on a home page.
shoesNsocks joined the channel
# capjamesg[d] Got it!
# capjamesg[d] This will be in a future release.
hs0ucy, [fluffy] and [jeremycherfas] joined the channel
# [jeremycherfas] I’ve hard-coded some of the variables for now. Understanding how Python does config files is a task for another day.
# [jeremycherfas] [capjamesg]++
hs0ucy joined the channel
# capjamesg[d] Let me know if you need anything else 🙂
hs0ucy joined the channel
# capjamesg[d] [tantek] I just added the h card logic.
# capjamesg[d] I’ll maybe release tomorrow.
# capjamesg[d] Your site must be really big!
# capjamesg[d] I assume you store posts in flat files?
# capjamesg[d] Or is it a DB that is using up that much space?
# capjamesg[d] Ah, got it.
akevinhuang2 joined the channel
marksuth joined the channel
akevinhuang joined the channel
# vilhalmer huh, apparently you can do this: https://stackoverflow.com/a/13738951
hendursaga joined the channel
tetov-irc, [schmarty], gRegor, maxwelljoslyn[d] and Seirdy joined the channel