#dev 2021-10-06

2021-10-06 UTC
angelo, gRegor, gRegorLove_, superkuh and Nuve joined the channel
grantcodes[d], Nuve and feoh joined the channel
#
@RobbiNespu
Hello #indieweb #webmention world! I implemented checkin on my ssg website. Check my testing post https://robbinespu.gitlab.io/indieweb/2021-10-06-checkin-pasar-dungun/ which implement map (using #mapbox #openstreetmap), sadly foursquare not available on http://brid.gy to test if my code parse correctly.
(twitter.com/_/status/1445646931619823619)
#
@RobbiNespu
Hello #indieweb #webmention world! I implemented checkin on my ssg website. Check my testing post https://robbinespu.gitlab.io/indieweb/2021-10-06-checkin-pasar-dungun/ which implement map (using #mapbox #openstreetmap), sadly foursquare not available on http://brid.gy to test if my code parse correctly.
(twitter.com/_/status/1445646931619823619)
kogepan, hans1963[d] and hendursa1 joined the channel
#
capjamesg[d]
Moving discussion on contexts in search to here.
tetov-irc and akevinhuang joined the channel
#
petermolnar
capjamesg[d]: re main channel and site deaths: maybe check a text diff/distance when you update a content? If it's too different, there might be a problem.
#
capjamesg[d]
[KevinMarks] I think Google actually has a meta tag to mark adult content. I could easily add that to help prevent some issues. Not a solution, but an idea.
#
capjamesg[d]
petermolnar good idea. Do you know how to do such a diff or distance check, algorithmically? I had md5 hashing before learning about hash collisions with md5. And it wasn't working quite right either.
#
capjamesg[d]
[KevinMarks] I hadn't considered that vector of creating spam either.
#
capjamesg[d]
The commonality between webmention feeds is likely to be markup. Those feeds could be excluded from indexing altogether since they are visitor-generated and most likely do not themselves address the intent of the visitor.
#
[KevinMarks]
hashes are good for checking if it's exactly the same - sha256 is a reasonable choice though in practice an accidental rather than malicious collision in md5 is unlikely. "edit distance" is probably a good search term - there are higher order ways to decide like word vector spaces, but that may be overkill.
#
[KevinMarks]
If you have well marked-up webmention feeds in h-cite etc then you can attribute them to the source site, though again you may want to crawl them to check.
#
[KevinMarks]
you're likely going to need a blocklist, an allowlist and some kinds of pending list where you use signals to hold for review
#
capjamesg[d]
p-comment would indicate a comments section in a well marked-up comments section. Technically, i could delete those from the raw HTML before the page is indexed.
#
capjamesg[d]
The trouble is that not all pages would fall under that category.
#
capjamesg[d]
A page with h-cards that point to more than 2 authors could also be indicative of a comments section?
#
capjamesg[d]
What "signals" would you recommend?
#
sknebel
a section with webmentions I would expect external links that point to the source page. not sure if a normal comments section really needs excluding
#
capjamesg[d]
I wonder though if that could open up some kind of manipulation.
#
capjamesg[d]
If the site owner doesn't add rel=nofollow to Webmention links, those links would have weight in ranking.
#
capjamesg[d]
[KevinMarks] is it worth having some kind of keyword blocklist that would surface any obvious cases of content that might be problematic? or is this too naive?
#
[KevinMarks]
That can work, but again it can easily go wrong - if you manually curate it, sure, but it can be easy to accidentally overblock. (a famous one was posts mentioning 'socialism' being blocked because of Cialis being added to the spam words blocklist) - this is where a pending queue can be useful
KartikPrabhu joined the channel
#
[tantek]
also beware that "adult content" has been historically used to discriminate against / censor LGBTQI content and publishers in general
#
capjamesg[d]
[tantek] +1
#
capjamesg[d]
I don't want to introduce a keyword filter. It's too risky and likely to be biased in one of many ways.
#
capjamesg[d]
"adult content" <meta> tags are added by site owners so they can be considered more reliable.
#
[tantek]
what is content warning?
#
Loqi
content warning is a feature of a post create UI where an author can hide by default some or all of the primary content of a post due to some concern about the nature of the content https://indieweb.org/content_warning
#
capjamesg[d]
A few potential metrics for spam prevention / flagging: too many external links, too many referring domains, very long URL, very long title.
#
[tantek]
that kind of self-labeling is far beyond "adult content", and framing it that way, or even "NSFW" is too limiting. it causes solutions that "work" for a particularly socially conservative privileged population, but not for people / web users as a whole
#
[tantek]
hence "content warning" instead
#
petermolnar
"also beware that "adult content" has been historically used to discriminate against / censor LGBTQI content and publishers in general" - may I ask for the source of this, please? Most "adult content" I encountered with was... adult content, and had nothing to do with gender identity.
#
petermolnar
or sexual identity
#
aaronpk
oh gosh there are plenty of examples
#
aaronpk
i don't even know where to start
#
petermolnar
I'll rephrase the question: I'd like to understand who used "adult content" to censor.
#
petermolnar
*censor content that is not actually classified as 18+, but based on other criteria as well
#
aaronpk
a broad category of content that falls under that category is sex ed
#
petermolnar
but... sex ed can't be adult content, given the people you'd want to educate are not adults, no?
#
capjamesg[d]
Would you say this applies to the meta tag content rating?
#
capjamesg[d]
[tantek] aaronpk
#
capjamesg[d]
I know this is a super nuanced topic.
#
sknebel
petermolnar: the trouble starts when *you* want to educate non-adults and other people think you are "perverting the youth" or something by mentioning the wrong things
#
Murray[d]
petermolnar: YouTube is a pretty major example of a service that heavily censored/censors LGBTQIA+ content (and sex ed content more broadly) under the umbrella of "adult content". The censoring can be literal deletion of accounts/videos, but normally more subtle instances like age gating, suppressing videos, etc.
#
sknebel
(+ lots of filtering just being generally crap, so it doesn't matter in what context you mention things once some rule got the idea its adult)
#
[tantek]
petermolnar it's a pretty well established pattern, examples above, and both online and offline. more details/history here: https://en.wikipedia.org/wiki/LGBT_movements_in_the_United_States#Opposition
#
[tantek]
Tumblr << Criticism: 2013-07-24 [https://www.baltimoresun.com/features/bal-tumblr-bans-gay-lesbian-searches-20130724-story.html Tumblr bans in-app searches of 'gay' and 'lesbian,' but not gay slurs] <blockquote>… plenty of content posted under those tags contributes to a productive discussion of LGBT issues</blockquote>
#
Loqi
ok, I added "Criticism: 2013-07-24 [https://www.baltimoresun.com/features/bal-tumblr-bans-gay-lesbian-searches-20130724-story.html Tumblr bans in-app searches of 'gay' and 'lesbian,' but not gay slurs] <blockquote>… plenty of content posted under those tags contributes to a productive discussion of LGBT issues</blockquote>" to the "See Also" section of /Tumblr https://indieweb.org/wiki/index.php?diff=77245&oldid=68805
#
jeremycherfas
As @cleverdevil is busy buying a house and moving, can anyone help me understand what this part of his Overcast scraping code might do: https://gist.github.com/cleverdevil/a8215850420493c1ee06364161e281c0#file-overcast-recently-played-py-L43-L44
#
jeremycherfas
Or rather, is `SESSION_PATH` a special variable in Python? I can't find anything that suggests it is.
#
jeremycherfas
Maybe he has a style of ALLCAPS for variables defined in a config file.
#
capjamesg[d]
Uppercase variables are typically used to determine constants in Python, which is a value that does not change.
#
capjamesg[d]
I use them in my config files for things like URLs that I know will not change.
#
capjamesg[d]
In that code, I don't know if conf is a module that cleverdevil has created, but I assume it is.
#
capjamesg[d]
And in a file like conf.py he would store his config variables.
#
capjamesg[d]
SESSION_PATH in this context looks to be the path where you want to save all of your files.
#
capjamesg[d]
At a high level, the code is: authenticating, getting an OPML file to parse, reading the file, and then extracting some information.
#
capjamesg[d]
Let me know if you have any questions.
#
capjamesg[d]
There are some good comments on cleverdevil's script too 🙂
hendursaga joined the channel
#
jeremycherfas
So it is a convention to ALLCAP variables that won't change, as you describe? I can handle that!
#
capjamesg[d]
Yeah 🙂
#
capjamesg[d]
It's a convention, not a rule though.
#
capjamesg[d]
That's just something to bear in mind. I haven't seen many cases of confusing constants.
#
[tantek]
capjamesg[d] use representative h-cards from homepages of sites as the source of information/tags about that site
#
capjamesg[d]
Thanks for the content filtering discussion everyone! I made one change: sites marked as "adult" will not be indexed, according to a meta tag. Google uses this in their SafeSearch to the extent I am aware. Code for reference: https://github.com/capjamesg/indieweb-search/blob/main/crawler/url_handling.py#L252
#
capjamesg[d]
So the entire h-card?
#
capjamesg[d]
That is actually very easy to do now because I am going to show results like the screenshot I shared in #indieweb earlier, with profile photos, etc.
#
capjamesg[d]
[tantek] I will use that logic instead. Good idea! I think that is likely to surface results with a greater intent than a word that appears on a home page.
#
[tantek]
precisely, "word on a home page" is just general search, not topic search
shoesNsocks joined the channel
#
capjamesg[d]
This will be in a future release.
hs0ucy, [fluffy] and [jeremycherfas] joined the channel
#
[jeremycherfas]
I’ve hard-coded some of the variables for now. Understanding how Python does config files is a task for another day.
#
[jeremycherfas]
[capjamesg]++
#
Loqi
[capjamesg] has 13 karma in this channel over the last year (25 in all channels)
hs0ucy joined the channel
#
capjamesg[d]
Let me know if you need anything else 🙂
hs0ucy joined the channel
#
capjamesg[d]
[tantek] I just added the h card logic.
#
capjamesg[d]
I’ll maybe release tomorrow.
#
aaronpk
gosh my site is too big for my poor laptop's 256gb hard drive
#
aaronpk
guess i am stuck using a remote VM for development
#
capjamesg[d]
Your site must be really big!
#
capjamesg[d]
I assume you store posts in flat files?
#
capjamesg[d]
Or is it a DB that is using up that much space?
#
aaronpk
it's mostly the photos and videos on disk... 32gb
#
aaronpk
the database index is about 2gb
#
capjamesg[d]
Ah, got it.
akevinhuang2 joined the channel
#
[snarfed]
aaronpk do you store your site (data) in git? if so you may be interested in git partial clones
#
aaronpk
i do have a separate repo for just the files
#
aaronpk
i could clone only the code part of my site onto my laptop for development and create a bunch of placeholder posts for it, but that sounds like more work and also then i'm not testing it with real data
marksuth joined the channel
#
[snarfed]
oh I was thinking cloning some but not all of the data
#
aaronpk
maybe if i could download like only the last 12 months of posts that would be cool... can git filter filenames like that?
akevinhuang joined the channel
#
[snarfed]
maybe? dynamic like that sounds complicated, but you could just switch it once a year, probably pretty easy
#
aaronpk
if i could match folder names that would be close cause i could ignore /2008 - /2019 for example
#
vilhalmer
I don't think you can do that without doing like a git-filter-branch
#
vilhalmer
and you can't do that as you clone afaik, or push anything back afterwards without more convoluted steps
#
aaronpk
well i don't need to commit anything in this repo from my laptop since it's only the data part of the site
#
vilhalmer
huh, apparently you can do this: https://stackoverflow.com/a/13738951
#
vilhalmer
I've never seen this before, only shallow
#
aaronpk
well that's neat
#
vilhalmer
I'm gonna have to play with this
hendursaga joined the channel
#
[tantek]
interesting, sparse-checkout sounds promising
tetov-irc, [schmarty], gRegor, maxwelljoslyn[d] and Seirdy joined the channel