#dev 2024-04-13

2024-04-13 UTC
scottishstoater joined the channel
#
[tantek]
capjamesg++ whoa very cool!
#
Loqi
capjamesg has 47 karma in this channel over the last year (185 in all channels)
#
[tantek]
capjamesg is python code readily runnable in a Browser Add-on? Or would the code need to be translated to Javascript
#
capjamesg
[tantek] I'd need to translate it. I'm more comfortable in Python.
#
capjamesg
I just added more of the Wikipedia disinformation lists, too.
#
capjamesg
Regarding the classification, I think the best approach would be to have a pre-classified list of Wikipedia categories so the browser can do a lookup.
#
capjamesg
You can technically do classification on the fly with a library like transformers.js but it is wasteful to load a model and compute vectors in the browser when all you need is to know if a Wikipedia category is positive or negative.
#
[tantek]
totally agree - no need for a model / vectors and possibly would give inaccurate results too
#
[tantek]
With Wikipedia you're at least providing some degree of consensus human review results
#
capjamesg
I think a classification model is ideal for a first pass of the Wikipedia categories at a low threshold, depending on how many there are. Then human review of anything below a certain threshold.
#
capjamesg
Using the classification model I'm using right now -- a public model fine-tuned on DistillBERT -- I am getting good results with a 80% confidence threshold.
#
[tantek]
oh do you mean classifying the categories themselves? rather than the domains/articles?
#
capjamesg
I'd envision a browser extension to have a Red, Green, or Grey icon that shows depending on whether a flag has or has not been raised, or if no information is available about a page.
#
capjamesg
[tantek] Yes.
#
capjamesg
My code classifies that Pseudoscience is negative so knows to flag the source as potentially untrustworthy.
#
capjamesg
But Privately held companies of the United States is neutral, so no flag needs to be raised.
#
capjamesg
(both categories on the Goop domain page)
#
[tantek]
ahhh interesting, I was expecting that the lists of "red" (stop) or "yellow" (caution) categories would itself be human curated
#
[tantek]
I didn't think there would be enough of them be worth automating in any way
#
capjamesg
There are!
#
[tantek]
I thought the real work would be making the domain name redirects manually
#
capjamesg
Where the site is marked with the Shadow Library. We'd need to case for things like that.
#
[tantek]
why? what problem is that solving?
#
capjamesg
Fair; disregard. That isn't relevant here.
#
[tantek]
the point is that most sites would not need any labeling/warning
#
capjamesg
But... note that the domain works.
#
capjamesg
Wikipedia has the domain listed as a URL of SciHub and redirects back.
#
[tantek]
yes! that's my point, there's a lot of those! when I found them (and created a few more) I realized it was possible to use it as an API
#
capjamesg
Exactly!
#
capjamesg
I think we're on the same page with that.
#
[tantek]
from what I can tell, no one else had previously proposed using Wikipedia look up of domain names in an automated manner like that
#
capjamesg
This implements our algorithm -- minus the n-day consensus -- with lookups on Wikipedia's reputable lists, and, potentially, third-party, authoritative lists of fake news websites.
#
capjamesg
I found some site that was charging for an API for this 😦
#
[tantek]
it's that plus the human-curatd "bad list" of Categories to look for that make this work IMO
#
capjamesg
Exactly.
#
[tantek]
I don't trust "third-party, authoritative lists of fake news websites" to be "authoritative" — because the process by which it is done is not open
#
capjamesg
Actually, correction: "We are very selective in who we allow access to our API. Please fill out the form below to receive permission or a quote to use our API."
#
[tantek]
I wouldn't use any "closed" list for this. Too easy for a private bad actor to silently mess with it
#
[tantek]
It's *very* hard for persistent vandalism on Wikipedia to stick
#
[tantek]
this is why I would *only* use Wikipedia lookup for this
#
capjamesg
Yeah. And in any case the n-day consensus would account for that.
#
capjamesg
Next step: make it a browser extension!
#
capjamesg
I can probably help with this at some point, but it's 2am so I will get some sleep now 😄
#
[tantek]
yes sleep!
#
[tantek]
I should go make dinner
#
capjamesg
I also learned Google has some kind of fact check system.
#
[tantek]
yes that was why I suggested looking at the page, then prev day, prev two days, and picking 2 out of 3 if they disagree on categories
#
Loqi
I agree
#
[tantek]
there's lots of private "fact check" systems.
#
[tantek]
it's not that interesting, also doesn't scale, and also not responsive
#
capjamesg
It doesn't look well developed.
#
capjamesg
As in, it is not rich with data.
#
[tantek]
e.g. new disinfo sites can come up all the time
#
capjamesg
Right. The ones of which I have heard are those that challenge the claims made by politicians, etc.
#
capjamesg
BBC Verify being one such example.
#
capjamesg
Go make dinner. I'll sleep. But I'd love to explore this further! I may turn my Python code into a package so it could be integrated into web app back-ends. That's a task for another day!
ttybitnik and geoffo joined the channel
#
[Al_Abut]
[aciccarello] and @btrem - I’m late to the discussion but yes, you can skip the <picture> element and just use <img> while serving up different sizes. Just use the <img> tag like usual and use srcset with different widths.
[tw2113] joined the channel
#
[Al_Abut]
The best part is that you don’t even have to deal with media queries, now that we have browser-level lazy loading. Just use the snippet here: https://ericportis.com/posts/2023/auto-sizes-pretty-much-requires-width-and-height/
geoffo, Tiffany and scottishstoater joined the channel
#
capjamesg
[tantek] I have added consensus logic to the Python implementation. You can choose between four strategies: percentage, majority, unanimous, or in one or more days.
#
capjamesg
Categories are retrieved once per day, cached, then passed through the consensus logic.
scottishstoater and [Jo] joined the channel
scottishstoater joined the channel
#
capjamesg
The package is now on PyPi and available for download. The docs aren't live on a website yet, but I'll get around to that at some point.
scottishstoater, teder_[d] and chimo joined the channel
#
capjamesg
[tantek] I don’t think extensions can make requests to the Wikipedia API because of CORS?
ttybitnik, scottishstoater, [Ros], gRegor and [dmitshur] joined the channel
#
[dmitshur]
I've realized recently that git itself doesn't seem to enforce the `user.email` value, it being an email is a convention but otherwise it can be set to a URL. I'm curious if anyone has considered taking advantage of that and actually use a URL (instead of email) as theit git author identity, for the same reasons that are motivated at https://indieweb.org/Why_web_sign-in#Why_not_email.
#
[dmitshur]
The reason I'm thinking about this now is because I'm working on letting people send changes to the git server on my site, and I authenticate and identify them via a URL. I don't track user's emails (and don't want to track them), so if git commits use email addresses, it's harder to associate it with a user. If git commits happened to use URLs, it'd be trivial for me to automatically verify that a person who authenticated as
#
[dmitshur]
http://example.com sending a patch must use a git author with the "email" http://example.com and not someoneelse.example.
scottishstoater joined the channel
#
[dmitshur]
I guess most people would not want to use a different git author identity only in some contexts, and email has the benefit of being the widely accepted default that'll work everywhere without any surprises (GitHub, Gerrit, GitLab, random people's personal websites, etc.). So someone would have to be very very motivated to consider switching away from email as their preferred git author identity.
scottishstoater and ttybitnik joined the channel
#
[Ros]
Hey pals, does anyone have any recommendation for YouTube episodes/series/creators who discuss the history of HTML, CSS, and Javascript?
#
[Ros]
E.g. I just went https://youtu.be/NzzGt7EmXVw?si=W1DZY1KH3n-mTcB0 that covers why and how HTML is formatted as it is today ([Jeremy_Keith]’s book is mentioned in episode 2), and I’m finding it so helpful to understand the background. So many “like duhhhh” moments hitting for why we do things as we do. I’m looking for more good episodes on HTML, CSS + Javascript if anyone can recommend 🙏
#
[KevinMarks]
Jeremy had given a lot of good talks on HTML - search for those
#
capjamesg
[Ros] https://thehistoryoftheweb.com/archives/?archive_type=posts has a lot of good posts about various parts of the web. https://thehistoryoftheweb.com/timeline/ is in a linear order; the red links denote articles that the author wrote about the topic.
#
[Ros]
↩️ Thanks [KevinMarks]! I actually have *6* talks of his lined up to binge-watch this weekend 😬
#
[Ros]
↩️ If someone asks me in the future about starting programming, I would recommend them the same to learn about the history adjacent to learning the skills. It’s making everything so much clearer
#
[Ros]
↩️ [capjamesg] Brilliant!
#
[Ros]
↩️ This looks ideal
#
[KevinMarks]
↩️ We have a list on http://indieweb.org at https://indieweb.org/videos_about_the_indieweb too, though they are a bit variable on how coding specific they are
#
[Ros]
↩️ [capjamesg] There’s an https://thehistoryoftheweb.com/book/ 😍 Goodbye friends for the next week 😆
#
[Ros]
↩️ [KevinMarks] Thank you so much!
#
[Ros]
↩️ [KevinMarks] Oh this is _amazing_ New bookmark!
#
capjamesg
I have learned a lot about the web from thehistoryoftheweb.com. I'm glad there are people like Jay who are writing such intuitive historical summaries.
#
[Ros]
↩️ This site looks so special. I’m glad it exists. I’ll write to him to say thank you after I read this whole book
#
[Ros]
↩️ Or [Jay_Hoffmann] I should say thanks to you and [Jeremy_Keith] in advance. I’m so glad you’ve made this site. My friends are not going to see me the next wee while as I gorge this
[Al_Abut], [aciccarello], gRegor, rrix, [tantek] and barnaby joined the channel
#
[tantek]
I need to figure out what I can blog publicly about the W3C AC Member meeting that just happened this past week in Hiroshima
#
[tantek]
Most of it is W3C Members-only but I think a bunch of it was recorded and there is some intent to make some of it public
#
[tantek]
Would anyone here care to learn about matters of organization, governance, process, and how new work is considered at W3C?
#
capjamesg
I would!
#
capjamesg
And probably many more people!
#
capjamesg
Also, I’d love your thoughts on the web extension / trust thing I shared earlier [tantek]!
#
[tantek]
Two things: 1) pretty sure Browser Add-ons can make arbitrary https requests (e.g. to update their lists of block sites, for ad blockers etc.),
#
capjamesg
Ah, I had this made as a web page for testing. I didn’t realise web extensions had different permissions.
#
[tantek]
and 2) naming, I think "trust" is the wrong framing and frankly misleading. the absence of a warning does NOT (must NOT) imply any degree of "trust" of the contents of a site — it is far more likely that there is an absence of information about the site. Better to name exactly what it does, which is identify *some* sources of misinformation. A browser add-on would likely also provide a warning
#
capjamesg
Right. What would you call it?
#
[tantek]
first, do you get why "source-trust" is a bad (actually misleading) name?
#
[tantek]
which is ironic on a meta level
#
capjamesg
Yeah. And it’s why the intro paragraph is not tight copy.
#
capjamesg
You can’t trust something in absence of information saying it is untrustworthy.
#
[tantek]
correct, to do so (or imply so) otherwise is a logic error
#
capjamesg
That’s like trusting a paper is complete because nobody refuted it.
#
[tantek]
and you really don't want a logic error in anything labeled "trust" 🙂
#
[tantek]
so I said: "name exactly what it does, which is identify *some* sources of misinformation" which you could easily shorten to something simple / boring like
#
[tantek]
misinfo-sites
#
[tantek]
or even more literal if you prefer
#
[tantek]
misinfo-domains
#
capjamesg
I have honestly been a bit hesitant to say disinformation because I know it’s a controversial thing. But that’s more of a personal barrier since I don’t usually deal with things of such controversy.
#
[tantek]
especially for programming libraries it is much better to ALWAYS name exactly what it does, and nothing more. no exaggerations, no flowery words
#
[tantek]
no it is not controversial, you are looking up consensus Categorization on Wikipedia
#
[tantek]
the assertion that disinformation is "controversial" is itself IMO a light form of misinformation
#
capjamesg
++ regarding naming.
#
[tantek]
people who lie have marketed the idea that "disinformation is controversial" as an attempt to get more consideration of their lies
#
[tantek]
controversy implies equivalency, which there is none in disinformation "debates"
#
capjamesg
But yeah, I’ll make the requisite naming updates. If you can come up with more concise copy for the intro in the README, feel free to submit a PR; otherwise, I’ll give it another go over the coming days.
#
[tantek]
capjamesg that article is tangentially about misinformation at best. it's more about whistleblowing, claims of funding contorting academic "honesty", and inconsistencies of apparent levels of academic freedom offered to employees of different positions (staff vs faculty)
#
[tantek]
capjamesg, sounds good. I may look at making a PR for the README on Monday if that's ok (remind me of the URL for the README?)
#
capjamesg
GitHub will handle redirects after renaming etc: https://github.com/capjamesg/source-trust
#
Loqi
[preview] [capjamesg] source-trust: Analyze the reliability of a source.
#
[tantek]
thanks capjamesg
#
[tantek]
I would say immediately strike the first sentence instead of waiting to rewrite it
#
[tantek]
you can always add an accurate intro sentence later
#
[tantek]
leaving it with a misleading intro sentence is worse than no intro sentence
[KevinMarks] joined the channel
#
[KevinMarks]
You could call it "dodgy links"
#
[KevinMarks]
If you want a word that means untrustworthy but is short
#
[tantek]
dodgy I feel implies potential security threats/problems, which this does not do any detection of at all
#
[tantek]
though does have the advantage of alliteration if you use "dodgy-domains"
#
[tantek]
lol that might actually make a good Browser Add-on name actually
#
[KevinMarks]
You could have dubious-domains if dodgy is too UK
#
[tantek]
I might go with disinfo though to be deliberately harsh (because the Wikipedia categories are harsh)
#
[tantek]
disinfo-domain-detector
scottishstoater, JadedBlueEyes and barnaby joined the channel