#dev 2021-08-27

2021-08-27 UTC
rockorager, Seirdy, jeremy, gerben, jeremycherfas and hendursa1 joined the channel
# 08:19 
capjamesg[d] Someone started a discussion about h-resumes somewhere.
# 08:19 
capjamesg[d] GWG maybe?
# 08:19 
capjamesg[d] Anyway...
# 08:19 
capjamesg[d] I think there's something interesting about assuming data marked up with mf2 is fine to use without permission.
# 08:20 
capjamesg[d] My initial thought was this is okay because the data has been deliberately structured in a way that makes it readable by external applications.
# 08:20 
capjamesg[d] Then again, I thought to myself that reading mf2 might be treated like web scraping if you done it across multiple sites without permission.
# 08:21 
capjamesg[d] I wonder: can anyone read structured data if it is for a search engine?
# 08:21 
capjamesg[d] I actually asked myself this question about a month ago. If businesses mark up their coffee products with JSON-LD (which they often do because of Shopify's SEO features, etc.) then could that be used to make a search engine?
# 08:21 
capjamesg[d] (I don't want to do this. It's a rather interesting thought experiment.)
# 08:27 
sknebel making a search engine is IMHO fine, certainly as long as you behave (respect robots.txt, reasonable rate-limit, ...)
# 08:55 
capjamesg[d] I am not in the mood to build *another* search engine 😄
# 08:55 
capjamesg[d] Just improving mine.
# 08:56 
capjamesg[d] The next feature coming up is named entity recognition. So if you type in "cairngorm coffee founder" you'll get the exact answer in bold before any search results.
# 09:16 
sknebel ok, that is kind of advanced :D
# 09:16 
capjamesg[d] Technically the logic could be applied to any index. But it's just for my blog right now.
# 09:18 
capjamesg[d] I am now starting to run into performance bottlenecks though. Trimming my code will help but I may need to look into loading models faster.
# 09:18 
capjamesg[d] And I need to train some models on my blog rather than default datasets.
hendursa1, tetov-irc and [KevinMarks] joined the channel
# 11:28 
[KevinMarks] reading data from the web is OK, but combining it with other information to create profile on an individual is something that GDPR explicitly calls out - see the discussion of profiling here https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general[…]ights-related-to-automated-decision-making-including-profiling/
# 11:37 
capjamesg[d] That makes sense.
# 11:38 
capjamesg[d] Because aggregated data doesn't have the same integrity as primary data.
# 11:38 
capjamesg[d] Whereas one source paints a picture at a moment in time, if you aggregate two sources from different moments in time without being clear on that, you can paint a picture that the person who created the data would not think is representative.
# 11:39 
capjamesg[d] Well, it's not that aggregated data doesn't have the same integrity. Bad choice of word. I hope people get my point 🙂
# 12:39 
capjamesg[d] What is search?
# 12:39 
Loqi search in the context of the IndieWeb refers to being able to search your personal site for your own content https://indieweb.org/search
chenghiz_, doosboox8, hendursaga and [fluffy] joined the channel
# 17:14 
[fluffy] Hm, I wonder if what Authl does with user profiles counts as a potential GDPR violation. It only collects data that’s publicly-visible on the user profile that someone logs in as, and in Publ it only holds onto it for the purpose of displaying the user profile for making admin decisions and to make the site slightly friendlier to the person who logged in, but I’ve gotten a couple of folks sending me GDPR disclosure requests with 
# 17:14 
[fluffy] tone that implies that they think I’ve potentially overstepped.
# 17:16 
[fluffy] that ‘decision making’ thing doesn’t apply at all at least.
# 17:17 
[snarfed] [fluffy] in practice, pretty much all indieweb sites and services are exempt from GDPR (and similar laws elsewhere like CCPA and LGPR) because we're small and generally non-commercial. details: https://brid.gy/about#gdpr
# 17:18 
[fluffy] Yeah, that’s what I tell folks when I respond, that 1) beesbuzz.biz is a personal site and 2) the sum total of what I collect can be seen at https://beesbuzz.biz/profile
# 17:19 
[fluffy] In principle someone could use Authl in a way that violates the GDPR but Authl itself doesn’t do any profile storage, it just forwards along to the app that uses it.
# 17:19 
sknebel "non-commercial" is not the same as "household activity", still ;) (but still of course little to worry about, especially if you are outside EU)
[aciccarello] joined the channel
# 17:19 
sknebel what's legal vs what people feel is acceptable is another can of worms either way
# 17:19 
[snarfed] totally!
# 17:20 
sknebel (+ vs what people think the law says...)
# 17:20 
[fluffy] right
# 17:21 
[snarfed] I love that we think about data sovereignty and custody, privacy, ethics etc. it can also be fun to think about indieweb's specific legal compliance w/GDPR etc, but that's not important or useful in practice, so we should probably spend a lot less time on that part than we currently do
# 17:21 
[snarfed] but laws are kind of catnip to engineers. they're (theoretically) rules you can parse and evaluate, vs ethics which is fuzzier
# 17:21 
[fluffy] I think it’s important to consider just because of how certain privacy wonks will have questions they want answered
# 17:22 
[snarfed] eh. no obligation to feed the trolls (or anyone) every time they ask a question
# 17:22 
sknebel not seen that much time spent on it by people that aren't affected by it :shrug: (i.e. most came from commercial devs in the EU who very much are at least at risk of falling under it)
# 17:22 
[snarfed] obligatory: https://xkcd.com/386/
# 17:22 
Loqi [XKCD] Duty Calls https://imgs.xkcd.com/comics/duty_calls.png
# 17:23 
sknebel yeah, reaching for the GDPR bat if you want to know what some small site is doing is annoying, just ask first
# 17:23 
[fluffy] yeah I mean you don’t have to spend a lot of time responding to each one individually but it’s good to have a place to point them to
# 17:24 
[snarfed] sure. https://indieweb.org/GDPR looks pretty good. might deserve a clearer "The Indieweb is pretty much entirely exempt!" warning at the top, but otherwise seems ok
# 17:29 
capjamesg[d] snarfed How much mf2 do you use?
# 17:29 
capjamesg[d] I only read a bit for search right now but I might need to add more support.
# 17:29 
[snarfed] how...much?
# 17:29 
capjamesg[d] Sorry... I meant how many different mf2 formats do you use?
# 17:29 
capjamesg[d] (i.e. h-entry, h-card)
# 17:30 
[snarfed] got me. maybe 10?
# 17:30 
capjamesg[d] 👍
# 17:31 
capjamesg[d] This is the time I wish I bought an Apple Mac.
# 17:31 
[snarfed] on the plus side, one of my primary use cases for indiemap was to answer exactly these kinds of questions, across the community as a whole
# 17:32 
capjamesg[d] I loved your charts in the talk. I had read them on your IndieMap site a little while ago though so they weren't new to me.
# 17:32 
capjamesg[d] I loved the distribution of rel=me link chart. That made me laugh.
# 17:32 
capjamesg[d] As did your 15 or so line crawler.
# 17:32 
[snarfed] eg in the BigQuery UI, https://console.cloud.google.com/bigquery?project=indie-map&ws=!1m9!1m3!3m2!1sindie-map!2sindiemap!1m4!4m3!1sindie-map!2sindiemap!3spages!1m0&j=bq:us:bquxjob_7e75f57_17b88a9736d&page=queryresults
# 17:32 
[snarfed] this query counts the pages on my site, `SELECT count(*) FROM `indie-map.indiemap.pages` where domain = 'snarfed.org'`
# 17:32 
capjamesg[d] My crawler / indexer is roughtly 831 lines of code.
# 17:32 
[snarfed] hah nice!
# 17:32 
capjamesg[d] Probably add another 400.
# 17:33 
capjamesg[d] But I'm processing web pages, reading markup, etc.
# 17:33 
[snarfed] let's see if I can write a query for distinct mf2 classes on my site
# 17:33 
capjamesg[d] I am at 2013 now on your site haha.
# 17:34 
[snarfed] yup, looks like 10 mf2 classes on my site. SQL: SELECT class, COUNT(*) AS num FROM indiemap.pages p,      p.mf2_classes class WHERE domain = 'snarfed.org' GROUP BY class ORDER BY num DESC
# 17:34 
capjamesg[d] Amazing!
# 17:35 
capjamesg[d] How is the data in the WARC file ordered?
# 17:36 
capjamesg[d] Do you keep your /<list> lists up to date like /chocolate and /beer?
# 17:40 
capjamesg[d] Oh I really need a faster computer...
# 17:40 
[snarfed] requests in the WARCs are probably unordered
# 17:41 
[snarfed] yeah most of the lists are evergreen and updated
# 17:41 
capjamesg[d] I have a few similar lists for coffee but I have never made a list of all the roasters from which I have ordered coffee. I should do that.
# 17:42 
[snarfed] all pages should have both dt-published and dt-updated if updated later
# 17:44 
[snarfed] definitely! i'm kinda anti quantified self (for myself personally), but i like keeping those lists of things as a way to remember what i liked, didn't like, etc
# 17:44 
[snarfed] i tried a few variations and found that good/meh/bad was the right granularity for rating, but to each their own
# 17:44 
capjamesg[d] Me too. I don't update my lists very often but I do reference them every now and again in conversation.
# 17:44 
[snarfed] (also my coffee page is very incomplete, ah well)
# 17:44 
capjamesg[d] So they have a use case.
# 18:01 
capjamesg[d] snarfed It's still going.
# 18:03 
[snarfed] 😆
# 18:03 
capjamesg[d] About that...
# 18:04 
capjamesg[d] 15 seconds later the program failed.
# 18:04 
[snarfed] aww
# 18:04 
capjamesg[d] I'm just going to index 1000 right now to make sure everything is okay before continuing.
# 18:04 
[snarfed] unrelated, your next project is to find some human raters and start doing a baseline comparison against my existing stock wordpress search 😆
# 18:04 
capjamesg[d] Funnily enough I was thinking about this recently.
# 18:07 
capjamesg[d] Well...
# 18:07 
capjamesg[d] "poutine" returns a result 😄
# 18:07 
capjamesg[d] https://snarfed.org/2015-12-05_16411
# 18:07 
capjamesg[d] inurl: filter working fine.
# 18:08 
Loqi [Ryan Barrett] poutine!
At The Monk’s Kettle with Haruki Oh.
# 18:08 
[snarfed] it's alive!
# 18:08 
capjamesg[d] n=1000
jjuran, KartikPrabhu, [jgmac1106], alex11, tetov-irc, Seirdy and wackycity[d] joined the channel