#dev 2021-08-28

2021-08-28 UTC
jeremycherfas, nertzy_ and gerben joined the channel
hendursa1 joined the channel
#
capjamesg[d]
Odd question: does anyone know anything about ideal cloud computing for compute-heavy operations (preferably that will not break the bank!).
#
capjamesg[d]
DO seems a bit pricey.
tetov-irc, rockorager and jamietanna joined the channel
#
jamietanna
capjamesg[d] give hetzner.cloud or Scaleway a go for potentially cheaper - but often to get compute heavy, it's gonna be fairly expensive :(
rockorager and christin joined the channel
#
capjamesg[d]
Hoping the Discord-IRC bridge sends a link ^
#
capjamesg[d]
!tell snarfed ^
#
Loqi
Ok, I'll tell them that when I see them next
hendursaga joined the channel
#
sknebel
capjamesg[d]: can you be more specific? do you need something to run 24/7, what limit are you hitting right now, ...
#
capjamesg[d]
sknebel I am just thinking about scale.
#
capjamesg[d]
The way things are going I could feasibly index a few thousand pages per hour with the PythonAnywhere server I am using.
#
capjamesg[d]
(But that server is limited by CPU usage so I'm only using it to relieve my computer of some work)
#
capjamesg[d]
This is indexing from IndieMap WARC files, not the web itself.
#
capjamesg[d]
Say I wanted to index 100 IndieWeb sites. That would take at least 10 days 😄 And that's from a file, not the web 🙂
#
sknebel
ok, but python anywhere probably is fairly limited, so I'd just start with a $5-$10 small-ish VPS somewhere and see how fast that goes
#
sknebel
(when I hear "compute-heavy" my mind went a few levels larger ;))
#
sknebel
ingesting WARCs is potentially something you could do more optimized when going deep into the options of some cloud offering, but tbh that sounds like more hassle to me
#
sknebel
so hetzner cloud or something sounds like a good starting point
#
sknebel
(if someone knows how they'd do that kind of thing with a more cloud-focussed setup I'd be curious though, I have no real sense how those compare at small scale)
#
capjamesg[d]
So would I.
#
capjamesg[d]
Ingesting from WARC is fast(ish) compared to actually crawling documents.
#
capjamesg[d]
Because then request time / processing needs to be taken into account.
#
capjamesg[d]
With a bit of Python wizardry I think I can get it down to about 15 mins for 5,000 documents.
rockorager joined the channel
#
capjamesg[d]
And my image search engine would take everything to another level.
#
capjamesg[d]
Because then images need to be downloaded, checked, compared, and optimized before being indexed. But I'm not even going to attempt indexing images.
#
rockorager
capjamesg[d]: Linode has $100 credits so you could test that out for free for a few months (I think you have 60 days to use the $100)
#
capjamesg[d]
Good call rockorager. Thanks!
#
aaronpk
If you're working with warc files do you even need this to be in the cloud? Why not use a desktop computer where it's cheaper to get a fast processor?
#
capjamesg[d]
Good question. It's more preference and about not wanting to leave my desktop running almost maxed out while these operations go on.
#
sknebel
(I'm also planning a small crawler project, but will just run that on one of my VPSes for a bit and see how that goes)
#
[jeremycherfas]
I’ve got myself in a bit of a muddle since GitHub deprecated passwords. I have a PAT working fine on my desktop, but not on my laptop. And when I use the PAT on the laptop I cannot authenticate. Is there a simple option? Am I best off to create a new PAT for the laptop only?
chenghiz_ joined the channel
#
sknebel
token per device sounds good
#
sknebel
(assuming they are setup to have that, but I think they do)
#
[jeremycherfas]
I guess I had better try that next. Tomorrow. I’ve had enough excitement for one day.
[chrisaldrich] and maxwelljoslyn[d] joined the channel
#
capjamesg[d]
Good example sknebel. I have tried not to use too many "luxurious" resources.
#
capjamesg[d]
(i.e. using smaller language models vs. larger ones with only slightly more accuracy)
jamietan1a joined the channel
#
capjamesg[d]
jeena.net is now in the index.
#
capjamesg[d]
There are at least 7,000 records collectively now.
#
doosboox
capjamesg[d]: how big is your db vs the amount of data you've parsed?
#
capjamesg[d]
Let's see.
#
capjamesg[d]
I need to index HTML for featured snippet support.
#
capjamesg[d]
That doesn't work for multiple suites quite yet because the logic was oriented around my site and its structure. But I will index HTML anyway for when that support comes (i.e. for "who is" queries to show h-cards).
#
capjamesg[d]
So I expect to have quite a big DB.
#
capjamesg[d]
With 8329 records.
#
@JamieTanna
Right now I'm attending the #IndieWeb #IndieAuth pop-up to look at further improving the specifications, and making it easier for folks to implement and integrate (https://www.jvt.me/mf2/2021/08/1jpqv/)
(twitter.com/_/status/1431682508140302337)
#
jamietanna
aaronpk rack-oauth2 (a common Ruby OAuth2 client) checks errors based on status code https://github.com/nov/rack-oauth2/blob/a8f1d2acd698cc13b6e102ba6fbf5a202d3b9d57/lib/rack/oauth2/client.rb#L150-L158
[dmitshur] joined the channel
#
GWG
The spec says 'expires_in' where did we get 'expires_at'? I'd prefer it. I convert to it now. https://github.com/indieweb/indieauth/issues/81
#
Loqi
[dshanske] #81 Adopt Expiration and Refresh Tokens into the Spec
#
aaronpk
I typo'd my comment, should have been expires_in
#
aaronpk
expires_at isn't a thing
#
GWG
aaronpk Do you just manually edit the html for the spec? Or is there a trick to it?
#
Zegnat
I actually mess up spec edits constantly.You change public/source/index.php (and maybe some other file in /source) you then have to get the output HTML of that file and put it in public/spec/index.html
#
Zegnat
So for every single change you have to edit at least 2 files
#
Zegnat
(For an example of me messing up, see https://github.com/indieweb/indieauth/pull/80 where I had to open a secondary PR because I only changed the public file)
#
Loqi
[Zegnat] #80 Add & to Example 6
KartikPrabhu joined the channel
#
aaronpk
yes, edit the one in the "source" folder
#
aaronpk
don't worry about updating the other files until we're ready to publish a version
#
GWG
Thanks, that probably saved me some confusion
tetov-irc and Rattroupe joined the channel
#
[fluffy]
Are there any codified standards for what an IndieAuth profile should include? I see that @rattroupe has added support for outgoing profiles but I don’t see any authoritative/canonical list of what fields are expected to be presented, just a vague-ish example in the IndieAuth spec itself.
#
[fluffy]
Authl has been requesting the profile scope for quite some time now but doesn’t actually make use of any fields returned because it just gets it from the h-card on the profile page.