#dev 2021-10-29

2021-10-29 UTC
[manton], hs0ucy and Agnessa[d] joined the channel
hendursaga joined the channel
#
capjamesg[d]
"I’m half joking, but if we can have HTTP 418 I’m a Teapot then there is enough room in the HTTP standard for the more useful HTTP 419 Never Gonna Give You Up error code."
#
[tantek]
I thought 419 was the small payment needed in order to retrieve larger sum of money
#
Ruxton
that'd be a funny response for payment gateway stuff :P
[calumryan] joined the channel
#
capjamesg[d]
Ironically, that page returns a 200.
hendursa1 and kogepan joined the channel
#
jamietanna
I see it returning a 418 to me? `curl` and Firefox's browser inspection
#
capjamesg[d]
I used curl too. How odd.
#
capjamesg[d]
Firefox is good.
#
capjamesg[d]
Owning a coffee blog, I feel like I have to do something with 418 / the coffee protocol.
#
petermolnar
TIL wordpress json api v2 returns the total number of pages for paginated results as a http header.
schmudde and tetov-irc joined the channel
#
GWG
jamietanna: Going to make the updates suggested over the weekend, then see if a
#
GWG
see if aaronpk as the editor can give the final approval for merge
KartikPrabhu and hs0ucy joined the channel
#
@fastreadInfo
https://www.info.fastread.in/what-is-web-mention/ What is Web Mention ? #Webmention use in #Blog #Fastread #Blogging
(twitter.com/_/status/1454060129553313792)
jaylamo joined the channel
#
jamietanna
Sounds good GWG!
hendursa1 and neocow joined the channel
#
GWG
jamietanna: After that, I can rewrite the token introspection endpoint PR and add a revokation endpoint PR
#
GWG
It just needs this first
chenghiz_ and jjuran joined the channel
#
capjamesg[d]
I have a new challenge to think about: how IndieWeb Search will scale.
#
capjamesg[d]
I went to run the link calculation program, usually just a background thing, but the server hit its limits 🤦
#
capjamesg[d]
I will have to think about what I can do to speed up the link calculation process at some point.
#
capjamesg[d]
Not an engineering / infra task right now but just something I am thinking about.
#
[KevinMarks]
haha you're going to invent mapreduce
#
capjamesg[d]
Using MapReduce I could run a calculation across multiple servers?
#
capjamesg[d]
I figure that nodes > having one big computer to do everything.
#
capjamesg[d]
The link building process is, roughly: scroll through Elasticsearch and find every link in every document (this takes about 1h:30m with the 370k pages right now), calculate how many links point to each page (which involves a dictionary and iterating over every link), and then updating Elasticsearch so that the new link values are all recorded.
#
capjamesg[d]
This can work in a few hours right now so it's not as if I need a solution right away. But I figure that in 100-200k more documents, I'll run into trouble again.
#
aaronpk
i thought graph databases were supposed to do this for you
#
[KevinMarks]
I did it with sql tables and updated it incrementally, but Google's PageRank did it recursively so needed more computation, and they came up with MapReduce to distribute it (and also to do it in O(n) by reading each link once.
#
capjamesg[d]
Would it be more efficient to log each link as it is found in crawling?
#
capjamesg[d]
If I done that and rebuilt the index, I'd take off 1:30 from the link calculation.
mikeputnam joined the channel
#
capjamesg[d]
Can you give an example of a graph database aaronpk?
#
capjamesg[d]
I see how graph theory would come into this.
#
capjamesg[d]
I wonder if I could get this working on my Raspberry Pi 😄
akevinhuang joined the channel
#
capjamesg[d]
When I think about it, the current method is quite inefficient.
#
[jacky]
make it work then make it fast!
#
capjamesg[d]
Indeed [jacky]!
#
capjamesg[d]
I have something that works, it's just not scalable.
#
capjamesg[d]
The current program uses up most computational power available on the cloud server between Elasticsearch and the link calculation.
#
capjamesg[d]
In theory, I could store all links in a database as they are found.
#
capjamesg[d]
When pages are recrawled, all links would be deleted and replaced to ensure that any removed links were not retained in the database.
#
capjamesg[d]
But that would only defer the ingestion phase.
#
capjamesg[d]
A graph would be more efficient in that I could calculate the number of edges to a node pretty easily.
#
capjamesg[d]
Thus saving on that stage.
#
capjamesg[d]
That stage is O(n) anyway.
#
capjamesg[d]
Ingestion back into the DB could be made more efficient if only links to known sources were added.
#
capjamesg[d]
Saving a query.
hendursaga, aaronpk_ and jaylamo joined the channel
#
capjamesg[d]
So clever.
[jacky] joined the channel
#
@erikkroes
↩️ Is that because of webmentions, or how it's presented?
(twitter.com/_/status/1454145455412162561)
hs0ucy and [chrisaldrich] joined the channel
#
capjamesg[d]
Quick question: what is the most efficient way to write three million records to a CSV?
#
capjamesg[d]
Should I batch rows in 1,000s and then save them?
#
capjamesg[d]
Or save each row individually?
#
capjamesg[d]
The file would be open the whole time.
#
sknebel
I doubt it matters much
#
sknebel
I suspect batching does help a bit though
akevinhuang2 joined the channel
#
@bnijenhuis
↩️ You can get the data, how you present it is up to you. I wrote a blog about implementing webmentions, if you're interested.
(twitter.com/_/status/1454156816846295043)
#
capjamesg[d]
Thanks sknebel.
#
[KevinMarks]
Where are the rows coming from? That may be more of a block than appending to a file, which is pretty optimised everywhere.
KartikPrabhu joined the channel
#
capjamesg[d]
The rows are being generated.
#
capjamesg[d]
I retrieve then iterate over every object in my elastic search instance. In each iteration I find all links, do a bit of qualification, and then append them to a file.
#
capjamesg[d]
links.csv held over 3 million links as of the last run.
#
capjamesg[d]
Which is why I’m looking to optimize it.
#
[KevinMarks]
when I'm writing python mungers like this I use Dabaez's coroutine pattern, which makes it easier to buffer things without rewriting too much - read this up to the end of part 2 https://www.dabeaz.com/coroutines/Coroutines.pdf
#
GWG
That reminds me, I need a good algorithm to figure out polling frequencies for h-feeds
#
capjamesg[d]
I need one too 😅
#
capjamesg[d]
Good share [KevinMarks]. I can’t claim to understand it much right now but I’ll give it a go!
#
GWG
capjamesg[d]: I'll tell you what I decide
#
capjamesg[d]
Thanks! For Microsub I just poll the feed every hour.
#
capjamesg[d]
For IndieWeb search I do the same thing every day.
#
GWG
capjamesg[d]: There are things you can do with that
#
GWG
Websub
#
GWG
Last Modified
#
GWG
Etags
#
GWG
Etc
#
capjamesg[d]
I do all of that for IndieWeb search.
#
capjamesg[d]
And in Microsub.
#
capjamesg[d]
But WebSub isn’t ready yet for Microsub.
#
GWG
I don't yet, so need to come up with the logic
#
capjamesg[d]
My implementation is in Python but I can share anyway: https://github.com/capjamesg/microsub/blob/main/poll_feeds.py
#
capjamesg[d]
I know you like PHP 🙂
#
GWG
capjamesg[d]: I also want some logic for backing off or speeding up polling
#
GWG
For example, let's say you post once a day, I should be able to determine that and only poll daily
#
GWG
If you start posting more often, I should be able to change it
#
capjamesg[d]
That’s something I would like to do too.
#
GWG
I may not do it all for the first iteration, but I want to plan it out
#
capjamesg[d]
Maybe a rolling average of the last month of updates?
#
capjamesg[d]
And maybe have a function to poll all feeds again every so often in case a site suddenly posts again.
#
capjamesg[d]
for instance, I could take a one month break from posting. I wouldn’t want that to mean my Microsub reader would not pick up on my posts for weeks.
#
GWG
Exactly.
hs0ucy joined the channel
#
GWG
capjamesg[d]: I would probably set the maximum at daily
sebbu joined the channel
#
capjamesg[d]
For Microsub that seems prudent.
#
capjamesg[d]
I get Garfield every day so there is no need to poll 24 times.
#
capjamesg[d]
The same with Dilbert.
#
capjamesg[d]
But I also get my GitHub feed which changes more frequently.
#
capjamesg[d]
I’d go hourly minimum - daily maximum.
#
[KevinMarks]
so, you should be able to close in on a rough posting frequency by doing exponential backoff, and when you get a post stop backing off, or divide the interval by the # of posts seen
#
GWG
capjamesg[d]: That's the idea..I just don't want to hammer anyone's site.
#
GWG
[KevinMarks]: That's roughly my plan
#
GWG
But first I want to support 304 and such.
#
[KevinMarks]
an approach that might fit would be to use a WebSub model, and have a polling proxy that sends websub to the core indexer
#
[KevinMarks]
yes, 304, etag last modified is good
#
GWG
I don't think my site supports it, I should enable it
tetov-irc joined the channel
#
[tantek]
Backing off to weekly is probably fine too
jaylamo joined the channel