capjamesg[d]"I’m half joking, but if we can have HTTP 418 I’m a Teapot then there is enough room in the HTTP standard for the more useful HTTP 419 Never Gonna Give You Up error code."
capjamesg[d]The link building process is, roughly: scroll through Elasticsearch and find every link in every document (this takes about 1h:30m with the 370k pages right now), calculate how many links point to each page (which involves a dictionary and iterating over every link), and then updating Elasticsearch so that the new link values are all recorded.
capjamesg[d]This can work in a few hours right now so it's not as if I need a solution right away. But I figure that in 100-200k more documents, I'll run into trouble again.
[KevinMarks]I did it with sql tables and updated it incrementally, but Google's PageRank did it recursively so needed more computation, and they came up with MapReduce to distribute it (and also to do it in O(n) by reading each link once.
capjamesg[d]I retrieve then iterate over every object in my elastic search instance. In each iteration I find all links, do a bit of qualification, and then append them to a file.
capjamesg[d]for instance, I could take a one month break from posting. I wouldn’t want that to mean my Microsub reader would not pick up on my posts for weeks.
[KevinMarks]so, you should be able to close in on a rough posting frequency by doing exponential backoff, and when you get a post stop backing off, or divide the interval by the # of posts seen