#capjamesg[d]"I’m half joking, but if we can have HTTP 418 I’m a Teapot then there is enough room in the HTTP standard for the more useful HTTP 419 Never Gonna Give You Up error code."
#capjamesg[d]Using MapReduce I could run a calculation across multiple servers?
#capjamesg[d]I figure that nodes > having one big computer to do everything.
#capjamesg[d]The link building process is, roughly: scroll through Elasticsearch and find every link in every document (this takes about 1h:30m with the 370k pages right now), calculate how many links point to each page (which involves a dictionary and iterating over every link), and then updating Elasticsearch so that the new link values are all recorded.
#capjamesg[d]This can work in a few hours right now so it's not as if I need a solution right away. But I figure that in 100-200k more documents, I'll run into trouble again.
#aaronpki thought graph databases were supposed to do this for you
#[KevinMarks]I did it with sql tables and updated it incrementally, but Google's PageRank did it recursively so needed more computation, and they came up with MapReduce to distribute it (and also to do it in O(n) by reading each link once.
#capjamesg[d]Would it be more efficient to log each link as it is found in crawling?
#capjamesg[d]If I done that and rebuilt the index, I'd take off 1:30 from the link calculation.
mikeputnam joined the channel
#capjamesg[d]Can you give an example of a graph database aaronpk?
#capjamesg[d]I retrieve then iterate over every object in my elastic search instance. In each iteration I find all links, do a bit of qualification, and then append them to a file.
#capjamesg[d]links.csv held over 3 million links as of the last run.
#capjamesg[d]Which is why I’m looking to optimize it.
#GWGcapjamesg[d]: I also want some logic for backing off or speeding up polling
#GWGFor example, let's say you post once a day, I should be able to determine that and only poll daily
#GWGIf you start posting more often, I should be able to change it
#capjamesg[d]That’s something I would like to do too.
#GWGI may not do it all for the first iteration, but I want to plan it out
#capjamesg[d]Maybe a rolling average of the last month of updates?
#capjamesg[d]And maybe have a function to poll all feeds again every so often in case a site suddenly posts again.
#capjamesg[d]for instance, I could take a one month break from posting. I wouldn’t want that to mean my Microsub reader would not pick up on my posts for weeks.
#[KevinMarks]so, you should be able to close in on a rough posting frequency by doing exponential backoff, and when you get a post stop backing off, or divide the interval by the # of posts seen
#GWGcapjamesg[d]: That's the idea..I just don't want to hammer anyone's site.