capjamesgThe use case was to crawl all links on a page on my blog for my search engine. If one is discovered that is not in the sitemap, that link should be added to the to-crawl list.
[KevinMarks]a common pattern is reference counting, so you do `d[url]=d.get(url,0)+1` then you have a dict with url and number of reference to it, which you can then maybe sort by count to decide which to crawl first
[Will_Monroe] and [tw2113_Slack_] joined the channel
capjamesgSince this is just for my blog -- and the crawl time is only a few mins -- I don't think reference counting is necessary. But that is very, very interesting. I'll keep that in mind!
[snarfed]anyone good at working with images? I’m looking for a way to automatically generate an alpha channel for a given image based on a background color, eg based on white in http://localhost/bridgy_logo.jpg, and get a full alpha channel (eg in a PNG version), not just on/off. any ideas?
capjamesgAny good reads on how search engines decide when to crawl content? I'm implementing recursion to immediately crawl any URL that is discovered. But this doesn't seem elegant or like the most practical solution.
capjamesgBasic logic is this: find URLs in sitemap, crawl page, if new URL is discovered I want to add it to the "to crawl" list, add information about page to DB, go on to next. And so on and so on.
capjamesgWell, thanks to my tendency to overengineer my blog search engine now supports reading robots.txt directives and link discovery within articles (even though all of my links are always in my sitemap).
LoqiIt looks like we don't have a page for "SingleFileZ" yet. Would you like to create it? (Or just say "SingleFileZ is ____", a sentence describing the term)