#capjamesgThe use case was to crawl all links on a page on my blog for my search engine. If one is discovered that is not in the sitemap, that link should be added to the to-crawl list.
#capjamesgBut doing so involved checking if a link had already been crawled first before adding to the to-crawl list.
#capjamesgWhich I could achieve by using a dictionary.
#capjamesgSo that would definitely be more efficient.
#[KevinMarks]same pattern works in javascript as well, but js objects can be a bit more complicated than python dicts
#[KevinMarks]a common pattern is reference counting, so you do `d[url]=d.get(url,0)+1` then you have a dict with url and number of reference to it, which you can then maybe sort by count to decide which to crawl first
[Will_Monroe] and [tw2113_Slack_] joined the channel
#capjamesgSince this is just for my blog -- and the crawl time is only a few mins -- I don't think reference counting is necessary. But that is very, very interesting. I'll keep that in mind!
[snarfed] joined the channel
#[snarfed]anyone good at working with images? I’m looking for a way to automatically generate an alpha channel for a given image based on a background color, eg based on white in http://localhost/bridgy_logo.jpg, and get a full alpha channel (eg in a PNG version), not just on/off. any ideas?
#[snarfed](I don’t have Photoshop etc, so ideally command line, eg ImageMagick, or web based)
#Loqiaaronpk has 50 karma in this channel over the last year (129 in all channels)
#capjamesgAny good reads on how search engines decide when to crawl content? I'm implementing recursion to immediately crawl any URL that is discovered. But this doesn't seem elegant or like the most practical solution.
#capjamesgBefore this, I stored all URLs in a dict but I couldn't change it because you can't change a dict in a for loop in Python.
#capjamesg(you can't change a dict you are iterating over already)
#capjamesgBasic logic is this: find URLs in sitemap, crawl page, if new URL is discovered I want to add it to the "to crawl" list, add information about page to DB, go on to next. And so on and so on.
#capjamesgThe "to crawl" dict is already being iterated over
#[snarfed]ah. you can, it just may change the iteration
#capjamesgWell, thanks to my tendency to overengineer my blog search engine now supports reading robots.txt directives and link discovery within articles (even though all of my links are always in my sitemap).
#capjamesgAnd now my code is quite messy so I need to tidy it.
[KevinMarks], [aciccarello], [schmarty] and alex11 joined the channel
#capjamesgI have a whole lot more respect for all the work that goes into Google now.
#capjamesgAnd more respect for Copilot by GitHub, which, to my surprise, has been a bit helpful.
#LoqiIt looks like we don't have a page for "SingleFileZ" yet. Would you like to create it? (Or just say "SingleFileZ is ____", a sentence describing the term)
#angelo_SingleFileZ is a tool that allows you to save a webpage as a self-extracting HTML file.