#[Jamie_Tanna]Ah interesting - I'd assumed Aperture and Monocle would both parse the same, could be library versions out of sync maybe?
#[Jamie_Tanna]Is anyone doing any reddit webmention backfill or something? I've seen a few of my old submissions (that I manually webmentioned back to my site) getting updates this afternoon
#sknebelsnarfed said he'd work on that at some point, so possible
#[snarfed]one wm per reddit submission, source is the reddit submission page
#[snarfed](separate from Bridgy, which sends wms for comments)
#[snarfed]also excited about the wm discovery dataset, it's run discovery on over 1M domains. I plan to publish that, and the per-wm results, on Indie Map
[campegg], jacky, gRegor and jamietanna joined the channel
#capjamesgHave you thought about how to speed up your crawler?
#angeloi forgot to mention earlier at HWC, it also opens the website in a headless browser for a screenshot. i'm going to try to break the different parts of the process into small background jobs so that i can better profile what's taking the most time.
#capjamesgI found cProfile really helpful in understanding performance bottlenecks in my Python code for IndieWeb Search.
#angeloi'll rerun the crawler now and have it skip downloading photos/favicons/screenshots but still no caching/conditional-get; otherwise, i can open /multiple/ browsers and i currently have the "worker count" set to 5. i think i can ratchet that up much higher..
#capjamesgWhat are you using to take the screenshot / open the headless browser?
#angeloyou can peek at https://indieweb.rocks/jobs to see two functions being repeatedly called in the background: get_media() and get_page(); if i decompose them further there'll be a finer grained lens on seeing the crawler's hot spots