jackyit kinda depends on how the parser does it; for mf2-rust, it's a straight up tree after transversing the DOM so I implemented some helper methods to help 'walk' the tree if looking for specific things
[manton]Apple has been going more into charging developers for things, presumably as part of increasing their services revenue. I think we’ll continue to see more web services APIs that have free and paid tiers.
[manton]Usually free/pair tiers make a lot of sense, but the Apple situation is kind of unique when you consider the basically mandatory $99/year dev program and the 30% tax on app purchases. I don’t love using developers as a revenue source for a company that has traditionally been product-focused.
@lordmatt↩️ No, but there is a module system that would enable someone to write a #webmention add-on. It might be bigger than the framework and would need a DB or read access to a folder. (twitter.com/_/status/1570468760217067522)
[manton]Full disclosure, I had a bug earlier today that sent webmentions in an infinite loop… 🙄 I think mostly internal to Micro.blog, and hopefully didn’t hit anyone else’s servers much.
[tonz]Today at Netherlands WordCamp a speaker (Joost, of the Yoast WP plugin), asled attention for reducing a website’s footprint by reducing hits from crawlers and bots. Highlighting how WordPress by default has all kinds of active URLs that no site owner actually needs, but do get actively crawled all the time. E.g. a single author site like my personal WP blog has an author archive. His slides are here :
[snarfed]granted, it feels messy and wasteful, and it may actually add up to noticeable bandwidth etc cost for big sites, but for most smaller sites I expect it's negligible
[tantek]4snarfed, all the arguments about minimizing (attack) surface. also makes sense from a URL maintenance perspective (e.g. less work to switch /URL_design or to a different CMS etc.)
LoqiIt looks like we don't have a page for "URL footprint" yet. Would you like to create it? (Or just say "URL footprint is ____", a sentence describing the term)
angeloi suppose it depends on how crawlers use what they find in the sitemap.. i've always thought of it as a list of recommended URLs to make /sure/ the robot will find but the bot will still merrily crawl all of the other things it finds that isn't blocked by robots.txt
[tantek]4there's got to be some prior terminology for this, as web IAs (information architects) have been thinking about it (and writing about it) deliberately since the late 1990s
capjamesgA bigger site may see their robots.txt crawled multiple times per day by search engines, presumably to ensure the crawler is adhering to the directives the best it can.
capjamesgangelo Sitemaps are one method of discovery for URLs. I put all my blog URLs (even page URLs) in my sitemap just to make the job of the crawler easier.
capjamesgangelo IndieWeb Search does this: 1. crawl robots.txt, 2. crawl any sitemaps found in robots.txt, 3. try to crawl sitemap.xml (I think) if it exists, 4. compares a URL to the robots.txt directive, 5. if URL can be crawled, crawl it, do URL discovery, and continue this for every URL.
capjamesgangelo "can you have a catch-all `Disallow: /` and then hand-curate your urlscape via sitemaps.xml?" If you say disallow /, the crawler to which that applies will not crawl anything.