#capjamesgDoes anyone know how graphs are represented in code?
#capjamesgSay I have a graph with "microformats is structured data" and "structured data is a way to represent information"
#capjamesgIf I said "define microformats" and wanted to traverse the graph, is this something I could do with a dictionary?
#jackyit kinda depends on how the parser does it; for mf2-rust, it's a straight up tree after transversing the DOM so I implemented some helper methods to help 'walk' the tree if looking for specific things
#jackya dictionary _could_ work but you'd be doing a bit of manual searching
#jackyhas a plan to refactor this so it's a giant flat list but restitching relationships would be annoying
#[manton]Apple has been going more into charging developers for things, presumably as part of increasing their services revenue. I think we’ll continue to see more web services APIs that have free and paid tiers.
#GWG[manton]: I don't mind a free/paid tier philosophy
#[manton]Usually free/pair tiers make a lot of sense, but the Apple situation is kind of unique when you consider the basically mandatory $99/year dev program and the 30% tax on app purchases. I don’t love using developers as a revenue source for a company that has traditionally been product-focused.
#@lordmatt↩️ No, but there is a module system that would enable someone to write a #webmention add-on. It might be bigger than the framework and would need a DB or read access to a folder. (twitter.com/_/status/1570468760217067522)
#[manton]Full disclosure, I had a bug earlier today that sent webmentions in an infinite loop… 🙄 I think mostly internal to Micro.blog, and hopefully didn’t hit anyone else’s servers much.
gRegor, [tonz], darth_mall, jacky and [jeremycherfas] joined the channel
#[tonz]Today at Netherlands WordCamp a speaker (Joost, of the Yoast WP plugin), asled attention for reducing a website’s footprint by reducing hits from crawlers and bots. Highlighting how WordPress by default has all kinds of active URLs that no site owner actually needs, but do get actively crawled all the time. E.g. a single author site like my personal WP blog has an author archive. His slides are here :
#[snarfed]granted, it feels messy and wasteful, and it may actually add up to noticeable bandwidth etc cost for big sites, but for most smaller sites I expect it's negligible
#GWGI don't want to have pages I don't care about indexed
jjuran joined the channel
#GWGFor example, the WordPress author page....I have a single author site... I'd like to just disable it
#[tantek]4snarfed, all the arguments about minimizing (attack) surface. also makes sense from a URL maintenance perspective (e.g. less work to switch /URL_design or to a different CMS etc.)
#[tantek]4less work to migrate = more freedom to migrate
#[snarfed]sure. still probably low priority for the average personal site, but understood
#[tantek]4also good advice from a setup perspective, before you unintentionally create a (maintenance) mess
#LoqiIt looks like we don't have a page for "URL footprint" yet. Would you like to create it? (Or just say "URL footprint is ____", a sentence describing the term)
#angelowhat would you call the collection of URLs served at your domain?
#GWGI'd just write a simple few lines of code in plugin form to disable what I don't want
#angeloi suppose it depends on how crawlers use what they find in the sitemap.. i've always thought of it as a list of recommended URLs to make /sure/ the robot will find but the bot will still merrily crawl all of the other things it finds that isn't blocked by robots.txt
#angeloin other words, keeping a small sitemap wouldn't necessarily reduce your overall "URL footprint"
#[tantek]4there's got to be some prior terminology for this, as web IAs (information architects) have been thinking about it (and writing about it) deliberately since the late 1990s
#angelocan you have a catch-all `Disallow: /` and then hand-curate your urlscape via sitemaps.xml?
#[schmarty]angelo: a behaving bot should check all URLs against robots.txt even if it learned them from a sitemap
#capjamesgangelo You should crawl robots.txt before anything else.
[jgarber]1 and [schmarty]1 joined the channel
#capjamesgA bigger site may see their robots.txt crawled multiple times per day by search engines, presumably to ensure the crawler is adhering to the directives the best it can.
#capjamesg[tantek] Do you have any resources re: what to include in a sitemap?
#capjamesgangelo Sitemaps are one method of discovery for URLs. I put all my blog URLs (even page URLs) in my sitemap just to make the job of the crawler easier.
#capjamesgBut my site is a few thousand pages. I don't know how massive sites do it.
#capjamesgangelo IndieWeb Search does this: 1. crawl robots.txt, 2. crawl any sitemaps found in robots.txt, 3. try to crawl sitemap.xml (I think) if it exists, 4. compares a URL to the robots.txt directive, 5. if URL can be crawled, crawl it, do URL discovery, and continue this for every URL.
#capjamesgangelo "can you have a catch-all `Disallow: /` and then hand-curate your urlscape via sitemaps.xml?" If you say disallow /, the crawler to which that applies will not crawl anything.
#capjamesgActually, I should have a provision for that in IndieWeb Search.