#dev 2023-08-15
2023-08-15 UTC
eitilt, btrem, [schmarty], win0err, gerben, [tantek], geoffo, [campegg] and [snarfed] joined the channel
# [snarfed] [KevinMarks] [sknebel] you all have some background with Python HTML parsers...we've been reconsidering lxml a bit recently because it doesn't support HTML5 🤦♂️ discussion in https://github.com/snarfed/bridgy/issues/1535#issuecomment-1678032141 . got any thoughts?
# Loqi [preview] [snarfed] Evidently the root cause is that libxml2 only supports HTML4, not HTML5, even 15y later 🤦. They have a 2y old issue tracking adding HTML5 support, with some discussion, but no obvious progress. Sigh. https://gitlab.gnome.org/GNOME/libxml2/-/issues...
# [schmarty] chants "up-stream fix. up-stream fix. Up-stream Fix. UP-STREAM FIX!"
# [schmarty] libxml2 😂
# [schmarty] hahaha, yep. i don't actually think it's a good idea for folks in this community to try and take on that gargantuan (and stuck for important reasons) project.
gerben joined the channel
# [KevinMarks] the real answer may be to refactor without BeautifulSoup, but that would be a big change
# [KevinMarks] html5lib was always a bit under-documented; is https://html5-parser.readthedocs.io/en/latest/ any better, or is it a lot of weird tree walking as well?
eitilt1 joined the channel
# [KevinMarks] I've done a little bit with it, but I admit that the last time I needed an html parser I used `html.parser` (in my defence, it was for HTML emails, so using very retro parsing makes some sense)
# [KevinMarks] html5lib will also emit lxml trees, and BeautifulSoup will use either or both of them, so it gets very confusing. Another profiling exercise might be worth it, and we can make mf2py pick the fast and correct answer rather than the mix and match we get at the moment form all these libs negotiating each others' boundaries
# [snarfed] granary Bridgy etc's current config: https://github.com/snarfed/webutil/blob/main/util.py#L1917-L1938 ; they currently ask for lxml explicitly , https://github.com/snarfed/webutil/blob/main/appengine_config.py#L7-L9 , but only due to to BS4's decade-old cargo cult "it's faster" claim. I'm open to switching to html5lib in the short term
# [KevinMarks] Before python had dependency management, there was a tradition of 'deal with whats installed' in these libs, and mf2py inherited that from BeautifulSoup.
btrem joined the channel
# [KevinMarks] html5-parser claims to emit soup format, so there may be a good fast path there, with a different tree walker approach as a later refactor. Would still need both testing and profiling. I don't think we have a checked-in profiling harness.
# [snarfed] some background: https://github.com/microformats/mf2py/issues/63
gerben joined the channel
# Loqi It looks like we don't have a page for "lxml" yet. Would you like to create it? (Or just say "lxml is ____", a sentence describing the term)
# Loqi BeautifulSoup is an HTML parsing library for Python https://indieweb.org/BeautifulSoup
# Loqi It looks like we don't have a page for "libxml2" yet. Would you like to create it? (Or just say "libxml2 is ____", a sentence describing the term)
# havenmatt Hi All! I'm working on my Micropub server, would it make sense for the spec to include an option for a Micropub server to advertise which types/properties it supports? That way you could have a very minimal server that only accepts twitter-style notes, but can still speak micropub. Micropub clients would be able to limit which fields they display based on the advertized support from the server. Any thoughts?
# [snarfed] but also https://micropub.spec.indieweb.org/#unrecognized-properties "If the request includes properties that the server does not recognize, it MUST ignore unrecognized properties and create the post with the values that are recognized. This allows clients to post rich types of content to servers that support it, while also posting fallback content to servers that don't."
# havenmatt [snarfed]: That is exactly what I was looking for!! Thanks!
# [schmarty] yah, using your own estimates and setting your own cap seems reasonable.
# [schmarty] memorycapjamesg
# [snarfed] capjamesg https://cachetools.readthedocs.io/ is a decent example of a lib that does this kind of thing
# [tantek] capjamesg, longer original article, which I'm still skeptical about, however someone obviously did do a lot of work: https://blog.redplanetlabs.com/2023/08/15/how-we-reduced-the-cost-of-building-twitter-at-twitter-scale-by-100x/
tbbrown and [benatwork] joined the channel
# [KevinMarks] Depends how much feed forward there is. You could chunk video at the gop level for processing.
tei_ joined the channel
# capjamesg Now you can use 7 lines of code to count an object in every frame of a video, display the count, and show the video: https://gist.github.com/capjamesg/80790c104a02d7cfa06e88971a6e3709
tei_1 and Nuve joined the channel
# [KevinMarks] Video is encoded in groups of frames, called GOPs, that have dependencies between them. That's the smallest chunk you should chop them into, which is dependent on the first stage of the decoding pipeline. Otherwise you're turning them back into imagebuffers and moving those around which are lots bigger and may well mess up your concurrency by eating lots of memory and bus throughput.
[KevinMarks]1, [campegg]1, [Jo], strugee and tei_ joined the channel