#dev 2023-08-15

2023-08-15 UTC
eitilt, btrem, [schmarty], win0err, gerben, [tantek], geoffo, [campegg] and [snarfed] joined the channel
#
[snarfed]
[KevinMarks] [sknebel] you all have some background with Python HTML parsers...we've been reconsidering lxml a bit recently because it doesn't support HTML5 🤦‍♂️ discussion in https://github.com/snarfed/bridgy/issues/1535#issuecomment-1678032141 . got any thoughts?
#
Loqi
[preview] [snarfed] Evidently the root cause is that libxml2 only supports HTML4, not HTML5, even 15y later 🤦. They have a 2y old issue tracking adding HTML5 support, with some discussion, but no obvious progress. Sigh. https://gitlab.gnome.org/GNOME/libxml2/-/issues...
#
[schmarty]
chants "up-stream fix. up-stream fix. Up-stream Fix. UP-STREAM FIX!"
#
[snarfed]
to what?
#
[schmarty]
libxml2 😂
#
[snarfed]
yeaaahhh "fix" is an understatement
#
[schmarty]
hahaha, yep. i don't actually think it's a good idea for folks in this community to try and take on that gargantuan (and stuck for important reasons) project.
#
Loqi
rofl
gerben joined the channel
#
[KevinMarks]
the real answer may be to refactor without BeautifulSoup, but that would be a big change
#
[KevinMarks]
html5lib was always a bit under-documented; is https://html5-parser.readthedocs.io/en/latest/ any better, or is it a lot of weird tree walking as well?
eitilt1 joined the channel
#
sknebel
[snarfed]: I've always said that you should use html5lib where you can
#
sknebel
Exactly because lxml doesn't do HTML5 properly
#
[KevinMarks]
I've done a little bit with it, but I admit that the last time I needed an html parser I used `html.parser` (in my defence, it was for HTML emails, so using very retro parsing makes some sense)
#
[snarfed]
thanks guys! and yeah sknebel fair, thank you
#
[KevinMarks]
html5lib will also emit lxml trees, and BeautifulSoup will use either or both of them, so it gets very confusing. Another profiling exercise might be worth it, and we can make mf2py pick the fast and correct answer rather than the mix and match we get at the moment form all these libs negotiating each others' boundaries
#
[snarfed]
granary Bridgy etc's current config: https://github.com/snarfed/webutil/blob/main/util.py#L1917-L1938 ; they currently ask for lxml explicitly , https://github.com/snarfed/webutil/blob/main/appengine_config.py#L7-L9 , but only due to to BS4's decade-old cargo cult "it's faster" claim. I'm open to switching to html5lib in the short term
#
[KevinMarks]
Before python had dependency management, there was a tradition of 'deal with whats installed' in these libs, and mf2py inherited that from BeautifulSoup.
#
[snarfed]
yeah lots of this feels like 10-15y Python cargo culting, which was fine then, but the world is different now
btrem joined the channel
#
[KevinMarks]
html5-parser claims to emit soup format, so there may be a good fast path there, with a different tree walker approach as a later refactor. Would still need both testing and profiling. I don't think we have a checked-in profiling harness.
#
sknebel
It was certainly faster when I measured a few years back
#
sknebel
Afaik that was why snarfed stuck with it at the time
#
Loqi
[preview] [kylewm] #63 performance: bs4.find_all is slow
#
[snarfed]
I remember an issue with a lot more background and discussion, lots from sknebel, haven't found it yet
gerben joined the channel
#
[tantek]
what is lxml
#
Loqi
It looks like we don't have a page for "lxml" yet. Would you like to create it? (Or just say "lxml is ____", a sentence describing the term)
#
[tantek]
what is BeautifulSoup
#
Loqi
BeautifulSoup is an HTML parsing library for Python https://indieweb.org/BeautifulSoup
#
[tantek]
^ used by/for what IndieWeb purposes? Could someone who knows add IndieWeb relevance to that dfn?
#
[tantek]
what is libxml2
#
Loqi
It looks like we don't have a page for "libxml2" yet. Would you like to create it? (Or just say "libxml2 is ____", a sentence describing the term)
#
[snarfed]
the BeautifulSoup wiki page mentions its IndieWeb context briefly, that part looks accurate
#
[snarfed]
we're probably ok without lxml and libxml2 pages
#
h​avenmatt
Hi All! I'm working on my Micropub server, would it make sense for the spec to include an option for a Micropub server to advertise which types/properties it supports? That way you could have a very minimal server that only accepts twitter-style notes, but can still speak micropub. Micropub clients would be able to limit which fields they display based on the advertized support from the server. Any thoughts?
#
[snarfed]
but also https://micropub.spec.indieweb.org/#unrecognized-properties "If the request includes properties that the server does not recognize, it MUST ignore unrecognized properties and create the post with the values that are recognized. This allows clients to post rich types of content to servers that support it, while also posting fallback content to servers that don't."
#
capjamesg
I have a programming question that isn't IndieWeb related but I thought someone may be able to help.
#
capjamesg
I am writing a programming language that is primarily dealing with images. It is implemented in Python. A history of images is maintained. This history is capped at 100 images, with FIFO enforced.
#
capjamesg
Is there a smarter way to make sure the app doesn't use too much memory? Or will Python handle that for me?
#
[snarfed]
you're...making...a new programming language?
#
[snarfed]
that's ambitious!
#
capjamesg
Inspired by Wolfram and Scratch, made to specify *what* rather than *how*. My ideal goal is something that a 10 year old would find fun and engaging.
#
[snarfed]
the short answer is, Python garbage collects unused memory, but if you actively try to use more memory than available, it will happily try, and eventually raise a MemoryError
#
[snarfed]
same with most programming languages. given arbitrary code, they can't know how to reduce the actively used memory (resident set) and still have the code run ok
#
[snarfed]
unrelated, you have so many projects! how do you maintain them all?!
#
[snarfed]
(I'm scared to ask about your domain bill 😆)
#
capjamesg
Right. In theory, I could say Load["folder_with_200_images"] and it would load all of those into the language.
#
capjamesg
This ease of loading is by design, but I just realized the potential memory issues.
#
[snarfed]
if you're having them specify what not how, one approach is to be lazy. collect their requested operations, but don't actually perform any operations until you need to, and when you do, do them in batches small enough to fit in memory
#
capjamesg
[snarfed] I'm taking a break from new ones. This project is something I have wanted to do for years.
#
[snarfed]
taking a break from new projects by doing a new project! 😆
#
capjamesg
This one has been going on since early July.
#
capjamesg
-> #chat
#
capjamesg
Lazy loading! Nice idea!
#
capjamesg
snarfed++
#
Loqi
snarfed has 103 karma in this channel over the last year (154 in all channels)
#
capjamesg
Are there best practices for ensuring buffers don't get too large (not in item count, but in memory used)?
#
[tantek]
mo buffers mo problems
#
h​avenmatt
[snarfed]: That is exactly what I was looking for!! Thanks!
#
[snarfed]
capjamesg limit size instead of items, stop accepting new items once you hit it
#
capjamesg
Right. So I should do a memory check on each list insertion.
#
[snarfed]
well, a buffer size check
#
[snarfed]
"does the system have enough memory available" is a surprisingly complicated question to answer
#
[schmarty]
yah, using your own estimates and setting your own cap seems reasonable.
#
[snarfed]
so don't try; check the buffer and new item's size instead
#
[schmarty]
memorycapjamesg
#
[snarfed]
capjamesg https://cachetools.readthedocs.io/ is a decent example of a lib that does this kind of thing
#
capjamesg
Right. So I should keep a global tally of bytes used by the array, and remove the last used item on the list when MAX_SIZE is exceeded?
#
[tantek]
capjamesg, longer original article, which I'm still skeptical about, however someone obviously did do a lot of work: https://blog.redplanetlabs.com/2023/08/15/how-we-reduced-the-cost-of-building-twitter-at-twitter-scale-by-100x/
#
[snarfed]
capjamesg how you handle exceeding the limit is up to you, but garbage collecting like that is dangerous if you're discarding user data. again I recommend lazy evaluation and sizing batches empirically at runtime to fit in a given amount of memory
#
capjamesg
I have just implemented lazy evaluation.
#
capjamesg
There is a case in which lazy evaluation is not enough. Someone can build a search engine with two lines of code. Load["folder"], Search["coffee"] finds all coffee cups in all images in the image buffer.
#
capjamesg
So I need both!
#
[snarfed]
what do you do if loading all the images runs out of memory?
#
[snarfed]
I guess my point is, "image buffer" is a how, not a what. you can incrementally stream images through memory to search them and handle that case
#
capjamesg
Even better.
#
capjamesg
I have implemented lazy evaluation and a max buffer size check on all buffers.
#
[snarfed]
capjamesg++
#
Loqi
capjamesg has 41 karma in this channel over the last year (125 in all channels)
tbbrown and [benatwork] joined the channel
#
capjamesg
Now I wonder how I can make the language support concurrency.
#
[snarfed]
very how not what. that way lie dragons. consider punting
#
capjamesg
The use case is suppose I have a 30 min long video I want to process. Processing that concurrently would be faster.
#
capjamesg
Okay yeah I just mocked this out and this is _way_ too much work.
#
[KevinMarks]
Depends how much feed forward there is. You could chunk video at the gop level for processing.
#
capjamesg
Say more?
tei_ joined the channel
#
capjamesg
Now you can use 7 lines of code to count an object in every frame of a video, display the count, and show the video: https://gist.github.com/capjamesg/80790c104a02d7cfa06e88971a6e3709
tei_1 and Nuve joined the channel
#
[KevinMarks]
Video is encoded in groups of frames, called GOPs, that have dependencies between them. That's the smallest chunk you should chop them into, which is dependent on the first stage of the decoding pipeline. Otherwise you're turning them back into imagebuffers and moving those around which are lots bigger and may well mess up your concurrency by eating lots of memory and bus throughput.
[KevinMarks]1, [campegg]1, [Jo], strugee and tei_ joined the channel