#dev 2024-08-19

2024-08-19 UTC
oodani, AramZS, bterry, thegreekgeek_, srijan, ttybitnik, [Joe_Crawford] and [aciccarello] joined the channel
# 16:55 
[aciccarello] The number of times I've plugged localhost URLs into microformats parsers expecting a result is to damn high 😆
beanbrain, gRegor and [tantek] joined the channel
# 19:11 
capjamesg[d] [KevinMarks] How much do you know about NoSQL databases?
# 19:12 
capjamesg[d] I have been implementing one and I'm a bit stuck on one piece.
# 19:12 
capjamesg[d] I learned about the concept of a Global Secondary Index that indexes a particular key in all your documents, so you can search them faster.
# 19:12 
capjamesg[d] If lots of documents have the same value for that key, checking to find all documents with that key is super fast.
# 19:13 
capjamesg[d] Then to check if documents start with something, you could construct a prefix-tree (trie) and use that for lookup.
# 19:13 
capjamesg[d] But I'm not sure about how to check if a key _contains_ something.
# 19:14 
capjamesg[d] Suppose I have a NoSQL database with five million documents. I want to check if `text`, which contains 1,000 words in each document, contains `coffee cats`. What would be the best way to do that search efficiently?
# 19:14 
capjamesg[d] My hunch is that for short strings, an efficient string search algorithm is sufficient.
# 19:14 
capjamesg[d] And for long strings, you should build a reverse index of the words in `text`?
# 19:15 
capjamesg[d] Ah, maybe you'd need a global index that maps every word in `text` to its doc id. Then you can look that up once!
# 19:15 
capjamesg[d] Ahhhh!
# 19:15 
capjamesg[d] I think I get it now.
# 19:15 
capjamesg[d] (But if any of this doesn't make sense, let me know!)
# 19:15 
capjamesg[d] I just wrote a long blog post on all of this and I was really stuck on this point.
[snarfed] joined the channel
# 19:40 
[snarfed] I'm pretty familiar with NoSQL datastores 😁
# 19:41 
[snarfed] but it sounds like your uses cases are more search/document oriented. in that case you may still be better served by unstructured text search indices like Elastic etc, or maybe document dbs
# 19:45 
capjamesg[d] [snarfed] That's what I'm building!
# 19:45 
capjamesg[d] I wanted to know tools like Elasticsearch work behind the scenes, so I'm trying to build one for myself 😄
# 19:45 
capjamesg[d] I don't plan to use it for production, but it has been fun to tinker.
# 19:45 
[snarfed] hah ok!
# 19:46 
capjamesg[d] Across 200,000 documents, a contains and starts_with query now takes 0.1s on my Mac with what I have built so far.
# 19:46 
ptramo[d] capjamesg[d] total payload bytes?
# 19:47 
[snarfed] yeah the "global index..." that you mentioned is a posting list aka a reverse index, the fundamental workhorse of most traditional IR and search
# 19:48 
capjamesg[d] Yeah.
# 19:48 
capjamesg[d] It seems like document stores are lots of indices 😄
# 19:48 
ptramo[d] capjamesg[d] a fun problem is stemming, ie returning documents containing "searched" when looking up "searching"
# 19:49 
capjamesg[d] Ahhh I'm not that far yet 😄
# 19:51 
[snarfed] ptramo++ and stop words etc
# 19:51 
Loqi ptramo has 2 karma over the last year
# 19:54 
capjamesg[d] ptramo[d]
# 19:54 
capjamesg[d] Oops!
# 19:55 
capjamesg[d] @π++
# 19:55 
capjamesg[d] ptramo[d] ++
# 19:55 
Loqi ptramo[d] has 3 karma over the last year
# 19:55 
capjamesg[d] There we go!
# 19:58 
ptramo[d] So my new job involves building automated web agents. Been having lots of fun building an obstacle course (https://challenge.xmit.dev). I'd be curious to get a list of problems around lack of accessibility to add test cases for them
jonnybarnes joined the channel
# 20:09 
capjamesg[d] ptramo[d] A link on the page where you get an incorrect answer back to the home page would be ideal.
# 20:09 
capjamesg[d] So I could keep using my keyboard to navigate the page.
# 20:10 
capjamesg[d] This is very cool by the way!
# 20:10 
capjamesg[d] https://cdn.discordapp.com/attachments/866577430886350869/1275185188397383690/Screenshot_2024-08-19_at_21.10.15.png?ex=66c4f840&is=66c3a6c0&hm=7bdcc044dee8cd4abf2102388d94170124f66efb177ef129f8ee4b79bddcfe16&
# 20:10 
capjamesg[d] It would be great if these items were spaced further apart:
# 20:10 
capjamesg[d] It is hard for me to read and understand them.
# 20:11 
capjamesg[d] Also, the white and black text is jarring in dark mode. Using an off-white or off-black colour makes text easier to read.
# 20:32 
ptramo[d] capjamesg[d] I don't set colors, blame your browser 😦
# 20:45 
[Murray] Just spent an hour trying to work out why API calls working locally weren't working on the server. Any guesses? Yeah, forgot to add the API key to the server's environment variables 😅 🤦‍♂️
# 20:49 
capjamesg[d] [Murray] At least you figured it out before going to bed!
# 21:06 
[Murray] This is true!
gRegorLove_ and bterry joined the channel
# 23:02 
ptramo[d] capjamesg[d]: Ah so agents that don't know how to navigate the history get penalized but not stuck… why not. Do you know have cmd-back to go back with your keyboard though?
# 23:03 
ptramo[d] * not have. Ctrl-left on Linux/windows and cmd-left on Mac?
# 23:04 
ptramo[d] Hmmm for spacing I don't know what would be most appropriate. a { padding: 0.25em } maybe?
bret joined the channel