#dev 2021-09-01
2021-09-01 UTC
# capjamesg[d] [snarfed] I have had to port over to Elasticsearch for document storage. 🙂 But it's so good.
hendursaga joined the channel
# capjamesg[d] Ruxton unfortunately it was the best option.
# capjamesg[d] PostgreSQL's full text search and BM25 support were not up to the job.
# capjamesg[d] MySQL wasn't quite right either, although it was close.
# capjamesg[d] I thought I'd just go all the way with elasticsearch.
# capjamesg[d] The search engine is now live and working with a (very) small subset of docs from my crawl: https://indieweb-search.jamesg.blog/results?query=blogroll
# capjamesg[d] [edit] The search engine is now live and working with a (very) small subset of docs from my crawl: https://indieweb-search.jamesg.blog/results?query=blogroll
# capjamesg[d] +1
# capjamesg[d] I'm just dipping my toe into elasticsearch. If you know any good docs for relevance / fuzziness let me know.
# capjamesg[d] Are you using the whole ELK stack or just elasticsearch?
# capjamesg[d] (I'm just using elasticsearch because it's all I need for this project)
tetov-irc, rockorager, nertzy__, hendursaga, chenghiz_ and joshproehl joined the channel
# [snarfed] capjamesg nice! afaik https://indiechat.search.cweiske.de/ is elastic too
# capjamesg[d] Yep, it is!
# capjamesg[d] The prod version is using Elasticsearch now but it will be down for a bit while I import data. It's taking sooooo long.
# [snarfed] esoteric HTTP status code of the day: 506 Variant Also Negotiates. https://httpstatuses.com/506
# sknebel (e.g. google says about 5xx " Because the server couldn't give a definite response to Google's robots.txt request, Google temporarily interprets server errors as if the site is fully disallowed [...] If the robots.txt is unreachable for more than 30 days, Google will use the last cached copy of the robots.txt. If unavailable, Google assumes that there are no crawl restrictions. ")
KartikPrabhu and angelo joined the channel
# Loqi robots are automated scripts that crawl, search, or perform requests for information https://indieweb.org/robots
# [tantek] ok so there is no current robots.txt "standard" nor anyone trying to make one per https://www.robotstxt.org/robotstxt.html
# [fluffy] Our boilerplate cites https://developers.google.com/search/docs/advanced/robots/robots_txt
# [tantek] robots_txt << Google crawler’s implementation of robots.txt: https://developers.google.com/search/docs/advanced/robots/robots_txt
# Loqi ok, I added "Google crawler’s implementation of robots.txt: https://developers.google.com/search/docs/advanced/robots/robots_txt" to the "See Also" section of /robots_txt https://indieweb.org/wiki/index.php?diff=76747&oldid=76746
# capjamesg[d] [tantek] Google has some of their own rules.
# capjamesg[d] I had to read into this when I was writing my robots.txt parser.
# capjamesg[d] I think Google's robots.txt parser is live on GitHub for anyone to read.
# capjamesg[d] In C++ I think.
# capjamesg[d] Yep!
# [tantek] please add to /robot_txt !
# capjamesg[d] robots_txt << Google's C++ robots.txt parser: https://github.com/google/robotstxt
# Loqi ok, I added "Google's C++ robots.txt parser: https://github.com/google/robotstxt" to the "See Also" section of /robots_txt https://indieweb.org/wiki/index.php?diff=76748&oldid=76747
# capjamesg[d] [tantek] Oh how I wish I could write up everything!
# capjamesg[d] I made a start for my personal search engine but this IndieWeb one has taken up all of my mental bandwidth that I would use for writing.
# capjamesg[d] I struggle to write and code in the same weeks.
# capjamesg[d] [tantek] 9,000 documents have been indexed in Elasticsearch. I have 50,000 pages indexed but not in elasticsearch right now.
# capjamesg[d] I wish there was an easy way to import from sqlite to elasticsearch.
BinarySavior, Seirdy, [Rose] and tetov-irc joined the channel
# [tantek] Thinking of you snarfed & aaronpk: https://jj.isgeek.net/2021/09/01-214655/