#dev 2021-09-01

2021-09-01 UTC
#
capjamesg[d]
[snarfed] I have had to port over to Elasticsearch for document storage. 🙂 But it's so good.
hendursaga joined the channel
#
Ruxton
elasticsearch is great, but very heavy
#
capjamesg[d]
Ruxton unfortunately it was the best option.
#
capjamesg[d]
PostgreSQL's full text search and BM25 support were not up to the job.
#
capjamesg[d]
MySQL wasn't quite right either, although it was close.
#
capjamesg[d]
I thought I'd just go all the way with elasticsearch.
#
capjamesg[d]
The search engine is now live and working with a (very) small subset of docs from my crawl: https://indieweb-search.jamesg.blog/results?query=blogroll
#
capjamesg[d]
[edit] The search engine is now live and working with a (very) small subset of docs from my crawl: https://indieweb-search.jamesg.blog/results?query=blogroll
#
Ruxton
Yeah it's super good, you just need some decent resources to keep it running
#
capjamesg[d]
I'm just dipping my toe into elasticsearch. If you know any good docs for relevance / fuzziness let me know.
#
Ruxton
I've just got a toe in too, I use it to store logs
#
capjamesg[d]
Are you using the whole ELK stack or just elasticsearch?
#
capjamesg[d]
(I'm just using elasticsearch because it's all I need for this project)
tetov-irc, rockorager, nertzy__, hendursaga, chenghiz_ and joshproehl joined the channel
#
[snarfed]
capjamesg nice! afaik https://indiechat.search.cweiske.de/ is elastic too
#
capjamesg[d]
Yep, it is!
#
capjamesg[d]
The prod version is using Elasticsearch now but it will be down for a bit while I import data. It's taking sooooo long.
#
[snarfed]
esoteric HTTP status code of the day: 506 Variant Also Negotiates. https://httpstatuses.com/506
#
[snarfed]
conneg--, conneg--, my kingdom for conneg--
#
Loqi
conneg has -4 karma over the last year
#
Loqi
conneg has -4 karma over the last year
#
[snarfed]
not low enough
#
sknebel
a very 2021 status code
#
[manton]
Sometimes I wonder what percentage of apps out there recognize any error codes except 200, 301, and 404.
#
[tantek]
410 for Webmention deletions!
#
[manton]
Excuse me, off to check if Micro.blog actually handles 410s… 🙂
#
[snarfed]
tantek++
#
Loqi
tantek has 19 karma in this channel over the last year (52 in all channels)
#
sknebel
I mean, the 5xx you generally can treat the same as consumer. "something has gone wrong and it isn't your fault"
#
[snarfed]
no no no the _key_ takeaway is that content negotiation is bad and should be abolished
#
[snarfed]
gotta stay on mesage
#
[manton]
[snarfed] 👍
#
[tantek]
😂 snarfed++
#
Loqi
snarfed has 25 karma in this channel over the last year (52 in all channels)
#
[fluffy]
i am not a fan of how robots.txt abuses 5xx
#
[fluffy]
per the robots spec, all 4xx errors on retrieving robots.txt should be treated as 404 (i.e. “there’s no robots.txt”), and it’s a 5xx that should be treated as “you are being disallowed globally”
#
[fluffy]
this is extremely confusing to a lot of people who think that we (at Moz) are ignoring their robots.txt because they’ve banned our IP addresses in a returns-a-403 way, and then they get very hostile at us
#
[fluffy]
and our response is always the same, “we need to be able to see the robots.txt to be able to process it, and the robots spec says to return a 500 error if you want us to treat it as a global deny”
#
[fluffy]
except that’s not what a 500 error means
#
[fluffy]
really the only codes that should mean “there is no robots.txt to process” are 404 and 410.
#
sknebel
that's spec?
#
sknebel
afaik Google only does that temporarily
#
sknebel
(although, isnt part of the issue with robots.txt that isn't really a spec?)
#
sknebel
(e.g. google says about 5xx " Because the server couldn't give a definite response to Google's robots.txt request, Google temporarily interprets server errors as if the site is fully disallowed [...] If the robots.txt is unreachable for more than 30 days, Google will use the last cached copy of the robots.txt. If unavailable, Google assumes that there are no crawl restrictions. ")
KartikPrabhu and angelo joined the channel
#
sknebel
heh, fun search bug: looked for something on the citys website, search found some api endpoint with the right info.
#
sknebel
got what I wanted, but JSON is not the ideal output format for a generic city website to show to a random visitor :D
#
[tantek]
at least it wasn't turtle?
#
[tantek]
what is robots
#
Loqi
robots are automated scripts that crawl, search, or perform requests for information https://indieweb.org/robots
#
[tantek]
ok so there is no current robots.txt "standard" nor anyone trying to make one per https://www.robotstxt.org/robotstxt.html
#
[tantek]
[fluffy] which robots.txt "spec" were you looking at? 🙂
#
[fluffy]
whichever one our customer service folks use whenever we get an abuse report
#
[fluffy]
and I think they cite Google’s implementation
#
[tantek]
robots_txt << Google crawler’s implementation of robots.txt: https://developers.google.com/search/docs/advanced/robots/robots_txt
#
capjamesg[d]
[tantek] Google has some of their own rules.
#
capjamesg[d]
I had to read into this when I was writing my robots.txt parser.
#
capjamesg[d]
I think Google's robots.txt parser is live on GitHub for anyone to read.
#
capjamesg[d]
In C++ I think.
#
[tantek]
capjamesg[d] any interest in writing up what you learned? It sounds like something worth documenting for other implementers
#
Loqi
[google] robotstxt: The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).
#
[tantek]
please add to /robot_txt !
#
capjamesg[d]
robots_txt << Google's C++ robots.txt parser: https://github.com/google/robotstxt
#
Loqi
ok, I added "Google's C++ robots.txt parser: https://github.com/google/robotstxt" to the "See Also" section of /robots_txt https://indieweb.org/wiki/index.php?diff=76748&oldid=76747
#
capjamesg[d]
[tantek] Oh how I wish I could write up everything!
#
capjamesg[d]
I made a start for my personal search engine but this IndieWeb one has taken up all of my mental bandwidth that I would use for writing.
#
capjamesg[d]
I struggle to write and code in the same weeks.
#
capjamesg[d]
[tantek] 9,000 documents have been indexed in Elasticsearch. I have 50,000 pages indexed but not in elasticsearch right now.
#
capjamesg[d]
I wish there was an easy way to import from sqlite to elasticsearch.
#
[tantek]
I can relate to the writing (prose) vs code in the same day / week
BinarySavior, Seirdy, [Rose] and tetov-irc joined the channel
#
[tantek]
Thinking of you snarfed & aaronpk: https://jj.isgeek.net/2021/09/01-214655/
#
Loqi
[Avatar of Jj] Wed 01 September 2021
#
[tantek]
^ could use better markup for that note!
#
aaronpk
speaking of, i forgot to PESOS the last post i did on instagram
#
aaronpk
oh good mine is still working
#
[snarfed]
scraping--
#
Loqi
scraping has -3 karma in this channel over the last year (-4 in all channels)
#
[snarfed]
they did change a username or two to user id(s) recently, or vice versa, did have to handle that