2023-06-15 UTC
# capjamesg The captioning models I have seen take up a fair bit of RAM (in the traditional computing sense), but they can identify a vast range of things. Whereas object detection models can generally identify fewer things, unless you use a large, general model like Grounding DINO or Facebook's Detic.