Finetuning to multi2vec-clip with Fashionpedia

Hello, I would like to know if multi2vec-clip was trained with Fashionpedia, if not I would like to know if you can finetuning the model so you can better distinguish the clothes, I could run this Building Multimodal AI in TypeScript | Weaviate - Vector Database and it works to detect several garments, but I understand that if you do finetuning makes it more “powerful”.
I also have other questions
what GPU do i need for finetuning? a rtx 3060m is enough?
does finetuning for a multimodal model is the same as for a text-only model like llama3-8B?
well, if someone has knowledge it would help me a lot with some advice or some prompt to send to the chatgpt to make me an explanation and a guide to get started
Thanks and greetings :smiley:

hi @Jjen_95 !!

Welcome to our community :hugs:

You can definitely train your own clip model and replace it with your own to be used on this container image:

According to that repo docs, this is the model used by default:

Let me know if this helps!