Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D]

“`html

The post questions whether the large language models (VLMs) in production are still using fixed-patch versions of the Voice Transformer (ViT), a popular model for vision tasks. The author suggests that with more efficient and effective tokenizations available, it might be unlikely that these models continue to use fixed patches.

The discussion points out several potential reasons why dynamic tokenization might not have been adopted: marginal gains in performance compared to fixed-patch versions, pipeline constraints requiring a predetermined number of tokens per image for efficiency, or insufficient understanding of scaling laws for input-adaptive patching. The author also questions whether these limitations are specific to the big players.

Dynamic tokenization is not widely adopted by large models due to potential marginal gains over fixed-patch versions.
Pipelines may require a fixed number of tokens per image, limiting the adoption of input-adaptive patching.
The scaling laws for input-adaptive patching are not well understood yet, prompting caution in adopting this approach by big players.

“`

“`md
– Dynamic tokenization is not widely adopted by large models due to potential marginal gains over fixed-patch versions.
– Pipelines may require a fixed number of tokens per image, limiting the adoption of input-adaptive patching.
– The scaling laws for input-adaptive patching are not well understood yet, prompting caution in adopting this approach by big players.

“`

Source Read original →