OpenGVLab/InternVideo
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
GitHub repository with 2,292 stars and 154 forks.
Language: Python
Topics: foundation-models, video-understanding, vision-transformer, action-recognition, masked-autoencoder, multimodal, open-set-recognition, spatio-temporal-action-localization, temporal-action-localization, video-question-answering