The widespread adoption of mobile Internet and the Internet of things(IoT)has led to a significant increase in the amount of video data.While video data are increasingly important,language and text remain the primary ...The widespread adoption of mobile Internet and the Internet of things(IoT)has led to a significant increase in the amount of video data.While video data are increasingly important,language and text remain the primary methods of interaction in everyday communication,text-based cross-modal retrieval has become a crucial demand in many applications.Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training(CLIP)to boost retrieval performance.However,implicit knowledge only records the co-occurrence relationship existing in the data,and it cannot assist the model to understand specific words or scenes.Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph,can play an auxiliary role in understanding the content of different modalities.Therefore,we study the application of external knowledge base in text-video retrieval model for the first time,and propose KnowER,a model based on knowledge enhancement for efficient text-video retrieval.The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets,i.e.,MSRVTT,DiDeMo,and MSVD.展开更多
基金supported by the National Key Research and Development Program of China(No.2020YFB1406800).
文摘The widespread adoption of mobile Internet and the Internet of things(IoT)has led to a significant increase in the amount of video data.While video data are increasingly important,language and text remain the primary methods of interaction in everyday communication,text-based cross-modal retrieval has become a crucial demand in many applications.Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training(CLIP)to boost retrieval performance.However,implicit knowledge only records the co-occurrence relationship existing in the data,and it cannot assist the model to understand specific words or scenes.Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph,can play an auxiliary role in understanding the content of different modalities.Therefore,we study the application of external knowledge base in text-video retrieval model for the first time,and propose KnowER,a model based on knowledge enhancement for efficient text-video retrieval.The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets,i.e.,MSRVTT,DiDeMo,and MSVD.