In response to the issue of poor performance of external knowledge-based VQA tasks, this paper constructs a framework for external knowledge-based VQA models that integrates cross-modal Transformers. By introducing an external knowledge base outside the VQA model, the inference ability of the VQA model on external knowledge-based tasks is improved. Further, the model utilizes a bidirectional cross attention mechanism to enhance the semantic interactive and fusion ability of text problems, images, and in order to optimize the problem of insufficient reasoning ability commonly found in VQA models in the face of external knowledge. According to the experiment, compared with the baseline model LXMERT, the overall performance index of our model OVERALL improves by 15.01% on the OK VQA dataset. Meanwhile, compared with the existing latest model, the overall performance index of our model overall improves by 4.46% on the OK VQA dataset. It can be seen that this paper's model improves the performance of external knowledge-based VQA tasks.