Fan Xu

Doctoral researcher at the FSTM

Fan works on the project ‘MULTIMODAL DEEP LEARNING FOR HISTORICAL DATASETS‘ under the supervision of Luis Leiva

In the past, old-fashioned Machine Learning mainly focused on a single format of data to extract useful information and learn potential patterns based on probabilistic models, while Artificial Neural Networks dominated the Machine Learning community and achieved huge success in various domains for different data types in recent years. Since our world is full of diverse information with varying formats, analysing a single format of data independently is not enough anymore, and it is beneficial to combine different formats of data so that each of them can supplement each other and jointly provide more information, which perfects the capabilities of a learning model, such as prediction and generation. This is what Multimodal Machine Learning does.  

Multimodal machine learning models are usually composed of multiple modules that deal with a specific format of data to extract embeddings of data (representation), then fuse the embeddings from each format of data together (alignment) and finally, the downstream tasks are performed based on alignment (application). Vision-language models and Text-to-Speech models are the most well-studied multimodal learning models but other models that deal with motion data, tables, time Eries, also make much progress. Nonetheless, multimodal learning models are not widely applied in the domain of digital history, which would be challenging and intriguing. Historical datasets usually contain blurry images, noise audios, and unformatted texts, which put forward new adaptations for multimodal models such that all the data are processed jointly, while maintaining model performance. To summarise, we pay attention to images, texts, and audio data collected chronologically, and align all the data, in order to perform downstream tasks, such as finding correspondence among modalities, answering questions based on images or audio, and generating images given texts.