A Model Comparison between Neural Architectures of Human Bilingual Sentence Processing
This work investigates phenomena related to human bilingual sentence processing in neural language models. We ask ourselves the question if and how the emergence of these phenomena depends on the model architecture. For this purpose, we train SRNs, LSTMs, and Transformers with different hidden layer sizes as bilingual- and monolingual language models. We test these models on three phenomena that have been shown to emerge in at least one of the architectures in the literature. We refer to them as reading time prediction; an agreement between monolingual vs. bilingual reading with models trained on monolingual vs. bilingual data, the cognate facilitation effect; a faster processing of form and meaning-similar words, and the grammaticality illusion; a preference for the ungrammatical version of a certain class of sentences that is reversed in some languages. Surprisingly, we found reading time prediction to depend not only on architecture and layer size, but also on the specific random initialization. Failure of reproduction of the effect was confirmed by the author, suggesting the original study to be due to coincidence. As for the cognate facilitation effect, we found it to be present in the SRN and LSTM, providing further evidence for its emergence in humans to be due to the cumulative frequency of cognates. The effect was found to decrease in magnitude for large layer sizes in the LSTM, which can be linked to the LSTM relying less on corpus frequency. However surprisingly, the effect was found to increase for small layer sizes in the SRN. We do not have an adequate explanation for this trend. Furthermore, it was not found in the Transformer, suggesting that Transformers exhibit less cross-linguistic transfer than the other architectures. The grammaticality illusion was found to be present in the SRN, but not in the LSTM and Transformer. This provides further evidence for the effect to arise as a result of short-distance language statistics rather than universal working memory constraints. The effect was found to stay fairly constant over layer size, syntactic linguistic transfer to be small. Furthermore, the Transformer displayed a consistent preference for grammatical sentences, suggesting super-human syntactic proficiency on this particular task.