John Bateman continued on SYSFLING 28 Nov 2024, at 21:35:
…This relates to a concern that has been discussed for a couple of years now concerning the degradation of training data due to including in the training data material that has been created by large language models rather than human produced outputs. Since language models are often trained on (English) texts collected from the web, if the web contains language model produced nonsense (e.g., incorrect, ideologically unsavoury, wildly enthusiastic, etc.), then one can expect more of the same. So even circulating these produced 'texts' is potentially contributing to the garbage heap.And, just to be clear, I like large language models a lot, we use them all the time in our research and even some text production, but one does not find out much about them by asking their 'opinion' despite how warm and cuddly the sequences of tokens produced appear!
Blogger Comments:
[1] To be clear, language that is "incorrect", ideologically unsavoury, or wildly enthusiastic is not "nonsense". These terms, along with 'nonsense', express attitudes to language which apply just as much to the language of humans as they do to the language produced by LLMs, since the latter mimics the former.
[2] To be clear, linguists, as the name implies, are interested in finding out about language. LLMs provide one means of doing so.