in Technology

Can Language Data Help Speak Truth to NLP’s Creations?

Like language AI, language data can be the basis for quite a range of things. Within localization, NMT (neural machine translation) and MTPE (machine translation post-editing) are the vanguard technologies for maximizing the automation of translation and other key processes involved in bringing content from one language to another. In this realm, language data-based AI provides a vastly accelerated service aimed at empowering human linguists to focus on the vital quality assurance process of a human linguistic review. Beyond solutions for the language services industry, however, a very different and notion of language AI is now making headlines on an almost weekly basis, as we learn what natural language processing (NLP) based AI agents have managed to achieve in their brief and highly experimental time among us. The latest? According to Wired, an AI dispatched to a comments board on Medicaid.gov managed to post half of the 1000 comments that accumulated in response to a request for public feedback on a proposed change to Idaho’s Medicaid program. More importantly, though, the people reviewing the comments were unable to distinguish those posts from real people’s concerns.

In general, it is easy to dismiss bots that can do astonishing things in language as something of a stunt designed to show off the power of what developers can create. AIs that can litter the internet with highly persuasive spam mark a new addition to the world of human concerns, though, and one that we have only started to grapple with. Whether arising from intentional disinformation and the false manipulation of images and voices in deep fakes, or simply accumulating in our midst as all kinds of non-human linguistic agents take to the web, sorting through inauthentic information is bound to be among the most pressing challenges ahead for…you guessed it: artificial intelligence. With AI that can write well enough to render us gullible comes the need for AI that can read well enough to call its bluff.

Related:  How LSPs Can Keep on Truckin’ with Green Tech

All of this may only be to say that the perceived distinction between the kind of language AI valued by language service providers and the kind of language AI promenading at large does face a reckoning. NLP bots have the potential to obfuscate communications at an intractable scale, but the alternatives to grappling with that challenge are just as unattractive. As noted in Wired, the case of the impostor who visited Medicaid’s site is particularly challenging to almost any assumption about how a government resource can or should be run in the digital age. If not a fraud detection mechanism, what means is there to insulate the government’s openness to the public from abuse apart from the removal of such portals from public access?

All of that, however, may only be to validate the aforementioned distinction and the work of language service providers in the realm of MT. Thanks to the deliberate training of machine learning models to support the development of multi-market communications, a great deal of language data already exists to reflect what authentic human communications can be – and, potentially, to reflect what content touched by a machine is likely to resemble. While models that generate convincing but baseless claims may be trained on annotated language data, that does not mean models to counter them will not also rely on the same basic input – language data – to compete in detection. With a growing range of concerns that hinge on fundamentally linguistic problems just now coming into existence, the linguistic data enterprises have accumulated working to good-faith outcomes in only set to become all the more valuable for not being of the show-stopping character as algorithms like GPT-3 and its ilk.

Related:  How LSPs Can Keep on Truckin’ with Green Tech

CSOFT’s extensive work with MT and AI-driven localizations solutions extends across industries, sectors, and global markets in more than 250 languages. Learn more about our linguistic resources at csoftintl.com!

[dqr_code size="120" bgcolor="#fff"]