The Linguist in the Machine

Last week we posted a piece about a dolphin research team in the Bahamas that is using an underwater microphone in an attempt to communicate with dolphins. Though the project is still in its early stages, there is already wild speculation of a Dr. Doolittle-esque world coming soon—one where you can chat with the birds in the trees and the fish in the seas. Seeking an in-the-know analysis, I sat down with a linguist and expert in the field of machine translation – M.T. for short – Robert Derbyshire, the Senior Manager of Product Development at CSOFT International.

My own notes are in italics.

Maxwell: Why is machine translation better at some languages than others?

Derbyshire: If the languages are fairly similar in structure, it’s easier for a machine to translate. Some languages are very difficult, for example East Asian languages like Korean or Japanese, which don’t even have spaces between words. That adds a whole extra layer of complexity, where the computer may not even know what constitutes a word. That’s the main reason that Asian languages are so difficult. It’s also related to the amount of data that you can feed into the engine. The main reason that Google is so good at English to French, Italian, German, and Spanish is that so much data is getting fed into the system whereas if you wanted to translate from Burmese into Tamil, a machine translation just wouldn’t be adequate.

Maxwell: If a machine translation is starting with a new language, how long would it typically take before you get a decent translation like you would from a human linguist?

Derbyshire: In terms of starting an engine, using it, finding and fixing problems until you finally get a good translation…(he stops, lips pursed and brow wrinkled)…at least a couple of years.

Maxwell: Could machine translation be used for a dead language?

Derbyshire: Yes, if you have enough to feed your engine with. If you could find texts to use as the base corpus, yes. That’s the advantage of statistical machine translation; you don’t have to understand the language like a linguist would to create an M.T. engine. With rule-based engines, the computer really has to understand the language.

Maxwell: Which is more popular right now, rule-based or statistical translation?

Derbyshire: Statistical seems to have won the battle.

Maxwell: What’s the reason for that? Is it somehow better?

Derbyshire: It just requires less effort to get an acceptable output.

Maxwell: Could M.T. be used for languages that have never been spoken, like invented languages?

Derbyshire: There’s no reason why not.

Maxwell: Could M.T. be used for non-human language?

Derbyshire: Yes (laughs). Like I said, the only thing that you need is a corpus, with at least 10,000 segments already translated.

Maxwell: Let’s say that you were doing a non-human language where presumably the structure would be radically different from anything we’re familiar with; how long do you think it would take to create a translation engine?

Derbyshire: (Pauses, thinking) To be honest, there are human languages that we still don’t have decent machine translators for. Things are improving but for languages like Japanese, machine translation should be avoided. So I’d say the prospect isn’t particularly hopeful.

Maxwell: Does that depend on the amount of information inputted into the system or is it dependent on something else?

Derbyshire: The amount of information you put into the system will improve the translation but the way that M.T. algorithms currently work, it depends on many, many other factors.

Maxwell: How long before East Asian or obscure African languages become machine translatable?

Derbyshire: You know the quotes I sent you yesterday? (He’s referring to two quotes from a presentation he sent prior to the interview. The first, from 1955 is “Thoroughly literate translations [created by machine] of literary works as good as published, run of the mill translations will be commonplace in 10 years.” The second, from Google in 2009 said, “Even today’s most sophisticated software, however, doesn’t approach the fluency” of a native linguist, “or the skill of a professional translator. Automatic translation is very difficult, as the meaning of words depends on the context in which they’re used. While we’re working on the problem, it may be some time before someone can offer human quality translations.”) What I love about that quote is that they were far more confident in the 1950’s than they were in 2009 about machine translation. It’s a very complicated area but that’s not to say we haven’t made huge advances.

Robert Derbyshire on the Dolphin Translator
There are some prerequisites to machine translations. The main prerequisite is being able to tell the computer “A-B-C in this language means X-Y-Z in that language,” and you need a bunch of those equations; around 10,000. So for the dolphin linguist, you’d have to be pretty sure what a click or a whistle meant so you could input “click equals ‘A’ and whistle equals ‘B’” into your engine. If you did have that sort of data, then feasibly you could develop an M.T. engine using audio files translating from dolphin into English.

But the true meaning of dolphins’ sonorous language has yet to be revealed, even with the massive recording databases already amassed. If Mr. Derbyshire’s down to Earth analysis is right, it’ll be a long time before Flipper can tell us how he’s feeling.

……………………………………………………………………………………………………………

If you’re interested in learning more about CSOFT’s globalization and localization solutions, don’t forget to visit us at csoftintl.com!

The Linguist in the Machine

CSOFT International in Language & Culture, Language Technology | April 24, 2014

Published

April 24, 2014

Updated

March 19, 2021

CSOFT International in Language & Culture, Language Technology | April 24, 2014

Published

April 24, 2014

Updated

March 19, 2021