Friday, February 17, 2006

Elements of a Great Dictionary: Spelling

Once we have decided that a word belongs in a great dictionary, we must add it by selecting a character or a series of characters that represent that word. That is, we must select a spelling, or more specifically, an orthography for that word. In most cases, this is not a difficult task. However, orthographies vary. In English, there is frequently a difference between US and Commonwealth spellings, for instance colorize/colourise. Spelling may also vary with time. It is no longer common to add a hyphen in "to-day."Since a great dictionary does not lack space, we can treat these simply as separate entries (marked with "archaic", "dated", "UK", "US", "rare," etc. as appropriate), and link them as translations or synonyms, as we like (though this has been the subject of some debate, already.

Orthography also encompasses such entities as diacritics and ligatures. Diacritics can alter the pronunciation and even the meaning of a word. In Spanish, means yes, but si means if. English employs few diacritics and ligatures, and those that it does employ tend to vanish over time, as in rôle, coöperate, and encyclopædia. When diacritics create a spelling variation, we should handle them simply as variant spellings.

This issue, though, may be more technological more than linguistic. Computers need character sets installed in order to display or enter special characters, ranging from diacritics to the entire scripts of some languages. Many users have difficulty displaying and entering these special characters, resulting in mojibake. I'm sure it's no coincidence, for example, that the Tamil Wiktionary has a link in English right at the top for help with character display.

For those who don't happen to have a button for æ or û on their keyboard, a great dictionary should have a palette of special characters in software, such as the examples here or here. It should be accessible from the search interface, as well as the edit screen. It should be easy to use, and flexible, so that one doesn't have to hunt past all of the Greek and Cyrillic alphabets just to find á. In fact, the set(s) of characters to display by default should be configurable in the user interface.

A great dictionary should also recognize that users do not always spell a word correctly. Since a free, digital dictionary may be used for spell-checking, we do not want entries for misspellings, unless we have some very clear way to filter them out. Recognizing, though, that one of the function of a dictionary is to provide correct spelling or orthography for those who don't know it, the search function should be broad enough to retrieve "definitely" if a user types "definately", or to retrieve "crème" for "creme" or "créme". Do you spell perfectly, or would you like a little help?

3 Comments:

Blogger Minh Nguyễn said...

This comment has been removed by a blog administrator.

7:10 am  
Blogger Minh Nguyễn said...

I saw the diaereses in a New Yorker article awhile back, and although I quickly understood the intent, it kind of puzzled me that this particular author was alone in using that accent mark – apparently it was the publisher, not the author. So thanks for the clarification.

Regarding character palletes: some languages have writing systems far too complex for simple palletes, although the one provided at the Hungarian Wiktionary looks very nice. At the Vietnamese Wikipedia, we had a user develop a JavaScript-based IME for the site, supporting the three main “standard” input methods for Vietnamese: Telex, VNI, and VIQR. We still have a character pallete, but the multitude of Vietnamese characters that we used to include there are now hidden to JavaScript-supporting browsers, and the only characters left are things like dashes, the đồng sign, arrows, and math operators.

A Vietnamese-language IME for our Wikipedia edition was inevitable. Vietnamese web surfers expect built-in IMEs on webpages, because web cafés can’t be depended on to install the appropriate software. Until we hooked up our insanely useful IME, we got constant complaints from users about the lack of such a feature.

I’m still waiting for the Chinese and Japanese Wikipedias to come up with something similar, since they”d seem to have more of a need for a built-in IME. But maybe Chinese and Japanese users have better access to IMEs already.

7:11 am  
Blogger Unknown said...

For colorize/colourise, I'm not sure but (if somebody could confirm)

In WZ, we will have a bag of words/spellings in one side, and a bag of definitions/meanings (called DefinedMeanings) on the other side, and what we'll do is link the spellings to the meanings. So, we should be able to link colorize and colourise to the same definition(s). Therefore,
1. We'll have the advantage of having an entry for each spelling
2. If we modify the Defined meaning, it is modified for the 2 spellings.

For now, in Wiktionary, we have 1 (if we copy the definition in 2 articles) or 2 (if we do a redirect), but never both for a given meaning. So, if what I understood is correct, this is a real improvement, not having to always wonder "do I have to create a separate entry or a redirection?".

Also, 2 spellings will share the same meaning, but they should have a different "usage note" field where we can put USA, GB, archaic or whatever.

3:28 pm  

Post a Comment

<< Home