Monday, February 20, 2006

Admins or Trusted users

Given what WiktionaryZ is about, we should appreciate that when you use a word to describe something, existing usage implies all kinds of things. Within the MediaWiki software we have things that are typically called "bureaucrats" and "admins". On the Dutch project the word "moderator" is used in stead of "admin".

The function of an admin is quite specific; an admin is trusted with extra functionality that help to manage the project. This functionality allows for the roll back of changes, for the deletion of content and for the blocking of users that have vandalised the project.

The function of a bureaucrat is also very specific; a bureaucrat can make a user an admin or a bureaucrat and it will also become possible to mark a user as a bot.

Technically, both these tools should be given to people that are trusted. People that are trusted to perform these tools for what they are intended to do. Technically it does not mean at all that any special expertise about subject matter is implied by the status that these titles seem to imply. It is therefore that I am of the opinion that the names of these functions are misnomers, they do not adequately reflect what they are. The name "trusted user" would better reflect what we call an admin. A bureaucrat is nothing but and admin; when consensus has arrived that someone is to be trusted, it is acted upon by the bureaucrat.

Within the wiktionaries and certainly within WiktionaryZ, there is another group that will be more relevant they are the people who have expertise. This expertise is something multi faceted; it does not take much expertise to add many translations to a phrase like "Olympische Winterspelen". I have added some 20 translations, but I do not have the expertise to say that they are correct. I have written a lot about WiktionaryZ so I do have a certain type of expertise about the subject of terminlogy, lexicology and thesauri.

Most of the time when people work on the content of a wiki project, the admins have a good feel to distinguish the good from the bad. Typically it is rather obvious when it is bad. You do not have to be much of an expert to do this; you have to be trusted to give a consistent best effort. That is all.

My proposal is therefore to rename "admin" to "trusted user" and "bureaucrat" to "admin" for WiktionaryZ. When people have a minimum of say 500 or maybe 1000 good edits, a language community can be asked if they have objections. When there is no objection, the admin for that language community just does the deed.

When someone proves that he is not to be trusted as a vandal fighter, the person is first warned and when this proves not to be enough the trust is revoked and the person has to win back the trust.

This does not mean at all that the person is less of an expert in his subject matter, it is just that he is not trusted as a vandal fighter. I am sure that in time we will create functionality that is best trusted in the hands of people that know a language well. These tools are different tools from the current vandal fighting tools, there may be some overlap however ..

So what do you think ?


Sunday, February 19, 2006

Elements of a Great Dictionary: Definitions

The definition is what many would consider the heart of a great dictionary.

It is best to write definitions for words that you know, especially at first. To write a definition, I prefer to begin with an example of how the word is used. I may spot this example in the "wild", by reading it somewhere or hearing it in a conversation. I may seek out an example, an easy task in an age of search engines. If I'm fairly familiar with a word, I may just make up a sentence using the word. Regardless of the source, this example gives me some context.

Let's take the word "cakewalk", which has no definition yet, as I write. This word has a few different senses in English. The one that's familiar to me would go in a sentence like, "Since I have studied English for years, that English quiz should be a cakewalk." (If it's a good sentence, I'll use it later, for an example in my article.)

Now, I try to imagine that I am explaining this word to somebody who hasn't heard of it before, especially somebody who is trying to learn English as a second language. I look back at the example sentence. What did I mean by this word, in simple terms? I meant that the quiz would be easy. So we have that definition: something easy.

At this point, I start thinking of some synonyms for this word. Easy and simple both fit this sense. Basic, insulting, and silly are all a bit outside of the sense of this word, but it's okay to feel around for meanings at this point. It's easy to cross off some extras that don't fit later, and the process of thinking of them might lead to other good ideas. You can also think in terms of antonyms, or phrases, such as "not difficult" or "not presenting a challenge."

One thing that also seems to help beginners is to use the word in a sentence: "A cakewalk is something that is not difficult or does not present a challenge." If it helps to think in these terms, please do. When writing the definition itself, though, leave out the part "A cakewalk is..." or "This is a word meaning...". This is a dictionary, so there's no need to reiterate that the meaning is what's to come.

So the definition goes like this, then:
cakewalk - 1. Something that is easy or simple, or does not present a challenge.

Now, the term "cakewalk" has a history, including other, earlier meanings. If you didn't happen to know that, just put in the definition(s) that you can. The beauty of a wiki is that somebody else can expand an article later. I happen to recall that a cakewalk was originally a contest, and is also a form of music, but I never learned the details of those contexts, at least not well enough to write a proper definition. This is where some research might come in handy.

Wikipedia has an article about cakewalk. (Many other sources do, too, but it's important to be careful of copyright status, so I prefer to check free and open references first.) It turns out that it was a dance contest, and the dance and the music associated with it. The prize was a cake. Part of this information goes in additional definitions, and part goes in the etymology.

I will talk about the other parts of this article as I continue this series, but you can see the finished product of this thought process here.

Friday, February 17, 2006

Glossaries from WiktionaryZ ... and well ... some features that help to improve the contents of WZ

At this moment I am translating the article about the XX Olympic Games from Italian to Neapolitan. A thing that seems to be easy, but that is not ... many terms of sports do not exist in Neapolitan dictionaries and very often I need to search quite a long time to find the term that fits. For translation, in order to allow for an easy update I use OmegaT since it makes sense to re-use the previous translation when I then pass on to update the text online. For small wikipedias it makes sense to work this way, since writing own articles often takes longer. Being a translator makes things easier for me as well, since I simply use the tools I always use.

OmegaT has a glossary function that is already very helpful, but it is not what it could be ... at this moment I am adding new terminology for my glossary here: http://nap.wikipedia.org/wiki/Utente:SabineCretella/lista_di_parole_IT-nap since terms must be approved in some way and available for all to use within OmegaT. Now to integrate them in the actual wiktionary you need to create the pages manually - one by one ... that takes time ... then you need to update your offline glossary, the OmegaT glossary I mean, in order to get the proposals of the terminology to use ... and this means that you have to convert the above page into OmegaT glossary format each time you update. If there are corrections to be made you need to do that in the glossary list and on wiktionary (if terms are created there) ... well WiktionaryZ together with OmegaT can resolve this problem that costs a lot of valuable time that could be used for creating new contents instead of repeating work over and over again. Therefore we thought about a reference implementation for a translation glossary - that would be THE solution. Instead of adding the words manually I just click on source + target and add it to WiktionaryZ and when I go ahead with my translation I can or update my glossary on my local machine or work directly with WiktionaryZ. That would make a huge difference ... loads of time saved, higher quality and coherency in the used terminology ... it would be simply great. Well then there is another feature that would easen work a lot ... the assemble from portions option like DéjàVu, a commercial CAT-Tool, has it. This would mean that while you translate the Wikipedia article OmegaT does not only propose the term to be used, but it already overwrites it within the translation ... if assemble from portions is used effectively it can help you a lot, in particular when translating pages like those of the calendar where you often have parts of prhases like "politician and sociologist" or "musician and componist" and whatever.
Hmmm ... I'd love to have all that now ... it would help so much with creating contents.


Elements of a Great Dictionary: Spelling

Once we have decided that a word belongs in a great dictionary, we must add it by selecting a character or a series of characters that represent that word. That is, we must select a spelling, or more specifically, an orthography for that word. In most cases, this is not a difficult task. However, orthographies vary. In English, there is frequently a difference between US and Commonwealth spellings, for instance colorize/colourise. Spelling may also vary with time. It is no longer common to add a hyphen in "to-day."Since a great dictionary does not lack space, we can treat these simply as separate entries (marked with "archaic", "dated", "UK", "US", "rare," etc. as appropriate), and link them as translations or synonyms, as we like (though this has been the subject of some debate, already.

Orthography also encompasses such entities as diacritics and ligatures. Diacritics can alter the pronunciation and even the meaning of a word. In Spanish, means yes, but si means if. English employs few diacritics and ligatures, and those that it does employ tend to vanish over time, as in rôle, coöperate, and encyclopædia. When diacritics create a spelling variation, we should handle them simply as variant spellings.

This issue, though, may be more technological more than linguistic. Computers need character sets installed in order to display or enter special characters, ranging from diacritics to the entire scripts of some languages. Many users have difficulty displaying and entering these special characters, resulting in mojibake. I'm sure it's no coincidence, for example, that the Tamil Wiktionary has a link in English right at the top for help with character display.

For those who don't happen to have a button for æ or û on their keyboard, a great dictionary should have a palette of special characters in software, such as the examples here or here. It should be accessible from the search interface, as well as the edit screen. It should be easy to use, and flexible, so that one doesn't have to hunt past all of the Greek and Cyrillic alphabets just to find á. In fact, the set(s) of characters to display by default should be configurable in the user interface.

A great dictionary should also recognize that users do not always spell a word correctly. Since a free, digital dictionary may be used for spell-checking, we do not want entries for misspellings, unless we have some very clear way to filter them out. Recognizing, though, that one of the function of a dictionary is to provide correct spelling or orthography for those who don't know it, the search function should be broad enough to retrieve "definitely" if a user types "definately", or to retrieve "crème" for "creme" or "créme". Do you spell perfectly, or would you like a little help?

Thursday, February 16, 2006

Elements of a Great Dictionary: Words

Now that we have identified a few items that don't belong in a dictionary, let's start with the basics of what does belong: words. A great dictionary, seeking to catalog all words, must begin with words themselves, or more precisely, with lexemes.

Linguists and writers of dictionaries distinguish between words and lexemes because they are not the same. Most words are lexemes, though a few (such as lickety, hightail and handbasket) have little or no existence outside larger phrases. More commonly, a phrase may comprise a lexeme, when it has a single, idiomatic meaning. When we say in English that something is "old hat," we mean not that it is headgear of some great age, but that it is familiar and well-practiced. Phrases of this sort also need special attention to translation. By contrast, a phrase like "Greek history" merely describes history of or relating to Greece, so we need not define more than the component words, in such a case. Of course, language can be subjective, and the distinction is not always clear.

What shall we do with words hovering on the edge of a language? Most people, I think, would agree that paper, archaic, and horsefeathers are English words that belong in a great dictionary. Some words, though, are not so clear. How widely must the term ginormous, for instance, be used before it is deemed a word? What about l337? In formal writing, the term humongous is rejected as a non-word, yet most native English speakers have heard it and used it colloquially, and would agree that it means "very large". How long must words like metrosexual or astroturfing exist before they merit inclusion?

In some languages, a government-appointed academy decides what words exist, and how to spell them. A great dictionary should have some means of flagging words that are accepted in this manner. In other languages, including English, publishers of dictionaries set standards of word-worthiness by choosing what words to include in the limited space in their dictionaries. Either way, they employ some arbitrary standard, such as how long or how frequently a word has been in use.

Whether academies, publishers, or nobody prescribes standards and proper usage, no language is static. People reuse words for novel meanings, or invent new or combined words, often as slang, at first. New technologies and concepts need names. A great dictionary is inclusive, and it should reflect the shifting language, though it should caution users about questionable words. At the same time, a dictionary should not seek to introduce new words, nor words that are used only by three friends at a certain school.

There must be some basic standard, or at least guideline, by which to judge the existence of a word. Search engines offer an excellent (though by no means certain) tool for determining the how popular or widespread a word is, at least in print. The existing Wiktionaries use a combination of guidelines and consensus. In WiktionaryZ, the community surrounding each language should decide the answers for itself.

Wednesday, February 15, 2006

Elements of a Great Dictionary

As a writer of dictionaries, I do a lot of work on definitions. Let's start with this question, then: what is a dictionary? Particularly, what makes a dictionary great, and what should a great dictionary be? We're trying to develop the world's greatest dictionary in WiktionaryZ, so I think these are reasonable questions to ask.

It's one of those questions we all think we know how to answer. Obviously, a dictionary is one of those big, thick books you haul out when children aren't quite tall enough to sit at the dinner table!

On closer inspection, though, the answer is not that obvious, so I'd like to take a closer look at what a dictionary is, and what a dictionary is not. I'll start with a few words about what a dictionary is not. A dictionary is not an encyclopedia. In an encyclopedia article, you should find information about a particular subject, for instance, paper. An encyclopedia might have a history of paper and paper-making, and its influence in world history. It might have information about different types of paper, or standard paper sizes. A dictionary, by contrast, will have information about the word paper, itself: etymology, pronunciation, meanings, and so on. There may very well be overlap. An encyclopedia article may begin with a brief description of what something is, or a word's history. Similarly, a dictionary may contain a few sentences about a subject to clarify a definition.

A great dictionary, furthermore, is not made of paper. Not these days. In an electronic age, a great dictionary should not be limited by the size of a bookshelf or a binding. Rare words should have their place. Complete details about words should have their place. Size and weight are not the only reasons to build a non-paper dictionary. A non-paper dictionary need not contain words in some linear order, and it can organize words by their relationships to one another. It can also retrieve information by an electronic search, either on headwords or on content. The information in it, if correctly structured, can be used for other purposes, such as spell-checking, language exercises, and machine translation. A non-paper dictionary can include multimedia content, too: audio pronunciations, images, and video, to name a few. I'll revisit these topics in subsequent entries.

Finally, a great dictionary is not exclusive. It should not exclude words, even rare or "objectionable" ones. It should not exclude any willing, sincere editors. Perhaps most importantly, it should not exclude users. The Oxford English Dictionary, the largest and most comprehensive dictionary currently existing for the English language, is exclusive in several ways. At $1500 or £850 for the print edition, or $295 for an annual online subscription, the cost is prohibitive to most users. It excludes words that are not English. It is copyrighted, so it is not free for derivative uses. Although scholars from throughout the world have helped to build the OED, it does exclude editors.

Over the next few entries, I'll explore more of what makes a great dictionary. I welcome your comments along the way.

Tuesday, February 07, 2006

Translations in WiktionaryZ will seem odd to people. Adding translations will be in many ways be against what people always do; people translate a word from one language to another language and yes, they take into account the meaning of the word in both languages. In WiktionaryZ this is not exactly how it is done.

In WiktionaryZ, a translation is added to a DefinedMeaning. The best way of understanding what the DefinedMeaning means is by looking at its definition. When the Italian word "cavallo" is added as a translation, the person adding the translation may know the word "horse" and "Pferd" and as a consequence put it in with the right DefinedMeaning. When this is associated with the Dutch word "paard", it is relevant to know that paard has multiple meanings and only one is this animal that you can sit on, another is this family of animals that all look like this animal that you can sit on ...

The point that I am making is that it is really relevant to translate the definition of the DefinedMeaning because that is what allows people to link new translations to the right DefinedMeaning. When definitions are newly formulated, there will be descrepancies between the definitions, these will sometimes be enough to warrant multiple DefinedMeanings.


Wednesday, February 01, 2006

WiktionaryZ will not be the same as any Wiktionary. There are many Wiktionaries and they all are different. They all have their own strong community, policies content and history. The great thing about the Wiktionaries is that they all evolve. In a way this evolution follows paths similar to the ones known from the Wikipedia projects; it starts off with a group of like minded people and over time this group grows, friction grows untill it becomes unpleasant.

WiktionaryZ will be structured and this structure is rather rigid. It is not something that one buzy beaver can do or undo with a thousand edits; information must fit in the structure that is the same for everyone. This does not negate that every language is different, that information of a type specific to a language needs to find its place. It does however mean that conventions used in paper dictionaries are indeed conventions for paper dictionaries. It is more relevant to learn from the TST-Centrale, the treasure trove of the INL, the Institute for Dutch Lexicology. They have mulitiple resources on-line and enrich the content by linking the diverse sources. We will be "stupid" and integrate diverse conten. We will probably regret this tendency at times, but this integration has its own rewards.

WiktionaryZ is intended for all languages, the paradox is that it will take fewer resources in an less well known language for WiktionaryZ to become relevant than it will for a language like Dutch or English. These languages have their great lexicons, WiktionaryZ will get its relevancy for these langages because it will be a digital resource and not so much dead wood.