Tuesday, October 31, 2006

What to do when a word is not found

One of the favourite stories of Ronald Beelaard is that many people cannot spell and consequently do not find what they are looking for. The example that he has used often is the word "papegaai", a word many people do not know how to spell.

What Ronald did for the library he was associated with was to write what amounts to an "alternative spelling" function. There are plenty of examples of how this can be done; Google is one of the examples that comes to mind..

For WiktionaryZ, it will not be as straight forward as it is for a monolingual resource. When a word is not found, it is important to know what language this word should be in. Also, WiktionaryZ is lacking many words that you would expect in a resource that want to include all words. This is best demonstrated that of the 1000 most popular words in English, several are still missing..

When a word is not found, it would be cool to have functionality that helps us to understand what the problem is. The first thing would be to aggregate the number of misses. It would be as cool to have the number of hits because it is in the percentages where we will find how well we do.

The words most often not found are the words that are the most relevant to add to WiktionaryZ. This will improve its function as a resource.


Sunday, October 29, 2006

Some more on non-standard orthography

Many languages do not have a standardised orthography. The variation that can be found is often quite specific for a specific location and also, the variations that can be found is often shared between different localities.

When there is no single biggest resource of what is considered to be standard for all these variations, we cannot use the strategy in WiktionaryZ like we do for English, French, German.. We need another strategy to make this work.

I had a long talk with Purodha just about this. The idea that we came upon is to accept all spelling variations for a word and have the "locations" subscribe to one variation. Our expectation is that there will be some regularity to be found; rules for writing that are always just so even when this rule is limited in it's scope.

The question we have to ask ourselves is how applicable this is and, it it does not make people come up with how they think "their" language is written. One thing I learned is that accepting this is not always the most reasonable thing to do. Yes, people may come from a certain area but that does not make them necessarily qualified to pronounce on the subject of their language. It often goes wrong because these people cannot and will not take some needed distance from the subject resulting in either a wish for the past or a wish for a new orthography or because of their politics.


Monday, October 23, 2006

Dialects any one ?

When there is a need for new languages in WiktionaryZ, the current procedure is that one of the bureaucrats can and do enable languages for editing content. So far we have added languages and had special considerations for scripts and locales. Scripts are "easy" as we have in the ISO-15924 a list of what are considered scripts, examples of their use is in how we deal with for instance Hausa, Mandarin or Serbian. Locales are more problematic, but so far we have restricted ourselves to using country codes. Country codes are easy too they can be found in the ISO 3166.

Now what to do with dialects. Let me be really practical; I have received a request for several languages by Mark Williamson. His request seems reasonable to me; he indicates clearly what he wants, the languages with their dialects and their scripts. When I try to do some research I find nothing that makes his request unreasonable, it is just that I cannot really judge it.

The biggest stumbling block however is, how to name these dialects as a code. It is one thing to insist on standards but another to find that there is no apparent standard for these dialects. I have been looking around and I subscribed to a few mailinglists to do with linguistics, and I hope that this will get me an answer.

With WiktionaryZ still in "pre-alpha" mode, I have an excellent excuse not to rush into the creation of these languages, but I think it is also fair to let it be known what the issue is. It is a practical one, it is not that we do not want dialects to be part of WiktionaryZ.



Sunday, October 15, 2006


Persian is a right to left language. It also has a different script than what we are used to. The implication of this is that the user interface does not work well because to enter Persian, you have to select the language name in English and add the word in Farsi.

At this moment you have to change keyboardmappings from one to the other. Really inconvenient. We are extremely grateful that we are getting people who contribute to our codebase. One of them is working on showing the language names in the local language.

I can imagine that because of caching purposes we may need to cache versions per language. The work on Mulitlingual MediaWiki has started afresh. It is great to see that these things are coming together at last.. :)


Thursday, October 12, 2006

A great new tool

Today Sabine told me about The Locale generator. This represents a great opportunity. It allows people to define what the many settings that are part of the CLDR are. The values defined here are of relevance to things like how you show a date, how to quote etcetera.

Even though it is a great start, there are things inherently problematic. The way languages are referred to are also not necessarily unambiguous. The language names do refer to the ISO-639-1 while
the documentation calls them ISO-639-2. ISO-639-2 exists in two flavors ...

The generated XML content is targeted for use with Open Office, they do a good job by including references to what Microsoft does. Sorting out their non-standard ways. Initiatives like this make me really happy. The next step that they could take is to provide the information created in this way to the Unicode people so that they can integrate it in the CLDR itself. When this trickles back into tools like PHP, Python and Java, thing will be really to look up. :)



Monday, October 09, 2006

Expressions without semantic content

A DefinedMeaning is the union between a specific set of signs (a Spelling) in a given Language - an Expression and a paraphrase of the semantic content of this Expression - a Definition. This is a fundamental pillar of WiktionaryZ, and without it WiktionaryZ might well prove to be a house of cards. For the vast majority of Expressions in most languages this pillar is a sound base.

In all languages however there exist a few Expressions (in some languages quite many) that are intrinsically empty of semantic content; their sole function is to serve as glue between the more common semantically heavy Expressions, and relate this either to one another or to the context in some way or another.

There are many examples. To name a few: the adverb of negation in most languages (not), the evidential particles of for example Akha, that serve to ground any declarative sentence in a conceptual framework where the speaker indicates how (s)he has obtained the information given (with the same ease as a speaker of English indicate the temporal framework within which f.ex a declarative sentence is staged), aspectual particles systems of many East Asian languages etc etc.

While there will in due time be space to define the Usage of a given Expression, the question of how to treat the Expressions that have no useful Definition with which to unite into a DefinedMeaning has imo not yet been thoroughly solved.

Some possible ways of dealing with this might be:
  • Allowing the Definition of a limited set of DefinedMeanings to not express semantic content, but grammatical (in the widest sense of the word) content.
  • Allowing the Definition of a limited set of DefinedMeanings to be empty, deferring to a Usage note.
  • Referring these Expressions to a grammar.
I am not too happy with any of these solutions, most especially not with the last one.

Saturday, October 07, 2006

WiktionaryZ and RFC 4646

The RFC 4646 provides many things that I will gladly follow particularly in the way it allows for extentions. I have discussed the issues that I have with Felix Sasaki (W3C) and Gerhard Budin (ISO) among others.
  • Many recognized languages have to "fit" as a subtag of another language. This is for many people who think "politically" about languages not acceptable. The way it is implemented is also a travesty for people who know about languages. The motivation is to provide backwards compatibility even though ISO-639-2 was dismissed as useless and ISO-639-3 had to be created really quickly.
  • Some of the things not standardized yet in the ISO codes are worked on as new standards (eg dialects) the proposed codes are not known at this time and this creates a mess of its own.
  • Orthographies are not supported in the planned for extentions to ISO-639. This is acknowledged as an omission.
Consequently, it is much better to make a clean break while still conforming to standards. The standards complicance existing in WiktionaryZ is better than the standards compliance of the current crop of Wikimedia language codes. They do not conform to a standard because it breaks standards in many places. Some of the language codes used are voted in with a total disregard of what would be valid vis a vis the terms of usage of the ISO-639 codes.

Let me repeat that WiktionaryZ does include a code for Wikimedia language codes; it makes sense to have backwards compatibility. The basis for inclusion at WiktionaryZ for languages at this time are the ISO-639-3 codes. When need be the ISO-639-3 codes are extended using RFC 4646 to provide guidelines on how to do this and we will and do ask people in standards organizations on how to solve issues that are outside what RFC 4646 provides answers for.

We do have a code currently indicated as "ISO-639-2" in our database. We could enter something there that would be compatible with the RFC 4646 once it is figured out what would be a valid code under this regime. Problematic is that many of the ISO-639-2 codes are depreciated in ISO-639-3 and, it would be a nice academic exercise to create these codes where I do not want to deal with the political fall out that this creates.


Labels: ,

Thursday, October 05, 2006

This user is a contributor to the OLPC Children's Dictionary.

WiktionaryZ is proud to support the OLPC or One Laptop Per Child project. This project intends to innovate education by bringing laptops to kids in countries like Nigeria and Brazil. Huge amounts of computers will be distributed to kids. These computers are exquisite, they are rugged, they are innovative and they will change things one way or another.

In education dictionaries are one of the resources that kids use when they are available. For many languages it is hard to find content. We are currently looking for people who can help us with Igbo, Yoruba, Hausa .. We already have people who can help us with other languages; at WiktionaryZ we have started with the 1000 basic words in English. These we want to expand with as many translations as possible both for the expressions and the definitions.

Sabine is doing the sterling work that we know of her by organizing byte sized packages; this way people help by adding small stitches making this a blanket of communal effort.


PS Yes I contribute too