Wednesday, March 22, 2006

Babel templates for 255 languages and counting

In WiktionaryZ we are preparing for when we have editable (relational) content. The Wiki is doing well; we are creating the environment for when we are technically able. When we are, we will slowly but surely invite more and more people to edit.

You can indicate your interest by creating a user and adding the Babel templates for the languages that you are interested in. We use the "Babel" templates that are also used by the English Wikipedia. We do however always allow for "professional users" of a language (think writers, translators and teachers).

We want to use the ISO-639 codes, and even tough we allow for ISO-639-3, we already have situations where we would like to have more languages or good codes for dialects. From this perspective, it would be great to have the "Wiki for Standards" now we have to use our personal contacts and that is much less convenient for many.


Monday, March 13, 2006

Using Babel for WiktionaryZ for the coming functionality

WiktionaryZ is a work in progress. Slowly but surely we will add functionality and content. We have started preparations for the next phase. The next phase is for us to be able to edit the relational content of WiktionaryZ.

Functionally WiktionaryZ is at a pre-alpha phase; this means that we do not have the full functionality available that is in the specifications. It means that when we add functionality it will not be available to everyone. In order to have some control of what happens, a limited group of people will be invited to edit the relational content. We will invite from the people that created a user on the wiki and provided their babel information.

The babel information allows you to give six levels of proficiency:
  • xx mother tongue
  • xx-1 basic ability
  • xx-2 intermediate ability
  • xx-3 advanced level
  • xx-4 near-native' level
  • xx-5 'professional' proficiency
The xx has to be replaced with the codes that you can find on the babel page. At this moment we have the codes for more than 100 languages. This equates to something like 1% of the languages that exist. We do invite you to help us with templates for the missing templates. You can help by checking and or translating the following text :

* This user is able to contribute with a basic level of English.
* This user is able to contribute with an intermediate level of English.
* This user is able to contribute with an advanced level of English.
* This user speaks English at a near native level.
* This user is able to contribute with a professional level of English.
* This user is a native speaker of English.
* These users speak English.

You can help by creating the templates or, by providing us with translations. Translations are best given to us here.


Thursday, March 09, 2006

Names and Etymology ... and what about old languages?

I am in Germany with my parents right now - this evening my mom watched "Deutschland Deine Namen" (Germany: your names - I know it is a very literal translation ... maybe it should be "Names of Germany"). But that brought me to the "what to include" question ... and to translations. We will have names of well known people in Wiktionary - only think about the old Romans ... their names have been translated. And today we transliterate names from countries that use other chars than we do. So yes, names must be in WiktionaryZ, because names are still and have been translated in past - and for sure they will be also in future.

But what about the Ethymology? Many names came up about 600 to 700 years ago and refer to old forms of languages they most of all refer to professions of people, places where they lived, habits they had etc.. How do we do that? Just add the Ethymology and that's it or do we then also add the real word these names come from and give it a language name? To me it makes sense to add these languages as I also recall the times when I was at school and we had texts in Middle High German and Old High German ... if you think that knowing German you would be able to understand such texts: hardly or no way. So I would say that integrating such words with definitions makes sense ... but in which language should the definition then be? hmmmm .... not easy, right?

Well these are just two very fast thoughts. I am really quite under time pressure, but I thought it would be good to just write down some lines to remeber that these things need some consideration in future. Please do not consider this to be a 100% error free posting ;-)

Thanks, Sabine.

Wednesday, March 08, 2006

About sorting or is it collating ?

When you have a dictionary, you expect words to be sorted in such a way that you can find them. Sorting is obvious right ? You use the alphabet and that is the end of it. According the article about the character IJ, the character is to be found in between the X and the Z together with the Y. The best Dutch dictionary Van Dale, has the ij sorted in the i range.

The consequence is that how to sort is not obvious. It does not follow from knowing the alphabet for a language and yes, the Dutch alphabet is different from the French or the English alphabet in the same way as the Farsi alphabet is different from the Arabic, the ....

In my personal blog, I wrote about what I would have a programmer do when I had money to spare. I mentioned this to Gangleri and jested that I would expect his choice to be working on right to left issues and sorting issues. I am really happy that he took it as a challenge and he made me really happy with the template wikivar. I do not understand all the ins and outs of it, but what I do understand is that he is asking people to help with defining the sorting order that makes sense for that resource, for that language.

One thing is, this sorting order only makes sense when you know the language the articles are in. For WiktionaryZ this will be obvious; WiktionaryZ will be language aware. For the current Wiktionary projects this is not possible; Dutch words, German, English and Farsi words are all together to be sorted. The current search routine is not that great, it does not allow for case insensitive sorting a wish that many would really like to see realized.

With [[Multilingual MediaWiki]] it will be possible to make the traditional wiktionaries language aware. This in turn will allow for the sorting of words according to how it is done for a language, for a locale. The question that I have would be; do people have the stomach to go for this. The good news is, that most wiktionaries become more and more structured. This makes it feasible for bots to do a good job.

The irony when it happens, the technology will be courtesy of the WiktionaryZ project. Then again, given its definition of success, it would be considered a success :)


Sunday, March 05, 2006

About characters ...

Everyone that has learned the Dutch language knows the alphabet. The Dutch alphabet ends with: ..x, ij, z. That means for Dutch is not ..x, y, z. The "Y" or "y" is called the "Griekse ij. This is what you need to know to understand this post.

The problem is, when people type, the "ij" is not on the keyboard. Typically people use the US-International keyboard. And when you type "ij" It looks the same, it must be the same except for the fact that it is NOT the same. When you use a wordprocessor a sentence would be spell checked to read like "Ijs drijft in de sloot." not "IJs drijft in de sloot."

In WiktionaryZ we do want to do things RIGHT, except everybody DOES use the i and the j not the ij. How should we deal with issues like this. I tend to agree that it is a problem and ignore it. At some stage we will create the software that does these things on the fly.

I can have this opinion because it is my language, what if the same question is asked and it is not my language? How will we resolve them? To what extend will we be dogmatic, to what extend pragmatic??


Thursday, March 02, 2006


This week I had the pleasure to see a number of questionmarks where I should see characters. Where I should have seen the name of a language in that script. It seems so obvious that when you have a computer, that you can see all the information that is available.. It is however much less obvious than it seems.

When you use the English Wikipedia, it is normal to include the names of countries and languages in the script. Many people do not have the necessary fonts to see the characters that are in many languages. They get like I did question marks. For WiktionaryZ we will have a similar situation but on a completely different scale. We want all words in all languages and, a word like water, air, fish, bird can be expected to be translated to any language. This will make it even more relevant for WiktionaryZ to consider fonts.

WiktionaryZ will be in UTF-8 but that only helps in so far as fonts exist in the first place. Also UTF-8 is a living standard, a new version; 5.0 beta 2 represents the latest developments. As we want to have both user interfaces and content in all languages, we sure are going to have our share of issues. With 5.0 we will have better bidi support, that will be a real boon.

It is easy to recognize that people do not have the fonts to see all Wikipedia and Wiktionary content. To solve this, we could provide the best information on the fonts needed for particular languages, scripts. When we start doing this, we will help people. It may also open several cans of worms. Then again, we could go whole hog and offer fonts to download.

PS this article had me start looking into fonts again.. a good and interesting read.