Wednesday, September 13, 2006

Problems with MediaWiki .. Incompatibilities

MediaWiki is great software, it works well, it scales like it should, it is probably the software that is localised most often. It is also incompatible with the orthographies of several languages.

We already new about the problem with Neapolitan, they have words like d''a the problem is that '' means that the next bit is to be italic .. or bold. A new problem came up. The N|u language is spoken in South Africa, because of the | it is not possible to create a wiki link to the name of this language.

There is a big move in the MediaWiki software to get WYSIWYG for MediaWiki. I hope people will also take the next step and kill off the current Wikisyntax that prevent us from doing what needs to be done; support all languages.


Tuesday, September 12, 2006

New languages ..

So far adding new languages to WiktionaryZ was a major pain. Erik had to do this directly in the database AND he is the only one who is allowed to do this. When you combine this with the fact that his time can be better spend in other ways, it was not great.

The good news is, Erik has created a function that allows bureaucrats to add languages. It means that we can slowly but surely add languages to WiktionaryZ. The first thing that we have done is fixed the Chinese problem; what was Chinese is now called "Mandarin (simplified)" and we have added "Mandarin (traditional)". This means that we can now add all the languages that were included in the ISO-639-2 zho code and consider script and other issues. An other thing will be adding English (United Kingdom).

My idea is to add the languages for which we have people on WiktionaryZ.. The good news is, people just need to ask, and after some consideration (scripts etc) we can honour such a request.


Monday, September 11, 2006

Alternative spellings

WiktionaryZ will provide for alternative spellings (though the name for them is yet to be finalized) of Expressions. Each type of alternative spelling will be tagged with a comment explaining its nature.

This could be used to enter currently correct alternative spellings, previously correct spellings, common misspellings, spellings that vary in capitalization etc. (Cf. How to deal with conventions)

Could it be used for more? What about the Swedish verb plural forms that went out of use in the 1950's? Today all persons and numbers use the same verb form, so looking up a verb plural form it makes sense to get the currently used form - that is the form that previously were only used for singular. Is this stretching the usage of an alternative spelling too far?

More new functionality ..

We are really happy to announce new functionality. These are the first examples of software that helps editors find work that needs to be done. The first one helps to find words in Italian that do not exist in Neapolitan .. (you can change it to other languages by manipulating the URL) .. The second one finds all words that include "casa".

The great thing of open source is that you can publish often; even though the new functionality is still rough around the edges, it does show where we are going. Not only that, it will provide input by having the functionality tested at the differen stages of the development of the new softwar..



WiktionaryZ and Machine Translation

Most of you probably know that Jeffrey V. Merkey is using machine translation to translate the complete English wikipedia to Cherokee and that the output is quite good so that only 5% of it need real corrections.

He agreed to give us the used wordlist under GFDL and so we added it to WiktionaryZ.

Yesterday I tried to work with it and immediately I found some limits for its contents. This is not because his wordlist is wrong, but just because it is so personalized that for other people it becomes somewhat difficult to understand which meaning has been translated to Cherokee and wich meaning than accordingly needs to be translated to other languages, in my case Neapolitan.

I did an estimate of how many words I have to translate to get the work done in a month: 233 words ... that is really a lot ... and I suppose that doing things correctly - I mean creating the English word, adding the defined meanings and translate them into Neapolitan will take me quite a long time and probably I will not be able to do more than an average of 10 words a day.

Now you are asking yourself why I take such a long time ... well: let's say an English word has averagely 3 meanings (well, that's a wish ... there will be words with many more meanings). Then: I am not used to translate from English to Neapolitan - already Italian to Neapolitan is quite difficult. The only dictionary I was able to find here is Neapolitan to Italian ... I did not manage to go to Naples by now and therefore could not try to get one for Italian to Neapolitan there. This means that I have at least to translate from English to Italian - well, then it makes sense to add German as well since I know most translations in German - and then, if I don't know the translation from Neapolitan to Italian I have to guess how it is written in Neapolitan and find poof of it in the dictionary :-) funny right? Now people say: well you speak quite a good part of your day Neapolitan, so why do you need to look it up ... speaking is different from translating ... also writing is different from translating.

The next thing is: Wikipedia articles are of specific domains ... well: we do not have a differentiation of specific domains by now - but when translating for some languages we will need that, because the same English word, according to its use in a specific domain will have different translations.

Considering these points I suppose that we will have approx. 21,000 words in the foreign language starting with the approx. 7,000 words in English.

It will be fun :-)

Sunday, September 10, 2006


WiktionaryZ is to include all words of all languages. To get a relevant amount of information, there are many ways that this can be achieved. For WiktionaryZ, it will be important to have information that is going to be part of the user interface. In order to include all languages, we will need to have all languages in WiktionaryZ. These lists exist, and it is easiest to start with an import of all these language names.

The same could have been done for countries, however there are less of them and as we have started creating creating portals for countries, it made sense to start creating the common names for countries. Having these portals is important because often people do not appreciate how many languages are spoken in a country.. Australia demonstrates this really well.

When more languages are enabled for editing, the list of translations for the countries becomes less complete. In many ways it is an uphill battle to get the information that we need. It is a good thing that we decided on fall back scenario's; it is relatively easy to provide information in English..


Friday, September 08, 2006

WiktionaryZ commission

WiktionaryZ has a commission. This commission consists of people who are well known in the community. Also they represent different constituencies. As time goes by, the people that make up this commission change. When people are not really active they may be replaced. When new people become important in the community they may be added when they also represent an important community.

We have asked Sannab to join us in the commission. I am really happy she accepted. Sanna is Scandinavian and we have a lot of Scandinavians at the moment but as important is that she studied both Chinese and Yi. This helps us reach out as well to the Chinese languages.

We have asked Barend Mons to join us in the commission. I am really happy he accepted. Barend is a scientist, he is in the field of bio semantics and he has been asked to join in anticipation of the large number of scientists that are likely to join the project once WiktionaryZ starts to include scientifically important data like the UMLS and others. Barend has been instrumental in securing the collaboration necessary to make this happen.


Tuesday, September 05, 2006

Some more numbers

WiktionaryZ does collect some statistics. We are collecting statistics for the fourth month running and we have two complete months worth of data. Today it is the fifth of September and we have already had more traffic than in the whole month of July.

Now there are "lies, damn lies and statistics" and there is some truth to this as well. The googlebot really likes us. It uses at the moment some 60.29% of the traffic versus 7.26% in July. At this moment it does not hurt us. I wonder what it is doing and how it likes our data.. Maybe it has it's knickers in a twist.


Monday, September 04, 2006

Helping editors

Many people have been working of WiktionaryZ, the amount of new words is staggering. The quality improvements of our first content was profound. But as we now have some 131.000 words and phrases it becomes more difficult to find what needs doing.

It is therefore that we need queries that help us find what we have and help us find what needs doing. There are a few things that come to mind:
  • All word in a language - with a possibility to start at a given word - with a possibility to select those words where we do not have a translation in another specified language.
  • All DefinedMeanings in a collection - with a possibility to start at a given word - with a possibility to select those DefinedMeanings where do not have a translation in another specified language.
Obviously these two would benefit our editors a lot. They would also benefit our users ..


Sunday, September 03, 2006

131.000 records in WiktionaryZ

WiktionaryZ has some 131.000 records I was told. This is really astonishing. It demonstrates how effective it is to add a translation this way.. The new articles are created at the same time as the new translation is added. When you compare this to the size of the English and French wiktionaries, they are already 170K and 200K big. Many of this content was created by bots. Much of the WiktionaryZ content was created by the import of the GEMET content. This was some 70.000 articles. Given our current 131.000 it means a 61.000 growht .. WAAUW..