Sunday, November 26, 2006

WiktionaryZ is a wiki and not perfect

One of the contributors of WiktionaryZ who is with us from the start is really into the "minor" languages. There are no newspapers in this language, there are many dialects in this minor language. And now we have this need for a user interface in this language too.

For this stellar worker, this is a problem. As the orthography is not set in stone, what is the correct word for the translations of English, German, Italian? As people will see it all the time, it has to be correct. There is also some reputation involved; when you are seen to be perfect in a language how can you err when everyone will start using what is provided?

I have been many translations in the last few days as well. I have been working hard to add this content on the Dutch Wiktionary. Now I have learned that the names of languages in Russian and Danish are not capitalised. So I will change these when I copy them in. I am not sure for languages like Slovak and Slovene.. I have the choice not to copy them in and leave it for someone else or I can make mistakes and hope that someone cleans up after me.

I am sure that some of the stuff I did needs correction. In the mean time it is an honest effort and, it shows that by making the correction ONCE, we truly benefit from the way WiktionaryZ works.


Monday, November 20, 2006


Well, sometimes advertising the good stuff makes you feel really good. Today, the data that allows us to show languages names in the language of the User Interface has gone life. For WiktionaryZ this is really important new functionality. We want to have a user interface that brings WiktionaryZ in the language of the user. You should not have to know English to get your information about an Urdu word.

The one problem is, we can only show you the languages in your language when we know them. It is therefore that we like you to help us build this part of the User Interface.. This is truly stuff that we will be using all the time..



Wednesday, November 15, 2006

Why Multilingual MediaWiki matters

You might think that MediaWiki, the software that runs Wikipedia (and WiktionaryZ), is fully multilingual. After all, everyone knows that Wikipedia exists in more than 200 languages -- all of them intricately connected through a network of "interlanguage links" from one edition to the other.

In actual fact, each language edition of Wikipedia runs on a separate database. Wikimedia uses a special setup to share code and configuration files, but that setup is not trivially reproducible. If you want to set up a wiki in multiple languages, your best bet is to set up multiple instances of MediaWiki. As a result, most MediaWiki installations around the world are monolingual. They only accept content in one language.

Of course, users are free to put, say, French language content into an English language wiki. They can even change their user interface language preference to French. But the problems begin quickly:
  • If multiple languages use the same title to refer to a particular page, you need to disambiguate it. In single database setups, this is typically done by appending the language code manually to the page title. These titles can quickly become messy and inconsistent.
  • As soon as activity picks up in multiple languages, the list of changes to the wiki quickly becomes cluttered with information that is useless to readers who do not speak the particular languages in which edits are made.
  • It is impossible to systematically search for pages in a particular language.
  • Pages about the same content in different languages have to be manually connected to each other, which is often done using templates. The wiki does not facilitate the process of interlanguage linking in a single installation. The interlanguage links which are used on Wikipedia are horribly inefficient, as a separate set of language links has to be maintained for each language.
  • The first experience for a user who does not speak the default language of the wiki is often negative. Unless the wiki has been specifically built (with policies and interface messages) to encourage multilingual contributions, they are unlikely even if they are theoretically possible.
Take a look at a big wiki hosting site like Wikia -- even though it sets up multilingual wikis on request (by setting up multiple databases), most of its wikis are monolingual by default. English is of course predominant. With hundreds of millions of Internet users who do not speak English but would be happy to contribute to these wikis, this is a tremendous loss of opportunity. Outside a framework like Wikia, with users setting up their own wikis, it gets even worse: very few people go to the effort of structuring their wikis to accept content in multiple languages.

Wikis have become as ubiquitous as forums or blogs. Whether we are talking about documentation, knowledge bases, directories, discussions, experiments in democracy, media archives -- there are millions of potential participants out there, waiting to be invited to contribute. Waiting to feel welcome. We need to reach out to them. It is not just the community that needs to make the decision to "go multilingual". It is the software that should support this decision as much as possible.

Fortunately, there is an answer: Multilingual MediaWiki. This set of specifications describes the changes to the MediaWiki software needed to accept content in multiple languages, to network it effectively, and to build truly multilingual communities. And fortunately, this is more than just a paper: It is being implemented by a very capable programme, with financial support from the University of Bamberg, and another sponsor who shall remain unnamed for now. You can view the first prototype (still very messy :-), which showcases the functionality to a) store content in language "meta-namespaces" and keep it separate, b) connect pages in different languages to each other.

There's still quite some way to go until this becomes part of MediaWiki proper, but we are making steady progress. When the project is completed, thousands of MediaWiki installations across the planet will gradually become fully capable of accepting content in all languages of the world (if their owners want them to). It will be another step in opening up the world of wikis to the global community.

Beyond its very direct impact on wiki users, MLMW is a requirement for WiktionaryZ. Right now, pages about expressions such as AIDS, which exist in many languages, can become very messy. Ideally, the user would, when looking up an expression, always specify which language it is in -- and then only see the DefinedMeanings in that language. However, for this to work effectively, MediaWiki must support looking up pages in a particular language, exactly the functionality that MLMW will provide.

This is one of many examples in which WiktionaryZ development benefits MediaWiki as a whole. Moreover, it reflects our philosophy to structure our work so that milestones can be reached independently whenever possible. We had some initial problems with the MLMW project -- a funding source ran out, and a developer team became unavailable. Fortunately, this did not impact the main WZ development, and work could continue as soon as we found a new source of funding and a new developer.

I'm not aware of any other wiki engine which handles content in multiple languages well. Hopefully, MediaWiki will become the first.

Tuesday, November 14, 2006

About Definitions

There is an art to writing lexical definitions. The problem for WiktionaryZ is, that this art has to be relearned. Definitions need to be concise. There is however no reason not to use full sentences. Abbreviations are not needed as there is plenty of space on the hard drives of modern computers.

One accepted practice in the dead wood dictionaries is to define by using synonyms. For WiktionaryZ it leads to circuitous definitions because the definition defines the synonym as much.

Another thing is to use bad constructs like "To do something (to someone)". This looks bad, it is better to just say "To do something to someone". There is no reason why we cannot produce full sentences without constructs that are inherited from existing dictionaries. If they were from existing dictionaries, they might be in line for improvement.

Yet another thing that we should not do is be overly specific and correct. The definitions define; they show what it is about. When this is not defined in "sufficient" detail, it often proves that there is a need for an encyclopaedic article and, that is when we need to refer to the Wikipedia article about the topic.


Friday, November 10, 2006

Success can kill

I am really grateful that Erik was able to reboot WiktionaryZ from Heathrow. It save us from another few hours of downtime. When he finally came home and when we talked, I learned that it is the popularity of WiktionaryZ that brought the system to its knees. When you look at the statistics of WiktionaryZ, you will see an upward trend and as Erik indicates, it is getting at what the current hardware can maximally handle.

It means that we have to upgrade our servers in the very near future.

The development of the software is going ahead strongly. I was told that we are now at the stage where we can start developing attributes on the level of the SynTrans records. This is techno babble for; we are ready to work on things like parts of speech functionality.

Parts of speech is Language specific. This means that two languages will not necessarily have the same parts of speech or inflections, conjugations. It means that the development is in phases.
  • Indicate that a specific part of speech exists for a language.
  • Indicate that a word that is among the translations or synonyms is a particular parts of speech
  • Indicate what conjugations / inflections are possible in a language for a specific parts of speech
  • Allow for scripts that propose what the conjugations / inflections could look like.
When you read this, it is rather glib. There are many issues that need resolving and all these issues are not mentioned. There are also several opportunities that will make for better functionality that are not indicated. But hey, this is a blog not a specification.

What IS important that this is a good time to talk about this. What I point out is that we are about to conceive of this functionality and that we do welcome comments.


Wednesday, November 08, 2006

Language names

WiktionaryZ is not far off from the moment when we will be able to have the labels of the languages shown with the translations and synonyms in the language of the user interface (UI). The way I understand it will work is that a DefinedMeaning that is tagged with the corresponding ISO-639-3 code will have its available translations copied to the Language table.

This will be fun to observe.

The codes that we use do not allow for specific versions of for instance English; "English (American)" is coded like eng-US. This is however not an ISO-639-3 code. We will find a solution for that one. For other languages there is this utter confusion of what is meant by a language name; Schwyzerdütsch for instance is also called Alemannisch and it is certainly not the Swiss version of High German. Then there are languages like "Western Mari", Wikipedia considers them a dialect where Ethnologue recognises it as a language.

I hope and expect that people will want to have all the languages in their language on WiktionaryZ. For many languages it will be hard to find a name that fits the orthographic rules of their language.. Haiǁom for instance ..


PS It would be nice if these entries are sorted.. Well, this is something for another blog..


Tuesday, November 07, 2006

When two people do the same ...

... and you are not aware of it, you can get in a situation where you would like to .... hmmm ... don't know how to translate this into English .... but the sense is: you get somewhat angry with yourself.

While writing the post from some minutes ago I learnt that what Webboy programmed had already been programmed by Leftmost and also submitted. We just were not aware of it.

The positive side of this is: we now know that both are great programmers.

The negative side is: we were not able to manage the programming in such a way that this double effort would have been avoided. That is really not nice for both of them. So: sorry, we will try to find a way in order to avoid this from happening again.

Well yes, we need an open accessible list where to write down our wishes and where programmers can write their note on "is on the way". So: time to think about a solution ... and we shall do that now.

Sorry again Leftmost and Webboy!

Queries ... they are such great ...

Today I saw one of those queries applied to WiktionaryZ. Imagine you want to know for which expressions of a certain collection there is a defined meaning in a certain language. So you choose the language you want to translate from, choose the one you want to translate in and, if desired you choose a collection. Then you click on that neat button that says "search". What you get is a list of terms where you can work on adding the translations of DefinedMeanings. That is great, right? Well at this stage a big "thank you" to Webboy, who is the programmer of that bit.

Now imagine our OLPC Children's Dictionary ... the same feature only searching on DefinedMeaning level ... that would help us to save really a lot of time - it would not be necessary anymore to say: please care about these 5 words ... you could say: go there, insert English + your target language + select the OPLC Children's Dictionary as a collection and please do some words (even only one word). That would be really great ... I hope we will get this feature soon.

Well there is another thing I would like to mention when it comes to the OLPC Children's Dictionary ... and that is about a dump that was needed to show how we can do things - in a period when the server was down and difficult to access ... Thanks to Leftmost for helping out with that.


Friday, November 03, 2006

Hosting as a showstopper

WiktionaryZ is down. It is probably an index in the database or some such that prevents the system to come up. To make matters worse, Erik the only person with access to the server is on his way to Jamaica from Berlin. This means extended downtime. Extended downtime is not good for a project; it leads to disgruntled users it prevents the further adoption of the project.

WiktionaryZ will in the near future (ie this year) host what some call "professional wikis". They are extensions to specific domains. The people that have this need require certain standards. One of these is 24*7 support, another is back-up support.

There are two points to it; hosting will not be on Erik's server when the new content is merged into WiktionaryZ. The second is that a known single point of failure is only acceptable as long as you are willing to pay the price of failure. It is good that we are increasingly unwilling to pay this price it is gratifying to realise that our hosting WILL be elsewhere at the end of the year.

For those who do not appreciate it; having WZ hosted by Erik was a good thing. It was the option that we had.