Wednesday, May 31, 2006

Some progress

It is one thing to mention that an effort is important, but it only is important when you are willing to do something with it. In my previous post on this blog, I mentioned the internationalization and localization effort that is going on for the MediaWiki software.

We have now linked the information about languages and their systemfiles to both Ethnologue and WiktionaryZ. We have also included a category with the languages that have can be understood as languages under ISO-639-3. The next bit will be to link back from WiktionaryZ to this development Wiki.

One of the things that are still to do is deal with the existing localization for languages like Chinese, Arab, Kurdish and Farsi. The codes that are in use in MediaWiki are now in use as macrolanguages. This means that they represent more than one language. Theoretically it is easy to do, just change the codes but practically it is more problematic.


Tuesday, May 30, 2006

The User Interface and project codes

The user interface (UI) of WiktionaryZ will consist of two parts. One part is specific to WiktionaryZ itself; this part will have data as part of the application itself.. we eat our own dogfood. The second part is the MediaWiki UI. The maintenance has been a big pain, but the last months a lot of great work has been done by Gangleri, Nikerabbit and many others. They have created a great resource where people work on the internationalization and the localization of the MediaWiki software.

For those who do not know the difference, internationalization is making software useful in a multi lingual environment. Given the number of Wikipedia projects, MediaWiki is certainly a project that is on the way to become outstanding in this regard.

There is one problem with respect to the localization; the codes that are used to indicate languages and projects are not the same. Wikipedia uses many ISO-639 codes but also adds codes as and when they feel like it. This has resulted in incompatibility with the ISO-639 code itself.

The correct internationalization and localization is a very important part of creating a multilingual UI for WiktionaryZ. It is therefore gratifying that WiktionaryZ and Ethnologue are now part of the metadata on the implemented languages. The first job will be to link the easy connections, the second will be the problematic connections. The most important job will be to urge people to help with the localization of the MediaWiki messages for their language(s). It will not only improve WiktionaryZ but all MediaWiki projects. It is therefore truly in the interest of all people who use MediaWiki.


Sunday, May 28, 2006

About splitter and lumpers

When you are working on the creation of a resource like WiktionaryZ, then one of the biggest issues is the theoretically large amount of different concepts or better DefinedMeanings. There are people (the splitters) who are of the opinion that "clear" distinctions have to be made between the different ways an Expression (word or phrase) is to be understood, another group (the lumpers) are of the opinion that it is "clear" that such a fine grained set of definitions are more confusing than helpful and therefore a reduced set of DefinedMeanings are in order.

This is a recipe for many quarrels; there is an obvious need for an objective way out.

At the LREC 2006 conference, Ed Hovy gave a keynote speech that I really enjoyed and that has in my mind implications on how we can resolve these questions. The key thing is that when a word or phrase has different senses, it should be possible to have a group identify these senses in a corpus and achieve at least 90% agreement. One really nice side effect is, that it proved empirically that typically less definitions proves to achieve better results.

When you have a more limited set of DefinedMeanings, there are also Definitions that are part of either a domain or a resource that has its place in WiktionaryZ. My solution would be to allow these as secondary definitions. They are their to show the different ways a concept is understood and also they provide a link to the resources that are imported in WiktionaryZ.

One thing we will have to experience is how this works out when we consider the multi lingual aspect of WiktionaryZ.. I am fairly confident that by adopting this approach it will help us to reduce the number of instances where the "Identical meaning" flag is turned off.


Friday, May 12, 2006

WiktionaryZ Licensing

GerardM asked me to share my thoughts about the problem of choosing the best licensing option for WiktionaryZ. I have posted them on the wiki and encourage you to add your own feedback.

Thursday, May 11, 2006

ISO 639-3 instead of a mixture ....

Two days ago Gerard asked people on the wikimedia mailing lists if there were objections to a change from ISO 639-2 combined with ISO 639-3 to the use of only ISO 639-3 codes. Well ... we did not receive any objections up to now, so I suppose people agree with us that using only one list of codes makes much more sense and is less confusing. We also talked about this with other people who agreed that the change makes sense. Since doing double work should be avoided it is best to move over asap and this means that we now start to move all WiktionaryZ portals over to ISO 639-3 codes.

Ciao, Sabine

Friday, May 05, 2006


When Trolls become a pest, you do not wish them on the Scandinavians (where they are said to live) . You do not feed them. You make it more difficult to feed on creating an emotional annoyance.

What you do is, you create loads of sysops on WiktionaryZ that can act on the first signs of trolling and, you sadly change the reply functionality of this blog to moderated. It means that someone, me, has to approve the reactions to the blog. I do this gladly, because I do not suffer trolls gladly.


Wednesday, May 03, 2006

The trouble with GFDL

GFDL includes a clause that says that attribution must be given in order to copy content. It's true in CC-by, too. GFDL, though, requires that the individual contributors be named. For documentation, which was the original purpose of the GFDL license, that makes sense. For projects with more content and more contributors, the requirement to name individual contributors becomes a burden.

One interpretation of GFDL suggests that the use of a particular Wiktionary entry requires attribution of all its contributors. This approach would paralyze the free re-use of the data in applications such as spell-checkers. Another approach would be to handle the attribution en masse, such as by including a single list of contributors to imported data (perhaps with edit counts) without tracking who contributed what.

The GFDL license also does not consider the attribution of organizations. We do not know, for instance, the names of all the individuals who contributed the GEMET data. WiktionaryZ has partners already, and will have many more as it grows.

In order to proceed smoothly, we propose the use of the CC-by license. At this stage, we welcome discussion. If you agree with this direction, please state your agreement to license your own contributions under CC-by on your user pages on WiktionaryZ and your home Wiktionary.

Monday, May 01, 2006

Edit on the relational data of WiktionaryZ

The first edits on WiktionaryZ have been done. This is an exiting time; not only will we learn how will the (pre-alpha) software works, we will also have to make sure that the concept of the DefinedMeaning is ingrained into the ideas of the editors. This means that the flag that indicates that an "identical meaning" is present in the synonym or translations is turned off when the translation is problematic...

Well, actually at this moment my overriding feeling is one of joy, all these problems will sort themselves. Now we can think in terms of solutions; we can add/edit Definitions and add translations and synonyms. Now we can demonstrate that the relational approach in a Wiki environment makes sense. Now we can make a difference .. :)

The first thing we "solved" is expand the list of languages that can be chosen from. To the list that comes with the Gemet, I asked to have Chinese, Russian and Neapolitan added. For these three languages we have people that will do right. They are just the first few new languages, as you know; we want them all..