Tuesday, January 17, 2006

As Gerard has pointed out, Wikidata and WiktionaryZ development is structured so that we can feed as much functionality as possible directly into the software which runs the Wikimedia projects, MediaWiki. The namespace manager, which is going to be part of MediaWiki 1.6 if I can help it, was the first example of this strategy. Namespaces are essential to structuring data -- they allow us to distinguish different types of content in a single wiki installation.


In WiktionaryZ, for example, we have expressions (which are connected to meanings, synonyms, translations, and so on), but we also have collections (allowing us to identify sets of concepts as being part of a single "body of work", such as the GEMET thesaurus), relation types, attributes, and quite importantly, regular wiki pages: there will be portals, policy and help pages in WiktionaryZ, as well as discussion pages. Namespaces allow us to keep it all neat and separate. These namespaces used to be hardcoded, hackish and unflexible. This has now changed.


The namespace manager is, however, not only beneficial to WiktionaryZ. It is also beneficial to other Wikimedia projects, such as Wikibooks, which can use them to structure large, related sets of pages. It is beneficial, in fact, to any MediaWiki user. And as developers of WiktionaryZ, we have the benefit that our architecture is sound, our work is visible, it is integrated and maintained as part of the official MediaWiki codebase, and that it is tested and reviewed by many users.


I have now specified a new component milestone that will hopefully be similarly beneficial: Multilingual MediaWiki. Currently, MediaWiki has no awareness of languages beyond the user interface level. We have multilingual wikis, such as Wikimedia Commons and Meta-Wiki. However, in these wikis, all languages share a single namespace; if there are title conflicts across languages, they can only be resolved manually. There is no support for linking languages together or creating translations easily.


You may say: "But I thought Wikipedia exists in over 100 languages?" And you're right - it does. But from a technical standpoint, each of these languages is a separate database. Pages in different languages are linked together using something called "Interlanguage Links", a very ugly hack that requires us to have a list of links to the page in all other languages at the bottom of every single language version. In effect, this means that if you have 10 versions of a page, you need to maintain 10*9 interlanguage links (each of the page linking to its 9 corresponding pages in other languages). There is no way to get a list of recent changes from a set of languages either-- MediaWiki knows only about one content language, the one it's set up with.


Still, things are working reasonably well for Wikimedia, and the database split has its advantages, too. But for truly multilingual projects like Meta, Commons and, incidentally, WiktionaryZ, which rely on the presence of a single database, the status quo is badly broken. This is what these specifications seek to address. It is absolutely essential to solve this problem for WiktionaryZ. You can see why by browsing the page index of the GEMET read-only milestone of WiktionaryZ. Here you will notice that pages like "AIDS" have multiple records in the page title index, because they exist in multiple languages. But MediaWiki knows nothing about that -- complicating the situation especially where you have an expression that has different meanings in different languages.


We could be hackish and try to support multiple languages only within WiktionaryZ. But this would bite us in the long run. It makes much more sense to make MediaWiki a truly multilingual application -- and this will benefit its hundreds of users world-wide. Imagine that, thanks to these changes, every single installation in the world will only have to upgrade MediaWiki, and be able to start accepting content in multiple languages if they so desire. Browse the list of sites using MediaWiki. How many of them are multilingual today? How many will be a year from now?


Multilingual MediaWiki also offers some "blue sky" potential. One of those blue sky ideas is the management of translations. When you need a translation of a document quickly, you want to notify all people who are able to provide it to you, and manage the assignment. It would not be too hard to add this functionality: Allow users to specify which languages they are willing and able to translate from, and notify them when a document to be translated appears in one of these languages.


This example shows how important it is to come up with a sound, scalable architecture for Wikidata and WiktionaryZ: it creates synergies. This is what we're trying to do, and this is why things aren't always moving at a rapid pace. Watch this space: One of the next major documents I will publish is a fairly complete set of specifications for Wikidata's versioning engine, multi-language management and schema management.

1 Comments:

Blogger SabineWanner said...

I love the idea of multilingual Wikis :-) and I immediately have a use for it: for example for the Neapolitan wikipedia. Neapolitan has no standardised way of writing and different "Neapolitans" coexist. Also other languages are actually attributed to Neapolitan even if they are so different that a Neapolitan would not understand that other language. The multilinguality would help us in a way that the actual mainpage would become something like "Welcome to the wikipedia of the Neapolitan language group", outlining that it is only a regional attribution and not really a linguistical. From there each language of the "Neapolitan language territory" would have its own namespace and that makes a lot of sense since admin work will be less (we can help each other). It will not do matter if there is only one person writing articles in one of the languages since no additional installation of the software is needed. So the community can be created by offering them contents to work on instead of having to first search for people who would co-operate and then maybe have to delude them because others were against the creation of a new Wikipedia.

Ciao, Sabine

11:23 am  

Post a Comment

<< Home