WiktionaryZ

Tuesday, December 12, 2006

Last post for this blog ..

The OmegaWiki project aims to provide information about all words of all languages with a user interface in all languages. It achieves this by using relational database technology extending the wiki paradigm. This project attracted attention from organizations that made the development of relational database technology from inside the MediaWiki software possible.

OmegaWiki originated from the 171 Wiktionary projects of the Wikimedia Foundation. Originally the project was called "Ultimate Wiktionary" it was renamed to "WiktionaryZ" and now it will be known as OmegaWiki to prevent confusion with its Wiktionary sister projects.

The OmegaWiki project will include specialist terminology and will have an ontological component that will allow for the inclusion of specialist thesauri. Due to the experimental nature and to some of the requirements make that the Wikimedia Foundation is not in a position to host the OmegaWiki project.

To allow for the further development and hosting of the OmegaWiki project, the Stichting Open Progress, a foundation based under Dutch law, will assume the responsibility for the project. The "WiktionaryZ commission" will continue to take care of the day-to-day issues for the OmegaWiki project and, it will be renamed to "OmegaWiki commission"

The content of OmegaWiki will continue to be licensed under a combined GFDL and CC-by license. This is to enable what the project has defined as success: "success is when someone finds an application for our data that we did not think of".

The Wikimedia Foundation and Stichting Open Progress will continue to work together to achieve their shared objective; to bring information to all people of this planet. For more information:
We hope and expect that the Wikimedia Foundation and Stichting Open Progress will be able to work together to achieve their shared objective; to bring information to all people of this planet.

For the Stichting Open Progress and the OmegaWiki committee,
Gerard Meijssen

Labels:

Wednesday, December 06, 2006

Ahead of the curve

WiktionaryZ is very much a project in its infancy. Many parts that are necessary to fulfil its promise are lacking. Nothing new here, WiktionaryZ is still pre alpha software. One of the great things of Open Source is that when a project has sufficient attraction people come along willing to collaborate.

WiktionaryZ and the Kamusi project have many similarities as well as differences. The most striking differences are the scope; Kamusi is about the Swahili language and its maturity. Kamusi has been going strong for quite some time. As Kamusi is so much ahead of the curve from where WiktionaryZ is, it is really interesting to look at its strengths.

I was reminded about Kamusi because of its recent newsletter; it offers a rich mix of facts about culture, updates of the project. Merchandise is offered with the Swahili clock being imho the most spectacular item on offer. All this to fund a project that runs as much on passion as it does on money.

Here too is a project that is life while continuously being developed. A project that depends on its community, it currently honours the people who contributed the most content with statistics and like WiktionaryZ it has supporters who are part of what makes Kamusi a reality.

I have met Martin Benjamin, the Kamusi editor at Wikimania 2006 in Boston. It was too short a pleasure. We have talked since using Google talk on several occasions and ast Kamusi is ahead of the curve I have been privileged to learn much from the Kamusi experience.

My hope is that at some stage we will have the maturity of Kamusi. It is likely that we will have this maturity first for some languages and slowly but surely for others as well.. WiktionaryZ is very much a journey, enjoy the ride :)

Thanks,
GerardM

Sunday, November 26, 2006

WiktionaryZ is a wiki and not perfect

One of the contributors of WiktionaryZ who is with us from the start is really into the "minor" languages. There are no newspapers in this language, there are many dialects in this minor language. And now we have this need for a user interface in this language too.

For this stellar worker, this is a problem. As the orthography is not set in stone, what is the correct word for the translations of English, German, Italian? As people will see it all the time, it has to be correct. There is also some reputation involved; when you are seen to be perfect in a language how can you err when everyone will start using what is provided?

I have been many translations in the last few days as well. I have been working hard to add this content on the Dutch Wiktionary. Now I have learned that the names of languages in Russian and Danish are not capitalised. So I will change these when I copy them in. I am not sure for languages like Slovak and Slovene.. I have the choice not to copy them in and leave it for someone else or I can make mistakes and hope that someone cleans up after me.

I am sure that some of the stuff I did needs correction. In the mean time it is an honest effort and, it shows that by making the correction ONCE, we truly benefit from the way WiktionaryZ works.

Thanks,
GerardM

Monday, November 20, 2006

*NOW WITH LANGUAGES IN YOUR LANGUAGE*

Well, sometimes advertising the good stuff makes you feel really good. Today, the data that allows us to show languages names in the language of the User Interface has gone life. For WiktionaryZ this is really important new functionality. We want to have a user interface that brings WiktionaryZ in the language of the user. You should not have to know English to get your information about an Urdu word.

The one problem is, we can only show you the languages in your language when we know them. It is therefore that we like you to help us build this part of the User Interface.. This is truly stuff that we will be using all the time..

Thanks,
GerardM

Labels:

Wednesday, November 15, 2006

Why Multilingual MediaWiki matters

You might think that MediaWiki, the software that runs Wikipedia (and WiktionaryZ), is fully multilingual. After all, everyone knows that Wikipedia exists in more than 200 languages -- all of them intricately connected through a network of "interlanguage links" from one edition to the other.

In actual fact, each language edition of Wikipedia runs on a separate database. Wikimedia uses a special setup to share code and configuration files, but that setup is not trivially reproducible. If you want to set up a wiki in multiple languages, your best bet is to set up multiple instances of MediaWiki. As a result, most MediaWiki installations around the world are monolingual. They only accept content in one language.

Of course, users are free to put, say, French language content into an English language wiki. They can even change their user interface language preference to French. But the problems begin quickly:
  • If multiple languages use the same title to refer to a particular page, you need to disambiguate it. In single database setups, this is typically done by appending the language code manually to the page title. These titles can quickly become messy and inconsistent.
  • As soon as activity picks up in multiple languages, the list of changes to the wiki quickly becomes cluttered with information that is useless to readers who do not speak the particular languages in which edits are made.
  • It is impossible to systematically search for pages in a particular language.
  • Pages about the same content in different languages have to be manually connected to each other, which is often done using templates. The wiki does not facilitate the process of interlanguage linking in a single installation. The interlanguage links which are used on Wikipedia are horribly inefficient, as a separate set of language links has to be maintained for each language.
  • The first experience for a user who does not speak the default language of the wiki is often negative. Unless the wiki has been specifically built (with policies and interface messages) to encourage multilingual contributions, they are unlikely even if they are theoretically possible.
Take a look at a big wiki hosting site like Wikia -- even though it sets up multilingual wikis on request (by setting up multiple databases), most of its wikis are monolingual by default. English is of course predominant. With hundreds of millions of Internet users who do not speak English but would be happy to contribute to these wikis, this is a tremendous loss of opportunity. Outside a framework like Wikia, with users setting up their own wikis, it gets even worse: very few people go to the effort of structuring their wikis to accept content in multiple languages.

Wikis have become as ubiquitous as forums or blogs. Whether we are talking about documentation, knowledge bases, directories, discussions, experiments in democracy, media archives -- there are millions of potential participants out there, waiting to be invited to contribute. Waiting to feel welcome. We need to reach out to them. It is not just the community that needs to make the decision to "go multilingual". It is the software that should support this decision as much as possible.

Fortunately, there is an answer: Multilingual MediaWiki. This set of specifications describes the changes to the MediaWiki software needed to accept content in multiple languages, to network it effectively, and to build truly multilingual communities. And fortunately, this is more than just a paper: It is being implemented by a very capable programme, with financial support from the University of Bamberg, and another sponsor who shall remain unnamed for now. You can view the first prototype (still very messy :-), which showcases the functionality to a) store content in language "meta-namespaces" and keep it separate, b) connect pages in different languages to each other.

There's still quite some way to go until this becomes part of MediaWiki proper, but we are making steady progress. When the project is completed, thousands of MediaWiki installations across the planet will gradually become fully capable of accepting content in all languages of the world (if their owners want them to). It will be another step in opening up the world of wikis to the global community.

Beyond its very direct impact on wiki users, MLMW is a requirement for WiktionaryZ. Right now, pages about expressions such as AIDS, which exist in many languages, can become very messy. Ideally, the user would, when looking up an expression, always specify which language it is in -- and then only see the DefinedMeanings in that language. However, for this to work effectively, MediaWiki must support looking up pages in a particular language, exactly the functionality that MLMW will provide.

This is one of many examples in which WiktionaryZ development benefits MediaWiki as a whole. Moreover, it reflects our philosophy to structure our work so that milestones can be reached independently whenever possible. We had some initial problems with the MLMW project -- a funding source ran out, and a developer team became unavailable. Fortunately, this did not impact the main WZ development, and work could continue as soon as we found a new source of funding and a new developer.

I'm not aware of any other wiki engine which handles content in multiple languages well. Hopefully, MediaWiki will become the first.

Tuesday, November 14, 2006

About Definitions

There is an art to writing lexical definitions. The problem for WiktionaryZ is, that this art has to be relearned. Definitions need to be concise. There is however no reason not to use full sentences. Abbreviations are not needed as there is plenty of space on the hard drives of modern computers.

One accepted practice in the dead wood dictionaries is to define by using synonyms. For WiktionaryZ it leads to circuitous definitions because the definition defines the synonym as much.

Another thing is to use bad constructs like "To do something (to someone)". This looks bad, it is better to just say "To do something to someone". There is no reason why we cannot produce full sentences without constructs that are inherited from existing dictionaries. If they were from existing dictionaries, they might be in line for improvement.

Yet another thing that we should not do is be overly specific and correct. The definitions define; they show what it is about. When this is not defined in "sufficient" detail, it often proves that there is a need for an encyclopaedic article and, that is when we need to refer to the Wikipedia article about the topic.

Thanks,
GerardM

Friday, November 10, 2006

Success can kill

I am really grateful that Erik was able to reboot WiktionaryZ from Heathrow. It save us from another few hours of downtime. When he finally came home and when we talked, I learned that it is the popularity of WiktionaryZ that brought the system to its knees. When you look at the statistics of WiktionaryZ, you will see an upward trend and as Erik indicates, it is getting at what the current hardware can maximally handle.

It means that we have to upgrade our servers in the very near future.

The development of the software is going ahead strongly. I was told that we are now at the stage where we can start developing attributes on the level of the SynTrans records. This is techno babble for; we are ready to work on things like parts of speech functionality.

Parts of speech is Language specific. This means that two languages will not necessarily have the same parts of speech or inflections, conjugations. It means that the development is in phases.
  • Indicate that a specific part of speech exists for a language.
  • Indicate that a word that is among the translations or synonyms is a particular parts of speech
  • Indicate what conjugations / inflections are possible in a language for a specific parts of speech
  • Allow for scripts that propose what the conjugations / inflections could look like.
When you read this, it is rather glib. There are many issues that need resolving and all these issues are not mentioned. There are also several opportunities that will make for better functionality that are not indicated. But hey, this is a blog not a specification.

What IS important that this is a good time to talk about this. What I point out is that we are about to conceive of this functionality and that we do welcome comments.

Thanks,
GerardM