Tuesday, January 31, 2006

Microsoft Glossaries and similar g...s

Well today again I read a posting in a forum concerning the Microsoft Glossaries that were available on ftp://ftp.microsoft.com/developr/msdn/newup/Glossary as to the answers from colleagues it really seems that Microsoft has definitely blocked access to their glossaries. Now translators needed them when they did localisations (well they really still do need them) for Windows based software since coherency of terminology within several applications is important.

Now what would be interesting to know: I for example localise software from Italian to German and from English to German and sometimes even German to Italian or English to Italian in co-operation with a colleague. Now we of course maintain our own glossaries. Should it now happen that we need to translate software for Windows we will very likely exchange words - we will ask each other: how is this option called on your system etc. and create our own Windows related glossaries .

What would be very important for me to know is: would there be problems if we attribute a category called "Microsoft" to one of our private made glossaries that we make available to all under GFDL on WiktionaryZ? I mean: Microsoft is a registered trademark, but all translators localising software for Microsoft OS need /create such terminology or should we need to use another name that may not say "Microsoft based terminology" but something like "M-based terminology for software" - how is the legal status for such stuff if we really create the wordlists ourselves?

We will have similar questions over and over again I suppose ... for Machinery (SCADA), Apple etc ...

Comments and hints on how to treat such terminology would be very much appreciated.



Saturday, January 28, 2006


WiktionaryZ aims to allow collaboration with other organizations. Even at this early stage, we have made contact with universities, translation organizations, and lexicographic companies. All of these organizations have seen the value of free information and an expanded, worldwide staff of editors.

On the surface, these collaborations are great news. Many of these organizations can offer a wealth of data along with money and manpower to assist with importing it, hosting it, maintaining it, and making necessary modifications to the architecture.

Collaborations will raise some questions which the Wiki community has not yet had occasion to address. Of course, we will make it clear that organizations must contribute, as anyone, under a free license, and must play by the same rules.

For their contributions, organizations will want attribution. Careful observers may note the "partners" link in the left margin of the prototype database. To prevent it from growing too large, the idea is that this section will display links to a handful of recently active partners, perhaps five, at any one time.

Many of these organizations also have professional data to offer. Without going into detail on what constitutes "professional"*, the idea is first to give credit to organizations that contribute data, often data that would require knowledge beyond that of most volunteers, such as unusual languages and dialects. This would probably take the form of a decal with a logo and a link to the organization.

The question this raises is twofold. A logo and link will smack of commercialism to some contributors in the wiki community, but done cautiously, it could be a simple and appropriate concession to organizations who contribute. The other question is a bit trickier, and may require a change to the architecture. By taking responsibility for a particular version of an entry, an organization would give its "stamp of approval" to a particular version of a piece of information. On the one hand, this lends authority and credence to that information. Critics have raised the question of accuracy in a body of data that anyone can edit, a valid concern. The stamp of an organization or of a trusted contributor could serve as a mechanism for review of content. On the other hand, it could intimidate or alienate volunteers, who may feel that an amateur or enthusiast does not have the authority to correct or improve upon a "professional" entry. The perception will persist, despite the fact that many organizations see the distributed editing of WiktionaryZ as a way to improve and expand volumes of data that have grown too large to maintain in other ways.

In my opinion, a healthy WiktionaryZ would include a volunteer community alongside organizations for mutual benefit and greatest volume and accuracy of data. What do you think?

— Dvortygirl

*In a nutshell, "professional" data may be just as incorrect as any other, and Wikipedia has already demonstrated that volunteer efforts can be as conscientious as traditional publications.

Wednesday, January 25, 2006

The relation between [[WiktionaryZ]] and the [[Wiktionaries]] is one of those thorny issues. What is the relation.. Do the Wiktionary have to be ended and merged into WiktionaryZ ??

This is a question that comes back time and time again. It is a question that can be answered in many ways and it will probably be answered many times..

The answer from the WiktionaryZ is: It is not for us to decide. It is for the Wiktionary communities to decide what they want to do once the content of their project is merged into WiktionaryZ. It is for the board to decide if they allow new Wiktionary projects once WiktionaryZ has reached its full functionality.

WiktionaryZ will attract people who are currently active in the Wiktionaries. It is each individual persons choise what he or she spends his valuable time on. For the WiktionaryZ project it is extremely important that we will attrackt wiktionarians. Wiktionarians are important because they understand what it is that makes sense to have words and more in a wiki. They understand how to deal with cranks that decide that a word like [[exicornt]] should exist. They are people interested in languages like Bambara, interested in phonetic descriptions, interested in thesauri. There are so many things the current Wiktionarians are interested in, it will keep us on our toes to make sure that we will have interesting content.


Saturday, January 21, 2006

Quality in a dictionary is expensive. This is something a Chinese dictionary found in the most negative way. The Xinhua dictionary is facing a suit because one of its readers found more than 3.000 errors. Mr Chen Dingxiang studied the dictionary since 1998; he is looking for some money for the time that he spend on his research and an apology for what he considers a substandard work.

The thing that is sad to me is that a person who has the inclination to work on dictionaries and did so much research did all this work without it leading to improvements in the dictionary. WiktionaryZ will have 3.000 issues, but everyone who recognises one will have the EDIT button in the top left hand side and will be able to make it 2.999 issues.


Thursday, January 19, 2006

In all the Wikimedia Foundation project there is one big bias. It is not a bias that we want, but it is there nonetheless. The WMF projects are Anglo centric. In many ways it is a good thing, English is effectively the lingua Franca of these times. In Uganda for instance much of the university education if not all is given in English; English is the official language of Uganda !!

In the WiktionaryZ project we want to have all words of all languages. This is a great idea and it comes with many challenges. In this blog entry I want to focus on Africa.

When WiktionaryZ is to be relevant to Africa, the content of WiktionaryZ has to be relevant. Relevancy can be achieved in many ways; it can be because there is nothing else, it can be because we include glossaries and thesauri for specific topics and it can be because we have a complete dictionary for a language.

The fun thing is that with much content in African languages, we get translations in a Western language, typically English. The consequence is that the African languages are likely to have perpetually less content than many of the Western languages.

One problem with terminological projects is funding.. Many projects do not happen or are abandoned before completion because of lack of funding. The reason for this is obvious; it is very expensive to create quality terminological or lexicological or thesaurus information. One benefit that we hope to bring into the equation is the community and consortium effort we hope and expect that this will bring the cost down.

There will also be a difference in aproach; often people do not publish content because it is incomplete. This is against the Wiki idea; here you publish what you have and together you improve the content. When people learn that WiktionaryZ is there, that it has some relevant content, they will want more relevant content. There are plenty of people in Afica that CAN help.


Tuesday, January 17, 2006

As Gerard has pointed out, Wikidata and WiktionaryZ development is structured so that we can feed as much functionality as possible directly into the software which runs the Wikimedia projects, MediaWiki. The namespace manager, which is going to be part of MediaWiki 1.6 if I can help it, was the first example of this strategy. Namespaces are essential to structuring data -- they allow us to distinguish different types of content in a single wiki installation.

In WiktionaryZ, for example, we have expressions (which are connected to meanings, synonyms, translations, and so on), but we also have collections (allowing us to identify sets of concepts as being part of a single "body of work", such as the GEMET thesaurus), relation types, attributes, and quite importantly, regular wiki pages: there will be portals, policy and help pages in WiktionaryZ, as well as discussion pages. Namespaces allow us to keep it all neat and separate. These namespaces used to be hardcoded, hackish and unflexible. This has now changed.

The namespace manager is, however, not only beneficial to WiktionaryZ. It is also beneficial to other Wikimedia projects, such as Wikibooks, which can use them to structure large, related sets of pages. It is beneficial, in fact, to any MediaWiki user. And as developers of WiktionaryZ, we have the benefit that our architecture is sound, our work is visible, it is integrated and maintained as part of the official MediaWiki codebase, and that it is tested and reviewed by many users.

I have now specified a new component milestone that will hopefully be similarly beneficial: Multilingual MediaWiki. Currently, MediaWiki has no awareness of languages beyond the user interface level. We have multilingual wikis, such as Wikimedia Commons and Meta-Wiki. However, in these wikis, all languages share a single namespace; if there are title conflicts across languages, they can only be resolved manually. There is no support for linking languages together or creating translations easily.

You may say: "But I thought Wikipedia exists in over 100 languages?" And you're right - it does. But from a technical standpoint, each of these languages is a separate database. Pages in different languages are linked together using something called "Interlanguage Links", a very ugly hack that requires us to have a list of links to the page in all other languages at the bottom of every single language version. In effect, this means that if you have 10 versions of a page, you need to maintain 10*9 interlanguage links (each of the page linking to its 9 corresponding pages in other languages). There is no way to get a list of recent changes from a set of languages either-- MediaWiki knows only about one content language, the one it's set up with.

Still, things are working reasonably well for Wikimedia, and the database split has its advantages, too. But for truly multilingual projects like Meta, Commons and, incidentally, WiktionaryZ, which rely on the presence of a single database, the status quo is badly broken. This is what these specifications seek to address. It is absolutely essential to solve this problem for WiktionaryZ. You can see why by browsing the page index of the GEMET read-only milestone of WiktionaryZ. Here you will notice that pages like "AIDS" have multiple records in the page title index, because they exist in multiple languages. But MediaWiki knows nothing about that -- complicating the situation especially where you have an expression that has different meanings in different languages.

We could be hackish and try to support multiple languages only within WiktionaryZ. But this would bite us in the long run. It makes much more sense to make MediaWiki a truly multilingual application -- and this will benefit its hundreds of users world-wide. Imagine that, thanks to these changes, every single installation in the world will only have to upgrade MediaWiki, and be able to start accepting content in multiple languages if they so desire. Browse the list of sites using MediaWiki. How many of them are multilingual today? How many will be a year from now?

Multilingual MediaWiki also offers some "blue sky" potential. One of those blue sky ideas is the management of translations. When you need a translation of a document quickly, you want to notify all people who are able to provide it to you, and manage the assignment. It would not be too hard to add this functionality: Allow users to specify which languages they are willing and able to translate from, and notify them when a document to be translated appears in one of these languages.

This example shows how important it is to come up with a sound, scalable architecture for Wikidata and WiktionaryZ: it creates synergies. This is what we're trying to do, and this is why things aren't always moving at a rapid pace. Watch this space: One of the next major documents I will publish is a fairly complete set of specifications for Wikidata's versioning engine, multi-language management and schema management.

Monday, January 16, 2006

Van Dale is the publisher of choice for the Dutch language. They contacted me last year through Kennisnet. Van Dale is interested in exploring what WiktionaryZ will mean to them. They are interested in collaboration and they are interested in business models of commercial organizations that collaborate with the Wikimedia foundation.

One of the great examples of such collaboration manifests itself in the books that are published in Germany by Zenadot Verlagsgesellschaft in Berlin. Together with the German chapter they produce books under the label WIKI press. If you read German I do recommend these books.


Sunday, January 15, 2006

Who is going to write the first blog .... that was the question.

Well ... I just thought I could write some notes on WiktionaryZ here.

Almost one and half a year ago there was that Dutch Wiktionarian, GerardM, who posted loads of, at that time, not understandable templates on the Italian wiktionary. Hmmm ... and what does a good wiktionarian do? He/she asks for clarification. Well, so I learnt that those templates where there just to reduce "workload" and that was something I was thinking about myself as well (not knowing templates my thoughts went completely different ways). And that day - it was the 30th of August 2004 was the starting point for this very particular projekt, known with its work name as Ultimate Wiktionary.

WiktionaryZ will provide a unique place for lexicological data, but not only. It is impossible to describe it in just a few words. Imagine all you can imagine in an online dictionary and you will only think about a small percentage of it.

Gerard and me dreamt that dream for some time - started to work on simplifying things ... then definitely Ultimate Wiktionary became a project and Gerard had a meeting with Kennisnet who were ready to pay 5.000 EUR for the programming of UW ... well Erik Moeller accepted :-) from that moment on we were in three.

Now you will ask yourself where we are today ... well, we have a nice name, WiktionaryZ, we have a subset of the software with the GEMET-data online (http://epov.org/wd-gemet/index.php/Main_Page) - it is a read-only version for now, but it is the first "visible" result of Erik's hard work. The database design ... well I let this to Gerard ... it is somewhat toooooooo huge to explain :-)

Being only in three was not enough to assure what needs to be assured (I like this sentence :-) and so there is that commission including some more people who hopefully will introduce themselves here and will write about their view.

Well ... now it is up to you, dear reader, to ask questions ... and it is up to my colleagues to write ... that's why writing the first entry of a blog sometimes is quite a favourable thingie ... you do not need to tell everything since you "must" leave something to others as well.

Thanks for taking the time!