[Mediawiki-i18n] Please view and comment CAPTCHA images in 154 languages

Federico Leva (Nemo) nemowiki at gmail.com
Tue Apr 1 22:30:42 UTC 2014

Today I made a couple patches that should address most of the problems 
reported as well as handle RTL languages and multilingual blacklist. I'm 
mostly using some Unicode magic which is quite well hidden in some 
obscure libraries, we'll see if it works. :)

In case it's not clear, for now I'm focusing on the *MediaWiki* side of 
the matter; the Wikimedia side, i.e. where to use what and how, is 
something we'll worry about when we actually have this option (or 
others) available in the codebase.

A couple questions below.

P. Blissenbach, 31/03/2014 17:13:
 > captchas having two lines
 > of identcal text [...] and accept either input.

This would need to be filed as separate enhancement request.

Shimmin, 31/03/2014 20:02:
> If you actually want the captchas to make any sense in terms of word
> combination and construction, that would be a whole different issue.
> There's inflection, rules on what happens when words are run together
> (spelling changes for one), and so on.

I suppose you're only talking of the morphological side here, right? The 
current patch contains a couple lines to handle hyphenation for Finnish, 
because it was originally provided by Nikerabbit, but we're definitely 
not going to build a universal grammar of univerbation in a MediaWiki 
script. Unless someone comes up with a general solution I think we'll 
drop that part.

If this turns out to be confusing, I'd rather just show the two (or N) 
words as separate words, what do you think? This can be done in a 
separate patch; once we introduce some other security improvements, I 
think the challenge of identifying where one word ends and the next 
starts may be redundant.

> Quite a few of the l look like i in this font, which seems problematic.

This is indeed a problem with sans serif fonts but the broad majority 
thinks they are better. We can try to pick clearer fonts but most help 
will come from words being familiar to humans. I may upload more tests 
with this font, though: https://commons.wikimedia.org/wiki/File:AndBasR.pdf

> Should this be "leigh"?

Yes. If incorrect, please edit: https://en.wiktionary.org/?oldid=23059687

> Looks like "neuscanshoil" with a random -y added, a hangover from
> English behaviour?

Same problem as with Malayam and others; the last version will avoid 
combining single letters to other words.

> [...]
> (though Aaue is a proper name) [...]
> Perick is also a proper name  [...]

Do others think proper names are a problem? If yes they might be easy 
enough to remove, usually they're tagged as such on Wiktionary. 
Otherwise, this adds some cheap variety in our dictionaries.

> The form "vaayl" is a rare grammar-induced form of an unusual word

In this case it's again a proper noun, no idea how correct or how 
current: <https://en.wiktionary.org/?oldid=21902154>

> Hard to read, could be "hiu shee" or "niu shee"

It was "hiu": no "niu" in our dictionary. If the latter is a valid word, 
you should add it to Wiktionary and then we can try to figure out 
something to exclude confusable words.

Once again, the proposed approach is to rely on a mix of Unicode magic 
and self-healing (wiki) dictionary. Neither is enough alone.

> This one means "arctic castration" (spoiy = castration).  Not obscene,
> but maybe not for everyone?

Well, it could fall under "obscene" for some definition of the word. I'm 
now blacklisting also "pejorative" and "offensive" words, those who care 
can try and see if their label edits survive on the wiki.


More information about the Mediawiki-i18n mailing list