[Mediawiki-i18n] Fwd: Language detection for Special:Search queries

Federico Leva (Nemo) nemowiki at gmail.com
Wed Jul 20 05:47:27 UTC 2016




-------- Messaggio inoltrato --------
Oggetto: 	[discovery] Better search results on wiki via TextCat
Data: 	Tue, 19 Jul 2016 19:42:27 -0600
Mittente: 	Deborah Tankersley
A: 	A public mailing list about Wikimedia Search and Discovery projects 
<discovery at lists.wikimedia.org>



We're happy to announce that after numerous tests and analyses[1] and a 
fully operational demo[2], the Discovery Team is ready to release 
TextCat[3] into production on wiki.

What is TextCat? It detects the language that the search query was 
written in which allows us to look for results on a different wiki. 
TextCat is a language detection library based on n-grams[4]. During a 
search, TextCat will only kick in when the following three things occur:
      1. fewer than 3 results are returned from the query on the current 
wiki
      2. language detection is successful (meaning that TextCat is 
reasonably certain what language the query is in, and that it is 
different from the language of the current wiki)
      3. the other wiki (in the detected language) has results

Our analysis of the A/B test[5] (for English, French, Spanish, Italian 
and German Wikipedia's) showed that:

     "...The test groups not only had a substantially lower zero results
     rate (57% in control group vs 46% in the two test groups), but they
     had a higher clickthrough rate (44% in the control group vs 49-50%
     in the two test groups), indicating that we may be providing users
     with relevant results that they would not have gotten otherwise."
This update will be scheduled for production release during the week of 
July 25, 2016 on the following Wikipedia's:

   * English [6]
   * German [7]
   * Spanish [8]
   * Italian [9]
   * French [10]

TextCat will then be added to this next group of Wikipedia's at a later 
date:

   * Portugese[11]
   * Russian[12]
   * Japanese[13]

This is a huge step forward in creating a search mechanism that is able 
to detect - with a high level of accuracy - the language that was used 
and produce results in that language. Another forward-looking aspect of 
TextCat is investigating a confidence measuring algorithm[14], to ensure 
that the language detection results are the best they can be.

We will also be doing more[15] A/B tests using TextCat on non Wikipedia 
sites, such as Wikibooks and Wikivoyage. These new tests will give us 
insight into whether applying the same language detection configuration 
across projects would be helpful.

Please let us know if you have any questions or concerns, on the TextCat 
discussion page[16]. Also, for screenshots of what this update will look 
like, please see this one[17] showing an existing search typed in on 
enwiki in Russian "первым экспериментом" and this one[18] for showing 
what it will look like once TextCat is in production on enwiki.


Thanks!


[1] https://phabricator.wikimedia.org/T118278
[2] https://tools.wmflabs.org/textcatdemo/
[3] https://www.mediawiki.org/wiki/TextCat
[4] https://en.wikipedia.org/wiki/N-gram
[5] 
https://commons.wikimedia.org/wiki/File:Report_on_Cirrus_Search_TextCat_AB_Test_-_Language_Detection_on_English,_French,_Spanish,_Italian,_and_German_Wikipedias.pdf
[6] https://en.wikipedia.org/
[7] https://de.wikipedia.org/
[8] https://es.wikipedia.org/
[9] https://it.wikipedia.org/
[10] https://fr.wikipedia.org/
[11] https://pt.wikipedia.org/
[12] https://ru.wikipedia.org/
[13] https://ja.wikipedia.org/
[14] https://phabricator.wikimedia.org/T140289
[15] https://phabricator.wikimedia.org/T140292
[16] https://www.mediawiki.org/wiki/Talk:TextCat
[17] https://commons.wikimedia.org/wiki/File:Existing-search_no-textcat.png
[18] https://commons.wikimedia.org/wiki/File:New-search_with-textcat.png

--
Deb Tankersley
Product Manager, Discovery
IRC: debt
Wikimedia Foundation



More information about the Mediawiki-i18n mailing list