Friday, September 30, 2011

Mapping Arabic Wikipedia

As part of an IDRC-funded multi-year project to understand local knowledge production on Wikipedia in the Middle East and North Africa, we plan to release our initial results on this blog.

In the project, we ask three key questions about Wikipedia in the region:

1) What is the geography of articles in the Middle East and North Africa, and how does this compare to the rest of the world? (we are also asking similar questions within the contexts of East Africa. This might mean that we occasionally mix some of our data from the two regions (as we do in the maps below)).

2) Do local authors in the region comprise disproportionally fewer of the contributions to articles about the region?

3) Are the contributions of local contributors undervalued?

The two maps below are the first in our series and depict the total number of Arabic articles in Wikipedia throughout the region as well as the number of Arabic articles per square kilometre (actually every 1000 km2).

The data were derived from the Wikimedia Foundation's regular XML dumps of the Arabic, Egyptian Arabic, English, French and Hebrew Wikipedias in March 2011.  The article source was analysed coordinate templates or recognisable coordinate parameters in other templates, such as "Infobox settlement." In cases where this method didn't reveal any coordinates, we then used interwiki links to obtain coordinates from other language versions of the same article. This gave us a much more useful set of points, particularly for the smaller wikis.

Once this was done, all parameter values were converted to a common format.  Our dataset still contained some coordinates that didn't make much sense for us to keep, notably coordinates of features on the moon and other planets, so we then had to make sure all non-Earthly articles were deleted from the dataset.

The maps above are then the result of counting the number of articles in the top-level subdivision in each of our areas of interest.

When looking at total counts (the top map), you can see that it is Israel/Palestine and parts of the Arabian Peninsula that tend to have the highest counts. However, to get a better sense of the density of layers of information over any given place, it is more useful to look at the number of articles per square kilometre. This is what the second map does.

Here you see that the densest layers of information in Arabic are again over Israel and Palestine. Much of the Mediterranean coast in Morocco, Tunisia, and Algeria as well as the Nile valley and parts of the UAE also have relatively dense clouds of content about them.

Obviously not all of these places are home to native Arabic speakers, and one of the stories we want to tell in future posts is how the geolinguistic contours of Wikipedia differ over different parts of the region.

We also aim to more closely examine the factors that might explain these uneven geographies of content. Is it internet access? GDP? Education levels? These data will be supplemented by in-depth focus groups that we aim to hold in Egypt and Jordan next year.

These initial mappings provide us with many more questions than answers, but this only means we have much to do over the next few months.

Feel free to comment with any questions or observations.


Anonymous said...

I recently did a similar evaluation for Germany, Austria, Switzerland:

Mark Graham said...

Great work! What method did you use for parsing out the lat/long coordinates? Please feel free to get in touch via email (my address is at:

Jessie said...

This is so interesting! Is this free license; can the images be posted to Wikimedia Commons?

Mark Graham said...

Thanks Jessie. Yes I think so. I'm going to start putting the CC BY-NC-SA license ( on all the project outputs.

Dror Kamir said...

This is so interesting. Thank you for this great work. Could you also produce a map showing the relation between the number of articles and the population density of a certain region? On the Hebrew Wikipedia, for example, you would find an article about nearly every neighborhood of Tel Aviv or Jerusalem, while I doubt if there was an article about any village in the Negev desert area. I think population density plays a role here.
Also, can you add the length of articles or the number of edits in their "history" as a parameter? This would also shed some more light.
Last thing - Do you make a distinction between existing localities and historical ones? For example, on the English-language Wikipedia I saw a long list of articles about 1948 depopulated Arab-Palestinian localities. Some of these villages had less than 100 inhabitants when they were depopulated. I am not sure whether editors would bother writing about an existing village of this size, namely the very fact that it was depopulated makes it interesting. So, maybe this is another parameter to take into the account.

Mark Graham said...

Thanks Dror! We're actually working on some of those things you suggest. The population density maps will be coming up soon.

The only thing we aren't doing is your suggestion on distinguishing between existing localities and historical ones - as I'm not sure how we'd get the data in a structured way without looking through thousands of places individually. Any ideas?

John Vandenberg said...

The images resulting from this research would be very useful on Wikipedia. So far only one image about English Wikipedia has been uploaded to Wikimedia Commons. While that is useful and appreciated, 'free' versions of the images of smaller wikipedias would be wonderful to use in mashups comparing the wikis. Unfortunately CC BY-NC-SA isn't acceptable on Wikimedia Commons. Could you upload more? Or make a statement somewhere that they are all released under CC BY-SA and we'll upload them for you ;-)

Mark Graham said...

Hi John,

Thanks for asking. This shouldn't be a problem. Could you send me an email to chat about the details? (mark.graham at Thanks...