Thursday, November 10, 2011

Mapping Wikipedia's augmentations of our planet

We all know that Wikipedia is an immense project. It is an incredibly impressive coming-together of human labour on a scale that the world rarely sees. Over the last few years, we've also seen a few maps of the encyclopedia (including my own) which have shown that the project is far from complete (whatever that might mean). 

That doesn't mean we should stop mapping the project though, and as part of our efforts to answer the first of our research questions looking at Wikipedia in the Middle East, North Africa, and East Africa, I'll present these global-scale maps of every article in the November 2011 versions of the Arabic, Egyptian Arabic, English, French, Hebrew, Persian, and Swahili Wikipedias. 

First, the English project. This encyclopedia is by far the largest, and currently hosts almost 700,000 geotagged articles (click on the image for a larger and more detailed version):


Each one of these yellow dots represents human effort that has gone into describing some aspect of a place. The density of this layer of information over some parts of the world is astounding. Some of our future posts will look more closely at measures of inequality in Wikipedia, but it is still hard not to be awed by this cloud of information about hundreds of thousands of events and places around the globe. 

The French Wikipedia, with almost a quarter of a million geotagged articles, is much smaller than the English version, but nonetheless still another impressive collection of human labour. There are much denser augmentations of information are much denser over some parts of the planet than others, but it remains that there is simply a lot of content about a lot of the world.



However, when looking at some of the smaller Wikipedias like Arabic (and Egyptian Arabic), Hebrew, and Persian, we don't see that same glowing cloud of information over much of the world. 







In the maps above we instead saw limited global focus. This is perhaps not that unexpected given the relatively small size of these encyclopedias (in terms of total numbers of geotagged articles, Arabic has 24,000, Hebrew has 15,000, Persian has 21,000, and Egyptian Arabic has only slightly more than 1000 [these are all approximate figures]). 

But it remains that if your primary free source of information about the world is the Persian or Arabic or Hebrew Wikipedia, then the world inevitably looks very different to you than if you were accessing knowledge through the English Wikipedia. There are far more absences and many parts of the world simply don't exist in the representations that are available to you.

However, one thing that should be pointed out are some of the strange patterns on parts of these maps. If you look closely at the Arabic or Persian maps you might see some interesting patterns (for instance look closely at the patterns in the US). You see a similar sort of unexpected spatial distribution of articles in the map of Swahili Wikipedia below (i.e. why are there so many articles in Turkey?). The answer is simply a few dedicated editors creating stub articles about relatively structured topics such as cities in Turkey (in the Swahili Wikipedia) or every county in the US state of Georgia (in the Arabic Wikipedia). 




What is perhaps most interesting about the Arabic, Hebrew, Persian, and Swahili Wikipedias is that it isn't the Global North that vanishes from the map. It is rather other parts of the South that become absent: an observation that seem to simply imply an entrenchment and a reproduction of the visibility of the already highly visible.

14 comments:

Anonymous said...

So is it possible to work out the bit of sea and land furthest from a place with a wikipedia article?

St├ęphane Roche said...

Did you think about the idea of mixing wikipedia and wikimapia data? That might provide other (probably complementary) understanding of certain phenomena, at bigger cartographic scale for instance.

Anonymous said...

Why does Japan have "a tail of articles" going thousands of kilometers to The Pacific?

metasonix said...

Sir:

I am working on an extensive book about Wikipedia, and noticed your
study on geotagged articles in Wikipedia.

This is a deeply flawed method, and only shows locations that were
written up in articles and geotagged.

Because as I've already found, the geographical articles in various
language WPs are often generated with cross-translation bots.

English WP has massive numbers of articles about locations in Poland
as a result of a bot called Kotbot, for example.

Also, maniac WP addict Dr. Blofeld is using a bot to generate bogus
articles about placenames in Turkey.

There are also some Swiss editors using bots to put all municipality
names in Switzerland into English WP.

Such articles are usually "stubs", and rarely ever read by human beings,
much less improved. These methods are used on other language
Wikipedias. No one know how much of it occurs, because the bot
operators commonly do not tell anyone what they're doing.

Just though you should know.

Mark Graham said...

@metasonix

I think you are perhaps misunderstanding the point of these maps. Each map has a title which states "Geotagged articles..." I think it is pretty clear then, that we are only mapping geotagged articles :)

You don't appear to have read the text, but I also do expand on the stub issue. As I point out, many of the Arabic articles in the US and the Swahili articles in Turkey are undoubtedly stubs.

@St├ęphane I'd love to map Wikimapia! If you have any tips on how to access their database, let me know.

@anonymous - some of these seem to be small islands, some battles, some shipwrecks etc.

Anonymous said...

what tools did you use to plot the coordinates to the map? Can you provide a step-to-step tutorial?

Mark Graham said...

@anonymous - quite a lot of steps - so I'll try to write a full blog post detailing our methods soon. But the main tool for making the map was software called ArcGIS.

Anonymous said...

I prepared a similar visual for the german wikipedia.

data source is the geotag dump from here:
https://toolserver.org/~dispenser/dumps/

For the plot itself I used the perl-script "plot-latlon".

Result:
https://twitpic.com/7dhwxw (world map)
https://twitpic.com/7dhx1h (germany map)

Greg said...

Hi Mark, very interesting maps. Would you be so kind to provide me the list of geotagged Wikipedia articles per language? I want to answer some research questions myself, basically by mashing up the data with some population data. Somehow I hope that I not have to scrape those Wikipedia dumps (again).

Thanks

K/Hill said...

It's possible that geotagging in the US may vary from state to state. I notice an unusually large number of tags in Southern Missouri on the Arabic map. Quite likely a group of university students (likely at Rolla) have made a dedicated effort to put those together. Could that be looked further into?

Johannes said...

For the large Wikipedias, I would be interested in a map that relates these data to population density. More articles can be expected for places where more people live; maybe this boring effect could be factored out.
Doing the same for the density of English speakers (foreign language at least) might be even more interesting, but probably also more difficult.

Mark Graham said...

@anonymous Thanks for that!! I'll explore soon. Looks like a great resource.

@greg We'll be releasing the geo-parser soon. And we'll also try to make some cleaned-up data dumps available. I'll post on this blog when then happens. In the meantime, check out the link in the post above yours. Might be exactly what you are looking for?

@k/hill It definitely varies from state to state. But a lot of these are just stub articles. One of the things we'll be doing is accounting for this. We already have the data processed and I just need to map it.

@Johannes. Factoring in population is actually one of the posts that will be out soon. As you note, language is harder. If you have any ideas on where we can get good data on # of x language speakers by country, I'd be very interested .

Daniel Lox said...

"Why does Japan have "a tail of articles" going thousands of kilometers to The Pacific?"

A logical guess would be that there are WWII battles fought in the waters between Midway and Japan, and that possibly there are also many oceanographic phenomena out there, presumably trenches, species of fish or whale.

Open questions like this are why I find the idea of geomapping a fascinating study.

Daniel Lox said...

"Why does Japan have "a tail of articles" going thousands of kilometers to The Pacific?"

A logical guess would be that there are WWII battles fought in the waters between Midway and Japan, and that possibly there are also many oceanographic phenomena out there, presumably trenches, species of fish or whale.

Open questions like this are why I find the idea of geomapping a fascinating study.