Tuesday, March 19, 2013

Who edits Wikipedia? A map of edits to articles about Egypt



This map displays where edits to English Wikipedia articles about Egypt come from. Detail about the method is included at the end of this post; but first a quick discussion of results.

One of the first things we notice in the map is that there is broad geographic interest in Egyptian-related topics. Editors from all over the world have played some part in writing about Egypt. In fact there is only a handful of countries that have never hosted an editor who wanted to write about Egypt.

However, this isn't to say that there aren't large differences in the origins of those edits. Whilst we might expect most edits about Egyptian-related topics to come from Egypt, we see that only 13% of all edits actually originate in the country. The US, in contrast, is home to 38% of all edits about Egypt (and the UK is home to 15%).

Because we're only looking at the English version of Wikipedia, the heavy presence of America-based editors is not that surprising: but does give us important empirical insights into just where geographic knowledge is created from.

Methods

The simplicity of the map belies the complexity of data collection that went into it.

First, we collected a list of all geotagged articles in Wikipedia. In other words, we collected a list of every single article about a place, event, or anything else that has a location, and then calculated which country each one of those articles is in. This process has already been described here and some of our results are available here.

Second, for every article, we constructed a list of every editor that made an edit to the article. Some of these editors are logged-in and identifiable by user-names and others edited anonymously and only left behind an IP address. We wanted to estimate, to the best of our abilities, the rough location (at the national-scale), for each of these editors. While we already know where edits come from at the national level, until now we have known little about what those editors are writing about.

The IP addresses were fairly straightforward to geolocate, and we placed 99.68% of them at the national-scale (accounting for about 52.5% of the total of 24,087,257 geotagged edits for geotagged articles)

The logged-in edits are more challenging to place because user-profiles are mostly unstructured and contain few consistent types of geographic data.

First, we gathered the GEOnet Names Server geographic gazetteer that included approximately 2.7 million place names, and used the place names signified by wikipedia geo-coded articles to refine that gazeteer further in terms of number of distinct locations mapped to any place name.

Second, we adopted the following two approaches to extract locations from wikipedia user pages.

I. We parsed the wikipedia meta-current dump to extract userboxes that signify user locations and mapped each to a particular ISO code.For example the userbox, {{User Georgia}} signifies that the user comes from Georgia, United States. This allowed us to directly map a user with any of such userboxes to a particular location with near-perfect accuracy.

II. To parse the user's unstructured text, we started off by generating a list of common preceding and succeeding patterns to locations in sentences. To generate that list, the following steps were carried out:
  • (A) Scan through matches of any of the place names in the gazeteer through all user pages after removing location userboxes that were already detected.
  • (B) For each place name match, increment count for the preceding and succeeding unigrams, bigrams and trigrams. in two separate dictionaries, one for preceding and the other for succeeding statements.
  • (C) Sort all counts for each combination descendingly and include frequently occurring predecessors and successors as a filtered down subset to use when tagging locations, resulting in patterns such as {I|Username} * live in {placename} or Wikipedians in {placename}.
  • (D) Since the common preceding and succeeding statements were manageable in size, manually tag each statement with a location relationship signifier, for example 'resides in' was tagged as a 'lives in' relation, whereas 'wikipedians from' was tagged as a born/from relation, the two relationship types that statements were mapped to.
  • (E) Then all user pages with place names within a pattern satisfying the listed predecessors and successors were mapped to the location mapping to that place name
  • (F) As for the cases where the place name were ambiguous (could possibly be mapped to two distinct locations), for example, Alexandria, Cairo and Alexandria, Virginia. If the places parent location was mentioned right after it then the disambiguation was straight-forward. As for other cases, we combined  a page rank score computed for each ISO location by generating a network of place names in the article and connecting them to their ISO location through their subnational locations, with the probability of parent location given place name as inferred from their co-occurences in all user pages , to make a decision on which of the parent locations to assign to a user.
From that procedure we were able to assign 122,888 users to countries, which allowed us to geolocate 11,437,436 edits of registered users to geotagged articles which adds up to 33.65% of the total of 33,991,052 registered user edits to all geotagged articles.

So by combining geolocations through ips and geotagged registered users, we were able to geolocate 51.6% (24,087,257 edits)  of all edits to geotagged articles in the english wikipedia, which are a total of 46,681,386 edits.

We should stress a few things about the results. Most importantly, we are not publishing any individual user locations, and are instead focusing entirely on aggregate data. We are also aware of the significant limitations of this method. In some ways, we are simply reproducing existing geographic inequalities. Absences in the gazetteer and in Wikipedia's coverage can be reproduced in our tagging method. The method might therefore somewhat underestimate the number of edits coming from places in the world's informational peripheries.

It will be interesting to see how well geographic patterns in IP edits data correspond with the parsed user data: as the IP data are less likely to suffer from the embedded biases mentioned above.   

Over the next few weeks and months, I'll be sifting through our data and publishing some of the more interesting findings on the blog.

Read more from the project:

Graham, M. 2011. Wiki Space: Palimpsests and the Politics of Exclusion. In Critical Point of View: A Wikipedia Reader. Eds. Lovink, G. and Tkacz, N. Amsterdam: Institute of Network Cultures, 269-282.

Graham, M., M. Zook., and A. Boulton. 2012. Augmented Reality in the Urban Environment: contested content and the duplicity of code. Transactions of the Institute of British Geographers. DOI: 10.1111/j.1475-5661.2012.00539.x

Graham, M. 2013. The Knowledge Based Economy and Digital Divisions of Labour. In Companion to Development Studies, 3rd edition, eds V. Desai, and R. Potter. Hodder (in press).

No comments: