sábado, 19 de enero de 2013

Wikipedia Facts


What can we find if we download and generate some statistics about Wikipedia?

1) Overview: Wikipedia on January 2nd 2013 has 13 057 082 entries (the Encyclopaedia Britannica sums 228 274 entries according to Wikipedia itself). There are almost as many redirections as actual articles:

Wikipedia Articles (blue: articles, green: redirections)
2) Articles: Let's look at the real articles contents only (no redirections). This is more than 7,1 million articles:

The average article belongs to 2.6 categories, links to 2.31 pages outside the site and 33.86 links internal to Wikipedia. Thousands of articles feature thousands links. In total, there are 16 413 888 external links and 240 751 315 internal links. Enough to get lost for a while!

The average size is 4 634 characters (roughly about 110 words per article), but the total size of article text is 32 GB. And this is only raw text, images are not included.


3) Content

This is one of the most striking result of all. I have searched for certain words within the text of the articles, and assigning a score. The follownig diagram shows how many articles are defined by a particular word (the word with most occurrences). This has been shown before, but the results still seem astonishing to me:




Perhaps we should start thinking of how airily we use the term "war".


4) Geography

I can only report for the articles last updated by anonymous users. But for the sake of it, this is how real article updates (by anonymous users) were distributed among the different continents. This population includes 490 080 articles:




5) Updates:  This is a result I am pretty surprised of. The following graph shows the year and quarter was the time that articles were last updated (separating redirections, in yellow, from articles, in blue). Apparently, a huge percentage of articles have been updated during the last quarter of 2012, which could mean that Wikipedia is very lively and is being updated frequently, although this value seems to high to me, and so I wonder if this may be some automatic process updating wikipedia articles.



Unfortunately, I can't get the "creation date" of articles as the normal Wikipedia dump doesn't include that information.

6) Titles

The average title entry is 26.9 characters long. There are entries starting with every character you can think of: , Ɣ, ¢, £, § ... We can also see how entries are distributed along letters. Surprisingly, numbers 1 and 2 have got more entries than letters Q, X, Y or Z. Even also more than U and V.


7) Processing

I have done the analysis and charts with CubesViewer OLAP Data Explorer, an open source data explorer that I published a couple of weeks ago.

Processing the Wikipedia export file (a 39,6 GB XML file) took my computer more than 18 hours, although using 1 core only and I didn't spend any time optimizing this.

source: 8,95GB 18:35:54 [ 140kB/s]
bzcat: 39,6GB 18:35:57 [ 621kB/s] 
articles: 13,1M 18:35:57 [ 195/s]    

No hay comentarios: