wikipedia growth dds

The Collaborative Organization of Knowledge

A diagram showing a tiny part of Wikipedia's graph structure Wikipedia is an ongoing endeavor to create a free encyclopedia through an open computer-mediated collaborative effort. How does Wikipedia grow and maintain its coverage? This page contains supporing material relevant to a publication that examines this question.

In the above paper, a longitudinal study of Wikipedia's evolution shows that although Wikipedia's scope is increasing, its coverage is not deteriorating. This can be explained by the fact that referring to an non-existing entry typically leads to the establishment of an article for it. Wikipedia's evolution also demonstrates the creation of a large real world scale-free graph through a combination of incremental growth and preferential attachment.

See also the comments appearing in my corresponding blog entry and an update containing an analysis of two more years (2006-2007).

Source Code

The source code used to process Wikipedia's XML dump is distributed as open source software. It can be downloaded as a compressed tar file (suitable for Unix and similar systems) or as a zip file (suitable for Windows systems).

Processing Results

You can download the processed results by following this link. The file starts with a header giving various attributes of the processed data set.

% Number of bins: 72
% Total revisions: 28247658
% Maximum revisions: 28273 (George W. Bush)
% Maximum reverts: 9218 (George W. Bush)
% Number of moves: 81380
% Total pages: 1898139
% Revisions from IP addresses: 8518913
% Total contributors: 230130
% Maximum different contributors: 2539 (George W. Bush)
% Redirected pages: 631567
% Restricted pages: 2441
% Maximum number of contained references: 17577 (List of all three letter acrony
ms)
% Pages with at least one revert: 211704
% Total number of reverts across all pages: 1147151
% Total time between reverts: 54524346346
% Moved pages: 80332

Next comes one line of data for each one of Wikipedia's entries. Here is an example.

A (musical note):1128386876:Mailer diablo:1130566991:MrD9:10:7:18:0:0:0:0:0:0:0:
0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:
0:0:0:0:0:0:0:0:0:0:0:1:1:1:2:2:2:2:2:2:2:2:2:2:2:2:E

Each line contains the following fields.

The fields are colon-separated. Colons in the input data are converted to an underscore.

Finally, come lines summarizing the data set's characteristis for each time period. Here is an example.

2001-07-01 4851 0  27106   15129        13458   531

Each line contains the following fields.

The following table lists the corresponding values.

PeriodEntriesStubsReferencesRef-entriesUndefinedContributors
2001-01-011910248867
2001-02-0164901129875829150
2001-03-0113830607943413997340
2001-04-0120610931760895589336
2001-05-01336601516293508393591
2001-06-0139170200041178610571319
2001-07-0148510271061512913458531
2001-08-0165580414352093918329854
2001-09-01941707029132115276381328
2001-10-0113570013288148013400501643
2001-11-0116932019168962177511831526
2001-12-0119872024435173851600451425
2002-01-0121812027251080798655381073
2002-02-01344620377197100305762093564
2002-03-01379670423217110167834421591
2002-04-01408380456976116741877091377
2002-05-01436370492485123454923131334
2002-06-014848705656211456371106761804
2002-07-015332106223801604471220531910
2002-08-016149007134081793301344912623
2002-09-017572509034392276661710683238
2002-10-01116785013875622665411774994864
2002-11-01124350015045472915461912802803
2002-12-01131174016635163101342045562715
2003-01-01140465218195653389082255243539
2003-02-01148472219556873617252420273158
2003-03-01157765221048863870012602173310
2003-04-01167263222530904126672784963385
2003-05-01178705224399484452493023873968
2003-06-01189266226203344735353226684110
2003-07-01204189227958425033213419214758
2003-08-01218772230091865425463705815192
2003-09-01233611232170355789853960155251
2003-10-01246612234188646161224233455269
2003-11-01265020236649856536794469176036
2003-12-01285492345739521436974084752196462
2004-01-01305725827242648507467705090926747
2004-02-013331601679546232488039775452158450
2004-03-0137637426851512035588374359689910408
2004-04-0141156436813560328096142165003810321
2004-05-01447336447726082770104287370815910821
2004-06-01484584553226583722111808275910111735
2004-07-01553553640837138009119893681151712333
2004-08-01593555733607776177127566286026113403
2004-09-01635161821768343211135713991117914362
2004-10-01679893912088970944143534995719316233
2004-11-0173022410074899007251521155100687718022
2004-12-01787038113040106535631628640107576018916
2005-01-01832525125671113082071712119112604219034
2005-02-01873786136111119145721799221118362319396
2005-03-01929868149655126727021900088124523423274
2005-04-01999950165580135545412012596131151426556
2005-05-011068129178884144513782109421136196328206
2005-06-011154561195872154873472239533143253729776
2005-07-011248213216818166920052378759150421235191
2005-08-011351281239123180065182558351161225539075
2005-09-011433290260220191160542683004167835538920
2005-10-011529265284788203458392831703175841344158
2005-11-011623388306871215396132978014183749346626
2005-12-011720297338459230493833207092199301062458
2006-01-011829517371762245946163396856210451870794
2006-02-011898442390959255792083512682217088760939

Diomidis Spinellis home page


Valid XHTML 1.0! Level Triple-A conformance icon, W3C-WAI Web Content Accessibility Guidelines 1.0

Creative Commons License Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-Share Alike 3.0 Greece License.
Last modified: 2008-08-09