US Military Removes Word Documents from the Web?
On August 25th 2004 the comp.risks forum run an article I submitted regarding the large number of Microsoft Word documents available on US milatary sites (sites in the .mil domain) through Google searches (23.50 "U.S. military sites offer a quarter million Microsoft Word documents"). The article documented how such documents could lead to the leakage of confidential data. A week later I setup a script to watch the number of Word documents available through Google searches to see if and when the military would recognise the threat those documents posed and remove them.
According to the data I gathered the number of .mil Word documents returned by Google peaked at 1,180,000 on September 20th 2005, and then started gradually declining. Currently there are 941,000 documents online. No such decline was visible on other domains I monitored, so the change is probably not an artefact of Google's collection or query mechanisms, but an organized move by the US military. The following charts illustrate the changes in the number of Word documents available over a number of different domains (red) compared to the total number of documents available through all monitored domains (green).
.mil Domain
Other non-country TLDs
Country TLDs
Updates
2005.11.12Jim Horning correctly noted that .mil might now be excluding robots on more sites. 2005.11.13
George Gousios notes that the large September spike may be Google's answer to Yahoo overtaking them in the number of indexed pages. Read and post comments