|
Article on other languages:
|
Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, or database queries (such as for Wikipedia:Maintenance). All text content is licensed under the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights. Where do I get...
In the http://download.wikimedia.org/ directory you will find the latest SQL dumps for the projects, not just English. For example, (others exist, just select the appropriate two letter language code and the appropriate project):
Some other directories (e.g. simple, nostalgia) exist, with the same structure. Images and uploaded filesUnlike most article text, images are not necessarily licensed under the GFDL. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from download.wikimedia.org. In conclusion, download these images at your own risk (Legal) Currently Wikipedia does not allow or provide facilities to download all ImagesAs of May 17, 2007, Wikipedia disabled or neglected all viable bulk downloads of images including torrent trackers. Therefore, there is no way to download image dumps other than scraping Wikipedia pages up or using Wikix, which converts a database dump into a series of scripts to fetch the images. Dealing with large filesYou may run into problems downloading files of unusual size. Some older operating systems, file systems, and web clients have a hard limit of 2GB on file size. If you seem to be hitting this limit, try using wget version 1.10 or greater, cURL version 7.11.1-1 or greater, or a recent version of lynx (using -dump). Users have experienced problems with Mozilla and Firefox, but recent versions are more likely to be fixed. It is recommended that you check the MD5 sums (provided in a file in the download directory) to make sure your download was complete and accurate. You can check this by running the "md5sum" command on the files you downloaded. Given how large the files are, this may take some time to calculate. Due to the technical details of how files are stored, file sizes may be reported differently on different filesystems, and so are not necessarily reliable. Also, you may have experienced corruption during the download, though this is unlikely. The file size limits for the various file systems are as follows:
Many standard programming libraries and functions may also cause problems when accessing large files. For example, the standard C function, fopen, limits file sizes to 2GB on 32-bit systems. This is due to it using signed 32-bit integers, limiting file pointers to 2^31 bits (2GB). Why not just retrieve data from wikipedia.org at runtime?Suppose you are building a piece of software that at certain points displays information that came from wikipedia. If you want your program to display the information in a different way than can be seen in the live version, you'll probably need the wikicode that is used to enter it, instead of the finished HTML. Also if you want to get all of the data, you'll probably want to transfer it in the most efficient way that's possible. The wikipedia.org servers need to do quite a bit of work to convert the wikicode into html. That's time consuming both for you and for the wikipedia.org servers, so simply spidering all pages is not the way to go. To access any article in XML, one at a time, access: http://en.wikipedia.org/wiki/Special:Export/Title_of_the_article Read more about this at Special:Export. To access any article via an RSS Feed, one at a time, access: http://www.blinkbits.com/en_wikifeeds_rss/Title_of_the_article Read more about this at User:Blinklmc. Please be aware that live mirrors of Wikipedia that are dynamically loaded from the Wikimedia servers are prohibited. Please see Wikipedia:Mirrors and forks. Please do not use a web crawlerPlease do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia. Our robots.txt blocks many ill-behaved bots. Sample blocked crawler email
Doing SQL queries on the current database dump[ Correction Needed? The wikisign page is not active at this time (11/09/2006). ] You can do SQL queries on the current database dump (as a replacement for the disabled Special:Asksql page) . For more information about this service, see de:Benutzer:Filzstift/wikisign.org (in German only). Dealing with compressed filesApproximate file sizes are given for the compressed dumps; uncompressed they'll be significantly larger. Some older archives are compressed with gzip, which is compatible with PKZIP (the most common Windows format). Newer archives are available in both bzip2 and 7zip compressed formats. Windows users may not have a bzip2 decompressor on hand; a command-line Windows version of bzip2 (from here) is available for free under a BSD license. The LGPL'd GUI file archiver, 7-zip [1], is also able to open bz2 compressed files, and is available for free. MacOS X ships with the command-line bzip2 tool. Please note that older versions of bzip2 may not be able to handle files larger than 2GB, so make sure you have the latest version if you experience any problems. Database schemaSQL schemaSee also: mw:Manual:Database layout The database schema is explained here. The cur tables contain the current revisions of all pages; the old tables contain the prior edit history. XML schemaThe XML schema for each dump is defined at the top of the file. Wikipedia uncompressed XML can be converted to SQL using this tool. Help parsing dumps for use in scripts
Help importing dumps into MySQLSee: Importing sections of a dumpThis section is out of date. The following Perl script is a parser for extracting the Help sections from the SQL dump:
s/^INSERT INTO cur VALUES //gi;
s/\n// if (($j++ % 2) == 0);
s/(\'\d+\',\'\d+\'\)),(\(\d+,\d+,)/$1\;\n$2/gs;
foreach (split /\n/) {
next unless (/^\(\d+,12,\'/);
s/^\(\d+,\d+,/INSERT INTO cur \(cur_namespace,cur_title,cur_text,cur_comment,cur_user,
cur_user_text,cur_timestamp,cur_restrictions,cur_counter,cur_is_redirect,cur_minor_edit,
cur_is_new,cur_random,cur_touched,inverse_timestamp\) VALUES \(12,/;
s/\n\s+//g;
s/$/\n/;
print;
}
NOTE: (as at 2005-05-16) the order of the fields in the cur table has changed. inverse_timestamp now comes BEFORE cur_touched. This may cause Windows users no end of grief because all of a sudden your MediaWiki starts sprouting PHP errors about dates that are negative or occur before 1 January 1970 being passed to gmdate and gmmktime functions in GlobalFunctions.php. The reason is that the fields are swapped around and so there is rubbish data in these two fields. Maybe the Unix versions of these functions are smarter or do not cause PHP to spit a Warning message into the HTML script output, or else people have php.ini configured to not display these. In other words, check that the field order in the script aligns with those in the dump. Better still, we should look at changing the script to retain whatever field order the dump uses 8-) You can run the script and get a resulting help.sql file with this command: bzip2 -dc <Date>_cur_table.sql.bz2 | perl -n <Script Name> > help.sql The script can be easily modified to acquire any section you need with a few minor changes. Currently, it is set to get all records from namespace 12, the Help namespace. You can change the two 12's to grab a different namespace, or slightly change a couple of regular expressions to get, say, all articles that begin with Q: next unless (/^\(\d+,\d+,\'[qQ]/); s/^\(\d+,/INSERT INTO cur \(cur_namespace,cur_title,cur_text,cur_comment,cur_user, cur_user_text,cur_timestamp,cur_restrictions,cur_counter,cur_is_redirect,cur_minor_edit, cur_is_new,cur_random,cur_touched,inverse_timestamp\) VALUES \(/; Or you can use more more generic version of this script from User:Msm/extract.pl. NOTE: While this sounds really straightforward as a way to grab the Help namespace (#12) for use on your newly implemented MediaWiki site, you need more than just that. You also need the Template namespace (# 10) since many of the Help: pages rely on templates in some form or another. Of course, you then end up with hundreds of templates that are NOT used by the Help: pages too. Has anyone got a better idea for a script to do this ? Armistej Static HTML tree dumps for mirroring or CD distributionMediaWiki 1.5 includes routines to dump a wiki to HTML, rendering the HTML with the same parser used on a live wiki. As the following page states, putting one of these dumps on the web unmodified will constitute a trademark violation. They are intended for private viewing in an intranet or desktop installation. The static version of Wikipedia created by Wikimedia
See also:
Dynamic HTML generation from a local XML database dump (WikiFilter)Instead of converting a database dump file to many pieces of static HTML, one can also use a dynamic HTML generator. Browsing a wiki page is just like browsing a Wiki site, but the content is fetched and converted from a local dump file upon request from the browser. WikiFilterWikiFilter is a program which allows you to browse over 100 dump files without visiting a Wiki site. WikiFilter system requirements
How to set up WikiFilter
WikiTaxiWikiTaxi is a portable offline-reader for wikis in MediaWiki format. It enables users to search and browse popular wikis like Wikipedia, Wikiquote, or WikiNews, without being connected to the Internet. WikiTaxi works well with different languages like English, German, Turkish, and others. WikiTaxi system requirements
WikiTaxi usage
For WikiTaxi reading, only two files are required: WikiTaxi.exe and the .taxi database. Copy them to any storage device (memory stick or memory card) or burn them to a CD or DVD and take your Wikipedia with you wherever you go! Rsync2This section is out of date. You can use rsync to download the database. For example, this command will download the current English database: rsync rsync://download.wikimedia.org/dumps/wikipedia/en/cur_table.sql.bz2 . --partial --progress The "--partial" switch prevents rsync from deleting the file in the event the download is interrupted. You may then issue the very same command again to resume the download. The "--progress" switch will show the download progress; for less verbose output, do not use this switch. The rsync utility is designed to synchronize files in a manner such that only the differences between the files are transferred. This provides a considerable performance enhancement, especially when synchronizing large files that have relatively few changes. However, if a file is compressed or encrypted, rsync will not perform well; in fact, it may perform worse than downloading a fresh copy of the file. Many of the database files are only available compressed. Therefore, there is little, if anything, to be gained by attempting to use rsync as a means of expediting an update of an older SQL dump. If the SQL dumps were available uncompressed, this process should work extremely well, especially if rsync is invoked with the on-the-fly compression switch (-z). It is uncertain as to whether uncompressed database dumps will become available. However, rsync does remain a useful and expedient tool for resuming downloads that have been interrupted, repairing downloads that have become corrupted, or updating any files that are not compressed (i.e. upload.tar). For more information, see rsync. Technical notes
Outdated contentYou may be interested in older en dumps in XML at http://download.wikimedia.org/enwiki/, though using the newest dumps is strongly recommended. These dumps contain:
See also |
This article is from Wikipedia. All text is available under the terms of the GNU Free Documentation License.
Mercedes Car
This site monitored by SitePinger.net