环境:Ubuntu 14.04, Gensim,
处理脚本process_wiki.py:
#!/usr/bin/env python # -*- coding: utf-8 -*- import logging import os.path import sys from gensim.corpora import WikiCorpus if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) < 3: print globals()['__doc__'] % locals() sys.exit(1) inp, outp = sys.argv[1:3] space = " " i = 0 output = open(outp, 'w') wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) for text in wiki.get_texts(): output.write(space.join(text) + "\\n") i = i + 1 if (i % 10000 == 0): logger.info("Saved " + str(i) + " articles") output.close() logger.info("Finished Saved " + str(i) + " articles")
下载中文和英文的wikipedia
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
wget https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
方法一:
python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
方法二:
Wikipedia Extractor 是用 Python 写的一个维基百科抽取器,使用非常方便。
wget http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
python WikiExtractor.py -cb1000M -o extracted zhwiki-latest-pages-articles.xml.bz2
参数 -b1000M 表示以 1000M 为单位切分文件,默认是 500K。
将wiki.zh.text中的繁体字转化位简体字:
sudo apt-get install opencc
opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini
处理非utf-8字符
iconv -c -t UTF-8 < wiki.zh.text.jian > wiki.zh.text.jian.utf-8