环境:Ubuntu 14.04, Gensim,
处理脚本process_wiki.py:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os.path
import sys
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 3:
print globals()['__doc__'] % locals()
sys.exit(1)
inp, outp = sys.argv[1:3]
space = " "
i = 0
output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write(space.join(text) + "\\n")
i = i + 1
if (i % 10000 == 0):
logger.info("Saved " + str(i) + " articles")
output.close()
logger.info("Finished Saved " + str(i) + " articles")
下载中文和英文的wikipedia
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
wget https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
方法一:
python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
方法二:
Wikipedia Extractor 是用 Python 写的一个维基百科抽取器,使用非常方便。
wget http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
python WikiExtractor.py -cb1000M -o extracted zhwiki-latest-pages-articles.xml.bz2
参数 -b1000M 表示以 1000M 为单位切分文件,默认是 500K。
将wiki.zh.text中的繁体字转化位简体字:
sudo apt-get install opencc
opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini
处理非utf-8字符
iconv -c -t UTF-8 < wiki.zh.text.jian > wiki.zh.text.jian.utf-8