环境: Windows 7, Ubuntu 12, RStudio Desktop
问题: 使用安装在windows 7 上的RStudio desktop, 用包XML中的readHTMLTable读取网页上的<table>数据,例:
library(XML)
u = ‘http://tech.163.com/special/00094IGJ/top1000.html’
url = htmlParse(u, encoding=”GB2312″)
tables = readHTMLTable(url)
raw = tables[[6]]
查看raw中文显示乱码, 查看sessionInfo(),
R version 2.15.1 (2012-06-22) Platform. x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936 [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 [4] LC_NUMERIC=C [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936 attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] XML_3.95-0.1 loaded via a namespace (and not attached): [1] tools_2.15.1
这个与操作相关, 可以尝试更改Sys.setlocale("LC_CTYPE", "UTF-8"),但报“操作系统报告说无法执行将本地化设成"UTF-8"的请求”。
在Ubuntu中使用RStudio却能正确显示中文,查看sessionInfo()
R version 2.14.1 (2011-12-22) Platform. x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C [4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C ""LC_IDENTIFICATION"" =C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base loaded via a namespace (and not attached): [1] tools_2.14.1
造成的原因推测是XML包编码方式与操作系统的字符编码相关。 有高手知道的具体原因的请帮忙解答下。