数据导入

Reads: 2121 Edit

1 读取Excel中的数据

1.1 需要在cmd中安装openpyxl库

pip install openpyxl

1.2 读取第一行包含列名的Excel数据

>>> import pandas as pd
>>> data_excel = pd.read_excel('D:\\Desktop\\EconomicData.xlsx', sheet_name='Sheet2', na_values='n/a')
>>> data_excel
    prov  year      pgdp  eduyear      pfdi      open  area  educost  college
0     安徽  2011  2.563782  8.24831  0.071742  0.328254     2   817.20      116
1     安徽  2012  2.874424  8.51635  0.091082  0.347451     2   991.76      118
2     安徽  2013  3.189051  8.52448  0.109776  0.399829     2  1041.30      117
3     安徽  2014  3.427436  8.72823  0.124615  0.436284     2  1045.78      118
4     安徽  2015  3.581879  8.79599  0.138070  0.430795     2  1157.85      119
..   ...   ...       ...      ...       ...       ...   ...      ...      ...
235   重庆  2014  4.767868  8.95551  0.086924  1.695338     3   698.00       63
236   重庆  2015  5.210346  8.94135  0.077882  1.212252     3   797.10       64
237   重庆  2016  5.820404  9.07319  0.060801  1.129954     3   886.32       65
238   重庆  2017  6.316657  9.23159  0.049620  1.242047     3   948.35       65
239   重庆  2018  6.960110  9.30048  0.069336  1.454839     3  1021.63       65

[240 rows x 9 columns]

1.3 读取第一行不包含列名的Excel数据

如果第一行不是列名,则需要加参数header = None,同时用names指定列名。

>>> data_excel = pd.read_excel('D:\\Desktop\\EconomicData.xlsx', sheet_name='Sheet3', na_values='n/a',header = None,names=['prov','year','pgdp','eduyear','pfdi','open','area','educost','college'])
>>> data_excel
    prov  year      pgdp  eduyear      pfdi      open  area  educost  college
0     安徽  2011  2.563782  8.24831  0.071742  0.328254     2   817.20      116
1     安徽  2012  2.874424  8.51635  0.091082  0.347451     2   991.76      118
2     安徽  2013  3.189051  8.52448  0.109776  0.399829     2  1041.30      117
3     安徽  2014  3.427436  8.72823  0.124615  0.436284     2  1045.78      118
4     安徽  2015  3.581879  8.79599  0.138070  0.430795     2  1157.85      119
..   ...   ...       ...      ...       ...       ...   ...      ...      ...
235   重庆  2014  4.767868  8.95551  0.086924  1.695338     3   698.00       63
236   重庆  2015  5.210346  8.94135  0.077882  1.212252     3   797.10       64
237   重庆  2016  5.820404  9.07319  0.060801  1.129954     3   886.32       65
238   重庆  2017  6.316657  9.23159  0.049620  1.242047     3   948.35       65
239   重庆  2018  6.960110  9.30048  0.069336  1.454839     3  1021.63       65

[240 rows x 9 columns]

1.4 将一列作为行索引

>>> data_excel = pd.read_excel('D:\\Desktop\\EconomicData.xlsx', sheet_name='Sheet2', na_values='n/a',index_col='year')
>>> data_excel
     prov      pgdp  eduyear      pfdi      open  area  educost  college
year                                                                    
2011   安徽  2.563782  8.24831  0.071742  0.328254     2   817.20      116
2012   安徽  2.874424  8.51635  0.091082  0.347451     2   991.76      118
2013   安徽  3.189051  8.52448  0.109776  0.399829     2  1041.30      117
2014   安徽  3.427436  8.72823  0.124615  0.436284     2  1045.78      118
2015   安徽  3.581879  8.79599  0.138070  0.430795     2  1157.85      119
...   ...       ...      ...       ...       ...   ...      ...      ...
2014   重庆  4.767868  8.95551  0.086924  1.695338     3   698.00       63
2015   重庆  5.210346  8.94135  0.077882  1.212252     3   797.10       64
2016   重庆  5.820404  9.07319  0.060801  1.129954     3   886.32       65
2017   重庆  6.316657  9.23159  0.049620  1.242047     3   948.35       65
2018   重庆  6.960110  9.30048  0.069336  1.454839     3  1021.63       65

[240 rows x 8 columns]

1.5 将多列作为行索引,跳过n行,读取n行

>>> data_excel = pd.read_excel('D:\\Desktop\\EconomicData.xlsx', sheet_name='Sheet3', na_values='n/a',header = None,names=['prov','year','pgdp','eduyear','pfdi','open','area','educost','college'],index_col=[0,1],skiprows=8,nrows=8) 
>>> data_excel
                pgdp  eduyear      pfdi      open  area  educost  college
prov year                                                                
北京   2011   8.051090  11.5550  0.225703  4.137171     1   737.38       87
     2012   8.640313  11.8360  0.245325  3.925019     1   871.75       89
     2013   9.362970  12.0280  0.249626  3.852769     1   999.84       89
     2014   9.913939  11.8540  0.258120  4.085807     1  1093.74       89
     2015  10.603360  12.0807  0.372929  3.752679     1  1117.12       91
     2016  11.813310  12.3037  0.398281  3.739240     1  1193.47       91
     2017  12.905950  12.5024  0.756735  3.782837     1  1251.27       92
     2018  15.368120  12.5550  0.531770  3.914133     1  1352.54       92

2 读取CSV中的数据

>>> data_csv = pd.read_csv('D:\\Desktop\\EconomicData.csv',encoding='gbk')
>>> data_csv
    prov  year      pgdp  eduyear      pfdi      open  area  educost  college
0     安徽  2011  2.563782  8.24831  0.071742  0.328254     2   817.20      116
1     安徽  2012  2.874424  8.51635  0.091082  0.347451     2   991.76      118
2     安徽  2013  3.189051  8.52448  0.109776  0.399829     2  1041.30      117
3     安徽  2014  3.427436  8.72823  0.124615  0.436284     2  1045.78      118
4     安徽  2015  3.581879  8.79599  0.138070  0.430795     2  1157.85      119
..   ...   ...       ...      ...       ...       ...   ...      ...      ...
235   重庆  2014  4.767868  8.95551  0.086924  1.695338     3   698.00       63
236   重庆  2015  5.210346  8.94135  0.077882  1.212252     3   797.10       64
237   重庆  2016  5.820404  9.07319  0.060801  1.129954     3   886.32       65
238   重庆  2017  6.316657  9.23159  0.049620  1.242047     3   948.35       65
239   重庆  2018  6.960110  9.30048  0.069336  1.454839     3  1021.63       65

[240 rows x 9 columns]

说明:在读取中文时经常会遇到编码错误,可以尝试加入encoding='gbk'参数!

>>> data_csv = pd.read_csv('D:\\Desktop\\EconomicData.csv',encoding='gbk',header=0,names=['index1','index2','x1','x2','x3','x4','x5','x6','x7'],index_col=[0,1])
>>> data_csv
                     x1       x2        x3        x4  x5       x6   x7
index1 index2                                                         
安徽     2011    2.563782  8.24831  0.071742  0.328254   2   817.20  116
       2012    2.874424  8.51635  0.091082  0.347451   2   991.76  118
       2013    3.189051  8.52448  0.109776  0.399829   2  1041.30  117
       2014    3.427436  8.72823  0.124615  0.436284   2  1045.78  118
       2015    3.581879  8.79599  0.138070  0.430795   2  1157.85  119
...                 ...      ...       ...       ...  ..      ...  ...
重庆     2014    4.767868  8.95551  0.086924  1.695338   3   698.00   63
       2015    5.210346  8.94135  0.077882  1.212252   3   797.10   64
       2016    5.820404  9.07319  0.060801  1.129954   3   886.32   65
       2017    6.316657  9.23159  0.049620  1.242047   3   948.35   65
       2018    6.960110  9.30048  0.069336  1.454839   3  1021.63   65

[240 rows x 7 columns]

3 读取txt数据

>>> pd.read_csv('D:\Desktop\DRUG.txt')
     Age Sex      BP Cholesterol        Na         K   Drug
0     23   F    HIGH        HIGH  0.792535  0.031258  drugY
1     47   M     LOW        HIGH  0.739309  0.056468  drugC
2     47   M     LOW        HIGH  0.697269  0.068944  drugC
3     28   F  NORMAL        HIGH  0.563682  0.072289  drugX
4     61   F     LOW        HIGH  0.559294  0.030998  drugY
..   ...  ..     ...         ...       ...       ...    ...
195   56   F     LOW        HIGH  0.848774  0.073380  drugC
196   16   M     LOW        HIGH  0.743021  0.061886  drugC
197   52   M  NORMAL        HIGH  0.549945  0.055581  drugX
198   23   M  NORMAL      NORMAL  0.784520  0.055959  drugX
199   40   F     LOW      NORMAL  0.683503  0.060226  drugX

[200 rows x 7 columns]

>>> data_txt=pd.read_csv('D:\\Desktop\\news.txt',encoding='gbk')
>>> data_txt.head(5)
                              10万预算的新款小型SUV 这4款很快就登场
0  一条[蜗牛车志]后台最新的昨天留言提问:“我在规划人生的第一台车,预算10万元左右。今年底能...
1    10万元预算,要SUV、要新车、还要有科技感…。你的要求有点多哈。不过没关系,有问题咱们...
2                             10万预算的新款小型SUV 这4款很快就登场
3    说来也巧,就在节目前,我还正好看到一则新车的官方消息,很符合今天的话题,所以不妨作为我们...
4    当然它其实是一台标准的燃油动力车,配备1.5T涡轮增压发动机加上湿式7速双离合双速器的动...


获取案例数据和源代码,请关注微信公众号并回复:Python_dt5


Comments

Make a comment