Data analysis in Python by Pandas

Python在数据科学领域的应用真的是越来越普及,得益于Python相对来讲通俗易懂的语言风格,语法简单且容易入门的特性,给很多数据科学领域的朋友,减轻了一部分学习编程语言的繁重。Pandas + NumPy + Matplotlib,这三者的结合基本可以胜任任意简单的数据分析和可视化的任务。复杂一点的可能还会需要SciPy的帮助。

本文目的

这次,我打算用一篇长文来记录一下自己是如何利用Pandas进行数据分析的。网上有很多的Pandas入门教程,因此我这里并不打算针对所有Pandas的基础操作描述的那么清楚,还是希望更多的表达一些对于数据分析的想法和实现。

广义上,数据分析其实包含了从导入数据->清洗数据->分析数据->展示数据,这一从头到尾的流程。狭义上,数据分析指的就是中间分析数据这一块内容。本文按照广义上的数据分析的过程来一步步探讨。

接下来我们就正式开始本次数据分析之旅。


正文

下面的这一段代码主要是包的调用和一些环境配置,Seaborn是也是一个plot包,可用来画出比Matplotlib更漂亮的图,它本身是基于Matplotlib设计的,对NumPyPandas都有很好的支持。这里我就不做过多解释了,对Seaborn有兴趣的朋友可以留言咨询或者自行探索。

1
2
3
4
5
6
7
8
9
10
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

sns.set(rc={"figure.figsize": (10, 6.25)})
sns.set_style("darkgrid")
colors = ["windows blue", "amber", "faded green", "greyish", "dusty purple", "red violet", "marine", "jungle green", "chocolate brown", "dull pink", "reddish orange"]
sns.set_palette(sns.xkcd_palette(colors))


导入数据

我这次用的数据是IGN上近20年来的各种平台的游戏,来源于这里

1
reviews = pd.read_csv("ign.csv")

数据读入之后,我们来看一下这里都有些什么内容。

1
reviews.head()

Unnamed: 0 score_phrase title url platform score genre editors_choice release_year release_month release_day
0 Amazing LittleBigPlanet PS Vita /games/littlebigplanet-vita/vita-98907 PlayStation Vita 9.0 Platformer Y 2012 9 12
1 Amazing LittleBigPlanet PS Vita – Marvel Super Hero E… /games/littlebigplanet-ps-vita-marvel-super-he… PlayStation Vita 9.0 Platformer Y 2012 9 12
2 Great Splice: Tree of Life /games/splice/ipad-141070 iPad 8.5 Puzzle N 2012 9 12
3 Great NHL 13 /games/nhl-13/xbox-360-128182 Xbox 360 8.5 Sports N 2012 9 11
4 Great NHL 13 /games/nhl-13/ps3-128181 PlayStation 3 8.5 Sports N 2012 9 11
5 Amazing LittleBigPlanet PS Vita /games/littlebigplanet-vita/vita-98907 PlayStation Vita 9.0 Platformer Y 2012 9 12
6 Amazing LittleBigPlanet PS Vita – Marvel Super Hero E… /games/littlebigplanet-ps-vita-marvel-super-he… PlayStation Vita 9.0 Platformer Y 2012 9 12
7 Great Splice: Tree of Life /games/splice/ipad-141070 iPad 8.5 Puzzle N 2012 9 12
8 Great NHL 13 /games/nhl-13/xbox-360-128182 Xbox 360 8.5 Sports N 2012 9 11
9 Great NHL 13 /games/nhl-13/ps3-128181 PlayStation 3 8.5 Sports N 2012 9 11

这里先简单介绍一下每一列都代表什么吧:

  • score_phrase – IGN用一个词来评价当前游戏,与得分直接相关;
  • title – 游戏名称;
  • url – 完整评论的地址;
  • platform – 游戏平台(PS4, PC, Xbox, etc.);
  • score – 游戏的具体评分,从1.0到10.0;
  • genre – 游戏分类;
  • editors_choice – 是否为IGN编辑推荐的游戏,与评分有关系。
  • release_year – 游戏发布年份;
  • release_month – 发布月份;
  • release_day – 发布日期。

我们来看下总共多少个数据。

1
reviews.shape

(18625, 11)

看来我们这次的数据里一共18625条数据,一共11列属性。


清洗数据

源数据导入后一般来说是不能直接使用的,需要进行一定范围的数据清洗,不过本次的数据基本不需要清洗,收集这个数据的Eric Grinstein已经对数据进行了清洗工作。不过这里我们仍需要做一点简单的清洗工作,去除一些我们不需要的内容。

1
2
reviews = reviews.iloc[:, 1:]
reviews.head()
score_phrase title url platform score genre editors_choice release_year release_month release_day
0 Amazing LittleBigPlanet PS Vita /games/littlebigplanet-vita/vita-98907 PlayStation Vita 9.0 Platformer Y 2012 9 12
1 Amazing LittleBigPlanet PS Vita – Marvel Super Hero E… /games/littlebigplanet-ps-vita-marvel-super-he… PlayStation Vita 9.0 Platformer Y 2012 9 12
2 Great Splice: Tree of Life /games/splice/ipad-141070 iPad 8.5 Puzzle N 2012 9 12
3 Great NHL 13 /games/nhl-13/xbox-360-128182 Xbox 360 8.5 Sports N 2012 9 11
4 Great NHL 13 /games/nhl-13/ps3-128181 PlayStation 3 8.5 Sports N 2012 9 11

分析数据

数据清洗之后,其实就是分析过程的正式开始。在开始分析过程之前,先说点题外话。我本身对于游戏是很热爱的,从小到大,游戏机,掌机,PC,也拥有过不少的游戏平台。从和父母斗智斗勇中各种争取时间玩的红白机,小霸王学习机,到后来可以躲在被子里玩的GameBoy,但偶尔还得探出头担心父母进屋里发现自己的小秘密;再往后的世嘉,所以又不得不和父母软磨硬泡恳求游戏时间。直至家里第一台PC的出现,基本其他的游戏平台就很少碰了,除了后来的PSP,那是我从GameBoy之后时隔很多年再次拿起掌机玩游戏。要说起游戏,游戏平台,游戏的历史,真的说上三天三夜也说不完,其实这也是我为什么选择这么IGN的这个数据作为数据分析的数据来源。我也很想看看这20年来电子游戏产业的发展和趋势。

好了,咱们言归正传,就我个人而言,拿到这么多的数据之后,第一反应是:这么多的游戏,究竟是分布在了多少平台上呢?我亲身体验过的平台其实并不多,大概10个左右吧。那么这个数据集里究竟包含了多少平台呢?

1
2
all_platforms = reviews["platform"].unique()
all_platforms
array(['PlayStation Vita', 'iPad', 'Xbox 360', 'PlayStation 3',
       'Macintosh', 'PC', 'iPhone', 'Nintendo DS', 'Nintendo 3DS',
       'Android', 'Wii', 'PlayStation 4', 'Wii U', 'Linux',
       'PlayStation Portable', 'PlayStation', 'Nintendo 64', 'Saturn',
       'Lynx', 'Game Boy', 'Game Boy Color', 'NeoGeo Pocket Color',
       'Game.Com', 'Dreamcast', 'Dreamcast VMU', 'WonderSwan', 'Arcade',
       'Nintendo 64DD', 'PlayStation 2', 'WonderSwan Color',
       'Game Boy Advance', 'Xbox', 'GameCube', 'DVD / HD Video Game',
       'Wireless', 'Pocket PC', 'N-Gage', 'NES', 'iPod', 'Genesis',
       'TurboGrafx-16', 'Super NES', 'NeoGeo', 'Master System',
       'Atari 5200', 'TurboGrafx-CD', 'Atari 2600', 'Sega 32X', 'Vectrex',
       'Commodore 64/128', 'Sega CD', 'Nintendo DSi', 'Windows Phone',
       'Web Games', 'Xbox One', 'Windows Surface', 'Ouya',
       'New Nintendo 3DS', 'SteamOS'], dtype=object)

这么多的平台……说实话,这里有很多我听都没听过,像DreamcastAtari 2600Vectrex等等,看来这20年,游戏产业的发展还是很多元化的,至少从游戏平台上就可以看出端倪。

有了游戏平台的信息,自然而然地就会问,每个平台大概都出过多少游戏呢?

1
reviews["platform"].value_counts(dropna=False)

PC                      3370
PlayStation 2           1686
Xbox 360                1631
Wii                     1366
PlayStation 3           1356
Nintendo DS             1045
PlayStation              952
Wireless                 910
iPhone                   842
Xbox                     821
PlayStation Portable     633
Game Boy Advance         623
GameCube                 509
Game Boy Color           356
Nintendo 64              302
Dreamcast                286
PlayStation 4            277
Nintendo DSi             254
Nintendo 3DS             225
Xbox One                 208
PlayStation Vita         155
Wii U                    114
iPad                      99
Lynx                      82
Macintosh                 81
Genesis                   58
NES                       49
TurboGrafx-16             40
Android                   39
Super NES                 33
NeoGeo Pocket Color       31
N-Gage                    30
Game Boy                  22
iPod                      17
Sega 32X                  16
Windows Phone             14
Master System             13
Arcade                    11
Linux                     10
NeoGeo                    10
Nintendo 64DD              7
Commodore 64/128           6
Saturn                     6
Atari 2600                 5
WonderSwan                 4
TurboGrafx-CD              3
Game.Com                   3
Atari 5200                 2
New Nintendo 3DS           2
Vectrex                    2
Pocket PC                  1
WonderSwan Color           1
Ouya                       1
Web Games                  1
SteamOS                    1
Dreamcast VMU              1
Windows Surface            1
DVD / HD Video Game        1
Sega CD                    1
Name: platform, dtype: int64

从上面的统计来看,PC端无疑是最大的贡献者,这也可以理解,毕竟个人电脑从上个世纪末开始出现井喷,到后来虽然出货量开始下降,但一直都是人们学习生活娱乐中不可或缺的一部分,并且早期的个人电脑绝大部分都是以Windows为操作系统。不过让我没想到的是Dreamcast竟然还有286款游戏,看来是我孤陋寡闻了……

下面来看看排名前十的平台都有哪些。

1
2
platforms = reviews["platform"].value_counts()[:10].index.tolist()
platforms
['PC',
'PlayStation 2',
'Xbox 360',
'Wii',
'PlayStation 3',
'Nintendo DS',
'PlayStation',
'Wireless',
'iPhone',
'Xbox']

既然前十的平台我已经知道了,那么下面来看看每个平台的游戏质量如何,虽然PC端的游戏最多,但不一定好游戏占比就是最多的,对吧?

想知道每个平台的游戏质量如何,我得先从所有的数据中将只属于前十的平台的游戏提取出来。这里我创建一个filter,用来筛选游戏平台。

1
2
fil = reviews["platform"] == platforms[0]   # create a filter
fil

0        False
1        False
2        False
3        False
4        False
5        False
6        False
7         True
8        False
9         True
          ...
18615    False
18616     True
18617    False
18618     True
18619     True
18620    False
18621    False
18622    False
18623    False
18624     True
Name: platform, Length: 18625, dtype: bool
1
2
3
4
for platform in platforms[1:]:
fil |= reviews["platform"] == platform

filtered_reviews = reviews[fil]

下面是提取出来的所有数据:

1
filtered_reviews

score_phrase title url platform score genre editors_choice release_year release_month release_day
3 Great NHL 13 /games/nhl-13/xbox-360-128182 Xbox 360 8.5 Sports N 2012 9 11
4 Great NHL 13 /games/nhl-13/ps3-128181 PlayStation 3 8.5 Sports N 2012 9 11
6 Awful Double Dragon: Neon /games/double-dragon-neon/xbox-360-131320 Xbox 360 3.0 Fighting N 2012 9 11
7 Amazing Guild Wars 2 /games/guild-wars-2/pc-896298 PC 9.0 RPG Y 2012 9 11
8 Awful Double Dragon: Neon /games/double-dragon-neon/ps3-131321 PlayStation 3 3.0 Fighting N 2012 9 11
9 Good Total War Battles: Shogun /games/total-war-battles-shogun/pc-142564 PC 7.0 Strategy N 2012 9 11
10 Good Tekken Tag Tournament 2 /games/tekken-tag-tournament-2/ps3-124584 PlayStation 3 7.5 Fighting N 2012 9 11
11 Good Tekken Tag Tournament 2 /games/tekken-tag-tournament-2/xbox-360-124581 Xbox 360 7.5 Fighting N 2012 9 11
12 Good Wild Blood /games/wild-blood/iphone-139363 iPhone 7.0 NaN N 2012 9 10
13 Amazing Mark of the Ninja /games/mark-of-the-ninja-135615/xbox-360-129276 Xbox 360 9.0 Action, Adventure Y 2012 9 7
14 Amazing Mark of the Ninja /games/mark-of-the-ninja-135615/pc-143761 PC 9.0 Action, Adventure Y 2012 9 7
16 Okay Home: A Unique Horror Adventure /games/home-a-unique-horror-adventure/pc-137135 PC 6.5 Adventure N 2012 9 6
17 Great Avengers Initiative /games/avengers-initiative/iphone-141579 iPhone 8.0 Action N 2012 9 5
18 Mediocre Way of the Samurai 4 /games/way-of-the-samurai-4/ps3-23516 PlayStation 3 5.5 Action, Adventure N 2012 9 3
19 Good JoJo’s Bizarre Adventure HD /games/jojos-bizarre-adventure/xbox-360-137717 Xbox 360 7.0 Fighting N 2012 9 3
20 Good JoJo’s Bizarre Adventure HD /games/jojos-bizarre-adventure/ps3-137896 PlayStation 3 7.0 Fighting N 2012 9 3
21 Good Mass Effect 3: Leviathan /games/mass-effect-3-leviathan/xbox-360-138918 Xbox 360 7.5 RPG N 2012 8 31
22 Good Mass Effect 3: Leviathan /games/mass-effect-3-leviathan/ps3-138915 PlayStation 3 7.5 RPG N 2012 8 31
23 Good Mass Effect 3: Leviathan /games/mass-effect-3-leviathan/pc-138919 PC 7.5 RPG N 2012 8 31
24 Amazing Dark Souls (Prepare to Die Edition) /games/dark-souls-prepare-to-die-edition/pc-13… PC 9.0 Action, RPG Y 2012 8 31
25 Good Symphony /games/symphony/pc-136470 PC 7.0 Shooter N 2012 8 30
27 Good Tom Clancy’s Ghost Recon Phantoms /games/tom-clancys-ghost-recon-online/pc-109114 PC 7.5 Shooter N 2012 8 29
28 Great Thirty Flights of Loving /games/thirty-flights-of-loving/pc-138374 PC 8.0 Adventure N 2012 8 29
29 Okay Legasista /games/legasista/ps3-127147 PlayStation 3 6.5 Action, RPG N 2012 8 28
31 Great World of Warcraft: Mists of Pandaria /games/world-of-warcraft-mists-of-pandaria/pc-… PC 8.7 RPG Y 2012 10 4
32 Bad Hell Yeah! Wrath of the Dead Rabbit /games/hell-yeah-wrath-of-the-dead-rabbit/ps3-… PlayStation 3 4.9 Platformer N 2012 10 4
33 Amazing Pokemon White Version 2 /games/pokemon-white-version-2/nds-129228 Nintendo DS 9.6 RPG Y 2012 10 3
34 Good War of the Roses /games/war-of-the-roses-140577/pc-115849 PC 7.3 Action N 2012 10 3
35 Amazing Pokemon Black Version 2 /games/pokemon-black-version-2/nds-129224 Nintendo DS 9.6 RPG Y 2012 10 3
36 Okay Drakerider /games/drakerider/iphone-135745 iPhone 6.5 RPG N 2012 10 3
18546 Great Devil Daggers /games/devil-daggers/pc-20049771 PC 8.5 Shooter N 2016 2 27
18547 Good Superhot /games/superhot/pc-20018899 PC 7.5 Action N 2016 2 25
18549 Good Battleborn /games/battleborn/pc-20021225 PC 7.1 Shooter N 2016 5 6
18554 Good The Park /games/the-park/pc-20042102 PC 7.0 Adventure N 2016 5 4
18555 Great Hitman: Episode 2 /games/hitman-episode-2/pc-20051629 PC 8.5 Shooter N 2016 4 29
18557 Amazing Hearts of Iron IV /games/hearts-of-iron-iv/pc-20012080 PC 9.0 Strategy Y 2016 6 6
18559 Okay Dangerous Golf /games/dangerous-golf/pc-20048436 PC 6.0 Sports, Action N 2016 6 3
18567 Great Offworld Trading Company /games/offworld-trading-company/pc-20018639 PC 8.0 Strategy N 2016 4 28
18568 Okay The Walking Dead: Michonne – Episode 3: What … /games/the-walking-dead-michonne-episode-3/pc-… PC 6.3 Adventure N 2016 4 27
18570 Good Battlefleet Gothic: Armada /games/battlefleet-gothic-armada/pc-20030300 PC 7.1 Strategy N 2016 4 22
18572 Amazing Overwatch /games/overwatch/pc-20027413 PC 9.4 Shooter Y 2016 5 28
18575 Good Fallout 4: Nuka World /games/fallout-4-nuka-world/pc-20054761 PC 7.9 RPG N 2016 8 30
18578 Good Master of Orion /games/master-of-orion-wargaming/pc-20038452 PC 7.1 Strategy N 2016 8 26
18580 Great Quadrilateral Cowboy /games/quadrilateral-cowboy/pc-159788 PC 8.5 Puzzle N 2016 7 28
18581 Great Fallout 4: Vault-Tec Workshop /games/fallout-4-vault-tec-workshop/pc-20054769 PC 8.2 RPG N 2016 7 27
18583 Great Kentucky Route Zero: Act 4 /games/kentucky-route-zero-act-4/pc-20046280 PC 8.5 Adventure N 2016 7 22
18586 Great F1 2016 /games/f1-2016/pc-20054151 PC 8.8 Racing N 2016 8 24
18589 Amazing Deus Ex: Mankind Divided /games/deus-ex-mankind-divided/pc-20013794 PC 9.2 Action, RPG Y 2016 8 19
18595 Bad Ghostbusters /games/ghostbusters-the-movie/pc-20052317 PC 4.4 Action N 2016 7 16
18596 Okay Necropolis /games/necropolis/pc-20030346 PC 6.5 Action, Adventure N 2016 7 14
18598 Okay Furi /games/furi/pc-20044439 PC 6.8 Action N 2016 7 13
18600 Good Hitman: Episode 4 /games/hitman-episode-4/pc-20051637 PC 7.4 Shooter N 2016 8 19
18603 Good Grow Up /games/grow-up/pc-20054824 PC 7.8 Platformer N 2016 8 18
18606 Okay Starcraft II: Nova Covert Ops – Mission Pack 2 /games/starcraft-ii-nova-covert-ops-mission-pa… PC 6.4 Strategy N 2016 8 4
18607 Good Pokemon Go /games/pokemon-go/iphone-20042699 iPhone 7.0 Battle N 2016 7 13
18613 Great XCOM 2: Shen’s Last Gift /games/xcom-2-shens-last-gift/pc-20055520 PC 8.0 Strategy N 2016 7 1
18616 Good Batman: The Telltale Series – Episode 1: Real… /games/batman-the-telltale-series-episode-1-re… PC 7.5 Adventure N 2016 8 2
18618 Amazing Starbound /games/starbound-2016/pc-128879 PC 9.1 Action Y 2016 7 28
18619 Good Human Fall Flat /games/human-fall-flat/pc-20051928 PC 7.9 Puzzle, Action N 2016 7 28
18624 Masterpiece Inside /games/inside-playdead/pc-20055740 PC 10.0 Adventure Y 2016 6 28

13979 rows × 10 columns


展示数据

现在已经有了前十平台的数据,需要思考的就是如何来呈现每个平台的游戏质量呢?当然可以用每个平台的score的平均值来对比,但未免有点单薄了。数据属性中有一列是score_phrase,用一个单词来形容当前游戏的好坏,与score直接挂钩,用这个来展示应该会更容易理解和分析。

这里可以用Matplotlib.pyplotbar来画,也可以用Seaborn中的countplot,后者使用起来更容易方便。

1
sns.countplot(x="platform", hue="score_phrase", data=filtered_reviews, palette=sns.xkcd_palette(colors));

png

展示的结果如上图所示,我们可以看到PC平台下,GreatGood这两栏下的游戏数量基本就占了大半,但我并不能说PC端的游戏质量就比其他平台高出一筹,因为我们依然无法判断每个平台下优秀的作品占比如何。这幅图只能直观地告诉我们每个平台下,所有分数的一个分布状况。

所以,下面的工作,我要继续细化一下数据分析和展示的部分。


进一步分析与展示数据

因为原先划分的score_phrase太多了,我决定将它们重新划为三个部分:好于Good的,差于Okay的,剩下的就是中间部分。我的这个标准可能比较严格,在我看来,评分8.0以上的才算的上是优秀的作品,也就是高于Good的;至于那些评分低于6.0的,也就是还不到Okay的,算作差劲也不算失礼吧。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
all_score_phrases = set(reviews["score_phrase"].unique())
bt_good = set(['Great', 'Amazing', 'Masterpiece'])
average = set(['Good', 'Okay'])
wt_okay = all_score_phrases - bt_good - average

def category_score_phrase(value):
if value in bt_good:
return "Better than Good"
elif value in wt_okay:
return "Worse than Okay"
else:
return "Average"

sizes = filtered_reviews["score_phrase"].apply(category_score_phrase).value_counts()
explode = (0, 0.1, 0)
plt.pie(sizes, labels=sizes.index, explode=explode, autopct='%1.2f%%', shadow=True, startangle=90);

png

这里我先用饼图来展示一下前十的平台,整体的游戏质量分布情况。

这里,我创建了一个新列,叫score_phrase_new,为了区别原有的score_phrase

1
2
filtered_reviews["score_phrase_new"] = filtered_reviews["score_phrase"].apply(category_score_phrase)
filtered_reviews.head()
score_phrase title url platform score genre editors_choice release_year release_month release_day score_phrase_new
3 Great NHL 13 /games/nhl-13/xbox-360-128182 Xbox 360 8.5 Sports N 2012 9 11 Better than Good
4 Great NHL 13 /games/nhl-13/ps3-128181 PlayStation 3 8.5 Sports N 2012 9 11 Better than Good
6 Awful Double Dragon: Neon /games/double-dragon-neon/xbox-360-131320 Xbox 360 3.0 Fighting N 2012 9 11 Worse than Okay
7 Amazing Guild Wars 2 /games/guild-wars-2/pc-896298 PC 9.0 RPG Y 2012 9 11 Better than Good
8 Awful Double Dragon: Neon /games/double-dragon-neon/ps3-131321 PlayStation 3 3.0 Fighting N 2012 9 11 Worse than Okay

先来用数字直观地看一下每个平台下,每个评分阶段的数量。

1
filtered_reviews.groupby(["platform", "score_phrase_new"]).count()

score_phrase title url score genre editors_choice release_year release_month release_day
platform score_phrase_new
Nintendo DS Average 462 462 462 462 462 462 462 462 462
Better than Good 207 207 207 207 207 207 207 207 207
Worse than Okay 376 376 376 376 375 376 376 376 376
PC Average 1394 1394 1394 1394 1393 1394 1394 1394 1394
Better than Good 1323 1323 1323 1323 1322 1323 1323 1323 1323
Worse than Okay 653 653 653 653 652 653 653 653 653
PlayStation Average 362 362 362 362 362 362 362 362 362
Better than Good 313 313 313 313 313 313 313 313 313
Worse than Okay 277 277 277 277 277 277 277 277 277
PlayStation 2 Average 716 716 716 716 716 716 716 716 716
Better than Good 542 542 542 542 542 542 542 542 542
Worse than Okay 428 428 428 428 426 428 428 428 428
PlayStation 3 Average 516 516 516 516 515 516 516 516 516
Better than Good 569 569 569 569 569 569 569 569 569
Worse than Okay 271 271 271 271 271 271 271 271 271
Wii Average 551 551 551 551 547 551 551 551 551
Better than Good 321 321 321 321 321 321 321 321 321
Worse than Okay 494 494 494 494 494 494 494 494 494
Wireless Average 473 473 473 473 471 473 473 473 473
Better than Good 308 308 308 308 306 308 308 308 308
Worse than Okay 129 129 129 129 129 129 129 129 129
Xbox Average 307 307 307 307 307 307 307 307 307
Better than Good 354 354 354 354 354 354 354 354 354
Worse than Okay 160 160 160 160 160 160 160 160 160
Xbox 360 Average 631 631 631 631 631 631 631 631 631
Better than Good 646 646 646 646 646 646 646 646 646
Worse than Okay 354 354 354 354 354 354 354 354 354
iPhone Average 412 412 412 412 405 412 412 412 412
Better than Good 321 321 321 321 315 321 321 321 321
Worse than Okay 109 109 109 109 108 109 109 109 109

事实上,上面的表格大部分内容也用不上,我们需要的其实就三列:游戏平台评分阶段数量。因此我就压缩一下原表格,让它变成下面的样子。

1
2
3
count_df = filtered_reviews.groupby(["platform", "score_phrase_new"]).count().reset_index().iloc[:, :3]
count_df.rename(columns={"score_phrase": "count"}, inplace=True)
count_df

platform score_phrase_new count
0 Nintendo DS Average 462
1 Nintendo DS Better than Good 207
2 Nintendo DS Worse than Okay 376
3 PC Average 1394
4 PC Better than Good 1323
5 PC Worse than Okay 653
6 PlayStation Average 362
7 PlayStation Better than Good 313
8 PlayStation Worse than Okay 277
9 PlayStation 2 Average 716
10 PlayStation 2 Better than Good 542
11 PlayStation 2 Worse than Okay 428
12 PlayStation 3 Average 516
13 PlayStation 3 Better than Good 569
14 PlayStation 3 Worse than Okay 271
15 Wii Average 551
16 Wii Better than Good 321
17 Wii Worse than Okay 494
18 Wireless Average 473
19 Wireless Better than Good 308
20 Wireless Worse than Okay 129
21 Xbox Average 307
22 Xbox Better than Good 354
23 Xbox Worse than Okay 160
24 Xbox 360 Average 631
25 Xbox 360 Better than Good 646
26 Xbox 360 Worse than Okay 354
27 iPhone Average 412
28 iPhone Better than Good 321
29 iPhone Worse than Okay 109

数据拿到手了,下面又该是用图形展示数据的时候。这次我们来看一下每一个评分阶段对于各自游戏平台占比究竟是多少。

1
2
3
4
5
6
7
bar_width = 1
bar_left = [i for i in range(len(count_df) // 3)]
tick_pos = [i + (bar_width / 2) for i in bar_left]
totals = [i + j + k for i, j, k in zip(count_df["count"][::3], count_df["count"][1::3], count_df["count"][2::3])]
ave_perc = [i / j * 100 for i, j in zip(count_df["count"][::3], totals)]
bt_good_perc = [i / j * 100 for i, j in zip(count_df["count"][1::3], totals)]
wt_okay_perc = [i / j * 100 for i, j in zip(count_df["count"][2::3], totals)]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
f, ax = plt.subplots(1)

ax.bar(bar_left,
wt_okay_perc,
label="Worst than Okay",
alpha=0.9,
width=bar_width,
edgecolor="white")

ax.bar(bar_left,
ave_perc,
bottom=wt_okay_perc,
label="Average",
alpha=0.9,
width=bar_width,
edgecolor="white")

ax.bar(bar_left,
bt_good_perc,
bottom=[i+j for i, j in zip(wt_okay_perc, ave_perc)],
label="Better than Good",
alpha=0.9,
width=bar_width,
edgecolor="white")

plt.xticks(tick_pos, set(count_df["platform"]))
ax.set_ylabel("Percentage")
ax.set_xlabel("")
plt.legend(bbox_to_anchor=(1., 1.))
plt.setp(plt.gca().get_xticklabels(), rotation=45, horizontalalignment="right")
plt.show()

png

我并没有直接把具体百分比的数值标记在上面,不过通过直观的图形依然可以看到一些信息。从图中可以看出来,相对来说,PlayStationPlayStation2WiiPCNintendo DS的游戏质量都是很不错的,高质量游戏占比高,且低质量游戏占比低。PlayStation3虽然低质量游戏占比很小,但是高品质游戏也不算很多。iPhoneXbox的表现算是最差的了,低质量游戏占比分属最高的一二,高品质游戏也是最低的两个平台。其实iPhone是这样的倒是不意外了,因为毕竟iPhone平台的起点相对于其他的平台要低很多,基本上三五个人,甚至一个人做出的游戏都有,这样很难保证游戏兼顾趣味性和剧情或者其他方面。在后期维护上面肯定也要比大公司开发的游戏差了很多。遗憾的是Xbox竟然也有如此差劲的表现,着实令我难以理解。


总结

至此,我打算分析的内容就呈现完了,这就是我个人拿到数据之后一个简单的想法,然后试着去将这个想法用数据分析的方法展现出来,供自己去理解。后面我还会对这个数据集进一步的分析,比如去探讨一下年份和分数的关系,游戏类别和分数的关系。希望这篇文章可以起到抛砖引玉的作用,能让各位看完之后对于如何开始分析一份数据有自己的想法。

各位看官对于本文有任何不明白的地方,欢迎提问,也欢迎指正和建议。


Related Links

  1. Pandas Tutorial: Data analysis with Python: Part 1
  2. Pandas Tutorial: Data analysis with Python: Part 2
  3. Stacked Percentage Bar Plot In MatPlotLib