导航:首页 > 数据分析 > 如何用python做数据日报

如何用python做数据日报

发布时间：2023-06-03 10:34:18

⑴ 怎样用 Python 进行数据分析

做数据分析，首先你要知道有哪些数据分析的方法，然后才是用Python去调用这些方法
那Python有哪些库类是能做数据分析的，很多，pandas，sklearn等等
所以你首先要装一个anaconda套件，它包含了几乎所有的Python数据分析工具，
之后再学怎么分析。

⑵ 如何用python写数据分析工具

数据导入
导入本地的或者web端的CSV文件；
数据变换；
数据统计描述；
假设检验
单样本t检验；
可视化；
创建自定义函数。

数据导入

这是很关键的一步，为了后续的分析我们首先需要导入数据。通常来说，数据是CSV格式，就算不是，至少也可以转换成CSV格式。在Python中，我们的操作如下：

Python

import pandas as pd

# Reading data locally

df = pd.read_csv('/Users/al-ahmadgaidasaad/Documents/d.csv')

# Reading data from web

data_url = "t/Analysis-with-Programming/master/2014/Python/Numerical-Descriptions-of-the-Data/data.csv"

df = pd.read_csv(data_url)

为了读取本地CSV文件，我们需要pandas这个数据分含旅析库中的相应模块。其中的read_csv函数能够读取本地和web数据。

数据变换仔洞

既然在工作空间有了数据，接下来就是数据变换。统计学家和科学家们通常会在这一步移除分析中的非必要数据。我们先看看数据：

Python

# Head of the data

print df.head()

# OUTPUT

0 12432934148330010553

1 41589235 4287806335257

2 17871922 19551074 4544

317152 14501 3536 1960731687

4 12662385 25303315 8520

# Tail of the data

print df.tail()

# OUTPUT

74 2505 20878 3519 1973716513

7560303 40065 7062 1942261808

76 63116756 3561 15910谈戚凳23349

7713345 38902 2583 1109668663

78 2623 18264 3745 1678716900

对R语言程序员来说，上述操作等价于通过print(head(df))来打印数据的前6行，以及通过print(tail(df))来打印数据的后6行。当然Python中，默认打印是5行，而R则是6行。因此R的代码head(df, n = 10)，在Python中就是df.head(n = 10)，打印数据尾部也是同样道理。

在R语言中，数据列和行的名字通过colnames和rownames来分别进行提取。在Python中，我们则使用columns和index属性来提取，如下：

Python

# Extracting column names

print df.columns

# OUTPUT

Index([u'Abra', u'Apayao', u'Benguet', u'Ifugao', u'Kalinga'], dtype='object')

# Extracting row names or the index

print df.index

# OUTPUT

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78], dtype='int64')

数据转置使用T方法，

Python

# Transpose data

print df.T

# OUTPUT

01 23 45 67 89

Abra1243 41581787171521266 5576 927215401039 5424

Apayao2934 92351922145012385 7452109917038138210588

Benguet148 42871955 353625307712796 24632592 1064

Ifugao3300

... 69 70 71 72 73 74 75 76 77

Abra ...12763 247059094 620913316 250560303 631113345

Apayao ...376251953235126 6335386132087840065 675638902

Benguet... 2354 4045 5987 3530 2585 3519 7062 3561 2583

Ifugao ... 9838171251894015560 774619737194221591011096

Kalinga...

Abra2623

Apayao 18264

Benguet 3745

Ifugao 16787

Kalinga16900

Other transformations such as sort can be done using<code>sort</code>attribute. Now let's extract a specific column. In Python, we do it using either<code>iloc</code>or<code>ix</code>attributes, but<code>ix</code>is more robust and thus I prefer it. Assuming we want the head of the first column of the data, we have

其他变换，例如排序就是用sort属性。现在我们提取特定的某列数据。Python中，可以使用iloc或者ix属性。但是我更喜欢用ix，因为它更稳定一些。假设我们需数据第一列的前5行，我们有：

Python

print df.ix[:, 0].head()

# OUTPUT

0 1243

1 4158

2 1787

317152

4 1266

Name: Abra, dtype: int64

顺便提一下，Python的索引是从0开始而非1。为了取出从11到20行的前3列数据，我们有：

Python

print df.ix[10:20, 0:3]

# OUTPUT

AbraApayaoBenguet

109811311 2560

1127366 15093 3039

12 11001701 2382

13 7212 11001 1088

14 10481427 2847

1525679 15661 2942

16 10552191 2119

17 54376461734

18 10291183 2302

1923710 12222 2598

20 10912343 2654

上述命令相当于df.ix[10:20, ['Abra', 'Apayao', 'Benguet']]。

为了舍弃数据中的列，这里是列1(Apayao)和列2(Benguet)，我们使用drop属性，如下：

Python

print df.drop(df.columns[[1, 2]], axis = 1).head()

# OUTPUT

AbraIfugaoKalinga

0 1243330010553

1 4158806335257

2 17871074 4544

317152 1960731687

4 12663315 8520

axis参数告诉函数到底舍弃列还是行。如果axis等于0，那么就舍弃行。

统计描述

下一步就是通过describe属性，对数据的统计特性进行描述：

Python

print df.describe()

# OUTPUT

AbraApayaoBenguetIfugao Kalinga

count 79.000000 79.00000079.000000 79.000000 79.000000

mean 12874.37974716860.6455703237.39240512414.62025330446.417722

std16746.46694515448.1537941588.536429 5034.28201922245.707692

min927.000000401.000000 148.000000 1074.000000 2346.000000

25% 1524.000000 3435.5000002328.000000 8205.000000 8601.500000

50% 5790.00000010588.0000003202.00000013044.00000024494.000000

75%13330.50000033289.0000003918.50000016099.50000052510.500000

max60303.00000054625.0000008813.00000021031.00000068663.000000

假设检验

Python有一个很好的统计推断包。那就是scipy里面的stats。ttest_1samp实现了单样本t检验。因此，如果我们想检验数据Abra列的稻谷产量均值，通过零假设，这里我们假定总体稻谷产量均值为15000，我们有：

Python

from scipy import stats as ss

# Perform one sample t-test using 1500 as the true mean

print ss.ttest_1samp(a = df.ix[:, 'Abra'], popmean = 15000)

# OUTPUT

(-1.1281738488299586, 0.26270472069109496)

返回下述值组成的元祖：

t : 浮点或数组类型
t统计量
prob : 浮点或数组类型
two-tailed p-value 双侧概率值

通过上面的输出，看到p值是0.267远大于α等于0.05，因此没有充分的证据说平均稻谷产量不是150000。将这个检验应用到所有的变量，同样假设均值为15000，我们有：

Python

print ss.ttest_1samp(a = df, popmean = 15000)

# OUTPUT

(array([ -1.12817385, 1.07053437, -65.81425599,-4.564575, 6.17156198]),

array([2.62704721e-01, 2.87680340e-01, 4.15643528e-70,

1.83764399e-05, 2.82461897e-08]))

第一个数组是t统计量，第二个数组则是相应的p值。

可视化

Python中有许多可视化模块，最流行的当属matpalotlib库。稍加提及，我们也可选择bokeh和seaborn模块。之前的博文中，我已经说明了matplotlib库中的盒须图模块功能。

;

重复100次; 然后
计算出置信区间包含真实均值的百分比

Python中，程序如下：

Python

import numpy as np

import scipy.stats as ss

def case(n = 10, mu = 3, sigma = np.sqrt(5), p = 0.025, rep = 100):

m = np.zeros((rep, 4))

for i in range(rep):

norm = np.random.normal(loc = mu, scale = sigma, size = n)

xbar = np.mean(norm)

low = xbar - ss.norm.ppf(q = 1 - p) * (sigma / np.sqrt(n))

up = xbar + ss.norm.ppf(q = 1 - p) * (sigma / np.sqrt(n))

if (mu > low) & (mu < up):

rem = 1

else:

rem = 0

m[i, :] = [xbar, low, up, rem]

inside = np.sum(m[:, 3])

per = inside / rep

desc = "There are " + str(inside) + " confidence intervals that contain "

"the true mean (" + str(mu) + "), that is " + str(per) + " percent of the total CIs"

return {"Matrix": m, "Decision": desc}

上述代码读起来很简单，但是循环的时候就很慢了。下面针对上述代码进行了改进，这多亏了Python专家，看我上篇博文的15条意见吧。

Python

import numpy as np

import scipy.stats as ss

def case2(n = 10, mu = 3, sigma = np.sqrt(5), p = 0.025, rep = 100):

scaled_crit = ss.norm.ppf(q = 1 - p) * (sigma / np.sqrt(n))

norm = np.random.normal(loc = mu, scale = sigma, size = (rep, n))

xbar = norm.mean(1)

low = xbar - scaled_crit

up = xbar + scaled_crit

rem = (mu > low) & (mu < up)

m = np.c_[xbar, low, up, rem]

inside = np.sum(m[:, 3])

per = inside / rep

desc = "There are " + str(inside) + " confidence intervals that contain "

"the true mean (" + str(mu) + "), that is " + str(per) + " percent of the total CIs"

return {"Matrix": m, "Decision": desc}

更新

那些对于本文ipython notebook版本感兴趣的，请点击这里。这篇文章由Nuttens Claude负责转换成ipython notebook 。

⑶ python如何做数据分析

Python做数据分析比较好用且流行的是numpy、pandas库，有兴趣的话，可以深入了解、学习一下。

阅读全文

与如何用python做数据日报相关的资料

热点内容

网络中常用的传输介质发布：2025-10-20 08:42:23 浏览：518

文件如何使用发布：2025-10-20 08:33:27 浏览：322

同步推密码找回发布：2025-10-20 08:04:22 浏览：865

乐高怎么才能用电脑编程序发布：2025-10-20 07:57:56 浏览：65

本机qq文件为什么找不到发布：2025-10-20 07:39:47 浏览：264

安卓qq空间免升级发布：2025-10-20 07:36:50 浏览：490

linux如何删除模块驱动程序发布：2025-10-20 07:36:06 浏览：193

at89c51c程序发布：2025-10-20 07:35:06 浏览：329

怎么创建word大纲文件发布：2025-10-20 07:24:54 浏览：622

袅袅朗诵文件生成器发布：2025-10-20 07:00:55 浏览：626

1054件文件是多少gb 发布：2025-10-20 06:03:27 浏览：371

高州禁养区内能养猪多少头的文件发布：2025-10-20 05:51:26 浏览：927

win8ico文件发布：2025-10-20 05:47:08 浏览：949

仁和数控怎么编程发布：2025-10-20 05:24:49 浏览：381

项目文件夹图片发布：2025-10-20 04:42:54 浏览：87

怎么在东芝电视安装app 发布：2025-10-20 04:42:54 浏览：954

plc显示数字怎么编程发布：2025-10-20 04:42:54 浏览：439

如何辨别假网站发布：2025-10-20 04:26:28 浏览：711

宽带用别人的账号密码发布：2025-10-20 04:08:00 浏览：556

新app如何占有市场发布：2025-10-20 03:39:57 浏览：42

导航:首页 > 数据分析 > 如何用python做数据日报

如何用python做数据日报

与如何用python做数据日报相关的资料

友情链接