泰克Tech 发表于 2024-10-18 16:07:28

泰涨知识 | Pandas数据预处理技术

本帖最后由 泰克Tech 于 2024-10-18 16:10 编辑

Pandas数据预处理技术

一、数据合并

1 merge数据合并

https://pic1.zhimg.com/80/v2-1234e52f9469fa1c3923c179faddd9fd_720w.jpg

输出结果:


https://picx.zhimg.com/80/v2-7a29ab054866a6e72aca609dccf95693_720w.webp



添加图片注释,不超过 140 字(可选)1.1 内连接-inner

https://picx.zhimg.com/80/v2-6fb734adbec64f373287277c211b9947_720w.webp



添加图片注释,不超过 140 字(可选)输出结果:

https://pica.zhimg.com/80/v2-f63c8b8230678d6abdc915fe640b692a_720w.webp



添加图片注释,不超过 140 字(可选)1.2 外连接--outer

https://picx.zhimg.com/80/v2-05ecdbdfc1f3000eff481e01d79825cf_720w.webp



添加图片注释,不超过 140 字(可选)输出结果:

https://pica.zhimg.com/80/v2-397f4f2fa1bf4dabc927752ea2e8936f_720w.webp



添加图片注释,不超过 140 字(可选)1.3 左连接-left

https://pic1.zhimg.com/80/v2-8a0b4009848d8fdae3808b8884dc9444_720w.webp



添加图片注释,不超过 140 字(可选)输出结果:

https://picx.zhimg.com/80/v2-cb3d164d24a4135358b4e3cf86ffc6bb_720w.webp



添加图片注释,不超过 140 字(可选)1.4 右连接-right

https://pic1.zhimg.com/80/v2-3a9f7d1aa246312a71bcb472210288d7_720w.webp



添加图片注释,不超过 140 字(可选)输出结果:

https://picx.zhimg.com/80/v2-0a74bc4f587ba1b0b3410e9f1a9991ad_720w.webp



添加图片注释,不超过 140 字(可选)1.5 参数suffiexes作用:重复列名的修改

https://picx.zhimg.com/80/v2-49006c14619d2f00e7b87743cdbbf576_720w.webp



添加图片注释,不超过 140 字(可选)输出结果:

https://pic1.zhimg.com/80/v2-b0f90908060b77a605e089109d203619_720w.webp



添加图片注释,不超过 140 字(可选)2 concat数据连接


2.1默认情况下,会按行堆叠数据。

https://picx.zhimg.com/80/v2-50e5c78f5006641d6e1108b4290dd2df_720w.webp




输出结果:

https://picx.zhimg.com/80/v2-d2707b3f4fe482115df77eb7e5f946bb_720w.webp




2.2两个DataFrame的数据合并

https://pic1.zhimg.com/80/v2-2442354721d2a401ecd494c004b537b8_720w.webp




输出结果:

https://picx.zhimg.com/80/v2-d88f039ce8c093d7ba56bd732ac41f00_720w.webp




2.3 两个DataFrame的数据合并

https://picx.zhimg.com/80/v2-161909dc5c92c9b80a396c6ecdda7d7d_720w.webp




输出结果:

https://pic1.zhimg.com/80/v2-f0acc6366d2ef5fec2a34114e82375fa_720w.webp




axis=1,按列进行合并,axis=0 表示的按行进行合并:

https://pic1.zhimg.com/80/v2-26a780db7ca9cd9ee0f1c02bc244efd0_720w.webp



添加图片注释,不超过 140 字(可选)输出结果:

https://pic1.zhimg.com/80/v2-04123959549a4027a01af380e33d012a_720w.webp




1 缺失值检测和统计

1.1 检测缺失值-isnull()

https://pica.zhimg.com/80/v2-88fb8d5f64919810cd4d0144f27a9866_720w.webp




输出结果:

https://picx.zhimg.com/80/v2-bf9a748e13da1268fdc74232d50d4b6c_720w.webp




1.2isnull.sum() 统计缺失值

https://picx.zhimg.com/80/v2-fe01aaded57dbaa41005409ab984caf0_720w.webp





https://picx.zhimg.com/80/v2-e6ff60dcfe574bec3d3c0ece4d80f63e_720w.webp




1.3 使用info可以查看缺失值

https://pica.zhimg.com/80/v2-4486f5dbcc466cf3bbcbcb6bcf5ec008_720w.webp



添加图片注释,不超过 140 字(可选)输出结果:

https://picx.zhimg.com/80/v2-7abb93e7b773ba6b2d2b737c68b104ca_720w.webp




删除缺失值 -dropna()

dropna方法的格式:

dropna(axis=0, how=‘any’, thresh=None, subset=None,inplace=False)

https://picx.zhimg.com/80/v2-c82157b096f7b88bb8800a6a65b75300_720w.webp




2.1 缺失值在Series的应用

https://picx.zhimg.com/80/v2-7ab6edfb6d29f1d25c81e89136b7519d_720w.webp





https://picx.zhimg.com/80/v2-3fd1d93481caba2efe7329683dc4d9c3_720w.webp




2.2 缺失值在DataFrame中的应用

dropna()默认会删除任何含有缺失值的行

https://picx.zhimg.com/80/v2-b9f3eafd4e6f133707df99d37b499d2f_720w.webp





https://picx.zhimg.com/80/v2-5825ce440be6aff9a05891662ef47cf4_720w.webp




2.3dropna参数how-any(只要含有任何一个 )all(全部为缺失值时删除)

https://pica.zhimg.com/80/v2-04bcbaa541bd09016eb1c3e27163a572_720w.webp




2.4 dropna参数axis=0( 按行) axis=1 (按列)默认按行

https://pic1.zhimg.com/80/v2-5fc862269eccc5b9f108dd29a3a6a3e9_720w.webp




输出结果:

https://picx.zhimg.com/80/v2-efd10e6c5251b0dcee9aac4e6ffeebf6_720w.webp




2.5 dropna中的thresh参数表示一行至少有N个非NaN才参存活

https://picx.zhimg.com/80/v2-ea22de2f30f672a80cb5e73c0949e97c_720w.webp




输出结果:

https://picx.zhimg.com/80/v2-7688af9a8c1147f35c87966fe6caaf21_720w.webp




3.1 给定值填弃缺失值:df.fillna({1:0.88,2:0.99}

https://picx.zhimg.com/80/v2-8fad1a8779e143ea81e44243c450c83e_720w.webp




输出结果:

https://picx.zhimg.com/80/v2-87d07d82d9284785f14ac8788e2ae358_720w.webp




注method='ffill'向下填充

https://picx.zhimg.com/80/v2-a5e1718011725ce491eadba10b283edc_720w.webp




输出结果:

https://picx.zhimg.com/80/v2-d92f70cc9eb8a04f69aef32a83da30d3_720w.webp




3.3 用Series的均值-mean()填充

https://pic1.zhimg.com/80/v2-07d76c3e72321182eaa75884d5735f3a_720w.webp




在DataFrame中用均值填充:

https://picx.zhimg.com/80/v2-a0bec912fa612fd27376e502fc92a789_720w.webp




输出结果:

https://pic1.zhimg.com/80/v2-9a129a1bfb5b40546d81bc4dfc4c92f3_720w.webp




4.1 检测重复值---duplicates()

在DataFrame中利用duplicates方法判断每一行是否与之前的行重复。duplicates方法返回一个布尔值:

https://picx.zhimg.com/80/v2-83e93d19fdcac70b7e8bc39deb8a9104_720w.webp




输出结果:

https://pic1.zhimg.com/80/v2-c936e9aa631068f3976a8fec1a430fc1_720w.webp




4.2删除重复的行——drop_duplicates()

https://picx.zhimg.com/80/v2-7892eed4b60556773cddeec359dd01db_720w.webp





https://pic1.zhimg.com/80/v2-c2dbbe6c4ba065689e0f1d122b764c0b_720w.webp




指定列名看是否重复:

https://picx.zhimg.com/80/v2-4bbe0927a3f40f3e31ad046be2b404ae_720w.webp




默认保留的数据为第一个出现的记录,通过keep='last' 可以保留最后一个出现的记录:

https://picx.zhimg.com/80/v2-1efa27018252c965cc963b2ac56d323f_720w.webp




更多精彩内容 尽在泰克教育

请持续关注


页: [1]
查看完整版本: 泰涨知识 | Pandas数据预处理技术