泰涨知识 | Pandas数据预处理技术
本帖最后由 泰克Tech 于 2024-10-18 16:10 编辑Pandas数据预处理技术
一、数据合并
1 merge数据合并
https://pic1.zhimg.com/80/v2-1234e52f9469fa1c3923c179faddd9fd_720w.jpg
输出结果:
https://picx.zhimg.com/80/v2-7a29ab054866a6e72aca609dccf95693_720w.webp
添加图片注释,不超过 140 字(可选)1.1 内连接-inner
https://picx.zhimg.com/80/v2-6fb734adbec64f373287277c211b9947_720w.webp
添加图片注释,不超过 140 字(可选)输出结果:
https://pica.zhimg.com/80/v2-f63c8b8230678d6abdc915fe640b692a_720w.webp
添加图片注释,不超过 140 字(可选)1.2 外连接--outer
https://picx.zhimg.com/80/v2-05ecdbdfc1f3000eff481e01d79825cf_720w.webp
添加图片注释,不超过 140 字(可选)输出结果:
https://pica.zhimg.com/80/v2-397f4f2fa1bf4dabc927752ea2e8936f_720w.webp
添加图片注释,不超过 140 字(可选)1.3 左连接-left
https://pic1.zhimg.com/80/v2-8a0b4009848d8fdae3808b8884dc9444_720w.webp
添加图片注释,不超过 140 字(可选)输出结果:
https://picx.zhimg.com/80/v2-cb3d164d24a4135358b4e3cf86ffc6bb_720w.webp
添加图片注释,不超过 140 字(可选)1.4 右连接-right
https://pic1.zhimg.com/80/v2-3a9f7d1aa246312a71bcb472210288d7_720w.webp
添加图片注释,不超过 140 字(可选)输出结果:
https://picx.zhimg.com/80/v2-0a74bc4f587ba1b0b3410e9f1a9991ad_720w.webp
添加图片注释,不超过 140 字(可选)1.5 参数suffiexes作用:重复列名的修改
https://picx.zhimg.com/80/v2-49006c14619d2f00e7b87743cdbbf576_720w.webp
添加图片注释,不超过 140 字(可选)输出结果:
https://pic1.zhimg.com/80/v2-b0f90908060b77a605e089109d203619_720w.webp
添加图片注释,不超过 140 字(可选)2 concat数据连接
2.1默认情况下,会按行堆叠数据。
https://picx.zhimg.com/80/v2-50e5c78f5006641d6e1108b4290dd2df_720w.webp
输出结果:
https://picx.zhimg.com/80/v2-d2707b3f4fe482115df77eb7e5f946bb_720w.webp
2.2两个DataFrame的数据合并
https://pic1.zhimg.com/80/v2-2442354721d2a401ecd494c004b537b8_720w.webp
输出结果:
https://picx.zhimg.com/80/v2-d88f039ce8c093d7ba56bd732ac41f00_720w.webp
2.3 两个DataFrame的数据合并
https://picx.zhimg.com/80/v2-161909dc5c92c9b80a396c6ecdda7d7d_720w.webp
输出结果:
https://pic1.zhimg.com/80/v2-f0acc6366d2ef5fec2a34114e82375fa_720w.webp
axis=1,按列进行合并,axis=0 表示的按行进行合并:
https://pic1.zhimg.com/80/v2-26a780db7ca9cd9ee0f1c02bc244efd0_720w.webp
添加图片注释,不超过 140 字(可选)输出结果:
https://pic1.zhimg.com/80/v2-04123959549a4027a01af380e33d012a_720w.webp
1 缺失值检测和统计
1.1 检测缺失值-isnull()
https://pica.zhimg.com/80/v2-88fb8d5f64919810cd4d0144f27a9866_720w.webp
输出结果:
https://picx.zhimg.com/80/v2-bf9a748e13da1268fdc74232d50d4b6c_720w.webp
1.2isnull.sum() 统计缺失值
https://picx.zhimg.com/80/v2-fe01aaded57dbaa41005409ab984caf0_720w.webp
https://picx.zhimg.com/80/v2-e6ff60dcfe574bec3d3c0ece4d80f63e_720w.webp
1.3 使用info可以查看缺失值
https://pica.zhimg.com/80/v2-4486f5dbcc466cf3bbcbcb6bcf5ec008_720w.webp
添加图片注释,不超过 140 字(可选)输出结果:
https://picx.zhimg.com/80/v2-7abb93e7b773ba6b2d2b737c68b104ca_720w.webp
删除缺失值 -dropna()
dropna方法的格式:
dropna(axis=0, how=‘any’, thresh=None, subset=None,inplace=False)
https://picx.zhimg.com/80/v2-c82157b096f7b88bb8800a6a65b75300_720w.webp
2.1 缺失值在Series的应用
https://picx.zhimg.com/80/v2-7ab6edfb6d29f1d25c81e89136b7519d_720w.webp
https://picx.zhimg.com/80/v2-3fd1d93481caba2efe7329683dc4d9c3_720w.webp
2.2 缺失值在DataFrame中的应用
dropna()默认会删除任何含有缺失值的行
https://picx.zhimg.com/80/v2-b9f3eafd4e6f133707df99d37b499d2f_720w.webp
https://picx.zhimg.com/80/v2-5825ce440be6aff9a05891662ef47cf4_720w.webp
2.3dropna参数how-any(只要含有任何一个 )all(全部为缺失值时删除)
https://pica.zhimg.com/80/v2-04bcbaa541bd09016eb1c3e27163a572_720w.webp
2.4 dropna参数axis=0( 按行) axis=1 (按列)默认按行
https://pic1.zhimg.com/80/v2-5fc862269eccc5b9f108dd29a3a6a3e9_720w.webp
输出结果:
https://picx.zhimg.com/80/v2-efd10e6c5251b0dcee9aac4e6ffeebf6_720w.webp
2.5 dropna中的thresh参数表示一行至少有N个非NaN才参存活
https://picx.zhimg.com/80/v2-ea22de2f30f672a80cb5e73c0949e97c_720w.webp
输出结果:
https://picx.zhimg.com/80/v2-7688af9a8c1147f35c87966fe6caaf21_720w.webp
3.1 给定值填弃缺失值:df.fillna({1:0.88,2:0.99}
https://picx.zhimg.com/80/v2-8fad1a8779e143ea81e44243c450c83e_720w.webp
输出结果:
https://picx.zhimg.com/80/v2-87d07d82d9284785f14ac8788e2ae358_720w.webp
注method='ffill'向下填充
https://picx.zhimg.com/80/v2-a5e1718011725ce491eadba10b283edc_720w.webp
输出结果:
https://picx.zhimg.com/80/v2-d92f70cc9eb8a04f69aef32a83da30d3_720w.webp
3.3 用Series的均值-mean()填充
https://pic1.zhimg.com/80/v2-07d76c3e72321182eaa75884d5735f3a_720w.webp
在DataFrame中用均值填充:
https://picx.zhimg.com/80/v2-a0bec912fa612fd27376e502fc92a789_720w.webp
输出结果:
https://pic1.zhimg.com/80/v2-9a129a1bfb5b40546d81bc4dfc4c92f3_720w.webp
4.1 检测重复值---duplicates()
在DataFrame中利用duplicates方法判断每一行是否与之前的行重复。duplicates方法返回一个布尔值:
https://picx.zhimg.com/80/v2-83e93d19fdcac70b7e8bc39deb8a9104_720w.webp
输出结果:
https://pic1.zhimg.com/80/v2-c936e9aa631068f3976a8fec1a430fc1_720w.webp
4.2删除重复的行——drop_duplicates()
https://picx.zhimg.com/80/v2-7892eed4b60556773cddeec359dd01db_720w.webp
https://pic1.zhimg.com/80/v2-c2dbbe6c4ba065689e0f1d122b764c0b_720w.webp
指定列名看是否重复:
https://picx.zhimg.com/80/v2-4bbe0927a3f40f3e31ad046be2b404ae_720w.webp
默认保留的数据为第一个出现的记录,通过keep='last' 可以保留最后一个出现的记录:
https://picx.zhimg.com/80/v2-1efa27018252c965cc963b2ac56d323f_720w.webp
更多精彩内容 尽在泰克教育
请持续关注
页:
[1]