目录
一、实验题目描述
实验题目:基于SVM进行分类预测
实验要求:通过给定数据,使用支持向量机算法(SVM)实现分类预测,具体为:
筛选变量(如:行程距离、费用、时间等),进行数据预处理(如:处理缺失值、异常值、归一化/标准化数据),关于数据量过大的问题,可以从中筛选部分数据,但要求数据总量不可少于10w条,解释数据选取依据。使用SVM算法实现对芝加哥出租车出行支付方式(现金/信用卡)的分类预测(注:要求使用给定数据集,并且使用python进行数据处理)。训练和预测的数据比例为:80%:20%,给出明确的实验准确度验证过程。此外根据数据集中的其他变量进行进一步分析,探索不同因素对支付方式的影响强度,期待有新的发现,并完成报告。
实验报告内容:实验问题,实验目标,数据介绍(需要文字介绍,并辅助配合时间、空间、多因素分布等图表),实验方法(统一要求使用支持向量机算法(SVM),并且要将算法流程及公式、数据处理流程、实验验证流程、完整写入方法章节),实验结果(需要标注参数设置,预测准确度等),实验结果分析(支持有多图表的实验结果,分析不同因素对实验结果的重要性等,有趣的发现额外加分)。
实验数据集说明:
数据集:Chicago Taxi Trips Dataset (2023)-----包含芝加哥出租车出行记录
Trip ID:出行编号 Taxi ID:出租车编号
Trip Start/End Timestamp 上车/下车时间 Trip Seconds:行程时长(秒) Trip Miles:行程距离(英里)
Pickup/ Dropoff Census Tract:上车/下车人口普查区编号:乘客上车位置所在的美国人口普查地块编号
Pickup/Dropoff Community Area:上车/下车社区区域编号:芝加哥市规划划分的77个社区区域
Fare:基础车费 Tips:小费金额 Tolls:路桥费 Extras:额外费用 Trip Total:总
费用
Payment Type:支付方式 Company:所属出租车公司
Pickup Centroid Latitude/Longitude:上车点纬度/经度
Pickup Centroid Location:上车位置坐标:上车地点的地理坐标(格式为纬度, 经度)
Dropoff Centroid Latitude/Longitude:下车点纬度/经度
Dropoff Centroid Location:下车位置坐标:下车地点的地理坐标(格式为纬度, 经度)
实验数据集可以参考https://download.csdn.net/download/2401_84149564/90962954?spm=1011.2124.3001.6210(由于数据量过大,这里只列出前108行数据,都可以放在"linear.csv"和"spiral.csv"文件中)。
Trip ID,Taxi ID,Trip Start Timestamp,Trip End Timestamp,Trip Seconds,Trip Miles,Pickup Census Tract,Dropoff Census Tract,Pickup Community Area,Dropoff Community Area,Fare,Tips,Tolls,Extras,Trip Total,Payment Type,Company,Pickup Centroid Latitude,Pickup Centroid Longitude,Pickup Centroid Location,Dropoff Centroid Latitude,Dropoff Centroid Longitude,Dropoff Centroid Location
011106b6114f83af0c17aace3867a464a7fc742b,4628ef9dfa973bdfe877c5aa9d9738f9dc1204e54f2f1a4cc18141f37e2e66d080533f82510a96d1525b28eee833696f7e1337e9999a38f2fd5babf71585a344,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,982,0.81,17031081800,17031081500,8,8,8.75,3,0,1,13.25,Credit Card,Chicago Independents,41.89321636,-87.63784421,POINT (-87.6378442095 41.8932163595),41.89250778,-87.62621491,POINT (-87.6262149064 41.8925077809)
e9a66ddcc78cfd79f419165314cbe5ee380f16c3,8efe74ab61de459003dcedd85c637ce11bba19bac633cde9559a4895c98d7185ce0f7742dbd8b1938151fc3cdb89a3b4234bf80bf6368654831c79cf9685b3a9,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,390,0.09,,,,,34.86,0,0,0,34.86,Cash,Flash Cab,,,,,,
e765192268db3480b5d9bd0443f7ce7fd5ba047d,6f45c05aa231c9dd389ebdb65ca751cd82ef7634766017a1240d6554bf91840a924cf3cd16564a1ca643c9b1880db706d977ab5164f4b0bd030e0fff21cb3934,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1271,4.18,,,8,6,15.5,0,0,0,15.5,Cash,Flash Cab,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.9442266,-87.65599818,POINT (-87.6559981815 41.9442266014)
c6510d4f82541cfacf8c20cab44fbb7c0b2c5efe,89fc6b1f0628f328ccd1021fcf4e7318bb2f9962da9259b522bde63ca44f9f201a016291bbd2801fad04845cfd5b30b954afedccb22f6c49fadda05821804a06,12/31/2023 11:45:00 PM,01/01/2024 12:15:00 AM,1280,8.19,,,,,22.5,3,0,0,26,Credit Card,Flash Cab,,,,,,
f9445eed26da9a0eff247350df942616cb51e764,14275cab8bb64007379de40be92944817231f63047033992a1964ce85a4a5405085b87a804736c24ecd8a334baa315255c066e271d7f73dfb756a19a24a44d25,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,585,5.02,,,28,6,35,2,0,0,37.5,Credit Card,Sun Taxi,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.9442266,-87.65599818,POINT (-87.6559981815 41.9442266014)
f74fa03a6cf8cdcf668d0726efa1671d398b4450,2fea69c8a6e08471bc4339a05e9ee7955bef68d791f77a202bd54f3ae41c805907d7ac13a89f86fac4494c976ca87883157baa32ea41f59056661884135f6bba,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1473,9.94,,,56,,26.75,0,0,18,44.75,Cash,Sun Taxi,41.79259236,-87.76961545,POINT (-87.7696154528 41.7925923603),,,
f40c2cda1cea33c2265a34b2ce1eb454067ad8d2,3618045f9110d4d88482266ade23659c1a50d32ac37f205c15614b1ada9d4ca14b171329afed5dfd81c7525bd5a05fe614cb63b2aa48d920626b519e20d9e146,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,0,0,,,24,24,3.25,0,0,11.25,14.5,Credit Card,Taxi Affiliation Services,41.90120699,-87.67635599,POINT (-87.6763559892 41.9012069941),41.90120699,-87.67635599,POINT (-87.6763559892 41.9012069941)
f3a139c0df3513324ff3f699bf40db2e84291e3a,9b48ad5744e86450fb4db78e7095a6827bafc43a6a9d9a8f656aac46cc0e429d129471cdad31f8a5a97b3a45c8af5fcbc80d003c1c4839075733900786e1a5a9,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1228,4.29,,,6,21,15.25,0,0,0,15.25,Cash,Sun Taxi,41.9442266,-87.65599818,POINT (-87.6559981815 41.9442266014),41.9386662,-87.71121059,POINT (-87.7112105933 41.9386661962)
ee7cce18a4b24e080366930ab5ec72d1aeb6556c,adb1cb74113851b651b474182fbe95a9663783779db8cca0fdc3ff7ac82cc8fe5d864b7f86a995c64e405fa6c89cd87d00b458108ebe0c62bc23c4c79e61da46,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1310,18.23,,,28,76,60,0,0,0,60.5,Credit Card,5 Star Taxi,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146)
ed445ada05f17c5f359892eda3c329e1445b5e7b,4b034948aceedd53262ae713f864b0364953a1852b6b24669f192cad26c5014f1af0b6c87b941abb1fa93e1abbe09f70d7f02d48e5371d2c55534b68565a3060,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,4,0,17031320100,17031320100,32,32,20,5.12,0,0,25.62,Credit Card,Sun Taxi,41.88498719,-87.62099291,POINT (-87.6209929134 41.8849871918),41.88498719,-87.62099291,POINT (-87.6209929134 41.8849871918)
ec183abaa7ff142f17ebcdafa1f3d4e611a9f494,f6d1b6c930d62f6d8cbbd8f86a593ff057408c82f764744a7a38ee63957a74b84eaeea80224ea3a0021ba1572323a282530b0659b5e1ce48d04939eacb504060,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,786,2.02,17031081500,17031330100,8,33,10,0,0,0,10,Cash,Chicago Independents,41.89250778,-87.62621491,POINT (-87.6262149064 41.8925077809),41.85934972,-87.61735801,POINT (-87.6173580061 41.859349715)
e5c03bc6d864518431ce24706a4a9055221dc333,99ec13d5d806f5f5fa7a57910f8e38d84f90630529f2f8766d65b47caae8cb7cacf3d4bb9ca6576dfa49bc45b9f0f615e79577ace618514c9c59dc52ffbf40b6,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1064,9.58,,,76,12,25,0,0,5,30,Cash,5 Star Taxi,41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),41.99393013,-87.75835359,POINT (-87.7583535876 41.9939301285)
e2b8bea5dbc60464ff88ba8dc8b66836513101e8,884655d853cbe41e1cdf747969f0dc5b55ed2d5f76c09ae207083297c948a813b2dd57912bfb4a6f4230556ab61363257e586827846cf2a89ff35c7c3b1c08e5,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,279,1.51,,,49,49,6.5,0,0,0,6.5,Cash,Flash Cab,41.70658788,-87.62336651,POINT (-87.6233665115 41.7065878819),41.70658788,-87.62336651,POINT (-87.6233665115 41.7065878819)
dcfbaa5d01e81e18637185fc5e822d6a08456f59,2659a61c08f91c6efd9e7d7947a00006a7bc26aa518241786d51cb05c853cdd86fdd1adc4010867706a2daa9f0da856cb2d7a705c111d3c89e53f5499741247e,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,300,1.8,,,76,7,7.5,4,0,1,13,Credit Card,Globe Taxi,41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),41.92268628,-87.64948873,POINT (-87.6494887289 41.9226862843)
d51430f93404726b121d82d42efc29a2062895a8,24d4c5e51d147aecbb7c4a1ad70c38dbc05c7b4485f6de4b41fee5ea270228859c19c4160b101df87b889acc5cc7ab49b5e0491676026518677d34ea81f76591,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,379,1.25,,,14,13,6.75,3,0,0,10.25,Credit Card,Taxicab Insurance Agency Llc,41.968069,-87.72155906,POINT (-87.7215590627 41.968069),41.98363631,-87.72358319,POINT (-87.7235831853 41.9836363072)
ce61e0f21271970f7bf3006d489638d1c320a62f,f7782da531b08c6ce5a1e16a8c2998f6f4f7943f29ab53713949ec17f3b4d7f8b4cd8a84da2fe7c4b0a17f1fc3439e376d3bfec25d83652dfa4468342e25f6b1,12/31/2023 11:45:00 PM,01/01/2024 12:45:00 AM,3845,4.98,,,32,6,31,0,0,0,31,Cash,Taxicab Insurance Agency Llc,41.87886558,-87.62519214,POINT (-87.6251921424 41.8788655841),41.9442266,-87.65599818,POINT (-87.6559981815 41.9442266014)
cde7d22932829a7b19fb43bfd9a1d635c1e3f04e,52e8915b8a7b8851b341adb6797c3652a198b772561e9e9888d9963de61b796f60653ebdd06f44aa2fad304efb0a53160d6210b2a26fcc64b21932db0d658e32,12/31/2023 11:45:00 PM,01/01/2024 12:15:00 AM,1852,22.97,17031980000,,76,,55.5,10,0,35,101,Credit Card,Flash Cab,41.97907082,-87.90303966,POINT (-87.9030396611 41.9790708201),,,
cb50d6951086242beccb8fe7d248cfad3fab3dd7,179f1a051e9e6d3fc0726628962faff68506086ee8df14091e91399452fada055453e421df026fcc4449a43ba54a357c364482e62a20420f836389bd2269b5ee,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1383,17.95,,,28,,43.75,11.06,0,0,55.31,Credit Card,Blue Ribbon Taxi Association,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),,,
c6cb3aad561e0c407239333d535a4922540f9adc,bca79085da78d157007711d04c6e06f655ee5eafb1e5b654033c2f34fcea1d1fd230b48cff6ab67c598d1baa406d0689fd207404711267f2d56fdd93e3a0e6ec,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,839,7.17,,,76,10,19.75,5.05,0,5,30.3,Credit Card,Flash Cab,41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),41.9850151,-87.80453201,POINT (-87.8045320063 41.9850151008)
c5ad9b572bff9d5cb9a8cf0da173789f7910a835,c7a8a53874bbcdb11e70a488485e8bdd0bb8cc0de8f5d98d3ef4d9c3223d7b4024a6d8fda9c00e8e55f3985e5a728ddce0175359c31e0bc5cd3c008d89b23ed1,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,7,0,,,28,28,30,0,0,0,30.5,Credit Card,Flash Cab,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383)
c23cbb41a952defb103a40ca767a32c387532614,32f1eebe57165cc17acba84eb8bb85d69a063ed0e5e15e108a68bc4548403834ab9c888ed3c3d474d10a0f95d498a5a9a0ef801331a61db60adec230c48e61b0,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1163,4.5,,,8,6,16.25,0,0,1,17.25,Cash,Chicago Independents,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.9442266,-87.65599818,POINT (-87.6559981815 41.9442266014)
b3f31e1c4e3673813abc423c9b2e415bdbd1b3a8,acad560bbc140c4015f4685c6559c93b61ccaf0f7d80143fa408d25169652c7b861a3be42e8fbcb91cac8da7e691489626884bc2bab1ae6015710cde1eab4e3f,12/31/2023 11:45:00 PM,01/01/2024 12:30:00 AM,2305,12.63,,,8,47,37,0,0,0,37,Cash,Taxicab Insurance Agency Llc,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.72818206,-87.5964756,POINT (-87.5964755956 41.728182061)
ae20c529b8423608f3f0bfcfa243100219f1241e,d3e38cf4471f5b65aa0c41a155252c395b7c8593ac6fb5741e0cac5f68831f4f418eebc0f19ea0acd5b398d6b376ec7cbb2dfa9b846af33a9d27146e2a009b4b,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,1,0.04,,,27,27,40,0,0,0,40,Cash,5 Star Taxi,41.8789145,-87.70589713,POINT (-87.7058971305 41.8789144956),41.8789145,-87.70589713,POINT (-87.7058971305 41.8789144956)
ac9c9cd082dbdbfb67fc062dcb74ed713820e47f,75cf3a53aae5e5858361a7ca64f75d3407dc0a44d7bc42843fd566a614cf1adcb57d543a15db44103c4801f879ceb236261b336079807dbfd2cd7a7775f166dc,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,439,0.98,,,32,32,6.5,4,0,1,12,Credit Card,City Service,41.87886558,-87.62519214,POINT (-87.6251921424 41.8788655841),41.87886558,-87.62519214,POINT (-87.6251921424 41.8788655841)
a5a598958fc186bb7c09f6eb61fc57ce7c11f898,a31d2ea87ea4f5a4793c30f84f000b0c6aad4cd956f6ea73b5628ebd509d47a7d051a0b7ccd201a550c236c42ed6e19ce5efbd4caa4bf2d1fcd77827b64b39f9,12/31/2023 11:45:00 PM,01/01/2024 12:30:00 AM,3121,7.63,,,8,39,32,0,0,1,33,Cash,Flash Cab,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.80891628,-87.59618334,POINT (-87.5961833442 41.8089162826)
a55c9af7b91b2239e4e432131062cc342f3cd2fe,dd16496faf01009b70959e7c0d5b86f9bb7f432a1771c5737f75542f886f1c16d56afc454b02a96737372836e47c77ffc172b0bdf80e13640ff9203b7d0d6dbc,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,502,0.38,,,8,8,6.25,2,0,0,8.75,Credit Card,Medallion Leasin,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
9f2747ed96c9f1465b93ce1f5114c907464c5d76,c26dee3edb5d4bce731d586ef40b399162a1c3a05cb5bb035e148b89a986a90612bb81bfe1745453c85fe7ae4a859e566611067e2fb730da1c1e5ce92674dd2d,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,0,0,,,8,8,3.25,0,0,1,4.25,Cash,"Taxicab Insurance Agency, LLC",41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
9937207e717533bf0a3e76621f06857138d6c2df,171ec426eaf8f54c5acbb7e3fde8e0683bfa6042af0b00e428e650cd9bc909011a2517de5b32358b9cb9d9ad3d5a5b26bbb14a09f8b7f5c1c9a37fa22f26f7a7,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,12,0.04,,,33,33,3.25,0,0,0,3.25,Cash,Flash Cab,41.85718386,-87.62033462,POINT (-87.6203346241 41.8571838585),41.85718386,-87.62033462,POINT (-87.6203346241 41.8571838585)
8bcad727d56e9761517e7129cc94ede7274f60fa,eed4cbab8d3be11fc5fcff8f92b3ba140f63602f2760446756572fc4262d89af90bd665c89604918253358364f34ae32113be46209e794e28390e6cca1a87768,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,361,1.16,,,28,28,20,4.1,0,0,24.6,Credit Card,Taxicab Insurance Agency Llc,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383)
8142f4b1547f4e4a683a80b5a6c7d0325ce09559,f75191fdf728d7ed7f4277ee1e39372c16658b87abc26a057a7e74b79dd5457cb375f859ea318a2aa47f19d24142bc3563cd5b8c0bfa633161570ec9b3686897,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,600,1.1,17031320100,17031320400,32,32,7.5,0,0,2,9.5,Cash,Chicago Independents,41.88498719,-87.62099291,POINT (-87.6209929134 41.8849871918),41.87740612,-87.62197165,POINT (-87.6219716519 41.8774061234)
7de7d6b1667cea33735670f88c50e9631e719f04,756721b3418247472431e2bd1022cc8ce0806af1b6b7dfeb3927318f86819fc67bf385b5829a0f13e006d05aed02020e7c1801204414d5eadf17b3e2ce71ce14,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1138,12.08,,,56,24,31.25,5,0,7,43.75,Credit Card,Medallion Leasin,41.79259236,-87.76961545,POINT (-87.7696154528 41.7925923603),41.90120699,-87.67635599,POINT (-87.6763559892 41.9012069941)
7d27d382cf9f0c93d7d0bf60c14cf7ec523624a0,be7e1462a37397809dadade8e174ef3ccbc3073294df4a0c1786610d3fbbc2cd18543a46938764af03553b89897d23ee9f88307fd485162626822cd306c2fed5,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1236,2.83,,,28,8,13,0,0,0,13,Cash,Blue Ribbon Taxi Association,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
716d7a0a2a097facc3f0f63e326830ecdf923d0a,2d72c5e6313ad93f663008a55045cad0c76164b057dcb756f23448dcfc082f616d8626020794704f296e6ad06f65837ff799930aa961729096167c5ef8612663,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,649,2.57,17031833000,17031330100,28,33,10,1,0,1,12.5,Credit Card,City Service,41.88528132,-87.6572332,POINT (-87.6572331997 41.8852813201),41.85934972,-87.61735801,POINT (-87.6173580061 41.859349715)
6f9899fa6b248a960572d5442018da559c192adb,90a7cf3946c408e70e8d64b08f2bc6819ae5de6159ecef3460c5287031148a66c4c0d4b6b6c53f13919fcb785db502dcd99c94fb60daa9ac6b338f01ed8c3a2b,12/31/2023 11:45:00 PM,01/01/2024 12:15:00 AM,940,5.02,,,1,14,15.75,0,0,0,15.75,Cash,Flash Cab,42.00962288,-87.67016686,POINT (-87.6701668569 42.0096228806),41.968069,-87.72155906,POINT (-87.7215590627 41.968069)
6b1b21ca32da77c68ee5d8816194ac27d9206082,38f6145c9a2b848dc1baa16fd91087e606b12bcb8757a9eb003dfab2c031fcaeb931c1ae6b486fab5f1c21037f33a187d1cb97080f4334a63f7ce0713d0f47b4,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1260,2.8,,,8,28,13,0,0,0,13,Credit Card,Taxi Affiliation Services,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383)
6722a6b9bd13d58f95c3914e12e1ef8b6b48a507,fc9af5a263f70826b274b29067232130b35f23b91479bb66a0655224a22b586ae2c4f88090c3de82a4f428726dd5018b74b0c84627b9e2cf57ee329c5d794575,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1143,1.12,17031320400,17031081500,32,8,10,0,0,0,10,Cash,City Service,41.87740612,-87.62197165,POINT (-87.6219716519 41.8774061234),41.89250778,-87.62621491,POINT (-87.6262149064 41.8925077809)
6138f0b3dff7fc33cd748727eb6714535a747657,35467b44491f6f51eaa0f4fb1cd65e4c23117aa268d9dd52d88a484194323088fc5a0d30455d37c6bced24218c2ed42b421d7260c119161f29b8381bd11b1784,12/31/2023 11:45:00 PM,01/01/2024 12:45:00 AM,3259,1.97,,,28,8,16,3.7,0,2,22.2,Credit Card,Sun Taxi,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
6132f6fb53c7329ed2e12fef54749d2ffc3d4d2f,b71c6761efe32829e7e453b0c6fcb78a456a7d83c720c746ac0575025dd8c5e3cd6b554288cf71419c89931c34166201ab5f47fe928d5d18e377bad66b8750fc,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1308,4.87,,,8,23,16.25,0,0,0,16.25,Cash,Flash Cab,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.9000696,-87.72091824,POINT (-87.7209182385 41.9000696026)
5f54dc81353c871c63b217a7d117c478dadc3a4b,083b7260314e48be5e10a9191da36fb2c0974b91499a5445d8a895ce901d4458b2a95e4fda48ae1ab55dfac3302268fbf967709c30ef58135041ce5d7d844065,12/31/2023 11:45:00 PM,01/01/2024 12:15:00 AM,1440,4,,,28,7,15,5,0,0,20,Credit Card,Taxi Affiliation Services,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.92268628,-87.64948873,POINT (-87.6494887289 41.9226862843)
5e3f05fac03791828973a9d5e273d756478e76a5,ae61536025042a43c682f2450eaa073da8c7a7f736aec5de1dde1d7e0e2c6be21402ea0d779c9b079b91c58fdccc9091ce99dbf01dbf8de1a81648a34c1f267b,12/31/2023 11:45:00 PM,01/01/2024 12:30:00 AM,1980,0,,,8,,43.75,2,0,7,52.75,Credit Card,Choice Taxi Association,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),,,
5d78f62496278dfb2b96a1c2c6ac428f09ea1ffc,6c6606251e8d2b1609f34d755bf884c4d972ab44b47bd75faa7e533a102e1cc2eed88f9d8d25cda28aaf89c678a3b2d17640c226cbb20375fd3f79685b719945,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,65,0.04,,,32,32,3.5,0,0,0,3.5,Cash,Flash Cab,41.87886558,-87.62519214,POINT (-87.6251921424 41.8788655841),41.87886558,-87.62519214,POINT (-87.6251921424 41.8788655841)
5bf80daf02fc3a3bab75740a1f72ebc09d4b0fe2,4cebb9edbffeb3a0eace8cccef967730a62f5a978869e216e7855270c48891c6ba7575d0ffe0fc7e5347a411f4b4149a76bb8d65812d6c0607ff975eb7c7f566,12/31/2023 11:45:00 PM,01/01/2024 12:15:00 AM,1323,12.94,,,76,,33.25,5.89,0,5.5,45.14,Credit Card,5 Star Taxi,41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),,,
5beed53a8f8bf37104223411fef93b1aa2df46e8,644680ecf5bbb5af6329b0c9d4595c39344cd6c50fababc6e2e17811d9cfe0d67ac4a8b828340bf260428913ffd4b8b82dfaa8e1e83da0e939723ee47ae034c9,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,552,0.56,,,32,32,6.5,2,0,1,10,Credit Card,Sun Taxi,41.87886558,-87.62519214,POINT (-87.6251921424 41.8788655841),41.87886558,-87.62519214,POINT (-87.6251921424 41.8788655841)
5bcabb19b28a7d07c1e114244476cba232dbfe78,171ec426eaf8f54c5acbb7e3fde8e0683bfa6042af0b00e428e650cd9bc909011a2517de5b32358b9cb9d9ad3d5a5b26bbb14a09f8b7f5c1c9a37fa22f26f7a7,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,463,3.29,,,32,35,11,3,0,0.5,15,Credit Card,Flash Cab,41.87886558,-87.62519214,POINT (-87.6251921424 41.8788655841),41.83511799,-87.61867777,POINT (-87.6186777673 41.8351179863)
54b2e6aa52ea342d65be8a7ac93a82650e781319,4ae32e2eb244ce143800e0c40055e537cc50e3358a07ce1e877bf9f91aa6c10db986c727b9d4674705f8d124a18b05a68d07d1bc8d70e95e173f77c2c0437c22,12/31/2023 11:45:00 PM,01/01/2024 12:45:00 AM,4044,1.66,,,32,8,26.25,0,0,0,26.25,Cash,City Service,41.87886558,-87.62519214,POINT (-87.6251921424 41.8788655841),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
4f2165874756524b46ba42d39db0f5c59e1c159d,f9f12d79733b1fa7934f8d9bd17ca1927f3c99ded1640bbd1c77ef4f0e8a5992897a445315545bedf550405a9c5bc5a9f4b8b03c7d06a1f36050a32c9733164d,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,9,0,,,33,33,3.25,0,0,47,50.75,Credit Card,Medallion Leasin,41.85718386,-87.62033462,POINT (-87.6203346241 41.8571838585),41.85718386,-87.62033462,POINT (-87.6203346241 41.8571838585)
4e366fa290c59b3d3c6ced770bc8b6b1d3519a0c,071d031c64f608418d27905c9ffe95bf52695615683d5f4e7072ed77fe2757fe623e369ce677a96e4535360841f5f1ad3f1d6de25ecb0e47e8848ec83bce4da3,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,709,1.21,17031081202,17031081700,8,8,8,0,0,0,8,Cash,5 Star Taxi,41.90278805,-87.62614559,POINT (-87.6261455896 41.9027880476),41.89204214,-87.63186395,POINT (-87.6318639497 41.8920421365)
45b165d46f064d1c685e5fa0ff222437970114f8,c1ffe6edab518145aedcfc816682cbfdcab6ecab156dc3d5b230407ef441db82ba5ad37bcea8436642d6a67a8f92e482dc4005c529ce8d2814d31a9681001bac,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,2,0,17031833000,17031833000,28,28,20,0,0,0,20.5,Credit Card,Flash Cab,41.88528132,-87.6572332,POINT (-87.6572331997 41.8852813201),41.88528132,-87.6572332,POINT (-87.6572331997 41.8852813201)
43dd2fec7bbaa6808d3f6ada656f5969b517c9ae,42560393a9c9b9ae28339f4b5aec77fd89bd49916ad54175d9ee679d69939f973c177065f2816d7990a6663a07270335a4a852190c3258497ba7978edced68c8,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,639,1.41,,,24,8,8.25,4,0,1,13.75,Credit Card,Patriot Taxi Dba Peace Taxi Associat,41.90120699,-87.67635599,POINT (-87.6763559892 41.9012069941),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
41bb85ba82698b51f96cebd8915b62767fd0698d,f9448164dcb56f4f31c2b2ad562f31443a01885c2bda20d6325a0747c9857007a024d358c98ed634fba5791a9dcbaf7252302b7dad34fa6bea7e625d2d185d4e,12/31/2023 11:45:00 PM,01/01/2024 12:15:00 AM,1183,11.51,,,56,38,30.25,0,0,4,34.25,Cash,Patriot Taxi Dba Peace Taxi Associat,41.79259236,-87.76961545,POINT (-87.7696154528 41.7925923603),41.81294894,-87.61785968,POINT (-87.6178596758 41.8129489392)
408dbbbbb5efe8825b9802e9e47b73bde2cad640,f29ed34900f8b339ab279eda0189ecae3312801dab967e2c71b537bcc8c744c8a8691d428541cc969b2eceae6fc36a8c6bfde2f469eba49c78ac96fa96665d9d,12/31/2023 11:45:00 PM,01/01/2024 12:15:00 AM,1303,10.65,,,56,28,28.75,0,0,5.5,34.25,Cash,Sun Taxi,41.79259236,-87.76961545,POINT (-87.7696154528 41.7925923603),41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383)
3e119576753a4a807e8b8702c2caa589a92c153c,f78d14baa2d1f80febaa17d73381c2eadb406cf4537522e111615ca2ccc9854f515cf1bb9dc9a9f48c4fdbf3e4e3adac120d6c6cc00dcd2aa715276b28bed0ad,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,960,2.7,,,7,8,11.5,3,0,1,15.5,Credit Card,Choice Taxi Association,41.92268628,-87.64948873,POINT (-87.6494887289 41.9226862843),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
3d4ad7f2659a6f86fecfee4d4f3a8559716ca894,083b7260314e48be5e10a9191da36fb2c0974b91499a5445d8a895ce901d4458b2a95e4fda48ae1ab55dfac3302268fbf967709c30ef58135041ce5d7d844065,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,0,0,,,28,28,3.25,0,0,0,3.25,Cash,Taxi Affiliation Services,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383)
36fbb628f333cf2a39d450485ea41df93d5b2554,2eda36427e0a5394e90d77488294cd75e2fd87f04acb02c2db58dcfcf473ee221e5404b47fad3df4874a934c12a36244bbadb66f265cbab0c3ff00aa25ac3ed0,12/31/2023 11:45:00 PM,01/01/2024 12:15:00 AM,1698,14.76,,,76,6,37.75,8.45,0,4,50.7,Credit Card,Taxicab Insurance Agency Llc,41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),41.9442266,-87.65599818,POINT (-87.6559981815 41.9442266014)
363810b6cfd667eace3ef3266ec553a546729ff5,847cf962bd6f62040673e6c24c24940aeb2d7fdaa54677eed6a0aaa4aeef61984916b32d763b4baa6c32476531543bb77e2346cd64f505618f6b9d562243f950,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,420,0.9,17031320100,17031320400,32,32,6.25,4,0,3,13.25,Credit Card,Taxi Affiliation Services,41.88498719,-87.62099291,POINT (-87.6209929134 41.8849871918),41.87740612,-87.62197165,POINT (-87.6219716519 41.8774061234)
36323a8a14400312e7cee05020326b7bf8dc301e,624e8f2a6af3b7f032d3c40d6f925f6fc5f0bf6a358ecc7d01503a55553689cf6b36732e7d972a664d2d55e928baadb5d5d387884a6b0cf2701453f45c14a7cf,12/31/2023 11:45:00 PM,01/01/2024 12:15:00 AM,1440,15.4,,,76,8,39,10.85,0,4,53.85,Credit Card,"Taxicab Insurance Agency, LLC",41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
327fa02e9cb1cc29e7898cf98830f6ade295f9e9,15ddbeeb791d41c7683b885617281c0b548544f189ee3630ea6205078abf793173f13acb37440222ebe2a7a3b701fdfca26b4a2d5d75921a3218ad63ab23aa3a,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1080,8.1,,,56,,22,7.8,0,16.5,46.3,Credit Card,Taxi Affiliation Services,41.79259236,-87.76961545,POINT (-87.7696154528 41.7925923603),,,
32747746b9fd2ac09476eabaac05c588f4f4cb83,e0e1f19080d131afa810280c286bc1f57e78b48fe55e992ac816f33602973ff890905d82df0bec55f791ff66f6f3d7d366281cdb4271de2f2d2acb047b743f32,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,358,0.61,17031081202,17031081500,8,8,5.75,0,0,1,6.75,Cash,Star North Taxi Management Llc,41.90278805,-87.62614559,POINT (-87.6261455896 41.9027880476),41.89250778,-87.62621491,POINT (-87.6262149064 41.8925077809)
2ab8133db10a059ff43e2abddaa7e19c20352451,c19109878e8ba25e09c0e464f8972f146c9d07502de920483fdbf2ef6686a35003a397903f3c330b3ee8f14feb74f29cb8a294bd950e9af6fed94b9dbc267aae,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1003,10.53,,,76,21,27.5,0,0,5.5,33,Cash,Flash Cab,41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),41.9386662,-87.71121059,POINT (-87.7112105933 41.9386661962)
23a321e48c465182b749d4e3d6fb901b39a28c36,c09f5ee2dc22a2a3c342dd27432eb0fe98506ef3698f5b2e066d6c56fd7da673e58d85db53894ac092f9eeae70bac42cbb470ffff8cedadee60cf639c49fc711,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,6,0,,,3,3,40,0,0,0,40.5,Credit Card,Sun Taxi,41.96581197,-87.65587879,POINT (-87.6558787862 41.96581197),41.96581197,-87.65587879,POINT (-87.6558787862 41.96581197)
21ca9b2d87a053138fe98c5ce8a3152ae752c945,e7f8c9242fc38babca76de5c34b1e59b9b7ae3ff40812cb34a7374980b9cfb20213b8cce120d9bf339e7974754eec9bd823adab5f83852410c0af0b1c0a7b6ec,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,11,0,,,8,8,10,2,0,0,12.5,Credit Card,Taxicab Insurance Agency Llc,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
1f99d4a620dd942bfe2e98dc274214751258bc9c,8eca35a570101ad24c638f1f43eecce9d0cb7843e13a75f0af0c911c3e31ddec549c4808e216bcf31634542025c1e7de2442b92d5d7d73463c4e05fd959e47b4,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1004,5.31,,,8,5,17,4.4,0,0.1,22,Credit Card,Sun Taxi,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.94779159,-87.68383494,POINT (-87.6838349425 41.9477915865)
1d518052b3bcea69bdfec3508886d7551406202d,ea8e6df913a36562d8eddf662abe7722f4c0dc08527e9819364aed7a595eb61abb3728e49185db0616deb070840f410f6672275961076b6fcacc6bdfcf9edcf1,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,1080,8.9,,,8,40,24.5,0,0,0,24.5,Cash,Taxi Affiliation Services,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.79235722,-87.61793138,POINT (-87.6179313803 41.7923572233)
1adedef3b9733f6a1859137ce37d8c685ad36cea,7d2e7cdd59237335e96b9b1a897a5e48cb4df467e6c09242b1e9461256f36aa4f9ab9649279f19ab5c8ad32ad7eb683800a0fd566f3f4fbd48323bab81908955,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,121,0.45,,,8,8,4.25,2,0,0,6.75,Credit Card,City Service,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
1686a96446c079079e6b53574c3d4f78da6fcfdc,64b71bd4e488e9c5571cdfcca045e7cb7a4abb0931f17b9689b4049c2a71df3161c216d73fd2cb53ba148afc4bb00df5bcdc9790cd679e9000319c12e37217e3,12/31/2023 11:45:00 PM,01/01/2024 12:00:00 AM,954,0.41,17031081800,17031081700,8,8,8.5,0,0,1,9.5,Cash,Flash Cab,41.89321636,-87.63784421,POINT (-87.6378442095 41.8932163595),41.89204214,-87.63186395,POINT (-87.6318639497 41.8920421365)
0353da5e93e2f5f973dc685d76fcd15f6bc0256e,ad4b1730fcbfdb84e41313179a688924012db322823f487d70ffcdbf1fa0e9ec11c35045af7e7cf561db41f5a46939ab7ea0565dc6fa26a0d14f68f6f568b92e,12/31/2023 11:45:00 PM,12/31/2023 11:45:00 PM,12,0.57,,,24,24,4.5,0,0,17,21.5,Cash,5 Star Taxi,41.90120699,-87.67635599,POINT (-87.6763559892 41.9012069941),41.90120699,-87.67635599,POINT (-87.6763559892 41.9012069941)
ef9aabfd57aa87f78421e37dcc6225790e777cb6,093e9e4c05ea53bf75c51763839d5f5bd5d1785c11ee5ec5e805c14bcb833c9fbfd81ab2b7874a85cba14046e54062335b2221738a0bb0bf1ddcfe83a7efa382,12/31/2023 11:30:00 PM,01/01/2024 12:00:00 AM,1920,3.76,17031081201,17031330100,8,33,18,0,0,0,18,Cash,City Service,41.89915561,-87.62621053,POINT (-87.6262105324 41.8991556134),41.85934972,-87.61735801,POINT (-87.6173580061 41.859349715)
6ede890339b9db28cce204e37f36a312c2f073d1,3f46ef398d3308fb9794b8c5de450a88439d16c47b77b79398f0e84b804e7aad4789cb5ee08f74c8b7f89a444653706802a31cfdc8b99d3867a16794641fb759,12/31/2023 11:30:00 PM,12/31/2023 11:30:00 PM,56,0.03,17031980000,,76,,3.5,0,0,0,3.5,Cash,Sun Taxi,41.97907082,-87.90303966,POINT (-87.9030396611 41.9790708201),,,
6be0efd956926d6600d23f45470d638f3c5c01c3,c797f1560410b9db343567ea7c8e4095f66ceb65800fa466623d4695efdf3151679fb9bfe88ee18d47096e518c23d9c517e741de11df233e4c4bc11da8c3d8b1,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,578,2.63,,,4,77,9.75,4,0,0,14.25,Credit Card,Flash Cab,41.97517094,-87.68751552,POINT (-87.6875155152 41.9751709433),41.9867118,-87.66341641,POINT (-87.6634164054 41.9867117999)
fc955fb2be6161f771c45ab35fd37b08e13dc1f6,d461dc72b7a599bfba3f33fae867f5530e0c5aa5c200d89b4a5cbd270da1eba6488b76e3ce70a8371b8242f3529dd35a1230f16dc7e7bf2626243840f6261b97,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,1046,0.61,17031320100,17031839100,32,32,9.25,2,0,0,11.75,Credit Card,Taxicab Insurance Agency Llc,41.88498719,-87.62099291,POINT (-87.6209929134 41.8849871918),41.88099447,-87.63274649,POINT (-87.6327464887 41.8809944707)
ed59a2b85933b8086d71aa55c04f85bbfa3f37c6,698ec513d27602fcd211bb62440a555a3f23bebbe3b2a1ec9ba6466a63bab46628c6dc1622de7c3dcbe4b0f98a7048c9fb07ae9c8a1572f432db4df68b1a4803,12/31/2023 11:30:00 PM,01/01/2024 12:00:00 AM,1504,13.51,,,76,,34.5,8.4,0,7,50.4,Credit Card,Taxicab Insurance Agency Llc,41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),,,
e9dfed1215cf9b95d528aabfdd3cab775b255913,0bea3de3c36237d68b009b24ee3db86c78e9e618a73a3b5776e5f4bba06775f91b3520db910d24b97d577e57c4372f5d9d2eb58d338f3a0add0e37c0f71f6701,12/31/2023 11:30:00 PM,12/31/2023 11:30:00 PM,660,0.9,17031081700,17031081201,8,8,7.25,0,0,0,7.25,Cash,Taxi Affiliation Services,41.89204214,-87.63186395,POINT (-87.6318639497 41.8920421365),41.89915561,-87.62621053,POINT (-87.6262105324 41.8991556134)
e9b1cfd8bc49629663f84f697badf17a88b7ab1f,bb4e75d3065311c33024a434640731c43fd2cf9e4482eb9e17cbf9f0ff0ed005455ffe22797df66b7467489a738e7be52c5983e16615b31c7c1d6af3ee0eb965,12/31/2023 11:30:00 PM,01/01/2024 12:15:00 AM,1920,1.2,,,24,74,50,0,0,0,50,Cash,Taxi Affiliation Services,41.90120699,-87.67635599,POINT (-87.6763559892 41.9012069941),41.69487897,-87.7131925,POINT (-87.7131924966 41.6948789661)
e92cbef122dee337d7502c7177916016d36e964b,847cf962bd6f62040673e6c24c24940aeb2d7fdaa54677eed6a0aaa4aeef61984916b32d763b4baa6c32476531543bb77e2346cd64f505618f6b9d562243f950,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,720,1.4,17031081700,17031320100,8,32,8.25,4,0,1,13.25,Credit Card,Taxi Affiliation Services,41.89204214,-87.63186395,POINT (-87.6318639497 41.8920421365),41.88498719,-87.62099291,POINT (-87.6209929134 41.8849871918)
e8473ad9fa9148bc0feaccfe68caaeef0ca1f648,259d38cfdbc9ac6f9bb01f0df740e0ddf4a631a70bbdd6525862b20b7ed0e0554dbbde64c2955b8b6f41468c8970e5507490db36348f884461783472621bda08,12/31/2023 11:30:00 PM,12/31/2023 11:30:00 PM,480,1.5,,,8,8,7.5,0,0,1,8.5,Cash,"Taxicab Insurance Agency, LLC",41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
e1968da25611d8cee47d5088502f9ab1c76877c8,56a1119c6ca57e39525cf06829f9ecff553cf4b5ac24821259d086c8ab30406ec45ae77335c646417897d2f4916479c3ed8b6313c2ccb9fb3fc248a4c3387800,12/31/2023 11:30:00 PM,12/31/2023 11:30:00 PM,4,0,17031081403,17031081403,8,8,15,0,0,0,15.5,Credit Card,Medallion Leasin,41.89092203,-87.61886836,POINT (-87.6188683546 41.8909220259),41.89092203,-87.61886836,POINT (-87.6188683546 41.8909220259)
e1739faf183448c03ab821871c96de984ace8697,552720f76dd5338d0cf254f8eb4045839a5501e095a0d34fee849df1633dce909ed9b7e001b6e904f64f5b235fc56377ef450dd8c29f16fcbd7a7c2116386654,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,1048,0.88,17031081202,17031081500,8,8,9.5,0,0,1,10.5,Cash,Choice Taxi Association Inc,41.90278805,-87.62614559,POINT (-87.6261455896 41.9027880476),41.89250778,-87.62621491,POINT (-87.6262149064 41.8925077809)
e151cf39ae70e33ac5df78ac76ca2c3706216321,913c95ba782fa447b7c55fbfc38d040907d13e7ddf7282a75fe448d2a25082dcdab927cd930805ea14e62bc534d5288669b15f73751bd43a746cc9e3bbddb2f4,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,830,2.66,,,7,28,10.5,0,0,0,11,Credit Card,Flash Cab,41.92268628,-87.64948873,POINT (-87.6494887289 41.9226862843),41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383)
de5e1fc2ebc09d6c56a83685e61b15d582059d2b,8be2c5887fd81a4918e0464359436d6fc5ed1dbbe4f5b0317403a4ead72c95b37979d3534c5ede73a29ceace53bdc820f692587ba4a88f651de93b329e4cf2f8,12/31/2023 11:30:00 PM,01/01/2024 12:00:00 AM,1320,16.3,,,76,28,40,8.9,0,4,52.9,Credit Card,Taxi Affiliation Services,41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383)
dd01d2cbf95e044799a49c5988de327fd0b4ed2f,b875e9e053d893ee490e723c96773ed5f81c0a2339545f941b006167253d2bd537c68266a6c87ecda89948c234e7ae93ae51869767a4e8345492b407ab62e424,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,637,3.96,,,8,6,13,4.35,0,1,18.85,Credit Card,Taxicab Insurance Agency Llc,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.9442266,-87.65599818,POINT (-87.6559981815 41.9442266014)
daed3f1e3c0cd866cca1f50fe0513d250ea80eff,f9bc93a0ba6b1f18c9709a96c99bb9c5a99054b1711f80ddfe986a0f78a02470f146c48fd7d66bf74fe374f53d032676cb2fb7871afce4314dfdac97d4f22d32,12/31/2023 11:30:00 PM,01/01/2024 12:00:00 AM,1674,8.3,,,28,77,25.25,6.44,0,0,32.19,Credit Card,City Service,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.9867118,-87.66341641,POINT (-87.6634164054 41.9867117999)
d949547c3f9f0bbce64bce18b50cca6df60f88f2,b52493d43f7de565ab5eaaa0b1238709ac2073a9cdd626a411f99151188aa290435bada1d7c0119f6423891cbc9c3ce5c9ddd4ed068bca8e8aaf75cedacb9f0d,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,900,7.9,,,76,12,22,5.3,0,4,31.3,Credit Card,Taxi Affiliation Services,41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),41.99393013,-87.75835359,POINT (-87.7583535876 41.9939301285)
d45e012bbd6fffa5ffa8aba12b6d61961c89e9e0,73052f4ccaf4e0fa9178722e491f8e5eda869f56e08aa4d659ef38139d36bd69df925ac00c96564af9fc30db0c616fb4c9312cb3ba36fe8c8cdd470750d4681d,12/31/2023 11:30:00 PM,01/01/2024 12:00:00 AM,1320,0,,,56,28,25.5,5.02,0,4,34.52,Credit Card,Taxi Affiliation Services,41.79259236,-87.76961545,POINT (-87.7696154528 41.7925923603),41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383)
d12be01c208e273389ccc1f306cbd9cc98bfc73d,f6aac57dbd69c58200d6fb22bbffe1343ca6ea5eece073d452f97009803408626e6357e405c35bffde1495f078c83419ca0378e26d8dca04c6b81644638cefc9,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,1260,0.2,,,6,8,16.5,0,0,0,16.5,Cash,Taxi Affiliation Services,41.9442266,-87.65599818,POINT (-87.6559981815 41.9442266014),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
d115aa399492b5b2bd4faed5d6c4fd36122918aa,24d4c5e51d147aecbb7c4a1ad70c38dbc05c7b4485f6de4b41fee5ea270228859c19c4160b101df87b889acc5cc7ab49b5e0491676026518677d34ea81f76591,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,1038,10.24,,,76,14,27,9.75,0,5,42.25,Credit Card,Taxicab Insurance Agency Llc,41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),41.968069,-87.72155906,POINT (-87.7215590627 41.968069)
cee99899eff1d446626df83a97b5b5f0571c7ec4,7d8179131ea9952793af4cda8635e94b56c2b92d3c376cd92517f7319ec4a3031207af4d7b8165367e1f8a185275814ab89c26ace551ac3bf96a04ea174371c1,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,696,1.22,17031081500,17031833000,8,28,8,2,0,0,10.5,Credit Card,5 Star Taxi,41.89250778,-87.62621491,POINT (-87.6262149064 41.8925077809),41.88528132,-87.6572332,POINT (-87.6572331997 41.8852813201)
ce3d8fe7f1f0906a2502c728358705f6f547d872,6898e40854937399e0ef25dad63740d21b20593439090721b2f747b039ab24c5e74ebda726b98515a4b4c6b7dd9f87717cfc7c1b52e3a8b4d52245214f09d229,12/31/2023 11:30:00 PM,12/31/2023 11:30:00 PM,16,0.02,,,28,28,35,5.32,0,0,40.82,Credit Card,Flash Cab,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383)
ce0f3501cd03c9e1795241db4da2a1285559f906,1de191ccc486f8d0e0e6b25a03d592e58ed4511cfed79e912c895802cb808ea9a2d609cc77a536d7ad6431160a92d39dc6b79ec88381059d540f95215364582e,12/31/2023 11:30:00 PM,01/01/2024 12:00:00 AM,1260,11.2,,,76,2,29,5,0,4,38,Credit Card,Taxi Affiliation Services,41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),42.00157103,-87.69501259,POINT (-87.6950125892 42.001571027)
cc3f3f6214a8b4ad9f15f472e0ea734b441728fa,599e7935e8f7321862152296420d8552c36d7fe97517f0bec1048c18ed7f2a434e06c55f5e16e5ec5d1da15f02eb49079258d610b196bd9eb0cf4183166878e4,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,776,4.3,,,56,30,13.75,4.81,0,5,24.06,Credit Card,Flash Cab,41.79259236,-87.76961545,POINT (-87.7696154528 41.7925923603),41.83908691,-87.71400381,POINT (-87.714003807 41.8390869059)
cb026ca7cca9c89c7bf96c3efcd12376b2800fc3,4477f5eda3c0c9379d7526db1b5029184a7d75a2adcad3b338b20c83f351865360b02546bc50125c663edc0ed86b206261dc50f7f002199e4d0880802c51311d,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,1028,5.45,,,8,6,17,4.38,0,0,21.88,Credit Card,Medallion Leasin,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.9442266,-87.65599818,POINT (-87.6559981815 41.9442266014)
cab8410b2d60210e11fa09bc929a1a6ba0696084,e533bfdc483206f9c02c1c879a118d88f0a3ca1cd2703f3cf88e318716bbbb0c71d5f1c5f86b042b4ee1a06dbc750fa840acec0ebaf5fc1d90edbdc215114a1d,12/31/2023 11:30:00 PM,01/01/2024 12:00:00 AM,1736,17.38,17031980000,17031081201,76,8,53.94,0,0,0,53.94,Cash,Flash Cab,41.97907082,-87.90303966,POINT (-87.9030396611 41.9790708201),41.89915561,-87.62621053,POINT (-87.6262105324 41.8991556134)
c38ed8ec5467492783aed71363709b87a31ac8d9,a62df4e9bfec5f3babb7922b1346263cce5c3116fa5fa3465e4845b94774ef86b30bb243f80f396cf2211d7e3309028193f159567ee341b44d39dc3a1f5495f9,12/31/2023 11:30:00 PM,01/01/2024 12:00:00 AM,1204,4.79,,,6,15,15.75,0,0,0,15.75,Cash,Medallion Leasin,41.9442266,-87.65599818,POINT (-87.6559981815 41.9442266014),41.95402765,-87.76339903,POINT (-87.7633990316 41.9540276487)
bfbc39a914248481c0b5c4899d1d0ca4f54f9851,d2a9362483decbe7b2d28d38ff371f05fafd542a60e8c9e4d5e3150e2c0b41b9e2b69e561807859b9ccb8c0f27922c183460b9f858df03f53b9656b8b829ceb1,12/31/2023 11:30:00 PM,01/01/2024 12:00:00 AM,2168,23.1,17031980000,17031839100,76,32,57.5,0,0,5,62.5,Cash,Flash Cab,41.97907082,-87.90303966,POINT (-87.9030396611 41.9790708201),41.88099447,-87.63274649,POINT (-87.6327464887 41.8809944707)
bdb3a185915415969458eeaf805d9a0012252754,3b95cedc13d4a99243e1974616a6a25267c25878336faa586d4372370c847c3618753718dca887e3713efb476a5af11bc5f5a86c9785c9f749ca45f3f8be4764,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,480,0,17031839100,17031833000,32,28,7.5,0,0,0,7.5,Cash,Taxi Affiliation Services,41.88099447,-87.63274649,POINT (-87.6327464887 41.8809944707),41.88528132,-87.6572332,POINT (-87.6572331997 41.8852813201)
bcab4490118535b7135fe1394c37507df7f6ad90,3665a72ee495b03f4dae72307dc6e5e58e21518f77d8e67dcd386c3b9daa1a0db86555cef4a877234542af8d1c0da6fa7a28a4e0e643e382236470d569d78668,12/31/2023 11:30:00 PM,01/01/2024 12:00:00 AM,1800,14.6,,,76,,37,8.7,0,6,51.7,Credit Card,Taxi Affiliation Services,41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),,,
b8f99264a52fe95e5160333451d51cd83f3b34c8,924ad289d7377302678c3954095a96778a3a5b2a9a2a69d5335f59ff00e672ea71a0d95fe058ecfa5d53fb86dc1eba63a1af3e51ab33ec04e8b5a679c91f564b,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,1380,0,,,76,24,41,0,0,5,46,Cash,"Taxicab Insurance Agency, LLC",41.98026432,-87.9136246,POINT (-87.913624596 41.9802643146),41.90120699,-87.67635599,POINT (-87.6763559892 41.9012069941)
b819f56e53b38f9d75067716f5701a3bdbae8761,422aa525858cddde977f39fa4e58947555918726746ebd72be48d2a2d09af86e2b5e5318fea36ecc84de5b6af8354307064e73672f67e4bc907dcabc21e61c09,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,1020,0.2,,,28,7,12.75,0,0,0,12.75,Cash,Taxi Affiliation Services,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.92268628,-87.64948873,POINT (-87.6494887289 41.9226862843)
b1a91a0cbb11e273aa8fe96eaf32cb26389570cf,c0d525ee45b1b77f1fcc69c7c56ff91661795d15482cc46a75ca8164ea25736a32169b1ba73fb5eee3ee98e629942c90ed23a5998f29bfe050afd3a08608a9a9,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,1140,0.1,,,32,8,11.5,0,0,2,13.5,Cash,Taxi Affiliation Services,41.87886558,-87.62519214,POINT (-87.6251921424 41.8788655841),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
aa31ae6e712e0a21598087799a7439a280301056,82bc059c3b13e97341f941d60f772ae9f83687498e91f7c399644ec42449cced734834174cb0a29955229b910c3c9810dc67997226ce38a8600bf7f24d149423,12/31/2023 11:30:00 PM,12/31/2023 11:30:00 PM,5,0,,,32,32,25,5.1,0,0,30.6,Credit Card,5 Star Taxi,41.87886558,-87.62519214,POINT (-87.6251921424 41.8788655841),41.87886558,-87.62519214,POINT (-87.6251921424 41.8788655841)
a96c2ef996a9458f48eecfe4f62e2fcb0790cb9d,e8d374b4e7bc344add5893f1a1ae3b611823439ac1caf06087c4bf2cb6fe114201a38c49646f2851d86f9e73c21c1e99f29903cceb84612fd1118e3224528b2b,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,720,0.4,,,,,7.5,0,0,0,7.5,Cash,Taxi Affiliation Services,,,,,,
a4d44dd31babbf742d35571b483e2f8ab7f5256a,0cae7ec64456b1830bd58df1991f046410f5506cf28b3aa16b6d5c4940b44ff0ca069324233093161b43d212c2c5eac61536cfa6e3284117bb62ee4105e945b1,12/31/2023 11:30:00 PM,12/31/2023 11:30:00 PM,0,0,,,77,,3.25,0,0,0,3.25,Cash,Taxi Affiliation Services,41.9867118,-87.66341641,POINT (-87.6634164054 41.9867117999),,,
a2136304c06eeb1897684e0402905d1d2b528cc8,42560393a9c9b9ae28339f4b5aec77fd89bd49916ad54175d9ee679d69939f973c177065f2816d7990a6663a07270335a4a852190c3258497ba7978edced68c8,12/31/2023 11:30:00 PM,12/31/2023 11:30:00 PM,239,0.7,,,28,8,5.25,2,0,1,8.75,Credit Card,Patriot Taxi Dba Peace Taxi Associat,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
a1383cbd5fab084a75d9a0c6302d33e3cb6104d3,b4ac2893286a7c3a55df851a3732ea65d7fb82e1da7a19f728a71651761babfd88544152301073319650be263fd4e1aabc072601e6f452daf22ceb764f8d70d5,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,475,0.45,,,8,8,6,0,0,1,7,Cash,City Service,41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111),41.89960211,-87.63330804,POINT (-87.6333080367 41.899602111)
9f072b4e70e16ebb84b0a2f6ff718150ecc3e345,00f4b381570486f8575cbaa57ed41f116ed2e1f9d85f73bb2f6dba13a72541761d2ad4cb1727990d97795a2b0bdd99f0e4a8826245c81dac443cebb1c19b26fb,12/31/2023 11:30:00 PM,01/01/2024 12:15:00 AM,2460,1.1,,,56,3,48.25,16.4,0,6,70.65,Credit Card,Taxi Affiliation Services,41.79259236,-87.76961545,POINT (-87.7696154528 41.7925923603),41.96581197,-87.65587879,POINT (-87.6558787862 41.96581197)
9ca870a06a41bcf8a0472890df4d404d7d592d5d,42e3ec7750e4be6e56c47bcdefe5cb86ddb0d0c65bcf4d09773512b3e854ed08adeacdad835a4e92a8ca871021858984bb70a72c1dc17d22b49d2f664a6e0fd2,12/31/2023 11:30:00 PM,01/01/2024 12:00:00 AM,1399,16.2,17031980000,17031839100,76,32,40.25,0,0,29,69.25,Cash,Taxicab Insurance Agency Llc,41.97907082,-87.90303966,POINT (-87.9030396611 41.9790708201),41.88099447,-87.63274649,POINT (-87.6327464887 41.8809944707)
97d0f9bb2bc7aed4e8c84da3444743f9f1256d32,f1c4fb891f4812fb2865e801d2185b401283b34401b71f25cafc8b108f48241363276826a3fe8f4830d1979de0179f5850a26e115de686d6af99b79e66218656,12/31/2023 11:30:00 PM,01/01/2024 12:00:00 AM,1680,0.6,,,28,77,30,5,0,1,36,Credit Card,Taxi Affiliation Services,41.87400538,-87.66351755,POINT (-87.6635175498 41.874005383),41.9867118,-87.66341641,POINT (-87.6634164054 41.9867117999)
97059ca7943e828e9b3b5da926d6f27d6ddc9f30,f81c929ea7d9107e6de8bd7ee335f42563b3413e967e98288480648a66455138dcdcde8b46b353ca4d6c287be49cd3087636ba13de6b7db6db3854c2ac8a157f,12/31/2023 11:30:00 PM,12/31/2023 11:45:00 PM,240,0.9,,,7,7,5.25,0,0,1,6.25,Cash,Taxi Affiliation Services,41.92268628,-87.64948873,POINT (-87.6494887289 41.9226862843),41.92268628,-87.64948873,POINT (-87.6494887289 41.9226862843)
二、实验步骤
(一)实验题目:基于SVM进行分类预测
程序输出:
============================================================
基于SVM进行分类预测
============================================================
(二)加载CSV文件
数学模型:输入数据矩阵和标签向量
- 筛选数据:根据CSV文件(文件中一共有2002行,23列数据,不能完全满足实验要求,因为CPU跑10万条的数据集效率很低,运行时间太长,感觉等不到运行结果),数据类型的输出如下:
步骤1: 数据加载
文件 linear.csv 原始形状: (2001, 23)
前几行数据:
Trip ID ... Dropoff Centroid Location
0 011106b6114f83af0c17aace3867a464a7fc742b ... POINT (-87.6262149064 41.8925077809)
1 e9a66ddcc78cfd79f419165314cbe5ee380f16c3 ... NaN
2 e765192268db3480b5d9bd0443f7ce7fd5ba047d ... POINT (-87.6559981815 41.9442266014)
3 c6510d4f82541cfacf8c20cab44fbb7c0b2c5efe ... NaN
4 f9445eed26da9a0eff247350df942616cb51e764 ... POINT (-87.6559981815 41.9442266014)
[5 rows x 23 columns]
数据类型:
Trip ID object
Taxi ID object
Trip Start Timestamp object
Trip End Timestamp object
Trip Seconds float64
Trip Miles float64
Pickup Census Tract float64
Dropoff Census Tract float64
Pickup Community Area float64
Dropoff Community Area float64
Fare float64
Tips float64
Tolls float64
Extras float64
Trip Total float64
Payment Type object
Company object
Pickup Centroid Latitude float64
Pickup Centroid Longitude float64
Pickup Centroid Location object
Dropoff Centroid Latitude float64
Dropoff Centroid Longitude float64
Dropoff Centroid Location object
根据上述的数据类型的输出,我们容易发现,经度和纬度由于数据变化范围特别小,因此Python不用访问,对于非数值类型(object),根据观察可以发现Python只能处理第16列非数值类型的数据,可以采用映射的方式将Cash映射为-1,将Credit Card映射为1,第5到第15列数据是数值类型(float64),Python可以处理,因此可以筛选第5-16列数据。
2.处理数据:特征列处理函数
过程模型:对于特征矩阵的每一列 进行数值转换和缺失值填充:
若转换成功,则
若存在缺失值NaN,则使用均值填充:
数值转换和缺失值处理
数学模型:对于向量,数值转换和缺失值填充的处理过程为:
尝试将转换为数值向量
对于缺失值,计算的均值
填充缺失值:
标签处理函数(特殊处理标签列)
数学模型:对于标签向量,我们定义映射函数
:
对于字符串标签,,
对于数值标签,
对于缺失值NaN,直接跳过该样本
由于第5-16列数据可能有缺失值,异常值的情况,需要标准化和归一化进行处理。处理结果如下:
dtype: object
处理列 '特征列 Trip Seconds', 原始类型: int64
处理列 '特征列 Trip Miles', 原始类型: float64
处理列 '特征列 Pickup Census Tract', 原始类型: float64
列 '特征列 Pickup Census Tract' 中有 1287 个值无法转换为数字,将使用均值填充
处理列 '特征列 Dropoff Census Tract', 原始类型: float64
列 '特征列 Dropoff Census Tract' 中有 1337 个值无法转换为数字,将使用均值填充
处理列 '特征列 Pickup Community Area', 原始类型: float64
列 '特征列 Pickup Community Area' 中有 66 个值无法转换为数字,将使用均值填充
处理列 '特征列 Dropoff Community Area', 原始类型: float64
列 '特征列 Dropoff Community Area' 中有 317 个值无法转换为数字,将使用均值填充
处理列 '特征列 Fare', 原始类型: float64
列 '特征列 Fare' 中有 2 个值无法转换为数字,将使用均值填充
处理列 '特征列 Tips', 原始类型: float64
列 '特征列 Tips' 中有 2 个值无法转换为数字,将使用均值填充
处理列 '特征列 Tolls', 原始类型: float64
列 '特征列 Tolls' 中有 2 个值无法转换为数字,将使用均值填充
处理列 '特征列 Extras', 原始类型: float64
列 '特征列 Extras' 中有 2 个值无法转换为数字,将使用均值填充
处理列 '特征列 Trip Total', 原始类型: float64
列 '特征列 Trip Total' 中有 2 个值无法转换为数字,将使用均值填充
成功加载文件: linear.csv, 特征数据形状: (2000, 11), 标签数量: 2000
处理标签数据,类型: <class 'numpy.ndarray'>, 形状: (2000,)
标签的唯一值: ['Cash' 'Credit Card']
处理后的标签分布: -1 (Cash): 969, 1 (Credit Card): 1031
数据处理完成 - 特征维度: (2000, 11), 标签分布: 负类(-1): 969, 正类(1): 1031
成功加载CSV文件
(三)数据检查与预处理
检查数据维度与类型分布,输出结果如下:
步骤2: 数据检查与预处理
数据维度: X=(2000, 11), y=(2000,)
类别分布 - 类别(-1): 969, 类别(1): 1031
(四)数据标准化
划分为训练集和测试集,输出结果如下:
步骤3: 数据标准化
训练集大小: (1600, 11)
测试集大小: (400, 11)
(五)手动SMO算法训练
1.支持向量机原理
支持向量机(SVM)的基本思想是在特征空间中寻找一个最优超平面,使得不同类别的样本分别位于超平面的两侧,且间隔最大。
原始优化问题:
约束条件:
为处理线性不可分情况,引入松弛变量和惩罚参数C:
约束条件:
2.拉格朗日对偶问题
通过引入拉格朗日乘子,原问题转化为对偶问题:
约束条件:
3 。核函数定义
核函数用于在高维空间中计算内积,常用的核函数包括:
线性核:
多项式核:
RBF核:,其中
Sigmoid核:
4。 序列最小优化算法(SMO)
SMO算法通过迭代选择两个拉格朗日乘子进行优化,关键步骤如下:
(1) 选择拉格朗日乘子: 选择违反KKT条件的两个变量和
(2) 计算边界:根据约束和
当时:
,
当时:
,
(3) 更新:
其中,
(4) 截断:
(5) 更新:
(6) 计算截距b:
如果,则
如果,则
否则
(7)决策函数
优化完成后,决策函数为:
显示支出项两个数,权重向量和偏置项,输出结果如下:
步骤4: 手动SMO算法训练
支持向量个数: 1384
权重向量 w = [-0.1751, 0.9524]
偏置项 b = -0.2822
决策边界可视化
数学模型:根据SVM决策函数可视化决策边界
手动SMO SVM - 准确率: 0.5975, 精确率: 0.7068, 召回率: 0.4352, F1: 0.5387
(六)不同核函数比较
混淆矩阵和评估指标
数学模型:计算分类性能指标
真正例(TP):
真负例(TN):
假正例(FP):
假负例(FN):
指标计算:
准确率(Accuracy):
精确率(Precision):
召回率(Recall):
F1分数:
比较线性核SVM,RBF核SVM,多项式核SVM,Sigmoid核SVM,依次计算这些核函数的准确率,精确率,召回率,F1的值,输出结果如下:
步骤5: 不同核函数比较
线性核 SVM - 准确率: 0.9275, 精确率: 1.0000, 召回率: 0.8657, F1: 0.9280
RBF核 SVM - 准确率: 0.9300, 精确率: 0.9896, 召回率: 0.8796, F1: 0.9314
多项式核 SVM - 准确率: 0.8875, 精确率: 0.9476, 召回率: 0.8380, F1: 0.8894
Sigmoid核 SVM - 准确率: 0.7600, 精确率: 0.7857, 召回率: 0.7639, F1: 0.7746
(七)自动参数调优
网格搜索自动调参
数学模型:通过网格搜索和交叉验证寻找最优超参数
交叉验证过程:
1. 将数据分成份
2. 对每个参数组合,计算交叉验证分数
3. 选择最优参数组合
输出结果如下:
步骤6: 自动超参数调优
对通用数据集进行调参...
开始自动调参...
自动调参失败: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)
(八)参数对性能的影响
步骤7: 参数对性能的影响
(九)学习曲线分析
步骤8: 学习曲线分析
(十)生成3D可视化(网页版)
3D可视化
数学模型:使用PCA或t-SNE进行降维,在3D空间中可视化数据分布和决策边界
PCA降维过程:
1. 计算协方差矩阵
2. 对协方差矩阵进行特征值分解
3. 选择前三个最大特征值对应的特征向量
4. 降维投影
输出结果如下:
步骤9: 生成3D可视化
(十一)创建决策边界动画(网页版)
动画可视化
数学模型:创建不同核函数决策边界的平滑过渡动画
过程:
1. 对每个核函数计算决策函数
2. 通过权重函数实现平滑过渡:
3. 使用帧序列可视化时间序列
输出结果如下:
步骤10: 创建决策边界动画
(十二)生成综合性能报告(网页版)
步骤11: 生成综合性能报告
=== 所有演示完成 ===
最佳模型参数: {'kernel': 'linear', 'C': 1.0}
数据处理、模型训练、可视化和性能评估已全部完成!
三、Python代码实现基于SVM进行分类预测
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score
from sklearn.impute import SimpleImputer # 导入缺失值处理模块
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import seaborn as sns
import os
import warnings
warnings.filterwarnings('ignore')
# 设置中文字体支持
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号
# ==================== CSV数据访问功能 ====================
def load_csv_with_specific_columns(file_paths, selected_columns=[4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], skip_header=True):
"""
从一个或两个CSV文件加载特定列数据
参数:
file_paths: str或list - CSV文件路径列表
selected_columns: list - 要选择的列索引(第5,6,7,8,9,10,11,12,13,14,15,16列对应索引4-15)
skip_header: bool - 是否跳过第一行
返回:
X: np.array - 特征数据
y: np.array - 标签数据(Cash=-1, Credit Card=1)
"""
all_data = []
all_labels = []
# 处理单个或多个文件
if isinstance(file_paths, str):
file_paths = [file_paths]
for file_path in file_paths:
try:
# 读取CSV文件
df = pd.read_csv(file_path)
print(f"文件 {file_path} 原始形状: {df.shape}")
print(f"前几行数据:")
print(df.head())
print(f"数据类型:")
print(df.dtypes)
# 跳过第一行如果需要
if skip_header:
df = df.iloc[1:]
# 检查列索引是否有效
max_col = max(selected_columns) if selected_columns else df.shape[1] - 1
if max_col >= df.shape[1]:
print(f"警告: 文件 {file_path} 列数不足,最大列索引: {df.shape[1] - 1}")
# 调整选择的列
valid_columns = [col for col in selected_columns if col < df.shape[1]]
else:
valid_columns = selected_columns
# 确保至少有两列(特征列+标签列)
if len(valid_columns) < 2:
print(f"警告: 有效列数不足,至少需要一个特征列和一个标签列")
continue
# 分离特征列和标签列
feature_columns = valid_columns[:-1]
label_column = valid_columns[-1]
# 处理特征列(数值型)
X_data = process_feature_columns(df, feature_columns)
# 处理标签列(保留字符串)
labels_data = df.iloc[:, label_column].values
all_data.append(X_data)
all_labels.extend(labels_data)
print(f"成功加载文件: {file_path}, 特征数据形状: {X_data.shape}, 标签数量: {len(labels_data)}")
except Exception as e:
print(f"加载文件 {file_path} 失败: {e}")
continue
if not all_data:
print("未能加载任何数据,生成模拟数据...")
return generate_sample_data()
# 合并特征数据
X = np.vstack(all_data) if len(all_data) > 1 else all_data[0]
# 将标签列转换为数组
labels = np.array(all_labels)
# 映射标签(不使用均值填充)
y = process_labels(labels)
print(f"数据处理完成 - 特征维度: {X.shape}, 标签分布: 负类(-1): {np.sum(y == -1)}, 正类(1): {np.sum(y == 1)}")
return X, y
def process_feature_columns(df, feature_columns):
"""
处理特征列(数值型处理)
参数:
df: DataFrame - 输入数据
feature_columns: list - 特征列索引
返回:
X: np.array - 处理后的特征数据
"""
# 提取特征列数据
features_df = df.iloc[:, feature_columns]
# 处理每一列
processed_features = []
for col_idx, col in enumerate(features_df.columns):
series = features_df[col]
# 尝试转换为数值类型并处理缺失值
numeric_series, _ = convert_to_numeric(series, f"特征列 {col}")
processed_features.append(numeric_series)
# 合并处理后的特征列
X = np.column_stack(processed_features)
return X
def convert_to_numeric(series, col_name):
"""
将pandas Series转换为数值类型
参数:
series: pandas Series
col_name: 列名(用于调试)
返回:
numeric_array: 数值数组
conversion_info: 转换信息
"""
print(f"处理列 '{col_name}', 原始类型: {series.dtype}")
# 尝试直接转换为数值类型
try:
# 首先尝试 pd.to_numeric
numeric_series = pd.to_numeric(series, errors='coerce')
# 检查转换后的缺失值
nan_count = numeric_series.isna().sum()
if nan_count > 0:
print(f"列 '{col_name}' 中有 {nan_count} 个值无法转换为数字,将使用均值填充")
# 使用均值填充NaN值
if not numeric_series.isna().all(): # 确保不是全部都是NaN
mean_value = numeric_series.mean()
numeric_series.fillna(mean_value, inplace=True)
else:
print(f"列 '{col_name}' 全部为非数值,使用0填充")
numeric_series.fillna(0, inplace=True)
return numeric_series.values, "数值转换成功"
except Exception as e:
print(f"列 '{col_name}' 数值转换失败: {e}")
# 如果是字符串列,尝试特殊处理
if series.dtype == 'object':
return process_numeric_object_column(series, col_name)
else:
# 最后的备选方案:全部设为0
print(f"对列 '{col_name}' 使用默认值0")
return np.zeros(len(series)), "使用默认值"
def process_numeric_object_column(series, col_name):
"""
处理object类型(通常是字符串)的特征列,尝试转换为数值
"""
print(f"处理object类型特征列 '{col_name}'")
# 查看唯一值
unique_values = series.unique()
if len(unique_values) < 10:
print(f"列 '{col_name}' 的唯一值: {unique_values}")
else:
print(f"列 '{col_name}' 有 {len(unique_values)} 个唯一值")
# 尝试映射常见的字符串到数值
result = []
for value in series:
if pd.isna(value) or value is None:
result.append(0) # NaN用0代替
elif isinstance(value, str):
# 尝试提取数字
numeric_value = extract_number_from_string(value)
result.append(numeric_value)
else:
try:
result.append(float(value))
except:
result.append(0)
return np.array(result), "字符串处理完成"
def extract_number_from_string(s):
"""
从字符串中提取数字
"""
if not isinstance(s, str):
return 0
# 移除空格
s = s.strip()
# 常见的字符串到数字的映射
string_to_number = {
'cash': -1,
'credit': 1,
'credit card': 1,
'debit': 0,
'yes': 1,
'no': 0,
'true': 1,
'false': 0,
'male': 1,
'female': 0,
'high': 1,
'low': -1,
'medium': 0
}
# 检查字符串映射
s_lower = s.lower()
if s_lower in string_to_number:
return string_to_number[s_lower]
# 尝试提取数字
import re
numbers = re.findall(r'-?\d+\.?\d*', s)
if numbers:
try:
return float(numbers[0])
except:
pass
# 如果无法提取,使用哈希值
return hash(s) % 1000 / 1000.0 # 转换为0-1之间的小数
def process_labels(labels):
"""
处理标签数据,保留字符串格式,不使用均值填充
"""
print(f"处理标签数据,类型: {type(labels)}, 形状: {labels.shape if hasattr(labels, 'shape') else len(labels)}")
# 查看标签的唯一值
if isinstance(labels, np.ndarray):
unique_labels = np.unique(labels)
else:
unique_labels = pd.Series(labels).unique()
# 显示唯一标签值
if len(unique_labels) < 10:
print(f"标签的唯一值: {unique_labels}")
else:
print(f"标签有 {len(unique_labels)} 个唯一值")
# 转换标签
y = []
for label in labels:
# 对于缺失的标签,跳过对应的样本
if pd.isna(label) or label is None:
continue
mapped_label = map_payment_label(label)
y.append(mapped_label)
# 输出转换后的标签分布
y_array = np.array(y)
print(f"处理后的标签分布: -1 (Cash): {np.sum(y_array == -1)}, 1 (Credit Card): {np.sum(y_array == 1)}")
return y_array
def map_payment_label(label):
"""
映射支付方式标签,保留字符串特性
Cash/cash -> -1
Credit Card/credit/credit card -> 1
"""
# 处理字符串标签
if isinstance(label, str):
label_lower = label.strip().lower()
# 检查大小写不敏感的匹配
if 'cash' in label_lower:
return -1
elif 'credit' in label_lower or 'credit card' in label_lower:
return 1
# 检查精确的匹配 (区分大小写)
if label.strip() == 'Cash':
return -1
elif label.strip() == 'Credit Card':
return 1
# 其他常见字符串值
elif label_lower in ['0', 'false', 'no', 'negative', 'failure', 'fail', 'n']:
return -1
elif label_lower in ['1', 'true', 'yes', 'positive', 'success', 'pass', 'y']:
return 1
else:
# 对于数值标签
try:
num_label = float(label)
# 对于明确的 -1/1 值,直接使用
if num_label == -1:
return -1
elif num_label == 1:
return 1
# 其他数值使用符号规则
return -1 if num_label <= 0 else 1
except:
pass
# 默认返回值(对于无法识别的标签)
return 1
def generate_linear_data(n_samples=200):
"""生成线性可分数据集(data1)"""
np.random.seed(42)
X = np.random.randn(n_samples, 2) * 2
# 线性决策边界: x + y > 0
y = np.where(X[:, 0] + X[:, 1] > 0, 1, -1)
# 添加少量噪声
noise_idx = np.random.choice(n_samples, size=int(0.05 * n_samples), replace=False)
y[noise_idx] = -y[noise_idx]
print("已生成线性可分模拟数据")
return X, y
def generate_spiral_data(n_samples=200):
"""生成螺旋形数据集(data2)"""
np.random.seed(42)
def spiral_xy(i, spiral_num):
"""生成螺旋坐标"""
angle = i * np.pi / 16
radius = 2 * i / n_samples
if spiral_num == 0:
return [radius * np.cos(angle), radius * np.sin(angle)]
else:
return [-radius * np.cos(angle), -radius * np.sin(angle)]
half_samples = n_samples // 2
X = np.zeros((n_samples, 2))
y = np.zeros(n_samples)
# 第一个螺旋 (类别1)
for i in range(half_samples):
X[i] = spiral_xy(i, 0)
y[i] = 1
# 第二个螺旋 (类别-1)
for i in range(half_samples):
X[i + half_samples] = spiral_xy(i, 1)
y[i + half_samples] = -1
# 添加噪声
X += np.random.randn(n_samples, 2) * 0.1
print("已生成螺旋形模拟数据")
return X, y
def generate_sample_data(n_samples=200, n_features=8):
"""生成常规模拟数据"""
np.random.seed(42)
X = np.random.randn(n_samples, n_features)
y = np.where(X[:, 0] + X[:, 1] + 0.3 * X[:, 2] > 0, 1, -1)
print("已生成常规模拟数据")
return X, y
# ==================== 数据预处理函数 ====================
def preprocess_data(X, y):
"""
数据预处理:处理缺失值、缩放特征
参数:
X: 特征数据
y: 标签数据
返回:
X_scaled: 预处理后的特征数据
y: 预处理后的标签数据
"""
# 1. 处理特征中的缺失值
if np.isnan(X).any():
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)
print("已使用均值填充特征中的缺失值")
# 2. 处理标签中的缺失值
valid_indices = ~np.isnan(y)
if not all(valid_indices):
X = X[valid_indices]
y = y[valid_indices]
print(f"已移除 {np.sum(~valid_indices)} 个标签缺失的样本")
# 3. 标准化缩放特征
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
return X_scaled, y
# ==================== 评估指标计算函数 ====================
def calculate_metrics(y_true, y_pred):
"""计算评估指标"""
true_positives = np.sum((y_true == 1) & (y_pred == 1))
true_negatives = np.sum((y_true == -1) & (y_pred == -1))
false_positives = np.sum((y_true == -1) & (y_pred == 1))
false_negatives = np.sum((y_true == 1) & (y_pred == -1))
precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
accuracy = np.sum(y_true == y_pred) / len(y_true)
return accuracy, precision, recall, f1
# ==================== SMO算法实现 ====================
def SMO(x, y, ker, C, max_iter, tol=1e-3):
"""SMO算法实现SVM训练"""
m = x.shape[0]
alpha = np.zeros(m)
b = 0
passes = 0
# 预计算核矩阵
K = np.zeros((m, m))
for i in range(m):
for j in range(m):
K[i, j] = ker(x[i], x[j])
# SMO主循环
while passes < max_iter:
num_changed_alphas = 0
for i in range(m):
Ei = np.sum(alpha * y * K[:, i]) + b - y[i]
if (y[i] * Ei < -tol and alpha[i] < C) or (y[i] * Ei > tol and alpha[i] > 0):
j = np.random.choice([l for l in range(m) if l != i])
Ej = np.sum(alpha * y * K[:, j]) + b - y[j]
alpha_i_old = alpha[i]
alpha_j_old = alpha[j]
if y[i] != y[j]:
L = max(0, alpha[j] - alpha[i])
H = min(C, C + alpha[j] - alpha[i])
else:
L = max(0, alpha[i] + alpha[j] - C)
H = min(C, alpha[i] + alpha[j])
if L == H:
continue
eta = 2 * K[i, j] - K[i, i] - K[j, j]
if eta >= 0:
continue
alpha[j] = alpha[j] - (y[j] * (Ei - Ej)) / eta
alpha[j] = np.clip(alpha[j], L, H)
if abs(alpha[j] - alpha_j_old) < tol:
continue
alpha[i] = alpha[i] + y[i] * y[j] * (alpha_j_old - alpha[j])
b1 = b - Ei - y[i] * (alpha[i] - alpha_i_old) * K[i, i] - y[j] * (alpha[j] - alpha_j_old) * K[i, j]
b2 = b - Ej - y[i] * (alpha[i] - alpha_i_old) * K[i, j] - y[j] * (alpha[j] - alpha_j_old) * K[j, j]
if 0 < alpha[i] < C:
b = b1
elif 0 < alpha[j] < C:
b = b2
else:
b = (b1 + b2) / 2
num_changed_alphas += 1
if num_changed_alphas == 0:
passes += 1
else:
passes = 0
return alpha, b
# ==================== 核函数定义 ====================
def linear_kernel(x, y):
"""线性核函数"""
return np.inner(x, y)
def polynomial_kernel(d):
"""多项式核函数"""
def kernel(x, y):
return np.inner(x, y) ** d
return kernel
def rbf_kernel(sigma):
"""RBF核函数"""
def kernel(x, y):
return np.exp(-np.inner(x - y, x - y) / (2.0 * sigma ** 2))
return kernel
def cosine_kernel(x, y):
"""余弦相似度核函数"""
return np.inner(x, y) / (np.linalg.norm(x, 2) * np.linalg.norm(y, 2) + 1e-10)
def sigmoid_kernel(beta, c):
"""Sigmoid核函数"""
def kernel(x, y):
return np.tanh(beta * np.inner(x, y) + c)
return kernel
# ==================== 增强可视化功能 ====================
def plot_decision_boundary_enhanced(X, y, model, title=None, ax=None, alpha=0.8,
show_support_vectors=True, confidence=True,
show_margin=True, point_size=60):
"""
绘制增强的决策边界可视化
参数:
X: 特征数据
y: 标签数据
model: SVM模型
title: 标题
ax: 坐标轴对象
alpha: 透明度
show_support_vectors: 是否显示支持向量
confidence: 是否显示置信度
show_margin: 是否显示间隔
point_size: 数据点大小
"""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 8))
# 使用前两个特征
X_2d = X[:, :2] if X.shape[1] > 2 else X
# 创建网格
x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
# 对网格点进行预测
if X.shape[1] > 2:
# 创建与原始特征维度相同的网格点
grid = np.zeros((xx.size, X.shape[1]))
grid[:, 0] = xx.ravel()
grid[:, 1] = yy.ravel()
# 对其余特征用均值填充
for i in range(2, X.shape[1]):
grid[:, i] = X[:, i].mean()
else:
grid = np.c_[xx.ravel(), yy.ravel()]
try:
# 获取决策函数值(距离超平面的距离)
Z = model.decision_function(grid).reshape(xx.shape)
# 预测结果
Z_pred = model.predict(grid).reshape(xx.shape)
if confidence:
# 使用绝对值距离来绘制渐变色的决策区域
abs_Z = np.abs(Z)
max_abs_Z = abs_Z.max()
# 创建归一化的置信度值(0-1范围)
conf = abs_Z / max_abs_Z
# 分别为不同类别创建颜色图
cmap_blue = plt.cm.Blues
cmap_red = plt.cm.Reds
# 提取两个类别区域
region_a = np.copy(conf)
region_b = np.copy(conf)
region_a[Z_pred != 1] = 0
region_b[Z_pred != -1] = 0
# 绘制带有渐变置信度的区域
ax.imshow(region_a, cmap=cmap_blue, alpha=alpha,
extent=(x_min, x_max, y_min, y_max), origin='lower')
ax.imshow(region_b, cmap=cmap_red, alpha=alpha,
extent=(x_min, x_max, y_min, y_max), origin='lower')
else:
# 简单的二分类区域
ax.contourf(xx, yy, Z_pred, alpha=alpha, cmap=ListedColormap(['#FFAAAA', '#AAAAFF']))
# 绘制决策边界和间隔边界
if show_margin:
ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors=['red', 'black', 'blue'],
linestyles=['--', '-', '--'], linewidths=[1, 2, 1])
else:
ax.contour(xx, yy, Z, levels=[0], colors=['black'],
linestyles=['-'], linewidths=[2])
except Exception as e:
print(f"绘制决策边界时出错: {e}")
# 只绘制数据点,不绘制决策边界
ax.text(0.5, 0.5, "绘制决策边界失败",
ha='center', va='center', transform=ax.transAxes,
bbox=dict(facecolor='red', alpha=0.1))
# 绘制数据点
scatter = ax.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap=ListedColormap(['red', 'blue']),
s=point_size, edgecolors='k', alpha=0.8)
# 绘制支持向量
if show_support_vectors and hasattr(model, 'support_vectors_'):
sv = model.support_vectors_
if sv.shape[1] > 2:
sv = sv[:, :2] # 只取前两个维度
ax.scatter(sv[:, 0], sv[:, 1],
s=point_size * 2, linewidth=1, facecolors='none', edgecolors='green')
# 添加标题和图例
if title:
ax.set_title(title, fontsize=14)
else:
kernel_type = model.kernel if hasattr(model, 'kernel') else 'unknown'
ax.set_title(f'SVM (kernel={kernel_type})', fontsize=14)
ax.set_xlabel('特征 1', fontsize=12)
ax.set_ylabel('特征 2', fontsize=12)
# 设置坐标轴
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
# 添加图例
handles, labels = scatter.legend_elements()
class_labels = ['类别 -1', '类别 1']
legend1 = ax.legend(handles, class_labels, loc="upper right")
ax.add_artist(legend1)
if show_support_vectors and hasattr(model, 'support_vectors_'):
sv_handle = plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='none',
markeredgecolor='green', markersize=10, linewidth=0)
ax.legend([sv_handle], ['支持向量'], loc='upper left')
# 添加网格
ax.grid(True, linestyle='--', alpha=0.3)
return ax
def create_3d_visualization_advanced(X, y, method='pca', model=None, title_suffix="",
show_decision_surface=True):
"""增强版3D可视化,支持显示决策边界和支持向量"""
# 确保没有NaN值
if np.isnan(X).any():
print("警告:3D可视化数据中包含NaN值,将使用均值填充")
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)
# 降维到3D
if method == 'pca':
# PCA降维到3D
pca = PCA(n_components=min(3, X.shape[1]))
X_3d = pca.fit_transform(X)
title = f"PCA 3D可视化 {title_suffix}"
explained_var = pca.explained_variance_ratio_
axis_labels = [f'PC{i + 1} ({explained_var[i]:.1%})' for i in range(min(3, X.shape[1]))]
else:
# t-SNE降维到3D
n_components = min(3, X.shape[1])
perplexity = min(30, len(X) // 4) if len(X) > 12 else 3
tsne = TSNE(n_components=n_components, random_state=42, perplexity=perplexity)
X_3d = tsne.fit_transform(X)
title = f"t-SNE 3D可视化 {title_suffix}"
axis_labels = [f't-SNE {i + 1}' for i in range(n_components)]
# 如果维度不足3,填充零向量
if X_3d.shape[1] < 3:
pad = np.zeros((X_3d.shape[0], 3 - X_3d.shape[1]))
X_3d = np.hstack((X_3d, pad))
for i in range(X_3d.shape[1] - len(axis_labels)):
axis_labels.append(f'填充维度 {i + 1}')
# 绘制3D散点图
fig = go.Figure()
# 添加决策曲面(如果需要且模型可用)
if show_decision_surface and model is not None and X.shape[1] >= 3:
try:
# 创建3D网格
x_min, x_max = X_3d[:, 0].min() - 0.5, X_3d[:, 0].max() + 0.5
y_min, y_max = X_3d[:, 1].min() - 0.5, X_3d[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 30),
np.linspace(y_min, y_max, 30))
# 网格点在原始空间中的坐标
if method == 'pca':
grid = np.c_[xx.ravel(), yy.ravel(), np.zeros(xx.size)]
# 计算第三维的值,使得点在决策边界上
# 这里简化了计算,实际应用可能需要更复杂的方法
z_vals = []
for i in range(grid.shape[0]):
# 尝试找到在决策边界上的z值
z_test = np.linspace(X_3d[:, 2].min(), X_3d[:, 2].max(), 5)
decision_vals = []
for z in z_test:
point_3d = np.array([grid[i, 0], grid[i, 1], z])
try:
# 将3D点投影回原始空间
point_orig = pca.inverse_transform(point_3d)
decision_vals.append(model.decision_function([point_orig])[0])
except:
decision_vals.append(float('inf'))
# 找到最接近决策边界的z值
idx = np.argmin(np.abs(decision_vals))
z_vals.append(z_test[idx])
grid[:, 2] = np.array(z_vals)
# 重塑网格
z = grid[:, 2].reshape(xx.shape)
# 添加决策曲面
fig.add_trace(go.Surface(
x=xx, y=yy, z=z,
colorscale='RdBu',
opacity=0.7,
showscale=False,
name='决策曲面'
))
except Exception as e:
print(f"3D决策曲面创建失败: {e}")
# 添加数据点
for class_val in np.unique(y):
mask = y == class_val
name = "负类" if class_val == -1 else "正类"
color = 'red' if class_val == -1 else 'blue'
fig.add_trace(go.Scatter3d(
x=X_3d[mask, 0],
y=X_3d[mask, 1],
z=X_3d[mask, 2],
mode='markers',
marker=dict(
size=5,
color=color,
opacity=0.8
),
name=name,
text=[f'样本 {i}, 类别: {"负类" if label == -1 else "正类"}' for i, label in enumerate(y[mask])],
hovertemplate='%{text}<br>x: %{x:.2f}<br>y: %{y:.2f}<br>z: %{z:.2f}<extra></extra>'
))
# 如果提供了模型,添加支持向量
if model is not None and hasattr(model, 'support_vectors_'):
try:
# 将支持向量映射到降维空间
if method == 'pca':
sv_3d = pca.transform(model.support_vectors_)
# 如果维度不足3,填充零向量
if sv_3d.shape[1] < 3:
pad = np.zeros((sv_3d.shape[0], 3 - sv_3d.shape[1]))
sv_3d = np.hstack((sv_3d, pad))
else:
# t-SNE不支持transform,简单方案是寻找最接近支持向量的训练样本
sv_3d = np.zeros((len(model.support_vectors_), 3))
for i, sv in enumerate(model.support_vectors_):
# 找到最近的原始样本
distances = np.sum((X - sv) ** 2, axis=1)
nearest_idx = np.argmin(distances)
sv_3d[i] = X_3d[nearest_idx]
# 添加支持向量
fig.add_trace(go.Scatter3d(
x=sv_3d[:, 0],
y=sv_3d[:, 1],
z=sv_3d[:, 2],
mode='markers',
marker=dict(
size=8,
color='green',
symbol='circle',
line=dict(color='green', width=2),
opacity=0.9
),
name='支持向量'
))
except Exception as e:
print(f"添加支持向量时出错: {e}")
# 更新布局
fig.update_layout(
title=title,
scene=dict(
xaxis_title=axis_labels[0],
yaxis_title=axis_labels[1],
zaxis_title=axis_labels[2]
),
width=900,
height=700,
margin=dict(l=0, r=0, b=0, t=40)
)
return fig
def create_animated_decision_boundary(X, y, models, model_names, steps=50):
"""创建动画展示不同核函数的决策边界"""
try:
# 确保使用前两个特征
X_2d = X[:, :2] if X.shape[1] > 2 else X
# 创建网格
x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
# 对于高维数据,创建一个满足维度的网格
if X.shape[1] > 2:
grid = np.zeros((xx.size, X.shape[1]))
grid[:, 0] = xx.ravel()
grid[:, 1] = yy.ravel()
# 使用平均值填充其余维度
for i in range(2, X.shape[1]):
grid[:, i] = X[:, i].mean()
else:
grid = np.c_[xx.ravel(), yy.ravel()]
# 计算每个模型的决策函数
Z_values = []
Z_pred_values = []
valid_models = []
valid_model_names = []
for i, (model, name) in enumerate(zip(models, model_names)):
try:
Z = model.decision_function(grid).reshape(xx.shape)
Z_pred = model.predict(grid).reshape(xx.shape)
Z_values.append(Z)
Z_pred_values.append(Z_pred)
valid_models.append(model)
valid_model_names.append(name)
except Exception as e:
print(f"模型 {name} 无法计算决策边界: {e}")
# 如果没有有效模型,返回None
if not valid_models:
print("没有可用的模型来创建动画")
return None
# 创建动画帧
frames = []
for step in range(steps):
# 计算插值权重
weights = [np.sin(np.pi * (step / steps + i / len(valid_models))) ** 2 for i in range(len(valid_models))]
weights = np.array(weights) / sum(weights) # 归一化权重
# 混合决策函数
Z_mix = np.zeros_like(Z_values[0])
for i, Z in enumerate(Z_values):
Z_mix += weights[i] * Z
# 预测结果基于最高权重
max_weight_idx = np.argmax(weights)
Z_pred_mix = Z_pred_values[max_weight_idx]
# 创建帧
frame = go.Frame(
data=[
# 数据点
go.Scatter(
x=X_2d[:, 0],
y=X_2d[:, 1],
mode='markers',
marker=dict(
size=8,
color=['red' if label == -1 else 'blue' for label in y],
line=dict(width=1, color='black')
),
showlegend=False,
),
# 决策函数热图
go.Contour(
z=Z_mix,
x=np.linspace(x_min, x_max, 100),
y=np.linspace(y_min, y_max, 100),
colorscale='RdBu',
showscale=False,
contours=dict(
start=-2,
end=2,
size=0.5,
showlabels=False
),
line=dict(width=1),
opacity=0.8
),
# 决策边界线
go.Contour(
z=Z_mix,
x=np.linspace(x_min, x_max, 100),
y=np.linspace(y_min, y_max, 100),
colorscale=[[0, 'black'], [1, 'black']],
showscale=False,
contours=dict(
start=0,
end=0,
size=1,
showlabels=False
),
line=dict(width=2),
opacity=1
)
],
name=f"frame{step}"
)
frames.append(frame)
# 创建基础图形
fig = go.Figure(
data=[
# 数据点
go.Scatter(
x=X_2d[:, 0],
y=X_2d[:, 1],
mode='markers',
marker=dict(
size=8,
color=['red' if label == -1 else 'blue' for label in y],
line=dict(width=1, color='black')
),
name='数据点'
),
# 初始决策边界
go.Contour(
z=Z_values[0],
x=np.linspace(x_min, x_max, 100),
y=np.linspace(y_min, y_max, 100),
colorscale='RdBu',
showscale=False,
contours=dict(
start=-2,
end=2,
size=0.5,
showlabels=False
),
line=dict(width=1),
opacity=0.8,
name='决策函数'
),
# 初始决策边界线
go.Contour(
z=Z_values[0],
x=np.linspace(x_min, x_max, 100),
y=np.linspace(y_min, y_max, 100),
colorscale=[[0, 'black'], [1, 'black']],
showscale=False,
contours=dict(
start=0,
end=0,
size=1,
showlabels=False
),
line=dict(width=2),
opacity=1,
name='决策边界'
)
],
frames=frames,
layout=go.Layout(
title="SVM决策边界动画",
xaxis=dict(range=[x_min, x_max], title="特征1"),
yaxis=dict(range=[y_min, y_max], title="特征2"),
updatemenus=[{
"type": "buttons",
"buttons": [
{
"label": "播放",
"method": "animate",
"args": [None, {"frame": {"duration": 100, "redraw": True}}]
},
{
"label": "暂停",
"method": "animate",
"args": [[None], {"frame": {"duration": 0, "redraw": True}}]
}
],
"direction": "left",
"pad": {"r": 10, "t": 10},
"x": 0.1,
"y": 0,
"xanchor": "right",
"yanchor": "top"
}],
sliders=[{
"steps": [
{
"args": [
[f"frame{k}"],
{"frame": {"duration": 100, "redraw": True}}
],
"label": str(valid_model_names[i % len(valid_model_names)]),
"method": "animate"
}
for k, i in zip(range(0, steps, steps // len(valid_model_names)), range(len(valid_model_names)))
],
"x": 0.1,
"y": 0,
"currentvalue": {
"font": {"size": 12},
"prefix": "模型: ",
"visible": True,
"xanchor": "center"
},
"len": 0.9,
"pad": {"b": 10, "t": 50},
"transition": {"duration": 300}
}]
)
)
return fig
except Exception as e:
print(f"创建动画时出错: {e}")
return None
def visualize_metrics_over_C_gamma(X_train, y_train, X_test, y_test, kernel='rbf'):
"""
可视化C和gamma参数对模型指标的影响
参数:
X_train, y_train: 训练数据
X_test, y_test: 测试数据
kernel: 核函数类型
"""
try:
# C参数网格
C_range = np.logspace(-3, 3, 7)
# gamma参数网格(仅用于非线性核)
if kernel != 'linear':
gamma_range = np.logspace(-3, 2, 6)
else:
gamma_range = [0.01] # 线性核不需要gamma,但为了代码一致性,设置一个默认值
# 记录不同参数的性能指标
results = []
# 训练和评估不同参数组合的模型
for C in C_range:
for gamma in gamma_range:
try:
if kernel == 'linear':
model = SVC(kernel=kernel, C=C, probability=True)
else:
model = SVC(kernel=kernel, C=C, gamma=gamma, probability=True)
# 训练模型
model.fit(X_train, y_train)
# 在测试集上评估
y_pred = model.predict(X_test)
accuracy, precision, recall, f1 = calculate_metrics(y_test, y_pred)
# 记录结果
results.append({
'C': C,
'gamma': gamma,
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1
})
except Exception as e:
print(f"训练参数 C={C}, gamma={gamma} 失败: {e}")
# 添加一个无效结果
results.append({
'C': C,
'gamma': gamma,
'accuracy': 0,
'precision': 0,
'recall': 0,
'f1': 0
})
# 转换为DataFrame
results_df = pd.DataFrame(results)
# 创建图形
if kernel != 'linear' and len(gamma_range) > 1:
# 3D曲面图:C,gamma vs 准确率
fig = plt.figure(figsize=(18, 10))
metrics = ['accuracy', 'precision', 'recall', 'f1']
titles = ['准确率', '精确率', '召回率', 'F1分数']
for i, (metric, title) in enumerate(zip(metrics, titles)):
ax = fig.add_subplot(2, 2, i + 1, projection='3d')
try:
# 重塑数据以适应3D曲面图
pivoted = results_df.pivot_table(
values=metric,
index='C',
columns='gamma'
)
X, Y = np.meshgrid(np.log10(gamma_range), np.log10(C_range))
Z = pivoted.values
# 绘制曲面
surf = ax.plot_surface(X, Y, Z, cmap='viridis',
linewidth=0, antialiased=True, alpha=0.8)
# 添加标题和标签
ax.set_title(f'{kernel}核函数: {title} vs C,gamma')
ax.set_xlabel('log10(gamma)')
ax.set_ylabel('log10(C)')
ax.set_zlabel(title)
# 添加颜色条
fig.colorbar(surf, ax=ax, shrink=0.5, aspect=5)
except Exception as e:
print(f"绘制3D曲面图失败: {e}")
ax.text2D(0.5, 0.5, "绘图失败",
ha='center', transform=ax.transAxes,
bbox=dict(facecolor='red', alpha=0.1))
plt.tight_layout()
plt.show()
else:
# 对于线性核,仅展示C的影响
plt.figure(figsize=(15, 5))
metrics = ['accuracy', 'precision', 'recall', 'f1']
titles = ['准确率', '精确率', '召回率', 'F1分数']
for i, (metric, title) in enumerate(zip(metrics, titles)):
plt.subplot(1, 4, i + 1)
plt.semilogx(results_df['C'], results_df[metric], marker='o', linewidth=2)
plt.title(f'线性核: {title} vs C')
plt.xlabel('C值 (log scale)')
plt.ylabel(title)
plt.grid(True)
plt.tight_layout()
plt.show()
return results_df
except Exception as e:
print(f"参数可视化失败: {e}")
return pd.DataFrame()
def plot_learning_curve(X, y, model_type='svm', kernels=['linear', 'rbf', 'poly'],
train_sizes=np.linspace(0.1, 1.0, 10)):
"""
绘制学习曲线,显示训练集大小对模型性能的影响
参数:
X, y: 数据和标签
model_type: 模型类型
kernels: 核函数列表
train_sizes: 训练集比例
"""
try:
plt.figure(figsize=(15, 5))
for i, kernel in enumerate(kernels):
# 每个核函数一个子图
plt.subplot(1, len(kernels), i + 1)
train_acc = []
test_acc = []
# 随机打乱数据
indices = np.random.permutation(len(X))
X_shuffled = X[indices]
y_shuffled = y[indices]
for size in train_sizes:
try:
# 划分训练集和测试集
train_size = max(10, int(len(X) * size)) # 确保至少有10个样本
if train_size >= len(X) - 10:
train_size = len(X) - 10 # 确保测试集至少有10个样本
X_train, X_test = X_shuffled[:train_size], X_shuffled[train_size:train_size + 10]
y_train, y_test = y_shuffled[:train_size], y_shuffled[train_size:train_size + 10]
# 如果数据太少或类别不全,跳过
if len(np.unique(y_train)) < 2 or len(np.unique(y_test)) < 2:
continue
# 训练模型
if kernel == 'linear':
model = SVC(kernel=kernel, C=1.0)
elif kernel == 'rbf':
model = SVC(kernel=kernel, C=10.0, gamma=0.1)
else: # poly
model = SVC(kernel=kernel, C=1.0, degree=3)
model.fit(X_train, y_train)
# 评估模型
train_acc.append(model.score(X_train, y_train))
test_acc.append(model.score(X_test, y_test))
except Exception as e:
print(f"学习曲线计算失败 (kernel={kernel}, size={size}): {e}")
# 绘制学习曲线
train_sizes_plt = train_sizes[:len(train_acc)]
if len(train_acc) > 0: # 确保有数据点
plt.plot(train_sizes_plt, train_acc, 'o-', label='训练集准确率')
plt.plot(train_sizes_plt, test_acc, 's-', label='测试集准确率')
else:
plt.text(0.5, 0.5, "数据不足以绘制学习曲线",
ha='center', va='center', transform=plt.gca().transAxes)
plt.title(f'{kernel}核函数的学习曲线')
plt.xlabel('训练集比例')
plt.ylabel('准确率')
plt.grid(True)
plt.legend(loc='best')
plt.tight_layout()
plt.show()
except Exception as e:
print(f"绘制学习曲线失败: {e}")
def create_comprehensive_performance_report(models, X_test, y_test, model_names):
"""创建综合性能报告"""
try:
# 创建子图
fig = make_subplots(
rows=2, cols=2,
subplot_titles=('模型性能对比', '混淆矩阵热图', 'ROC曲线对比', '特征重要性'),
specs=[[{"type": "bar"}, {"type": "heatmap"}],
[{"type": "scatter"}, {"type": "bar"}]]
)
# 收集所有模型的性能指标
results = {}
colors = ['blue', 'red', 'green', 'orange', 'purple']
# 筛选有效的模型
valid_models = []
valid_model_names = []
for model, name in zip(models, model_names):
try:
# 测试模型是否可用
y_pred = model.predict(X_test)
valid_models.append(model)
valid_model_names.append(name)
except Exception as e:
print(f"模型 {name} 不可用: {e}")
if not valid_models:
print("没有有效的模型可供评估")
return None, {}
for i, (model, name) in enumerate(zip(valid_models, valid_model_names)):
# 预测
y_pred = model.predict(X_test)
try:
y_pred_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
except:
y_pred_proba = None
# 计算指标
accuracy, precision, recall, f1 = calculate_metrics(y_test, y_pred)
results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1_score': f1
}
# 混淆矩阵
if i == 0: # 只为第一个模型添加
cm = confusion_matrix(y_test, y_pred)
# 归一化混淆矩阵
cm_sum = cm.sum(axis=1)
cm_norm = np.zeros_like(cm, dtype=float)
for j in range(len(cm_sum)):
if cm_sum[j] > 0:
cm_norm[j] = cm[j] / cm_sum[j]
# 添加混淆矩阵热图
fig.add_trace(
go.Heatmap(
z=cm_norm,
x=['预测-1', '预测1'],
y=['实际-1', '实际1'],
colorscale='Blues',
showscale=True,
text=[[f'{cm[i, j]}<br>({cm_norm[i, j]:.1%})' for j in range(2)] for i in range(2)],
hoverinfo='text'
),
row=1, col=2
)
# ROC曲线
if y_pred_proba is not None:
try:
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
auc_score = auc(fpr, tpr)
results[name]['auc'] = auc_score
fig.add_trace(
go.Scatter(x=fpr, y=tpr, mode='lines',
name=f'{name} (AUC={auc_score:.3f})',
line=dict(color=colors[i % len(colors)])),
row=2, col=1
)
except Exception as e:
print(f"计算ROC曲线时出错: {e}")
# 添加随机分类器线
fig.add_trace(
go.Scatter(x=[0, 1], y=[0, 1], mode='lines',
name='随机分类器', line=dict(dash='dash', color='black')),
row=2, col=1
)
# 性能指标对比柱状图
metrics = ['accuracy', 'precision', 'recall', 'f1_score']
metric_names = ['准确率', '精确率', '召回率', 'F1分数']
for i, (metric, metric_name) in enumerate(zip(metrics, metric_names)):
values = [results[name].get(metric, 0) for name in valid_model_names]
fig.add_trace(
go.Bar(x=valid_model_names, y=values, name=metric_name,
marker_color=colors[i % len(colors)]),
row=1, col=1
)
# 特征重要性(如果有线性模型)
has_linear = False
for model, name in zip(valid_models, valid_model_names):
if hasattr(model, 'coef_') and len(model.coef_) > 0:
has_linear = True
importance = np.abs(model.coef_[0])
fig.add_trace(
go.Bar(
x=importance,
y=[f'特征 {i + 1}' for i in range(len(importance))],
orientation='h',
name=name
),
row=2, col=2
)
break # 只显示一个线性模型的特征重要性
if not has_linear:
fig.add_annotation(
text="非线性模型<br>无法显示特征重要性",
x=0.5, y=0.5,
xref="x3", yref="y3",
showarrow=False,
font=dict(size=14)
)
fig.update_layout(height=800, showlegend=True, title_text="SVM模型综合性能报告")
fig.update_xaxes(title_text="模型", row=1, col=1)
fig.update_yaxes(title_text="分数", row=1, col=1)
fig.update_xaxes(title_text="假正例率", row=2, col=1)
fig.update_yaxes(title_text="真正例率", row=2, col=1)
fig.update_xaxes(title_text="重要性", row=2, col=2)
fig.update_yaxes(title_text="特征", row=2, col=2)
return fig, results
except Exception as e:
print(f"创建性能报告失败: {e}")
return None, {}
# ==================== 自动调参功能 ====================
def auto_hyperparameter_tuning(X_train, y_train, cv=5, dataset_type=None):
"""SVM自动调参,针对不同数据集类型优化"""
try:
# 根据数据集类型调整参数网格
if dataset_type == 'data1' or dataset_type == 'linear':
# 线性数据集偏好线性核
param_grid = [
{'kernel': ['linear'], 'C': [0.1, 1, 10, 100]},
{'kernel': ['rbf'], 'C': [1, 10, 100], 'gamma': [0.1, 1, 'scale']}
]
print("对线性可分数据集进行调参...")
elif dataset_type == 'data2' or dataset_type == 'spiral':
# 螺旋数据集偏好RBF和多项式核
param_grid = [
{'kernel': ['rbf'], 'C': [0.1, 1, 10, 100], 'gamma': [0.01, 0.1, 1, 10]},
{'kernel': ['poly'], 'C': [0.1, 1, 10], 'gamma': [0.1, 1], 'degree': [2, 3, 4]}
]
print("对螺旋数据集进行调参...")
else:
# 通用参数网格
param_grid = [
{'kernel': ['linear'], 'C': [0.1, 1, 10, 100]},
{'kernel': ['rbf'], 'C': [0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 'scale']},
]
print("对通用数据集进行调参...")
# 如果数据集较小,简化参数网格
if len(X_train) < 50:
print("数据集较小,使用简化调参...")
param_grid = [
{'kernel': ['linear'], 'C': [1, 10]},
{'kernel': ['rbf'], 'C': [1, 10], 'gamma': ['scale']}
]
cv = min(cv, 3) # 减少交叉验证折数
# 网格搜索
grid_search = GridSearchCV(
SVC(probability=True),
param_grid,
cv=cv,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
print("开始自动调参...")
grid_search.fit(X_train, y_train)
print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳交叉验证分数: {grid_search.best_score_:.4f}")
return grid_search.best_estimator_, grid_search.best_params_
except Exception as e:
print(f"自动调参失败: {e}")
# 返回一个默认模型
default_model = SVC(kernel='linear', C=1.0, probability=True)
default_model.fit(X_train, y_train)
return default_model, {'kernel': 'linear', 'C': 1.0}
# ==================== 主函数 ====================
def main():
"""主函数:展示所有功能"""
print('=' * 60)
print('基于SVM进行分类预测')
print('=' * 60)
# 1. 数据加载 - 尝试加载CSV文件
print("\n步骤1: 数据加载")
# 尝试加载linear.csv和spiral.csv
csv_files = ['linear.csv', 'spiral.csv'] # 可以替换为实际文件路径
try:
# 先尝试加载linear.csv
X, y = load_csv_with_specific_columns(csv_files[0])
print("成功加载CSV文件")
except Exception as e:
print(f"CSV加载失败: {e}")
print("使用模拟数据...")
X, y = generate_linear_data() # 生成线性可分数据作为默认
# 2. 数据预处理及可视化
print("\n步骤2: 数据检查与预处理")
# 检查是否有NaN值
if np.isnan(X).any() or np.isnan(y).any():
print("数据中包含NaN值,进行预处理...")
X, y = preprocess_data(X, y)
# 基本统计信息
print(f"数据维度: X={X.shape}, y={y.shape}")
print(f"类别分布 - 类别(-1): {np.sum(y == -1)}, 类别(1): {np.sum(y == 1)}")
# 可视化数据
plt.figure(figsize=(10, 8))
if X.shape[1] >= 2: # 至少有2个特征才能2D可视化
plt.scatter(X[y == -1, 0], X[y == -1, 1],
color='red', marker='o', label='类别 -1')
plt.scatter(X[y == 1, 0], X[y == 1, 1],
color='blue', marker='x', label='类别 1')
else: # 1维特征,用y=0作为第二维
plt.scatter(X[y == -1], np.zeros_like(X[y == -1]),
color='red', marker='o', label='类别 -1')
plt.scatter(X[y == 1], np.zeros_like(X[y == 1]),
color='blue', marker='x', label='类别 1')
plt.title('数据集可视化', fontsize=14)
plt.xlabel('特征 1', fontsize=12)
plt.ylabel('特征 2' if X.shape[1] >= 2 else 'Y = 0', fontsize=12)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
# 3. 数据预处理
print("\n步骤3: 数据标准化")
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
print(f"训练集大小: {X_train.shape}")
print(f"测试集大小: {X_test.shape}")
# 4. 手动SMO算法训练
print("\n步骤4: 手动SMO算法训练")
if X.shape[1] >= 2:
# 使用前两个特征进行线性SVM演示
X_demo = X_scaled[:, :2]
X_train_demo, X_test_demo, y_train_demo, y_test_demo = train_test_split(
X_demo, y, test_size=0.2, random_state=42)
try:
# 训练线性SVM
alpha, b = SMO(X_train_demo, y_train_demo, ker=linear_kernel, C=1.0, max_iter=100)
# 计算权重和支持向量
sup_idx = alpha > 1e-5
if np.sum(sup_idx) > 0: # 确保有支持向量
w = np.sum((alpha[sup_idx] * y_train_demo[sup_idx]).reshape(-1, 1) * X_train_demo[sup_idx], axis=0)
print(f'支持向量个数: {np.sum(sup_idx)}')
print(f'权重向量 w = [{w[0]:.4f}, {w[1]:.4f}]')
print(f'偏置项 b = {b:.4f}')
# 绘制手动SMO的决策边界
plt.figure(figsize=(10, 8))
# 创建网格
x_min, x_max = X_train_demo[:, 0].min() - 1, X_train_demo[:, 0].max() + 1
y_min, y_max = X_train_demo[:, 1].min() - 1, X_train_demo[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
# 计算网格点的预测值
Z = np.sign(xx * w[0] + yy * w[1] + b)
# 绘制决策边界
plt.contourf(xx, yy, Z, alpha=0.3, cmap=ListedColormap(['#FFAAAA', '#AAAAFF']))
# 绘制数据点
plt.scatter(X_train_demo[y_train_demo == -1, 0], X_train_demo[y_train_demo == -1, 1],
color='red', marker='o', label='训练集 - 类别 -1')
plt.scatter(X_train_demo[y_train_demo == 1, 0], X_train_demo[y_train_demo == 1, 1],
color='blue', marker='x', label='训练集 - 类别 1')
# 绘制支持向量
plt.scatter(X_train_demo[sup_idx, 0], X_train_demo[sup_idx, 1],
s=100, facecolors='none', edgecolors='green', linewidth=2,
label='支持向量')
# 绘制测试点
plt.scatter(X_test_demo[:, 0], X_test_demo[:, 1],
marker='s', c=y_test_demo, cmap=ListedColormap(['red', 'blue']),
alpha=0.3, s=50, label='测试集')
# 绘制超平面
plt.plot([x_min, x_max], [(-b - w[0] * x_min) / w[1], (-b - w[0] * x_max) / w[1]],
'k-', linewidth=2)
# 绘制间隔
plt.plot([x_min, x_max], [(-b - w[0] * x_min - 1) / w[1], (-b - w[0] * x_max - 1) / w[1]],
'k--', linewidth=1)
plt.plot([x_min, x_max], [(-b - w[0] * x_min + 1) / w[1], (-b - w[0] * x_max + 1) / w[1]],
'k--', linewidth=1)
plt.title('手动SMO算法实现的SVM决策边界', fontsize=14)
plt.xlabel('特征 1', fontsize=12)
plt.ylabel('特征 2', fontsize=12)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.3)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.tight_layout()
plt.show()
# 预测和评估
y_pred_demo = np.sign(X_test_demo @ w.reshape(-1, 1) + b).flatten()
accuracy, precision, recall, f1 = calculate_metrics(y_test_demo, y_pred_demo)
print(
f'手动SMO SVM - 准确率: {accuracy:.4f}, 精确率: {precision:.4f}, 召回率: {recall:.4f}, F1: {f1:.4f}')
else:
print("SMO算法没有找到支持向量,跳过手动SVM演示")
except Exception as e:
print(f"手动SMO算法训练失败: {e}")
print("跳过手动SVM演示")
else:
print("特征维度不足,跳过手动SVM演示")
# 5. 不同核函数比较
print("\n步骤5: 不同核函数比较")
# 训练不同核函数的sklearn SVM模型
kernels = ['linear', 'rbf', 'poly', 'sigmoid']
kernel_names = ['线性核', 'RBF核', '多项式核', 'Sigmoid核']
fig, axs = plt.subplots(2, 2, figsize=(18, 14))
axs = axs.flatten()
models = []
names = []
for i, (kernel, name) in enumerate(zip(kernels, kernel_names)):
try:
# 调整参数
if kernel == 'linear':
model = SVC(kernel=kernel, C=1.0, probability=True)
elif kernel == 'rbf':
model = SVC(kernel=kernel, C=10.0, gamma=0.1, probability=True)
elif kernel == 'poly':
model = SVC(kernel=kernel, C=1.0, degree=3, gamma=0.1, probability=True)
else: # sigmoid
model = SVC(kernel=kernel, C=1.0, gamma=0.1, probability=True)
# 训练模型
model.fit(X_train, y_train)
models.append(model)
names.append(name)
# 评估模型
y_pred = model.predict(X_test)
accuracy, precision, recall, f1 = calculate_metrics(y_test, y_pred)
print(f'{name} SVM - 准确率: {accuracy:.4f}, 精确率: {precision:.4f}, 召回率: {recall:.4f}, F1: {f1:.4f}')
# 绘制决策边界
try:
plot_decision_boundary_enhanced(
X_scaled, y, model,
title=f'{name} SVM',
ax=axs[i],
confidence=True
)
except Exception as e:
print(f"绘制决策边界失败 ({name}): {e}")
axs[i].set_title(f"{name} SVM (绘制失败)")
axs[i].text(0.5, 0.5, "绘制决策边界失败",
ha='center', va='center', transform=axs[i].transAxes,
bbox=dict(facecolor='red', alpha=0.1))
except Exception as e:
print(f"模型训练失败 ({name}): {e}")
axs[i].text(0.5, 0.5, f"模型训练失败: {name}",
ha='center', va='center', transform=axs[i].transAxes,
bbox=dict(facecolor='red', alpha=0.1))
plt.tight_layout()
plt.show()
# 6. 自动调参
print("\n步骤6: 自动超参数调优")
# 确定数据集类型 - 如果是简单的线性可分数据,使用'data1'类型
if X.shape[1] <= 2: # 如果特征数小于等于2,可能是线性或螺旋数据
# 这里简化处理,假设线性数据
dataset_type = 'data1'
else:
# 对于高维数据,使用通用调参
dataset_type = None
best_model, best_params = auto_hyperparameter_tuning(X_train, y_train, dataset_type=dataset_type)
# 展示最佳模型的决策边界
if X.shape[1] >= 2:
plt.figure(figsize=(10, 8))
try:
plot_decision_boundary_enhanced(
X_scaled, y, best_model,
title=f"最佳SVM模型 ({best_model.kernel})",
confidence=True,
show_margin=True
)
plt.show()
except Exception as e:
print(f"绘制最佳模型决策边界失败: {e}")
# 7. 参数对性能的影响
print("\n步骤7: 参数对性能的影响")
# 可视化C和gamma参数对性能的影响
if X_train.shape[0] > 20 and not np.isnan(X_train).any(): # 数据点足够多且无缺失值才展示
results_df = visualize_metrics_over_C_gamma(X_train, y_train, X_test, y_test, kernel=best_model.kernel)
else:
print("数据量不足或有缺失值,跳过参数影响可视化")
# 8. 学习曲线
print("\n步骤8: 学习曲线分析")
# 绘制学习曲线 - 数据足够多时才展示
if X.shape[0] > 50 and not np.isnan(X).any():
plot_learning_curve(X_scaled, y, kernels=['linear', 'rbf'])
else:
print("数据量不足或有缺失值,跳过学习曲线分析")
# 9. 3D可视化
print("\n步骤9: 生成3D可视化")
# 检查数据是否适合3D可视化
if not np.isnan(X_scaled).any() and len(X) > 10:
try:
# 创建3D可视化
fig_3d = create_3d_visualization_advanced(X_scaled, y,
method='pca',
model=best_model,
title_suffix="(PCA降维)")
fig_3d.show()
except Exception as e:
print(f"3D可视化创建失败: {e}")
else:
print("数据不适合3D可视化,跳过此步骤")
# 10. 动画可视化
print("\n步骤10: 创建决策边界动画")
# 检查数据是否适合创建动画
if not np.isnan(X_scaled).any() and len(X) > 10 and X.shape[1] >= 2:
try:
# 过滤有效的模型
valid_models = []
valid_names = []
for model, name in zip(models, names):
try:
# 测试模型是否可用
model.predict(X_test[:1])
valid_models.append(model)
valid_names.append(name)
except:
continue
if valid_models:
# 添加最佳模型
if best_model not in valid_models:
valid_models.append(best_model)
valid_names.append('最佳模型')
# 创建动画
anim_fig = create_animated_decision_boundary(X_scaled, y, valid_models, valid_names)
if anim_fig:
anim_fig.show()
else:
print("没有有效的模型可创建动画")
except Exception as e:
print(f"动画创建失败: {e}")
else:
print("数据不适合创建动画,跳过此步骤")
# 11. 综合性能报告
print("\n步骤11: 生成综合性能报告")
# 汇总所有模型
all_models = models.copy()
all_names = names.copy()
# 添加最佳模型
if best_model not in all_models:
all_models.append(best_model)
all_names.append('最佳模型')
# 生成性能报告
performance_fig, performance_results = create_comprehensive_performance_report(
all_models, X_test, y_test, all_names
)
if performance_fig:
performance_fig.show()
# 12. 总结
print("\n=== 所有演示完成 ===")
print(f"最佳模型参数: {best_params}")
print("数据处理、模型训练、可视化和性能评估已全部完成!")
if __name__ == "__main__":
try:
main()
except Exception as e:
print(f"程序运行出错: {e}")
print("使用简化版本演示基本功能...")
# 简化版本演示
print("\n简化版本演示:")
try:
# 尝试加载CSV
X, y = load_csv_with_specific_columns('linear.csv')
# 处理缺失值
if np.isnan(X).any() or np.isnan(y).any():
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)
# 如果y中有缺失值,移除这些样本
valid_indices = ~np.isnan(y)
if not all(valid_indices):
X = X[valid_indices]
y = y[valid_indices]
except:
X, y = generate_linear_data()
# 简单的2D可视化
plt.figure(figsize=(8, 6))
if X.shape[1] >= 2:
colors = ['red' if label == -1 else 'blue' for label in y]
plt.scatter(X[:, 0], X[:, 1], c=colors, alpha=0.7)
else:
plt.scatter(X[:, 0], np.zeros_like(X[:, 0]),
c=['red' if label == -1 else 'blue' for label in y],
alpha=0.7)
plt.title("数据可视化")
plt.xlabel("特征1")
plt.ylabel("特征2" if X.shape[1] >= 2 else "")
plt.legend(['负类(-1)', '正类(1)'])
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()
# 简单的SVM模型
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
try:
model = SVC(kernel='linear')
model.fit(X_train, y_train)
print(f"模型准确率: {model.score(X_test, y_test):.4f}")
except Exception as e:
print(f"模型训练失败: {e}")
四、我的收获
支持向量机是一种强大而优雅的算法,它将优化理论、凸分析和核方法等高级数学概念与实用的分类器结合起来。通过这次实验,我不仅掌握了支持向量机SVM的理论和实现,更重要的是建立了理论与实践的连接,培养了分析问题和实现复杂系统的能力。特别是对标签列的特殊处理需求,让我意识到在实际应用中,算法往往需要根据具体业务需求进行定制和调整。因此,在本节的实验中,我的收获有:
(一)理论与实践的结合
支持向量机的理论在课本上看起来十分抽象,特别是涉及到拉格朗日乘子、对偶问题和KKT条件等数学概念时。然而,通过亲手实现SMO算法,我真正理解了这些理论的实际含义:
1.最大间隔的直观感受:通过可视化决策边界,我直观地看到了SVM如何在保证分类正确的前提下最大化间隔,这使得抽象的优化目标变得具体可感。
2. 对偶问题的意义:以前只知道SVM求解时会转化为对偶问题,但不理解为什么。通过编码实现,我发现对偶形式不仅计算效率更高,而且为核技巧的应用提供了可能性。
3. 支持向量的作用:观察到大部分训练点的拉格朗日乘子为零,只有少数支持向量真正影响决策边界,这极大地提高了模型的泛化能力和计算效率。
(二)核函数的选择与影响
实验中尝试了不同的核函数(线性、RBF、多项式、Sigmoid),对比它们在各类数据集上的表现:
1. 线性核:在线性可分数据上表现优秀,模型简单且计算速度快,但在复杂数据上无法找到有效的决策边界。
2. RBF核:适应性最强,能处理各种复杂模式,但调参难度大。特别是γ参数对模型影响显著 - 过小会导致欠拟合,过大则容易过拟合。
3. 多项式核:在某些特定问题上表现出色,但计算开销大且数值稳定性较差。度数参数需要谨慎选择。
4. Sigmoid核:虽然理论上很有趣,但在实际应用中往往不如其他核函数,参数调整也更为困难。
通过3D可视化和动画,我清晰地看到不同核函数如何在特征空间中构建决策边界,这大大加深了我对核方法本质的理解。
(三)数据处理的重要性
本项目特别关注CSV数据处理,尤其是标签列的特殊处理,这让我认识到数据预处理对机器学习模型的重要性:
1. 缺失值处理:对特征列使用均值填充是常见做法,但对标签列则需要更谨慎的处理策略。
2. 字符串到数值的映射:设计合理的映射函数,既保留原始数据语义又满足算法需求,这是实际应用中的关键挑战。
3. 标准化的必要性:未经标准化的数据可能导致某些特征主导模型决策,从实验中可以明显看到标准化对SVM性能的显著影响。
(四)可视化的价值
交互式可视化不仅美观,更是理解和调试模型的强大工具:
- 决策边界可视化:通过可视化决策边界和支持向量,我能够直观地判断模型是否过拟合或欠拟合。
2. 参数影响分析:3D图表展示了C和γ参数对模型性能的影响,帮助我更有针对性地调整参数。
3. 降维技术的应用:使用PCA和t-SNE进行3D可视化,让我理解了高维数据的结构以及模型在实际空间中的行为方式。
4. 动画效果:动态展示不同核函数的决策边界变化,这种动态视角比静态图表能提供更多信息。
(五)写Python代码的收获
从编码角度,这个项目也带给我很多收获:
1. 模块化设计:将复杂系统拆分为数据访问、算法实现、可视化和自动调参等模块,大大提高了代码的可读性和可维护性。
2. 错误处理:在实际数据处理中,异常情况远比预想的多,全面的错误处理和降级策略确保了系统的稳定运行。
3. 算法效率:通过实现SMO算法,我体会到了算法优化的重要性,特别是启发式选择变量和矩阵预计算等技巧。
4. 交互性设计:设计交互式界面比简单的数据处理要复杂得多,但带来的用户体验提升也是显著的。
(六)未来改进方向
1. 增加更多核函数:实现更多特殊核函数,如Chi-Square核、波形核等,探索它们在特定问题上的表现。
2. 优化SMO算法:当前实现的是简化版SMO,未来可以加入完整的启发式变量选择策略,进一步提高收敛速度。
3. 扩展到多分类:使用one-vs-one或one-vs-all策略将SVM扩展到多分类问题。
4. 集成学习:将SVM作为基学习器,探索集成方法如SVM-Bagging或多核融合的可能性。
5. 在线学习:探索增量SVM算法,使模型能够处理流式数据。
五、我的感受
支持向量机虽然在近年来被深度学习的热潮所掩盖,但它依然是机器学习领域的基石,在许多场景中有着不可替代的价值。这次实验不仅加深了我对机器学习的理解,也培养了我解决实际问题的能力,更加深了我对人工智能算法的兴趣,是一次非常有价值的学习经历。