张泽鹏先生手搓的纯ANSI处理UTF-8与美团龙猫调用expat库读取Excel xml对比测试

发布于:2025-09-06 ⋅ 阅读:(13) ⋅ 点赞:(0)

昨日预告张泽鹏先生要和AI PK,一夜之间形势变化了,龙猫自编的解析程序已经退出了赛场,现在是调用expat库的程序和张先生的程序PK。
1.100万行16列lineitem大表性能测试,以下是执行时间对比

#张先生挑战程序
time ./aich2 lineitem/xl/worksheets/sheet1.xml A100000:Z200000 out.csv

real	0m0.926s
user	0m0.836s
sys	0m0.024s
time ./aich2 lineitem/xl/worksheets/sheet1.xml A200000:Z300000 out.csv

real	0m0.907s
user	0m0.888s
sys	0m0.020s
time ./aich2 lineitem/xl/worksheets/sheet1.xml A600000:Z700000 out.csv

real	0m1.148s
user	0m1.128s
sys	0m0.020s
time ./aich2 lineitem/xl/worksheets/sheet1.xml A200000:Z700000 out.csv

real	0m4.054s
user	0m3.968s
sys	0m0.080s
time ./aich2 lineitem/xl/worksheets/sheet1.xml A1:Z1000000 out.csv

real	0m8.648s
user	0m7.672s
sys	0m0.268s
#expat
time ./expatxml3 lineitem/xl/worksheets/sheet1.xml A100000:Z200000
解析范围: A100000:Z200000

real	0m17.757s
user	0m16.984s
sys	0m0.644s
time ./expatxml3 lineitem/xl/worksheets/sheet1.xml A200000:Z300000
解析范围: A200000:Z300000

real	0m17.457s
user	0m16.992s
sys	0m0.464s
time ./expatxml3 lineitem/xl/worksheets/sheet1.xml A600000:Z700000
解析范围: A600000:Z700000

real	0m17.501s
user	0m17.020s
sys	0m0.476s
time ./expatxml3 lineitem/xl/worksheets/sheet1.xml A200000:Z700000
解析范围: A200000:Z700000

real	0m22.048s
user	0m18.092s
sys	0m3.632s
time ./expatxml3 lineitem/xl/worksheets/sheet1.xml A1:Z1000000
解析范围: A1:Z1000000

real	0m25.136s
user	0m19.432s
sys	0m4.960s

在arm64 linux上张泽鹏先生的程序快得难以置信,基本和range行数成正比。expat是减去常数后成正比,这个常数大约是16秒。观察代码发现,它先全表扫描解析了一遍,于是让龙猫优化了一下,但是有很多BUG,看来AI还需要磨炼。
2.功能测试
具有换行符和转义符的测试用例如下

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

	<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
	<sheetData>
	<row r="1"><c r="A1" t="b" s="5"><v>1</v></c><c r="B1"><v>123.0</v></c><c r="C1"><v>123456.0</v></c><c r="D1"><v>123.456</v></c><c r="E1"><v>123456.789</v></c><c r="F1" t="inlineStr"><is><t>Hello &lt;world&gt;
Six spaces: `      `
&amp;&apos;&quot;&lt;&gt;€©♥</t></is></c><c r="G1" s="4"><v>45877.9904050926</v></c><c r="H1" s="1"><v>1.0</v></c><c r="I1" s="3"><v>0.9905787037037037</v></c></row><row r="2"><c r="B2"><v>-33.0</v></c><c r="C2"><v>-32.0</v></c><c r="D2"><v>-123.456</v></c><c r="E2"><v>-73.9273028</v></c><c r="F2" t="inlineStr"><is><t>HTML
Strong
中文</t></is></c><c r="H2" s="1"><v>59.0</v></c><c r="I2" s="3"><v>0.0</v></c></row><row r="3"><c r="F3" t="inlineStr"><is><t>Hello World</t></is></c><c r="H3" s="1"><v>61.0</v></c><c r="I3" s="3"><v>0.5</v></c></row><row r="4"><c r="A4" t="b" s="5"><v>0</v></c></row></sheetData></worksheet>

张先生挑战程序输出的csv

1,123.0,123456.0,123.456,123456.789,"Hello <world>
Six spaces: `      `
&';"";<>€©♥",45877.9904050926,1.0,0.9905787037037037
,-33.0,-32.0,-123.456,-73.9273028,"HTML
Strong
中文",,59.0,0.0
,,,,,Hello World,,61.0,0.5
0,,,,,,,,

expat程序./expatxml3 format.xml A1:Z100输出的csv

Row,A,B,C,D,E,F,G,H,I
1,1,123.0,123456.0,123.456,123456.789,Hello <world>
Six spaces: `      `
&'"<>€©♥,45877.9904050926,1.0,0.9905787037037037
2,,-33.0,-32.0,-123.456,-73.9273028,HTML
Strong
中文,,59.0,0.0
3,,,,,,Hello World,,61.0,0.5

expat输出少了1行,换行符单元格没有用双引号包裹,用Excel打开会导致错误,单引号和双引号转义符张先生多输出了;字符,综合性能和功能比赛张泽鹏先生完胜。


网站公告

今日签到

点亮在社区的每一天
去签到