pyserini安装&使用

发布于:2024-10-15 ⋅ 阅读:(98) ⋅ 点赞:(0)

目录

代码

安装

使用

msmarco-passage bm25


代码

git clone https://github.com/castorini/pyserini.git --recurse-submodules 
pyserini/tools为submodules,地址:anserini-tools

安装

https://github.com/castorini/pyserini/blob/master/docs/installation.md 
pyserini依赖java环境,可用conda直接安装。安装后,java版本查看命令 "java --version"
如果只是利用开源索引做测试,使用 PyPI Installation即可。
如果自行开发,需要Development Installation。最后一步将fatjar包copy到pyserini/resources/jars/中,fatjar包有两种获取方式,

  1. 在 anserini 项目中编译“mvn clean package”,保存路径为 anserini/target/anserini-X.Y.Z-SNAPSHOT-fatjar.jar。https://github.com/castorini/anserini?tab=readme-ov-file#-installation
  2. 直接下载 https://repo1.maven.org/maven2/io/anserini/anserini/0.38.0/anserini-0.38.0-fatjar.jar。 https://github.com/castorini/anserini/blob/master/docs/fatjar-regressions/fatjar-regressions-v0.38.0.md 

使用

默认下载保存路径  ~/.cache/pyserini/
指定下载保存路径 export PYSERINI_CACHE=/path/to/cache

msmarco-passage bm25

https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md 

下载数据集

mkdir collections/msmarco-passage

wget https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

格式转为jsonl

python tools/scripts/msmarco/convert_collection_to_jsonl.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl

建立索引

python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input collections/msmarco-passage/collection_jsonl \
  --index indexes/lucene-index-msmarco-passage \
  --generator DefaultLuceneDocumentGenerator \
  --threads 9 \
  --storePositions --storeDocvectors --storeRaw
# index 为索引保存路径

检索

python -m pyserini.search.lucene \
  --index indexes/lucene-index-msmarco-passage \
  --topics msmarco-passage-dev-subset \
  --output runs/run.msmarco-passage.bm25tuned.txt \
  --output-format msmarco \
  --hits 1000 \
  --bm25 --k1 0.82 --b 0.68 \
  --threads 4 --batch-size 16

计算指标

python -m pyserini.eval.msmarco_passage_eval \
   tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
   runs/run.msmarco-passage.bm25tuned.txt

#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################

其他指标计算,需要建立trec格式索引,qrels转为trec格式

https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md#evaluation

自定义数据集

pyserini/docs/usage-index.md at master · castorini/pyserini · GitHub


网站公告

今日签到

点亮在社区的每一天
去签到