数据集笔记:SeekWorld

发布于:2025-06-01 ⋅ 阅读:(20) ⋅ 点赞:(0)
  • 提出了一项新任务:地理定位推理(Geolocation Reasoning)
    • 该任务要求模型在感知视觉信息的同时,推理出图像中视觉语义所隐含的高级逻辑关系,从而确定图像的拍摄地点 
    • TheEighthDay/SeekWorld at main
  • 构建了一个基于规则的地理定位强化学习数据集 SeekWorld
    • 包含两个训练集
      • Train-Clue-Tracking
        • 从 o3 中收集的针对视觉线索跟踪的详细推理过程数据
        • {"response": "<think>\n\n
              <image clue>1: crop_images/20250317_190614-norway_rogaland/image_1.jpg [coordinates: 0.0000, 0.0000, 0.9990, 0.9981]\n\n
              <image clue>2: crop_images/20250317_190614-norway_rogaland/image_2.jpg [coordinates: 0.2004, 0.3430, 0.5559, 0.6496]\n\nI am considering the architectural origin of this picture.\n\nThe image resolution is low, perhaps due to a screenshot or downsampling issue, making the text difficult to read. It seems that the vertical sign says \"SKAGEN\"? However, the vertically arranged letters on the red building look like \"S N...\". I plan to continue zooming in on the image to better examine these letters.。。。

      • Train-No-Process
        • 不含推理过程的普通样本数据
        • {"message": 
              "[{\"role\": \"system\", 
              \"content\": \"You are a helpful assistant good at solving problems with step-by-step reasoning. 
                          You should first thinks about the reasoning process in the mind and then provides the user with the answer. 
                          The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags.\"}, 
              {\"role\": \"user\", 
              \"content\": 
                  [{\"type\": \"image\", 
                  \"image\": \"/data/phd/tiankaibin/dataset/data/streetview_images_first_tier_cities/bourbon_street_french_quarter_new_orleans_la_usa_h45_r100_20250317_184521.jpg\"},
                   {\"type\": \"text\", 
                  \"text\": \"In which country and within which first-level administrative region of that country was this picture taken?
                      Please answer in the format of <answer>$country,administrative_area_level_1$</answer>?\"}]}]", 
                  "answer": "$united states,louisiana/state of louisiana/la/pelican state$"}

  • 基于 Train-No-Process 数据,并以 Qwen2.5-7B-VL-Instruct 为基础模型,通过强化学习训练得到一个专门的视觉地理定位模型 SeekWord-7B 

网站公告

今日签到

点亮在社区的每一天
去签到