MIT XV6 - 1.5 Lab: Xv6 and Unix utilities - xargs

发布于:2025-05-11 ⋅ 阅读:(25) ⋅ 点赞:(0)

接上文 MIT XV6 - 1.4 Lab: Xv6 and Unix utilities - find

xargs

继续实验,实验介绍和要求如下 (原文链接 译文链接) :

Write a simple version of the UNIX xargs program for xv6: its arguments describe a command to run, it reads lines from the standard input, and it runs the command for each line, appending the line to the command’s arguments. Your solution should be in the file user/xargs.c.

The following example illustrates xarg’s behavior:

$ echo hello too | xargs echo bye
bye hello too
$

Note that the command here is “echo bye” and the additional arguments are “hello too”, making the command “echo bye hello too”, which outputs “bye hello too”.

Please note that xargs on UNIX makes an optimization where it will feed more than argument to the command at a time. We don’t expect you to make this optimization. To make xargs on UNIX behave the way we want it to for this lab, please run it with the -n option set to 1. For instance

$ (echo 1 ; echo 2) | xargs -n 1 echo
1
2
$

Some hints:

  • Use fork and exec to invoke the command on each line of input. Use wait in the parent to wait for the child to complete the command.
  • To read individual lines of input, read a character at a time until a newline (‘\n’) appears.
  • kernel/param.h declares MAXARG, which may be useful if you need to declare an argv array.
  • Add the program to UPROGS in Makefile.
  • Changes to the file system persist across runs of qemu; to get a clean file system run make clean and then make qemu.

xargs, find, and grep combine well:

  $ find . b | xargs grep hello

will run “grep hello” on each file named b in the directories below “.”.
To test your solution for xargs, run the shell script xargstest.sh. Your solution is correct if it produces the following output:

 $ make qemu
 ...
 init: starting sh
 $ sh < xargstest.sh
 $ $ $ $ $ $ hello
 hello
 hello
 $ $

You may have to go back and fix bugs in your find program. The output has many $ because the xv6 shell doesn’t realize it is processing commands from a file instead of from the console, and prints a $ for each command in the file.

上面这句话👆,竟然还留了个坑

这个实验就是让我们实现一个简单版的xargs,利用管道和多进程来实现命令的组合以打破对参数数量的限制,比如当我们处理日志文件,将所有包含某个字符串的一条记录输出到一个新的文件,或者处理find的输出等等,彻底贯彻"工具集"这个思想。

实验前的准备

想要做好这个实验,不得不重温一下教材了,重点看一下 1.2 I/O and File descriptors1.3 Pipes,要用到的重要的知识点总结如下:

  • A file descriptor is a small integer representing a kernel-managed object that a process may read from or write to.
  • Internally, the xv6 kernel uses the file descriptor as an index into a per-process table, so that
    every process has a private space of file descriptors starting at zero.
  • By convention, a process reads from file descriptor 0 (standard input), writes output to file descriptor 1 (standard output), and writes error messages to file descriptor 2 (standard error).
  • The close system call releases a file descriptor, making it free for reuse by a future open, pipe, or dup system call (see below). A newly allocated file descriptor is always the lowestnumbered unused descriptor of the current process.

有一个 shell 中关于 cat 有趣(tricky)的实现, 对应这上面第一段红色的描述:

char *argv[2];
argv[0] = "cat";
argv[1] = 0;
if(fork() == 0) {
    close(0);
    open("input.txt", O_RDONLY);
    exec("cat", argv);
}

After the child closes file descriptor 0, open is guaranteed to use that file descriptor for the newly opened input.txt: 0 will be the smallest available file descriptor. cat then executes with file descriptor 0 (standard input) referring to input.txt. The parent process’s file descriptors are not changed by this sequence, since it modifies only the child’s descriptors.

什么意思呢,当我们在 shell 中执行 cat < input.txt 的时候,shell 会在创建的子进程中把 标准输入文件描述符 0 给释放掉,这样我们下一个 open 的调用返回的文件描述符必然是 0,这样当 cat 真正执行的时候,可以不用管是要从哪里读取内容,不管是文件也好,管道也罢,而且这个操作对父进程完全没影响,让我们先看一下 cat.c 的实现

void cat(int fd)
{
  int n;

  while((n = read(fd, buf, sizeof(buf))) > 0) {
    if (write(1, buf, n) != n) {
      fprintf(2, "cat: write error\n");
      exit(1);
    }
  }
  if(n < 0){
    fprintf(2, "cat: read error\n");
    exit(1);
  }
}

int main(int argc, char *argv[])
{
  int fd, i;

  if(argc <= 1){
    cat(0);
    exit(0);
  }

  for(i = 1; i < argc; i++){
    if((fd = open(argv[i], O_RDONLY)) < 0){
      fprintf(2, "cat: cannot open %s\n", argv[i]);
      exit(1);
    }
    cat(fd);
    close(fd);
  }
  exit(0);
}

可以看到,当参数数量 <=1 时,直接调用 cat 函数,而 cat 函数就是从给的文件描述符里面不停的读,一边读一边往标准输出中写,做一个规规矩矩的复读机;而参数 >1 的时候才会去把每个参数挨个打开,不管你是设备是文件,然后分别调用 cat

先决条件有了,让我们去追一下xv6 中 shell 的真正实现,代码在 user/sh.c, 我们就假设现在在 shell 中输入了cat < input.txt,一步一步看看是怎么调用到 cat 的:

  • 获取当前命令

    int main(void)
    {
        ...
    
        // Read and run input commands.
        while(getcmd(buf, sizeof(buf)) >= 0){
            if(buf[0] == 'c' && buf[1] == 'd' && buf[2] == ' '){
                // Chdir must be called by the parent, not the child.
                buf[strlen(buf)-1] = 0;  // chop \n
                if(chdir(buf+3) < 0)
                    fprintf(2, "cannot cd %s\n", buf+3);
                continue;
            }
            if(fork1() == 0)
                runcmd(parsecmd(buf));
            wait(0);
        }
        exit(0);
    }
    

    如果不是 cd 则直接 fork ,在子进程中解析和运行命令

  • parsecmd -> parseline -> parsepipe -> parseexec -> parseredirs,好了我已经吐了这么深的调用路径,让我们看看 parseredirs 的实现:

    struct cmd* redircmd(struct cmd *subcmd, char *file, char *efile, int mode, int fd)
    {
        struct redircmd *cmd;
    
        cmd = malloc(sizeof(*cmd));
        memset(cmd, 0, sizeof(*cmd));
        cmd->type = REDIR;
        cmd->cmd = subcmd;
        cmd->file = file;
        cmd->efile = efile;
        cmd->mode = mode;
        cmd->fd = fd;
        return (struct cmd*)cmd;
    }
    
    struct cmd* parseredirs(struct cmd *cmd, char **ps, char *es)
    {
        int tok;
        char *q, *eq;
    
        while(peek(ps, es, "<>")){
            tok = gettoken(ps, es, 0, 0);
            if(gettoken(ps, es, &q, &eq) != 'a')
                panic("missing file for redirection");
            switch(tok){
                case '<':
                    cmd = redircmd(cmd, q, eq, O_RDONLY, 0);
                    break;
                case '>':
                    cmd = redircmd(cmd, q, eq, O_WRONLY|O_CREATE|O_TRUNC, 1);
                    break;
                case '+':  // >>
                    cmd = redircmd(cmd, q, eq, O_WRONLY|O_CREATE, 1);
                    break;
            }
        }
        return cmd;
    }
    

    你看啊,当tok>< 的时候,构建 cmd 的时候,传入的参数 fd 刚好一个是 stdin: 0stdout: 1

  • 然后就是 runcmd 函数对 REDIR 类型命令的处理

    void runcmd(struct cmd *cmd)
    {
        ...
        switch(cmd->type){
            ...
            case EXEC:
                ecmd = (struct execcmd*)cmd;
                if(ecmd->argv[0] == 0)
                    exit(1);
                exec(ecmd->argv[0], ecmd->argv);
                fprintf(2, "exec %s failed\n", ecmd->argv[0]);
                break;
            case REDIR:
                rcmd = (struct redircmd*)cmd;
                close(rcmd->fd);
                if(open(rcmd->file, rcmd->mode) < 0){
                    fprintf(2, "open %s failed\n", rcmd->file);
                    exit(1);
                }
                runcmd(rcmd->cmd);
                break;
        }
        exit(0);
    }
    

    这里干了什么,先 close(rcmd->fd) ,也就是刚刚传入的 stdin: 0, 然后呢调用 open 去打开 cat < input.txt 传入的 input.txt,然后再调用 runcmd(rcmd->cmd), 这时候就是一个 EXEC 类型的命令了,直接一条 exec(ecmd->argv[0], ecmd->argv) 走你,而且紧跟着一条错误输出,这是因为如果 exec 调用成功的话,根本不会继续走下去了.

实验正文

我们的 xargs 走的是 PIPE 类型的 cmd :

void runcmd(struct cmd *cmd)
{
    ...
    switch(cmd->type){
        case EXEC:
            ecmd = (struct execcmd*)cmd;
            if(ecmd->argv[0] == 0)
                exit(1);
            exec(ecmd->argv[0], ecmd->argv);
            fprintf(2, "exec %s failed\n", ecmd->argv[0]);
            break;
        case PIPE:
            pcmd = (struct pipecmd*)cmd;
            if(pipe(p) < 0)
            panic("pipe");
            if(fork1() == 0){
                close(1);
                dup(p[1]);
                close(p[0]);
                close(p[1]);
                runcmd(pcmd->left);
            }
            if(fork1() == 0){
                close(0);
                dup(p[0]);
                close(p[0]);
                close(p[1]);
                runcmd(pcmd->right);
            }
            close(p[0]);
            close(p[1]);
            wait(0);
            wait(0);
            break;
    }
    exit(0);
}

假设我们测试用命令 find . b | xargs grep hello,那么 pcmd->left 就是 find . bpcmd->right 就是 xargs grep hello,你看啊,他直接把管道左侧的子进程的 stdout: 1 利用 dup(p[1]) 给重定向到管道的写端口,把右侧子进程的 stdin: 0 利用 dup(p[0]) 给重定向到管道的读端口,然后两个子进程分别把管道读写端口描述符都关了一遍。。。但管道依然在,这帮子人最早是如何想到这种写法的?

那么我们要做的就是循环读 stdin: 0 呗,读到换行符 \n,就把这一行内容当成一个参数附加到参数列表中传给子进程,然后就等子进程执行完,不管错误与否,直接下一轮。

/**
 * xargs.c - A simplified implementation of the Unix xargs command
 *
 * This module implements a basic version of the xargs command, which reads
 * items from standard input, delimited by newlines, and executes a command for
 * each item.
 *
 * Algorithm Description:
 * 1. Read command-line arguments (the command to execute)
 * 2. Read input lines from stdin
 * 3. For each input line:
 *    - Construct argument list by combining command args and input line
 *    - Fork a child process
 *    - Execute the command in child process
 *    - Parent waits for child to complete
 *
 * Sequence Diagram:
 *
 * Parent Process                    Child Process
 *     |                                |
 *     |-- fork() --------------------> |
 *     |                                |
 *     |<-- exec(command, args) ------- |
 *     |                                |
 *     |-- wait() --------------------> |
 *     |                                |
 *     |<-- exit() -------------------- |
 *     |                                |
 *
 * @author: xv6-labs
 * @version: 1.0
 */

#include "kernel/types.h"
#include "kernel/param.h"
#include "user/user.h"
#include "kernel/fcntl.h"

/**
 * Reads a single line from standard input into the provided buffer
 *
 * @param buf: Buffer to store the read line
 * @param max: Maximum number of characters to read
 * @return: Number of characters read (excluding null terminator)
 *
 * This function reads characters one by one from stdin until either:
 * - A newline character is encountered
 * - The maximum buffer size is reached
 * - End of input is reached
 */
int readline_from_stdin(char* buf, int max) {
  int n = 0;
  while (n < max && read(0, buf + n, 1) > 0) {
    if (buf[n] == '\n') {
      break;
    }
    n++;
  }

  buf[n] = '\0';
  return n;
}

/**
 * Main function implementing the xargs command
 *
 * Usage: xargs command [arg1 arg2 ...]
 *
 * The program:
 * 1. Validates command-line arguments
 * 2. Reads input lines from stdin
 * 3. For each line:
 *    - Constructs argument list
 *    - Executes command with arguments
 *
 * @param argc: Number of command-line arguments
 * @param argv: Array of command-line argument strings
 * @return: 0 on success, 1 on error
 */
int main(int argc, char* argv[]) {
  // Validate minimum number of arguments
  if (argc < 2) {
    fprintf(2, "usage: xargs command\n");
    exit(1);
  }

  // Check if number of arguments exceeds system limit
  // MAXARG - 1 because we need space for the command and input line
  if (argc > MAXARG - 1) {
    fprintf(2, "xargs: too many arguments\n");
    exit(1);
  }

  char buf[1024];      // Buffer for reading input lines
  char* args[MAXARG];  // Array to hold command arguments
  args[0] = argv[1];   // First argument is the command to execute

  // Process each line from stdin
  while (readline_from_stdin(buf, sizeof(buf) / sizeof(buf[0])) > 0) {
    // Copy command-line arguments to args array
    for (int i = 2; i < argc; i++) {
      args[i - 1] = argv[i];
    }
    args[argc - 1] = buf;  // Add the input line as the last argument
    args[argc] = 0;        // Null terminate the argument list

    // Create child process to execute command
    if (fork() == 0) {
      exec(argv[1], args);  // Execute command with constructed arguments
      fprintf(2, "xargs: exec %s failed\n", argv[1]);
    }

    wait(0);  // Parent waits for child to complete
  }

  exit(0);
}

开始我跑偏了,写着写着忘了题目要求是把读到的一行内容当参数传给子进程,我又搞了一遍管道重定向的操作,结果总是不行,看了 grep 的源码才发现有问题。实验结果如下:

make qemu
qemu-system-riscv64 -machine virt -bios none -kernel kernel/kernel -m 128M -smp 3 -nographic -global virtio-mmio.force-legacy=false -drive file=fs.img,if=none,format=raw,id=x0 -device virtio-blk-device,drive=x0,bus=virtio-mmio-bus.0

xv6 kernel is booting

hart 1 starting
hart 2 starting
init: starting sh
$ sh < xargstest.sh
$ $ $ $ $ $ hello
hello
hello
$ $ 

另外实验末尾的这句话,我没看懂,我理解,这明明是 shell user/sh.c 的问题啊?怎么让我去改 find呢?

You may have to go back and fix bugs in your find program. The output has many $ because the xv6 shell doesn’t realize it is processing commands from a file instead of from the console, and prints a $ for each command in the file.

我直接执行 find . b | xargs grep hello 就没有这么多 $ 了啊,所以为什么让我去改 find ??

$ find . b | xargs grep hello
hello
hello
hello
$