Hadoop之HDFS
一、HDFS的Shell操作
启动Hadoop集群(方便后续测试)
[atguigu@hadoop102 ~]$ sbin/start-dfs.sh [atguigu@hadoop102 ~]$ sbin/start-yarn.sh
-help:输出这个命令参数
[atguigu@hadoop102 ~]$ hadoop fs -help rm
-ls:显示目录信息
[atguigu@hadoop102 ~]$ hadoop fs -ls /
-mkdir:在HDFS上创建目录
[atguigu@hadoop102 ~]$ hadoop fs -mkdir -p /user/atguigu/input [atguigu@hadoop102 ~]$ hadoop fs -mkdir /bigdata0523
-moveFromLocal:从本地剪切粘贴到HDFS
[atguigu@hadoop102 hadoop-3.1.3]$ vim qiangge.txt --- 只年说:流年同学你好,可以加个微信吗? --- [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -moveFromLocal ./qiangge.txt /bigdata0523
-appendToFile:追加一个文件到已经存在的文件末尾
[atguigu@hadoop102 hadoop-3.1.3]$ vim pantongxue.txt --- 流年同学说:你说加就加啊。不给!! --- [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -appendToFile pantongxue.txt /bigdata0523/qiangge.txt
-cat:显示文件内容
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cat /bigdata0523/qiangge.txt 2025-07-05 12:03:57,587 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 只年说:流年同学你好,可以加个微信吗? 流年同学说:你说加就加啊。不给!!
-chgrp、-chmod、-chown:Linux文件系统中的用法一样,修改文件所属权限
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -chmod u+x /bigdata0523/qiangge.txt
-copyFromLocal:从本地文件系统中拷贝文件到HDFS路径去
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -copyFromLocal pantongxue.txt /bigdata0523
-put效果和它相同,一般用put就行
[atguigu@hadoop102 hadoop-3.1.3]$ vim xinge.txt --- 龙哥说:你们俩在干啥,带我一个呗!!! --- [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -put xinge.txt /bigdata0523
-copyToLocal:从HDFS拷贝到本地
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -copyToLocal /bigdata0523/qiangge.txt .
-get效果和它相同,一般用get就行
[atguigu@hadoop102 hadoop-3.1.3]$ vim mengmeng.txt --- 被班主任耽误的舞蹈选手 --- [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -moveFromLocal mengmeng.txt /bigdata0523 [atguigu@hadoop102 hadoop-3.1.3]$ ll | grep mengmeng [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -get /bigdata0523/mengmeng.txt . [atguigu@hadoop102 hadoop-3.1.3]$ ll | grep mengmeng -rw-r--r--. 1 atguigu atguigu 34 7月 5 12:34 mengmeng.txt
-cp:从HDFS的一个路径拷贝到HDFS的另一个路径
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cp /bigdata0523/mengmeng.txt /
-mv:在HDFS目录中移动文件
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mv /mengmeng.txt /user/atguigu/input # 同时还具备改名的功能,下方这种写法 dageoge.txt前方没有明确路径,默认是放到 /user/atguigu 下面了。可以理解为类似于linux也有用户家目录的概念 [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mv /user/atguigu/input/mengmeng.txt dagaoge.txt # 移动的同时还可以更改名称 [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mv /user/atguigu/dagaoge.txt /user/atguigu/input/mengmeng.txt
-get:等同于copyToLocal,就是从HDFS下载文件到本地
# 上方copyToLocal已经说过-get,此处不再操作 hadoop fs -get /bigdata0523/mengmeng.txt .
-getmerge:合并下载多个文件,比如HDFS的目录/user/atguigu/test 下有多个文件:log.1,log.2,log.3…
# 不用看后缀名,后缀名没啥用,里面是合并后的内容。getmerge用的比较少 [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -getmerge /bigdata0523 ./teacherAndStudent.avi [atguigu@hadoop102 hadoop-3.1.3]$ cat teacherAndStudent.avi 被班主任耽误的舞蹈选手 流年同学说:你说加就加啊。不给!! 只年说:流年同学你好,可以加个微信吗? 流年同学说:你说加就加啊。不给!! 龙哥说:你们俩在干啥,带我一个呗!!!
-put:等同于copyFromLocal
# 上方copyFromLocal已经说过-put,此处不再操作 hadoop fs -put xinge.txt /bigdata0523
-tail:显示一个文件的末尾
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -tail /input/README.txt # -n查看指定多少行不能用!实时查看可以跟-f [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -tail -f /input/README.txt
# 当我们把副本数改为5,再去做操作的时候报错 [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -appendToFile mengmeng.txt /input/README.txt 2025-07-05 13:11:46,387 WARN hdfs.DataStreamer: DataStreamer Exception java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[192.168.1.102:9866,DS-da913c4d-3a02-424d-b964-2a3602c1db98,DISK], DatanodeInfoWithStorage[192.168.1.103:9866,DS-3849d948-2fd7-40c0-925b-c498f6c67f76,DISK], DatanodeInfoWithStorage[192.168.1.104:9866,DS-692fc8b7-3c6d-464e-8d81-674708d0ee44,DISK]], original=[DatanodeInfoWithStorage[192.168.1.104:9866,DS-692fc8b7-3c6d-464e-8d81-674708d0ee44,DISK], DatanodeInfoWithStorage[192.168.1.103:9866,DS-3849d948-2fd7-40c0-925b-c498f6c67f76,DISK], DatanodeInfoWithStorage[192.168.1.102:9866,DS-da913c4d-3a02-424d-b964-2a3602c1db98,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304) at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372) at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:720) appendToFile: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[192.168.1.102:9866,DS-da913c4d-3a02-424d-b964-2a3602c1db98,DISK], DatanodeInfoWithStorage[192.168.1.103:9866,DS-3849d948-2fd7-40c0-925b-c498f6c67f76,DISK], DatanodeInfoWithStorage[192.168.1.104:9866,DS-692fc8b7-3c6d-464e-8d81-674708d0ee44,DISK]], original=[DatanodeInfoWithStorage[192.168.1.104:9866,DS-692fc8b7-3c6d-464e-8d81-674708d0ee44,DISK], DatanodeInfoWithStorage[192.168.1.103:9866,DS-3849d948-2fd7-40c0-925b-c498f6c67f76,DISK], DatanodeInfoWithStorage[192.168.1.102:9866,DS-da913c4d-3a02-424d-b964-2a3602c1db98,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
# 我们改为查看wc.input [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -tail -f /wcinput/wc.input # 然后另开一个窗口102 [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -appendToFile mengmeng.txt /wcinput/wc.input # 可以看到第一个窗口的查看处出现了我们追加的内容
README.txt副本数改为5后无法追加的原因:pipeline有bad datanode,应该是我们改为5个副本后有两个是不行的,目前先不管
-rm:删除文件或文件夹
# 删除一个文件 [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -rm /wcoutput/_SUCCESS Deleted /wcoutput/_SUCCESS # 删除一个目录(默认rm不能删目录) [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -rm /wcoutput rm: `/wcoutput': Is a directory # 需要加上-r递归才能删除目录 [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -rm -r /wcoutput Deleted /wcoutput
-rmdir:删除空目录
# 删除空目录 [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mkdir /abc [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -rmdir /abc # 删除非空目录(非空目录不允许删除) [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -rmdir /bigdata0523 rmdir: `/bigdata0523': Directory is not empty
-du:统计文件夹的大小信息
# 查看目录下的文件夹大小(看第一列就行,第二列的大小是包含副本占的总大小,比如此时副本数量是3,基本上就是3倍的第一列值) [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -du -h /input 1.3 K 4.0 K /input/README.txt 186.0 M 557.9 M /input/jdk-8u212-linux-x64.tar.gz # 查看文件夹的大小(-s就是做一个汇总) [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -du -h -s /input 186.0 M 557.9 M /input
-setrep:设置HDFS中文件的副本数量
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -setrep 2 /input/README.txt Replication 2 set: /input/README.txt # 此时这两份在哪台机器是NN决定的,然后我们将副本数量改为5,可以看到副本数量变成了5,但真正只存了3份(因为我们只有3台dn) [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -setrep 5 /input/README.txt Replication 5 set: /input/README.txt
解决web页面中操作没有权限的问题
当我们在页面操作文件时,如果提示类似
Permission denied user=dr.who,access=WRITE,inode="/":atguigu:supergroup:drwxr-wx-x
的错误hadoop默认情况下开启了权限检查,且默认使用dir.who作为http访问的静态用户,因此可以通过关闭权限检查或者配置http访问的静态用户为atguigu,二选一即可。我们在前面已经配置过静态用户,所以不会出现这个报错。
在core-site中修改http访问的静态用户为atguigu
<property> <name>hadoop.http.staticuser.user</name> <value>atguigu</value> </property>
在hdfs-site.xml中关闭权限检查
<property> <name>dfs.permissions.enabled</name> <value>false</value> </property>
修改后需要重启hdfs,若遇到报错
Cannot delete /wcoutput. Name node is in safe mode. The reported blocks 40 has reached the threshold 0.9990 of local blocks 40. The minimum number of live datanodes is not required. In safe mode extension. Safe mode will be turned off automatically in 6 seconds. NamenodeHostName:hadoop102
只需等一会儿即可,因为重启hdfs刚启动的时候,它有一小段时间处于安全模式,无法操作。
二、HDFS的客户端操作
2.1 客户端环境准备
把hadoop-3.1.0放到D:\DevSoft\Hadoop\下,它的路径为D:\DevSoft\Hadoop\hadoop-3.1.0
配置环境变量HADOOP_HOME。值为D:\DevSoft\Hadoop\hadoop-3.1.0
配置Path环境变量%HADOOP_HOME%\bin,然后重启电脑
创建一个Maven工程
导入相应的依赖坐标+日志添加
<dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-slf4j-impl</artifactId> <version>2.12.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>3.1.3</version> </dependency> </dependencies>
在项目的
src/main/resources
目录下,新建一个文件,命名为log4j2.xml
,在文件中填入<?xml version="1.0" encoding="UTF-8"?> <Configuration status="error" strict="true" name="XMLConfig"> <Appenders> <!-- 类型名为Console,名称为必须属性 --> <Appender type="Console" name="STDOUT"> <!-- 布局为PatternLayout的方式, 输出样式为[INFO] [2018-01-22 17:34:01][org.test.Console]I'm here --> <Layout type="PatternLayout" pattern="[%p] [%d{yyyy-MM-dd HH:mm:ss}][%c{10}]%m%n" /> </Appender> </Appenders> <Loggers> <!-- 可加性为false --> <Logger name="test" level="info" additivity="false"> <AppenderRef ref="STDOUT" /> </Logger> <!-- root loggerConfig设置 --> <Root level="info"> <AppenderRef ref="STDOUT" /> </Root> </Loggers> </Configuration>
创建包名:com.atguigu.hdfs
创建TestHDFS类
/** * 客户端性质的开发 * 1. 获取客户端对象 * 2. 调用相关方法实现功能 * 3. 关闭 */ public class TestHDFS { @Test public void testHDFS() throws IOException, InterruptedException { // 1.创建文件系统对象 URI uri