基于
Docker
构建
Hadoop
平台
0.
绪论
使⽤
Docker
搭建
Hadoop
技术平台,包括安装
Docker
、
Java
、
Scala
、
Hadoop
、
Hbase
、
Spark
。
集群共有
5
台机器,主机名分别为
h01
、
h02
、
h03
、
h04
、
h05
。其中
h01
为
master
,其他的为
slave
。
虚拟机配置:建议
1
盒
2
线程、
8G
内存、
30G
硬盘。最早配置
4G
内存,
HBase
和
Spark
运⾏异常。
JDK 1.8
Scala 2.11.12
Hadoop 3.3.3
Hbase 3.0.0
Spark 3.3.0
1. Docker
1.1 Ubuntu 22.04
安装
Docker
在
Ubuntu
下对
Docker
的操作都需要加上
sudo
,如果已经是
root
账号了,则不需要。
如果不加
sudo
,
Docker
相关命令会⽆法执⾏。
在
Ubuntu
下安装
Docker
的时候需在管理员的账号下操作。
安装完成之后,以
sudo
启动
Docker
服务。
显⽰
Docker
中所有正在运⾏的容器,由于
Docker
才安装,我们没有运⾏任何容器,所以显⽰结果如
下所⽰。
1.2
使⽤
Docker
现在的
Docker
⽹络能够提供
DNS
解析功能,我们可以使⽤如下命令为接下来的
Hadoop
集群单独构
建⼀个虚拟的⽹络。可以采⽤直通、桥接或
macvlan
⽅式,这⾥采⽤桥接模式,可以做到
5
台主机互联,
并能访问宿主机和⽹关,可以连接外⽹,便于在线下载程序资源。
以上命令创建了⼀个名为
hadoop
的虚拟桥接⽹络,该虚拟⽹络内部提供了⾃动的
DNS
解析服务。使⽤
下⾯这个命令查看
Docker
中的⽹络,可以看到刚刚创建的名为
hadoop
的虚拟桥接⽹络。
mike@ubuntu2204:~$ wget -qO- https://get.docker.com/ | sh
mike@ubuntu2204:~$ sudo service docker start
mike@ubuntu2204:~$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
mike@ubuntu2204:~$
mike@ubuntu2204:~$sudo docker network create --driver=bridge hadoop
mike@ubuntu2204:~$ sudo docker network ls
[sudo] password for mike:
NETWORK ID NAME DRIVER SCOPE
3948edc3e8f3 bridge bridge local
337965dd9b1e hadoop bridge local
cb8f2c453adc host host local
fff4bd1c15ee mynet macvlan local
30e1132ad754 none null local
mike@ubuntu2204:~$
查找
ubuntu
容器
打开
https://hub.docker.com/
官⽹,搜索
ubuntu
,找到官⽅认证镜像,这⾥选取第⼀个
点击第⼀个
ubuntu
,查找可选⽤的版本,这⾥选取
22.04
下载
ubuntu 22.04
版本的镜像⽂件
mike@ubuntu2204:~$ sudo docker pull ubuntu:22.04
查看已经下载的镜像
mike@ubuntu2204:~$ sudo docker images
[sudo] password for mike:
REPOSITORY TAG IMAGE ID CREATED SIZE
newuhadoop latest fe08b5527281 3 days ago 2.11GB
ubuntu 22.04 27941809078c 6 weeks ago 77.8MB
mike@ubuntu2204:~$
根据镜像启动⼀个容器,可以看出
shell
已经是容器的
shell
了,这⾥注意
@
后⾯的容器
ID
与上图镜像
ID
⼀致
mike@ubuntu2204:~$ sudo docker run -it ubuntu:22.04 /bin/bash
root@27941809078c:/#
输⼊
exit
可以退出容器,不过建议使⽤
Ctrl + P + Q
,退出容器状态,但仍让容器处于后台运⾏状
态。
mike@ubuntu2204:~$
查看本机上所有的容器
此处会看到刚刚创建好的容器,并在后台运⾏。这⾥因为是后期制作的教程,为了节省内存,只保留了
5
台
hadoop
的容器,最原始的容器已经删除。
启动⼀个状态为退出的容器,最后⼀个参数为容器
ID
进⼊⼀个容器
关闭⼀个容器
2.
安装集群
主要是安装
JDK 1.8
的环境,因为
Spark
要
Scala
,
Scala
要
JDK 1.8
,以及
Hadoop
,以此来构建基础
镜像。
2.1
安装
Java
与
Scala
进⼊之前的
Ubuntu
容器
先更换
apt
的源
2.1.1
修改
apt
源
备份源
先删除就源⽂件,这个时候没有
vim
⼯具
..
mike@ubuntu2204:~$ sudo docker ps -a
[sudo] password for mike:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS
NAMES
8016da5278ae newuhadoop "/bin/bash" 3 days ago Up 2 days
h05
409c7e8aa2e9 newuhadoop "/bin/bash" 3 days ago Up 2 days
h04
0d8af236e1e7 newuhadoop "/bin/bash" 3 days ago Up 2 days
h03
72d62b7d4874 newuhadoop "/bin/bash" 3 days ago Up 2 days
h02
d4d3ca3bbb61 newuhadoop "/bin/bash" 3 days ago Up 2 days 0.0.0.0:8088-
>8088/tcp, :::8088->8088/tcp, 0.0.0.0:9870->9870/tcp, :::9870->9870/tcp h01
mike@ubuntu2204:~$
mike@ubuntu2204:~$ sudo docker start 27941809078c
mike@ubuntu2204:~$ sudo docker attach 27941809078c
mike@ubuntu2204:~$ sudo docker stop 27941809078c
root@27941809078c:/# cp /etc/apt/sources.list /etc/apt/sources_init.list
root@27941809078c:/#
root@27941809078c:/# rm /etc/apt/sources.list
复制以下命令,回⻋,即可⼀键切换到阿⾥云
ubuntu 22.04
镜像:(此时已经是
root
权限,提⽰符为
#
)
再使⽤
apt update
/ apt upgrade
来更新,
update
更列表,
upgrade
更新包
2.1.2
安装
Java
与
Scala
安装
jdk 1.8
,直接输⼊命令
测试⼀下安装结果
接下来安装
scala
测试⼀下安装结果
bash -c "cat << EOF > /etc/apt/sources.list && apt update
deb http://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy main restricted universe
multiverse
deb http://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe
multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe
multiverse
deb http://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe
multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe
multiverse
deb http://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe
multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe
multiverse
deb http://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe
multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted
universe multiverse
EOF"
root@27941809078c:/# apt update
root@27941809078c:/# apt upgrade
root@27941809078c:/# apt install openjdk-8-jdk
root@27941809078c:/# java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
root@27941809078c:/#
root@27941809078c:/# apt install scala
输⼊
:quit
退出
scala
2.2
安装
Hadoop
在当前容器中将配置配好
导⼊出为镜像
以此镜像为基础创建五个容器,并赋予
hostname
进⼊
h01
容器,启动
Hadoop
2.2.1
安装
Vim
与 ⽹络⼯具包
安装
vim
,⽤来编辑⽂件
安装
net-tools
、
iputils-ping
、
iproute2
⽹络⼯具包,⽬的是为了使⽤
ping
、
ifconfig
、
ip
、
traceroute
等命令
2.2.2
安装
SSH
安装
SSH
,并配置免密登录,由于后⾯的容器之间是由⼀个镜像启动的,就像同⼀个磨具出来的
5
把锁
与钥匙,可以互相开锁。所以在当前容器⾥配置
SSH
⾃⾝免密登录就
OK
了。
安装
SSH
服务器端
安装
SSH
的客⼾端
进⼊当前⽤⼾的⽤⼾根⽬录
⽣成密钥,不⽤输⼊,⼀直回⻋就⾏,⽣成的密钥在当前⽤⼾根⽬录下的
.ssh
⽂件夹中。以
.
开头
的⽂件与⽂件夹
ls
是隐藏的,需要
ls
-
al
才能查看。
将公钥追加到
authorized_keys
⽂件中
root@27941809078c:/# scala
Welcome to Scala 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_312).
Type in expressions for evaluation. Or try :help.
scala>
root@27941809078c:/# apt install vim
root@27941809078c:/# apt install net-tools
root@27941809078c:/# apt install iputils-ping
root@27941809078c:/# apt install iproute2
root@27941809078c:/# apt install openssh-server
root@27941809078c:/# apt install openssh-client
root@27941809078c:/# cd ~
root@27941809078c:~#
root@27941809078c:~# ssh-keygen -t rsa -P ""
root@27941809078c:~# cat .ssh/id_rsa.pub >> .ssh/authorized_keys
root@27941809078c:~#
启动
SSH
服务
root@27941809078c:~# service ssh start
* Starting OpenBSD Secure Shell server sshd
[ OK ]
root@27941809078c:~#
免密登录⾃⼰
root@27941809078c:~# ssh 127.0.0.1
Welcome to Ubuntu 22.04 LTS (GNU/Linux 5.15.0-41-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
This system has been minimized by removing packages and content that are
not required on a system that users do not log into.
To restore this content, you can run the 'unminimize' command.
Last login: Sun Jul 17 08:26:15 2022 from 172.18.0.1
* Starting OpenBSD Secure Shell server sshd
root@27941809078c:~#
修改
.bashrc
⽂件,启动
shell
的时候,⾃动启动
SSH
服务
⽤
vim
打开
.bashrc
⽂件
root@27941809078c:~# vim ~/.bashrc
按⼀下
i
键,使得
vim
进⼊插⼊模式,此时终端的左下⻆会显⽰为
--
INSERT
--
,将光标移动到最后
⾯,添加⼀⾏(
Caps + g
可直接到最后⼀⾏
)
service ssh start
添加完的结果为,只显⽰最后⼏⾏
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi
# enable programmable completion features (you don't need to enable
# this, if it's already enabled in /etc/bash.bashrc and /etc/profile
# sources /etc/bash.bashrc).
#if [ -f /etc/bash_completion ] && ! shopt -oq posix; then
# . /etc/bash_completion
#fi
service ssh start
按⼀下
Esc
键,使得
vim
退出插⼊模式
再输⼊英⽂模式下的冒号
:
,此时终端的左下⽅会有⼀个冒号
:
显⽰出来
再输⼊三个字符
wq!
,这是⼀个组合命令
w
是保存的意思
q
是退出的意思
!
是强制的意思
再输⼊回⻋,退出
vim
。
此时,
SSH
免密登录已经完全配置好。
2.2.3
安装
Hadoop
下载
Hadoop
的安装⽂件
解压到
/usr/local
⽬录下⾯并重命名⽂件夹
修改
/etc/profile
⽂件,添加⼀下环境变量到⽂件中
先⽤
vim
打开
/etc/profile
追加以下内容
JAVA_HOME
为
JDK
安装路径,使⽤
apt
安装就是这个,⽤
update
-
alternatives
--
config java
可
查看
root@27941809078c:~# wget https://mirrors.aliyun.com/apache/hadoop/common/hadoop-
3.3.3/hadoop-3.3.3.tar.gz
root@27941809078c:~# tar -zxvf hadoop-3.3.3.tar.gz -C /usr/local/
root@27941809078c:~# cd /usr/local/
root@27941809078c:/usr/local# mv hadoop-3.3.3 hadoop
root@27941809078c:/usr/local#
vim /etc/profile
#java
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
#hadoop
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export HDFS_DATANODE_USER=root
export HDFS_DATANODE_SECURE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export HDFS_NAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
使环境变量⽣效
root@27941809078c:/usr/local# source /etc/profile
root@27941809078c:/usr/local#
在⽬录
/usr/local/hadoop/etc/hadoop
下,修改
6
个重要配置⽂件
修改
hadoop-env.sh
⽂件,在⽂件末尾添加⼀下信息
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
修改
core-site.xml
,修改为
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://h01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop3/hadoop/tmp</value>
</property>
</configuration>
修改
hdfs-site.xml
,修改为
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop3/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.namenode.data.dir</name>
<value>/home/hadoop3/hadoop/hdfs/data</value>
</property>
</configuration>
修改
mapred-site.xml
,修改为
修改
yarn-site.xml
,修改为
修改
worker
为
此时,
hadoop
已经配置好了
2.2.4
在
Docker
中启动集群
先将当前容器导出为镜像,并查看当前镜像。使⽤
ctrl + p + q
,退出容器,回到宿主机
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>
/usr/local/hadoop/etc/hadoop,
/usr/local/hadoop/share/hadoop/common/*,
/usr/local/hadoop/share/hadoop/common/lib/*,
/usr/local/hadoop/share/hadoop/hdfs/*,
/usr/local/hadoop/share/hadoop/hdfs/lib/*,
/usr/local/hadoop/share/hadoop/mapreduce/*,
/usr/local/hadoop/share/hadoop/mapreduce/lib/*,
/usr/local/hadoop/share/hadoop/yarn/*,
/usr/local/hadoop/share/hadoop/yarn/lib/*
</value>
</property>
</configuration>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>h01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
h01
h02
h03
h04
h05
mike@ubuntu2204:~$ sudo docker commit -m "hadoop" -a "hadoop" 27941809078c
newuhadoop
sha256:648d8e082a231919faeaa14e09f5ce369b20879544576c03ef94074daf978823
mike@ubuntu2204:~$ sudo docker images
[sudo] password for mike:
REPOSITORY TAG IMAGE ID CREATED SIZE
newuhadoop latest fe08b5527281 4 days ago 2.11GB
ubuntu 22.04 27941809078c 6 weeks ago 77.8MB
mike@ubuntu2204:~$
启动
5
个终端,分别执⾏这⼏个命令
第⼀条命令启动的是
h01
是做
master
节点的,所以暴露了端⼝,以供访问
web
⻚⾯
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h01" --name "h01" -p
9870:9870 -p 8088:8088 newuhadoop /bin/bash
* Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h01:/#
其余的四条命令就是⼏乎⼀样的了,注意:启动容器后,使⽤
ctrl + p + q
退回到宿主机,之后再启动下
⼀个容器
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h02" --name "h02"
newuhadoop /bin/bash
[sudo] password for mike:
* Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h02:/#
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h03" --name "h03"
newuhadoop /bin/bash
[sudo] password for mike:
* Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h03:/#
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h04" --name "h04"
newuhadoop /bin/bash
[sudo] password for mike:
* Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h04:/#
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h05" --name "h05"
newuhadoop /bin/bash
[sudo] password for mike:
* Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h05:/#
接下来,在
h01
主机中,启动
Haddop
集群
先进⾏格式化操作,不格式化操作,
hdfs
会起不来
root@h01:/usr/local/hadoop/bin# ./hadoop namenode -format
进⼊
hadoop
的
sbin
⽬录
root@h01:/# cd /usr/local/hadoop/sbin/
root@h01:/usr/local/hadoop/sbin#
启动
hadoop
root@h01:/usr/local/hadoop/sbin# ./start-all.sh
Starting namenodes on [h01]
h01: Warning: Permanently added 'h01,172.18.0.2' (ECDSA) to the list of known
hosts.
Starting datanodes
h05: Warning: Permanently added 'h05,172.18.0.6' (ECDSA) to the list of known
hosts.
h02: Warning: Permanently added 'h02,172.18.0.3' (ECDSA) to the list of known
hosts.
h03: Warning: Permanently added 'h03,172.18.0.4' (ECDSA) to the list of known
hosts.
h04: Warning: Permanently added 'h04,172.18.0.5' (ECDSA) to the list of known
hosts.
h03: WARNING: /usr/local/hadoop/logs does not exist. Creating.
h05: WARNING: /usr/local/hadoop/logs does not exist. Creating.
h02: WARNING: /usr/local/hadoop/logs does not exist. Creating.
h04: WARNING: /usr/local/hadoop/logs does not exist. Creating.
Starting secondary namenodes [h01]
Starting resourcemanager
Starting nodemanagers
root@h01:/usr/local/hadoop/sbin#
使⽤
jps
查看集群启动状态 (这个状态不是固定不变的,随着应⽤不同⽽不同,但⾄少应该有
3
个)
root@h01:~# jps
10017 HRegionServer
10609 Master
9778 HQuorumPeer
8245 SecondaryNameNode
8087 DataNode
9881 HMaster
41081 Jps
10684 Worker
7965 NameNode
8477 ResourceManager
8591 NodeManager
root@h01:~#
使⽤命令
./hdfs dfsadmin
-
report
可查看分布式⽂件系统的状态
root@h01:/usr/local/hadoop/bin# ./hdfs dfsadmin -report
Configured Capacity: 90810798080 (84.57 GB)
Present Capacity: 24106247929 (22.45 GB)
DFS Remaining: 24097781497 (22.44 GB)
DFS Used: 8466432 (8.07 MB)
DFS Used%: 0.04%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
-------------------------------------------------
Live datanodes (5):
Name: 172.18.0.2:9866 (h01)
Hostname: h01
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 2875392 (2.74 MB)
Non DFS Used: 11887669248 (11.07 GB)
DFS Remaining: 4712182185 (4.39 GB)
DFS Used%: 0.02%
DFS Remaining%: 25.95%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 10
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Tue Jul 19 23:36:54 GMT 2022
Num of Blocks: 293
Name: 172.18.0.3:9866 (h02.hadoop)
Hostname: h02
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 1396736 (1.33 MB)
Non DFS Used: 11889147904 (11.07 GB)
DFS Remaining: 4846399828 (4.51 GB)
DFS Used%: 0.01%
DFS Remaining%: 26.68%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 8
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Tue Jul 19 23:51:39 GMT 2022
Num of Blocks: 153
Name: 172.18.0.4:9866 (h03.hadoop)
Hostname: h03
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 1323008 (1.26 MB)
Non DFS Used: 11889221632 (11.07 GB)
DFS Remaining: 5114835114 (4.76 GB)
DFS Used%: 0.01%
DFS Remaining%: 28.16%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 4
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Wed Jul 20 02:14:39 GMT 2022
Num of Blocks: 151
Name: 172.18.0.5:9866 (h04.hadoop)
Hostname: h04
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 1527808 (1.46 MB)
Non DFS Used: 11889016832 (11.07 GB)
DFS Remaining: 4712182185 (4.39 GB)
DFS Used%: 0.01%
DFS Remaining%: 25.95%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 10
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Wed Jul 20 00:42:09 GMT 2022
Num of Blocks: 134
Name: 172.18.0.6:9866 (h05.hadoop)
Hostname: h05
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 1343488 (1.28 MB)
Non DFS Used: 11889201152 (11.07 GB)
DFS Remaining: 4712182185 (4.39 GB)
DFS Used%: 0.01%
DFS Remaining%: 25.95%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 10
访问宿主机的
8088
与
9870
端⼝就可以看到监控信息了
⾄此,
Hadoop
集群已经构建好了
2.2.5
运⾏内置
WordCount
例⼦
把
license
作为需要统计的⽂件
在
HDFS
中创建
input
⽂件夹
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Wed Jul 20 02:36:21 GMT 2022
Num of Blocks: 149
root@h01:/usr/local/hadoop/bin#
root@h01:/usr/local/hadoop# cat LICENSE.txt > file1.txt
root@h01:/usr/local/hadoop# ls
root@h01:/usr/local/hadoop/bin# ./hadoop fs -mkdir /input
root@h01:/usr/local/hadoop/bin#
上传
file1.txt
⽂件到
HDFS
中
root@h01:/usr/local/hadoop/bin# ./hadoop fs -put ../file1.txt /input
root@h01:/usr/local/hadoop/bin#
查看
HDFS
中
input
⽂件夹⾥的内容
root@h01:/usr/local/hadoop/bin# ./hadoop fs -ls /input
Found 1 items
-rw-r--r-- 2 root supergroup 15217 2022-07-17 08:50 /input/file1.txt
root@h01:/usr/local/hadoop/bin#
运⾏
wordcount
例⼦程序
root@h01:/usr/local/hadoop/bin# ./hadoop jar ../share/hadoop/mapreduce/hadoop
mapreduce-examples-3.3.3.jar wordcount /input /output
输出如下:
root@h01:/usr/local/hadoop/bin# ./hadoop jar ../share/hadoop/mapreduce/hadoop
mapreduce-examples-3.3.3.jar wordcount /input /output
2022-07-20 05:12:38,394 INFO client.DefaultNoHARMFailoverProxyProvider:
Connecting to ResourceManager at h01/172.18.0.2:8032
2022-07-20 05:12:38,816 INFO mapreduce.JobResourceUploader: Disabling Erasure
Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1658047711391_0002
2022-07-20 05:12:39,076 INFO input.FileInputFormat: Total input files to process
: 1
2022-07-20 05:12:39,198 INFO mapreduce.JobSubmitter: number of splits:1
2022-07-20 05:12:39,399 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1658047711391_0002
2022-07-20 05:12:39,399 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-07-20 05:12:39,674 INFO conf.Configuration: resource-types.xml not found
2022-07-20 05:12:39,674 INFO resource.ResourceUtils: Unable to find 'resource
types.xml'.
2022-07-20 05:12:39,836 INFO impl.YarnClientImpl: Submitted application
application_1658047711391_0002
2022-07-20 05:12:39,880 INFO mapreduce.Job: The url to track the job:
http://h01:8088/proxy/application_1658047711391_0002/
2022-07-20 05:12:39,882 INFO mapreduce.Job: Running job: job_1658047711391_0002
2022-07-20 05:12:49,171 INFO mapreduce.Job: Job job_1658047711391_0002 running in
uber mode : false
2022-07-20 05:12:49,174 INFO mapreduce.Job: map 0% reduce 0%
2022-07-20 05:12:54,285 INFO mapreduce.Job: map 100% reduce 0%
2022-07-20 05:13:01,356 INFO mapreduce.Job: map 100% reduce 100%
2022-07-20 05:13:02,391 INFO mapreduce.Job: Job job_1658047711391_0002 completed
successfully
2022-07-20 05:13:02,524 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=12507
FILE: Number of bytes written=577413
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=15313
HDFS: Number of bytes written=9894
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3141
Total time spent by all reduces in occupied slots (ms)=3811
Total time spent by all map tasks (ms)=3141
Total time spent by all reduce tasks (ms)=3811
Total vcore-milliseconds taken by all map tasks=3141
Total vcore-milliseconds taken by all reduce tasks=3811
Total megabyte-milliseconds taken by all map tasks=3216384
Total megabyte-milliseconds taken by all reduce tasks=3902464
Map-Reduce Framework
Map input records=270
Map output records=1672
Map output bytes=20756
Map output materialized bytes=12507
Input split bytes=96
Combine input records=1672
Combine output records=657
Reduce input groups=657
Reduce shuffle bytes=12507
Reduce input records=657
Reduce output records=657
Spilled Records=1314
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=126
CPU time spent (ms)=1110
Physical memory (bytes) snapshot=474148864
Virtual memory (bytes) snapshot=5063700480
Total committed heap usage (bytes)=450887680
Peak Map Physical memory (bytes)=288309248
Peak Map Virtual memory (bytes)=2528395264
Peak Reduce Physical memory (bytes)=185839616
Peak Reduce Virtual memory (bytes)=2535305216
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=15217
File Output Format Counters
Bytes Written=9894
root@h01:/usr/local/hadoop/bin#
查看
HDFS
中的
/output
⽂件夹的内容
查看
part
-
r
-
00000
⽂件的内容
⾄此,
hadoop
部分已经结束
2.3
安装
Hbase
在
Hadoop
集群的基础上安装
Hbase
下载
Hbase 3.0.0
解压到
/usr/local
⽬录下⾯
修改
/etc/profile
环境变量⽂件,添加
Hbase
的环境变量,追加下述代码
使环境变量配置⽂件⽣效
使⽤
ssh h02
可进⼊
h02
容器,修改
profile
⽂件如上。依次修改
h03
、
h04
、
h05
即是每个容器都要在
/etc/profile
⽂件后追加那两⾏环境变量
在⽬录
/usr/local/hbase
-
3.0.0/conf
修改配置
修改
hbase-env.sh
,追加
修改
hbase-site.xml
为
root@h01:/usr/local/hadoop/bin# ./hadoop fs -ls /output
Found 2 items
-rw-r--r-- 2 root supergroup 0 2022-07-20 05:13 /output/_SUCCESS
-rw-r--r-- 2 root supergroup 9894 2022-07-20 05:13 /output/part-r-00000
root@h01:/usr/local/hadoop/bin#
root@h01:/usr/local/hadoop/bin# ./hadoop fs -cat /output/part-r-00000
root@h01:~# wget https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/3.0.0-alpha-
3/hbase-3.0.0-alpha-3-bin.tar.gz
root@h01:~# tar -zxvf hbase-3.0.0-bin.tar.gz -C /usr/local/
export HBASE_HOME=/usr/local/hbase-3.0.0
export PATH=$PATH:$HBASE_HOME/bin
root@h01:/usr/local# source /etc/profile
root@h01:/usr/local#
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HBASE_MANAGES_ZK=true
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://h01:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.master</name>
<value>h01:60000</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>h01,h02,h03,h04,h05</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/zoodata</value>
</property>
</configuration>
修改
regionservers
⽂件为
h01
h02
h03
h04
h05
使⽤
scp
命令将配置好的
Hbase
复制到其他
4
个容器中
root@h01:~# scp -r /usr/local/hbase-3.0.0 root@h02:/usr/local/
root@h01:~# scp -r /usr/local/hbase-3.0.0 root@h03:/usr/local/
root@h01:~# scp -r /usr/local/hbase-3.0.0 root@h04:/usr/local/
root@h01:~# scp -r /usr/local/hbase-3.0.0 root@h05:/usr/local/
启动
Hbase
root@h01:/usr/local/hbase-3.0.0/bin# ./start-hbase.sh
h04: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h04.out
h02: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h02.out
h03: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h03.out
h05: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h05.out
h01: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h01.out
running master, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase--master
h01.out
h05: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h05.out
h01: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h01.out
h04: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h04.out
h03: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h03.out
h02: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h02.out
root@h01:/usr/local/hbase-3.0.0/bin#
打开
Hbase
的
shell
root@h01:/usr/local/hbase-3.0.0/bin# ./hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/book.html#shell
Version 3.0.0-alpha-3, rb3657484850f9fa9679f2186bf53e7df768f21c7, Wed Jun 15
07:56:54 UTC 2022
Took 0.0017 seconds
hbase:001:0>
hbase
测试
创建表
member
hbase:006:0> create 'member','id','address','info'
Created table member
Took 0.6838 seconds
=> Hbase::Table - member
hbase:007:0>
添加数据,并查看表中数据
hbase:007:0> put 'member', 'debugo','id','11'
Took 0.1258 seconds
hbase:008:0> put 'member', 'debugo','info:age','27'
2.4
安装
Spark
在
Hadoop
的基础上安装
Spark
下载
Spark 3.3.0
解压到
/usr/local
⽬录下⾯
修改⽂件夹的名字
修改
/etc/profile
环境变量⽂件,添加
Hbase
的环境变量,追加下述代码
使环境变量配置⽂件⽣效
使⽤
ssh h02
可进⼊其他四个容器,依次修改。
即是每个容器都要在
/etc/profile
⽂件后追加那两⾏环境变量
在⽬录
/usr/local/spark
-
3.3.0/conf
修改配置
修改⽂件名
Took 0.0108 seconds
hbase:009:0> count 'member'
1 row(s)
Took 0.0499 seconds
=> 1
hbase:010:0> scan 'member'
ROW COLUMN+CELL
debugo column=id:, timestamp=2022-07-
20T05:37:58.720, value=11
debugo column=info:age, timestamp=2022-07-
20T05:38:11.302, value=27
1 row(s)
Took 0.0384 seconds
hbase:011:0>
root@h01:~# wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-
3.3.0/spark-3.3.0-bin-hadoop3.tgz
root@h01:~# tar -zxvf spark-3.3.0-bin-hadoop3.tgz -C /usr/local/
root@h01:~# cd /usr/local/
root@h01:/usr/local# mv spark-3.3.0-bin-hadoop3 spark-3.3.0
export SPARK_HOME=/usr/local/spark-3.3.0
export PATH=$PATH:$SPARK_HOME/bin
root@h01:/usr/local# source /etc/profile
root@h01:/usr/local#
修改
spark-env.sh
,追加
修改⽂件名
修改
slaves
如下
使⽤
scp
命令将配置好的
Hbase
复制到其他
4
个容器中
启动
Spark
3
其他
root@h01:/usr/local/spark-3.3.0/conf# mv spark-env.sh.template spark-env.sh
root@h01:/usr/local/spark-3.3.0/conf#
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SCALA_HOME=/usr/share/scala
export SPARK_MASTER_HOST=h01
export SPARK_MASTER_IP=h01
export SPARK_WORKER_MEMORY=4g
root@h01:/usr/local/spark-3.3.0/conf# mv slaves.template slaves
root@h01:/usr/local/spark-3.3.0/conf#
h01
h02
h03
h04
h05
root@h01:/usr/local# scp -r /usr/local/spark-3.3.0 root@h02:/usr/local/
root@h01:/usr/local# scp -r /usr/local/spark-3.3.0 root@h03:/usr/local/
root@h01:/usr/local# scp -r /usr/local/spark-3.3.0 root@h04:/usr/local/
root@h01:/usr/local# scp -r /usr/local/spark-3.3.0 root@h05:/usr/local/
root@h01:/usr/local/spark-3.3.0/sbin# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark-
3.3.0/logs/spark--org.apache.spark.deploy.master.Master-1-h01.out
h03: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h03.out
h02: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h02.out
h04: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h04.out
h05: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h05.out
h01: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h01.out
root@h01:/usr/local/spark-3.3.0/sbin#
3.1 HDFS
重格式化问题
参考
https://blog.csdn.net/gis_101/article/details/52821946
重新格式化意味着集群的数据会被全部删除,格式化前需考虑数据备份或转移问题;
先删除主节点(即
namenode
节点),
Hadoop
的临时存储⽬录
tmp
、
namenode
存储永久性元数
据⽬录
dfs/name
、
Hadoop
系统⽇志⽂件⽬录
log
中的内容 (注意是删除⽬录下的内容不是⽬
录);
删除所有数据节点
(
即
datanode
节点
)
,
Hadoop
的临时存储⽬录
tmp
、
namenode
存储永久性元数
据⽬录
dfs/name
、
Hadoop
系统⽇志⽂件⽬录
log
中的内容;
格式化⼀个新的分布式⽂件系统:
注意事项
:
Hadoop
的临时存储⽬录
tmp
(即
core-site.xml
配置⽂件中的
hadoop.tmp.dir
属性,默认值
是
/tmp/hadoop-${user.name}
),如果没有配置
hadoop.tmp.dir
属性,那么
hadoop
格式化时将
会在
/tmp
⽬录下创建⼀个⽬录,例如在
cloud
⽤⼾下安装配置
hadoop
,那么
Hadoop
的临时存储⽬
录就位于
/tmp/hadoop-cloud
⽬录下
Hadoop
的
namenode
元数据⽬录(即
hdfs-site.xml
配置⽂件中的
dfs.namenode.name.dir
属性,
默认值是
${hadoop.tmp.dir}/dfs/name
),同样如果没有配置该属性,那么
hadoop
在格式化时将
⾃⾏创建。必须注意的是在格式化前必须清楚所有⼦节点(即
DataNode
节点)
dfs/name
下的内
容,否则在启动
hadoop
时⼦节点的守护进程会启动失败。这是由于,每⼀次
format
主节点
namenode
,
dfs/name/current
⽬录下的
VERSION
⽂件会产⽣新的
clusterID
、
namespaceID
。但
是如果⼦节点的
dfs/name/current
仍存在,
hadoop
格式化时就不会重建该⽬录,因此形成⼦节点
的
clusterID
、
namespaceID
与主节点(即
namenode
节点)的
clusterID
、
namespaceID
不⼀致。
最终导致
hadoop
启动失败。
root@h01:/usr/local/hadoop/bin# ./hadoop namenode -format
文章来源地址https://www.uudwc.com/A/rZAkW/
文章来源:https://www.uudwc.com/A/rZAkW/