hadoop合并小文件_Hadoop 的HDFS怎么样对读取的多个文件合并成一个整体的文件的

『壹』 Hadoop 的HDFS怎么样对读取的多个文件合并成一个整体的文件的,怎么把一个文件拆分成多文件的

文件的名字有联系的。而且在主节点master上有记录，读取的时候会找出所有的文件。
拆分的时候单纯按照默认的大小分割的，不管任何结构，即使是一行的数据也会拆开

『贰』 hive优化之小文件合并

hive优化之小颤饥文件合并

文件数目过多，会给HDFS带来压力，并且会影响处理效率，可以通过烂知合并Map和Rece的结果文件来消除这样的影响：

set hive.merge.mapfiles = true ##在 map only 的任务结束时合并小文件

set hive.merge.mapredfiles = false ## true 时在 MapRece 的任务结束时合并小文件

set hive.merge.size.per.task = 256*1000*1000 ##合并文件的大小

set mapred.max.split.size=256000000; ##每个 Map 最大分割大小

set mapred.min.split.size.per.node=1; ##一个饥洞消节点上 split 的最少值

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; ##执行 Map 前进行小文件合并

『叁』 Hive如何处理大量小文件

1.动态分区插入数据的时候，会产生大量的小文件，从而导致map数量的暴增
2.数据源本身就包含有大量的小文件
3.rece个数越多，生成的小文件也越多

1 从HIVE角度来看的话呢，小文件越多，map的个数也会越多，每一个map都会开启一个JVM虚拟机，每个虚拟机都要创建任务，执行任务，这些流程都会造成大量的资源浪费，严重影响性能
2 在HDFS中，每个小文件约占150byte，如果小文件过多则会占用大量的内存。这样namenode内存容量严重制约了集群的发展

4.1 使用Hadoop achieve把小文件进行归档
4.2 重建表，建表时减少rece的数量
4.3 通过参数调节，设洞闭置map/rece的数量
4.3.1设置兄悔map输入合并小文件的相关纳尘裂参数：

4.3.2 设置map输出和rece输出进行合并的相关参数：

『肆』 hadoop 系统是怎样增加节点的

Hadoop添加节点的方法
自己实际添加节点过程：
1. 先在slave上配置好环境，包括ssh，jdk，相关config，lib，bin等的拷贝；
2. 将新的datanode的host加到集群namenode及其他datanode中去；
3. 将新的datanode的ip加到master的conf/slaves中；
4. 重启cluster,在cluster中看到新的datanode节点；
5. 运行bin/start-balancer.sh，老仿这个会很耗时间
备注：
1. 如果不balance，那旦含凳么cluster会把新的数据都存放在新的node上，这样会降低mr的工作效率；
2. 也可调用bin/start-balancer.sh 命令执行，也可加参数 -threshold 5
threshold 是平衡阈值，默认是10%，值越低各节点越平衡，但消耗时间也更长。
3. balancer也可以在有mr job的cluster上运行，默认dfs.balance.bandwidthPerSec很低，模旅为1M/s。在没有mr job时，可以提高该设置加快负载均衡时间。

其他备注：
1. 必须确保slave的firewall已关闭;
2. 确保新的slave的ip已经添加到master及其他slaves的/etc/hosts中，反之也要将master及其他slave的ip添加到新的slave的/etc/hosts中
mapper及recer个数
url地址： http://wiki.apache.org/hadoop/HowManyMapsAndReces
HowManyMapsAndReces
Partitioning your job into maps and reces
Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop. Increasing the number of tasks increases the framework overhead, but increases load balancing and lowers the cost of failures. At one extreme is the 1 map/1 rece case where nothing is distributed. The other extreme is to have 1,000,000 maps/ 1,000,000 reces where the framework runs out of resources for the overhead.
Number of Maps
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the [WWW] InputFormat determines the number of maps.
The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num). This can be used to increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data.
Number of Reces
The right number of reces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). At 0.95 all of the reces can launch immediately and start transfering map outputs as the maps finish. At 1.75 the faster nodes will finish their first round of reces and launch a second round of reces doing a much better job of load balancing.
Currently the number of reces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReces << heapSize). This will be fixed at some point, but until it is it provides a pretty firm upper bound.
The number of reces also controls the number of output files in the output directory, but usually that is not important because the next map/rece step will split them into even smaller splits for the maps.
The number of rece tasks can also be increased in the same way as the map tasks, via JobConf's conf.setNumReceTasks(int num).
自己的理解：
mapper个数的设置：跟input file 有关系，也跟filesplits有关系，filesplits的上线为dfs.block.size，下线可以通过mapred.min.split.size设置，最后还是由InputFormat决定。

较好的建议：
The right number of reces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.rece.tasks.maximum).increasing the number of reces increases the framework overhead, but increases load balancing and lowers the cost of failures.
<property>
<name>mapred.tasktracker.rece.tasks.maximum</name>
<value>2</value>
<description>The maximum number of rece tasks that will be run
simultaneously by a task tracker.
</description>
</property>

单个node新加硬盘
1.修改需要新加硬盘的node的dfs.data.dir，用逗号分隔新、旧文件目录
2.重启dfs

同步hadoop 代码
hadoop-env.sh
# host:path where hadoop code should be rsync'd from. Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop

用命令合并HDFS小文件
hadoop fs -getmerge <src> <dest>

重启rece job方法
Introced recovery of jobs when JobTracker restarts. This facility is off by default.
Introced config parameters "mapred.jobtracker.restart.recover", "mapred.jobtracker.job.history.block.size", and "mapred.jobtracker.job.history.buffer.size".
还未验证过。

IO写操作出现问题
0-1246359584298, infoPort=50075, ipcPort=50020):Got exception while serving blk_-5911099437886836280_1292 to /172.16.100.165:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/
172.16.100.165:50010 remote=/172.16.100.165:50930]
at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
at java.lang.Thread.run(Thread.java:619)

It seems there are many reasons that it can timeout, the example given in
HADOOP-3831 is a slow reading client.

解决办法：在hadoop-site.xml中设置dfs.datanode.socket.write.timeout=0试试；
My understanding is that this issue should be fixed in Hadoop 0.19.1 so that
we should leave the standard timeout. However until then this can help
resolve issues like the one you're seeing.

HDFS退服节点的方法
目前版本的dfsadmin的帮助信息是没写清楚的，已经file了一个bug了，正确的方法如下：
1. 将 dfs.hosts 置为当前的 slaves，文件名用完整路径，注意，列表中的节点主机名要用大名，即 uname -n 可以得到的那个。
2. 将 slaves 中要被退服的节点的全名列表放在另一个文件里，如 slaves.ex，使用 dfs.host.exclude 参数指向这个文件的完整路径
3. 运行命令 bin/hadoop dfsadmin -refreshNodes
4. web界面或 bin/hadoop dfsadmin -report 可以看到退服节点的状态是 Decomission in progress，直到需要复制的数据复制完成为止
5. 完成之后，从 slaves 里（指 dfs.hosts 指向的文件）去掉已经退服的节点

附带说一下 -refreshNodes 命令的另外三种用途：
2. 添加允许的节点到列表中（添加主机名到 dfs.hosts 里来）
3. 直接去掉节点，不做数据副本备份（在 dfs.hosts 里去掉主机名）
4. 退服的逆操作——停止 exclude 里面和 dfs.hosts 里面都有的，正在进行 decomission 的节点的退服，也就是把 Decomission in progress 的节点重新变为 Normal （在 web 界面叫 in service)

Hadoop添加节点的方法
自己实际添加节点过程：
1. 先在slave上配置好环境，包括ssh，jdk，相关config，lib，bin等的拷贝；
2. 将新的datanode的host加到集群namenode及其他datanode中去；
3. 将新的datanode的ip加到master的conf/slaves中；
4. 重启cluster,在cluster中看到新的datanode节点；
5. 运行bin/start-balancer.sh，这个会很耗时间
备注：
1. 如果不balance，那么cluster会把新的数据都存放在新的node上，这样会降低mr的工作效率；
2. 也可调用bin/start-balancer.sh 命令执行，也可加参数 -threshold 5
threshold 是平衡阈值，默认是10%，值越低各节点越平衡，但消耗时间也更长。
3. balancer也可以在有mr job的cluster上运行，默认dfs.balance.bandwidthPerSec很低，为1M/s。在没有mr job时，可以提高该设置加快负载均衡时间。

其他备注：
1. 必须确保slave的firewall已关闭;
2. 确保新的slave的ip已经添加到master及其他slaves的/etc/hosts中，反之也要将master及其他slave的ip添加到新的slave的/etc/hosts中

『伍』如何将多个输入文件合并到hadoop中的一个文

在使用hadoop是，我们有时候需要将本地文件系统上的多个文件合并为hadoop文件系统上的一个文件，但是 hadoop文件系统本身的shell命令并不支持这样的功能，但是hadoop的put命令支持从标准输入读取数据，所以实现标题功能的hadoop命令如下：

[html]view plain

catlocalfile1localfile2|bin/hadoopfs-put/dev/fd/0destfile

『陆』 hadoop，hdfs的namenode中fsimage和edits的合并问题

这个不影响读取文件，读取的时候nn会去读取fsimage和edits两个文件来查找文件信息

导航:首页 > 文件教程 > hadoop合并小文件

hadoop合并小文件

与hadoop合并小文件相关的资料

友情链接