hadoop合並小文件_Hadoop 的HDFS怎麼樣對讀取的多個文件合並成一個整體的文件的

『壹』 Hadoop 的HDFS怎麼樣對讀取的多個文件合並成一個整體的文件的,怎麼把一個文件拆分成多文件的

文件的名字有聯系的。而且在主節點master上有記錄，讀取的時候會找出所有的文件。
拆分的時候單純按照默認的大小分割的，不管任何結構，即使是一行的數據也會拆開

『貳』 hive優化之小文件合並

hive優化之小顫飢文件合並

文件數目過多，會給HDFS帶來壓力，並且會影響處理效率，可以通過爛知合並Map和Rece的結果文件來消除這樣的影響：

set hive.merge.mapfiles = true ##在 map only 的任務結束時合並小文件

set hive.merge.mapredfiles = false ## true 時在 MapRece 的任務結束時合並小文件

set hive.merge.size.per.task = 256*1000*1000 ##合並文件的大小

set mapred.max.split.size=256000000; ##每個 Map 最大分割大小

set mapred.min.split.size.per.node=1; ##一個飢洞消節點上 split 的最少值

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; ##執行 Map 前進行小文件合並

『叄』 Hive如何處理大量小文件

1.動態分區插入數據的時候，會產生大量的小文件，從而導致map數量的暴增
2.數據源本身就包含有大量的小文件
3.rece個數越多，生成的小文件也越多

1 從HIVE角度來看的話呢，小文件越多，map的個數也會越多，每一個map都會開啟一個JVM虛擬機，每個虛擬機都要創建任務，執行任務，這些流程都會造成大量的資源浪費，嚴重影響性能
2 在HDFS中，每個小文件約佔150byte，如果小文件過多則會佔用大量的內存。這樣namenode內存容量嚴重製約了集群的發展

4.1 使用Hadoop achieve把小文件進行歸檔
4.2 重建表，建表時減少rece的數量
4.3 通過參數調節，設洞閉置map/rece的數量
4.3.1設置兄悔map輸入合並小文件的相關納塵裂參數：

4.3.2 設置map輸出和rece輸出進行合並的相關參數：

『肆』 hadoop 系統是怎樣增加節點的

Hadoop添加節點的方法
自己實際添加節點過程：
1. 先在slave上配置好環境，包括ssh，jdk，相關config，lib，bin等的拷貝；
2. 將新的datanode的host加到集群namenode及其他datanode中去；
3. 將新的datanode的ip加到master的conf/slaves中；
4. 重啟cluster,在cluster中看到新的datanode節點；
5. 運行bin/start-balancer.sh，老仿這個會很耗時間
備註：
1. 如果不balance，那旦含凳么cluster會把新的數據都存放在新的node上，這樣會降低mr的工作效率；
2. 也可調用bin/start-balancer.sh 命令執行，也可加參數 -threshold 5
threshold 是平衡閾值，默認是10%，值越低各節點越平衡，但消耗時間也更長。
3. balancer也可以在有mr job的cluster上運行，默認dfs.balance.bandwidthPerSec很低，模旅為1M/s。在沒有mr job時，可以提高該設置加快負載均衡時間。

其他備註：
1. 必須確保slave的firewall已關閉;
2. 確保新的slave的ip已經添加到master及其他slaves的/etc/hosts中，反之也要將master及其他slave的ip添加到新的slave的/etc/hosts中
mapper及recer個數
url地址： http://wiki.apache.org/hadoop/HowManyMapsAndReces
HowManyMapsAndReces
Partitioning your job into maps and reces
Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop. Increasing the number of tasks increases the framework overhead, but increases load balancing and lowers the cost of failures. At one extreme is the 1 map/1 rece case where nothing is distributed. The other extreme is to have 1,000,000 maps/ 1,000,000 reces where the framework runs out of resources for the overhead.
Number of Maps
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the [WWW] InputFormat determines the number of maps.
The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num). This can be used to increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data.
Number of Reces
The right number of reces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). At 0.95 all of the reces can launch immediately and start transfering map outputs as the maps finish. At 1.75 the faster nodes will finish their first round of reces and launch a second round of reces doing a much better job of load balancing.
Currently the number of reces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReces << heapSize). This will be fixed at some point, but until it is it provides a pretty firm upper bound.
The number of reces also controls the number of output files in the output directory, but usually that is not important because the next map/rece step will split them into even smaller splits for the maps.
The number of rece tasks can also be increased in the same way as the map tasks, via JobConf's conf.setNumReceTasks(int num).
自己的理解：
mapper個數的設置：跟input file 有關系，也跟filesplits有關系，filesplits的上線為dfs.block.size，下線可以通過mapred.min.split.size設置，最後還是由InputFormat決定。

較好的建議：
The right number of reces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.rece.tasks.maximum).increasing the number of reces increases the framework overhead, but increases load balancing and lowers the cost of failures.
<property>
<name>mapred.tasktracker.rece.tasks.maximum</name>
<value>2</value>
<description>The maximum number of rece tasks that will be run
simultaneously by a task tracker.
</description>
</property>

單個node新加硬碟
1.修改需要新加硬碟的node的dfs.data.dir，用逗號分隔新、舊文件目錄
2.重啟dfs

同步hadoop 代碼
hadoop-env.sh
# host:path where hadoop code should be rsync'd from. Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop

用命令合並HDFS小文件
hadoop fs -getmerge <src> <dest>

重啟rece job方法
Introced recovery of jobs when JobTracker restarts. This facility is off by default.
Introced config parameters "mapred.jobtracker.restart.recover", "mapred.jobtracker.job.history.block.size", and "mapred.jobtracker.job.history.buffer.size".
還未驗證過。

IO寫操作出現問題
0-1246359584298, infoPort=50075, ipcPort=50020):Got exception while serving blk_-5911099437886836280_1292 to /172.16.100.165:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/
172.16.100.165:50010 remote=/172.16.100.165:50930]
at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
at java.lang.Thread.run(Thread.java:619)

It seems there are many reasons that it can timeout, the example given in
HADOOP-3831 is a slow reading client.

解決辦法：在hadoop-site.xml中設置dfs.datanode.socket.write.timeout=0試試；
My understanding is that this issue should be fixed in Hadoop 0.19.1 so that
we should leave the standard timeout. However until then this can help
resolve issues like the one you're seeing.

HDFS退服節點的方法
目前版本的dfsadmin的幫助信息是沒寫清楚的，已經file了一個bug了，正確的方法如下：
1. 將 dfs.hosts 置為當前的 slaves，文件名用完整路徑，注意，列表中的節點主機名要用大名，即 uname -n 可以得到的那個。
2. 將 slaves 中要被退服的節點的全名列表放在另一個文件里，如 slaves.ex，使用 dfs.host.exclude 參數指向這個文件的完整路徑
3. 運行命令 bin/hadoop dfsadmin -refreshNodes
4. web界面或 bin/hadoop dfsadmin -report 可以看到退服節點的狀態是 Decomission in progress，直到需要復制的數據復制完成為止
5. 完成之後，從 slaves 里（指 dfs.hosts 指向的文件）去掉已經退服的節點

附帶說一下 -refreshNodes 命令的另外三種用途：
2. 添加允許的節點到列表中（添加主機名到 dfs.hosts 里來）
3. 直接去掉節點，不做數據副本備份（在 dfs.hosts 里去掉主機名）
4. 退服的逆操作——停止 exclude 裡面和 dfs.hosts 裡面都有的，正在進行 decomission 的節點的退服，也就是把 Decomission in progress 的節點重新變為 Normal （在 web 界面叫 in service)

Hadoop添加節點的方法
自己實際添加節點過程：
1. 先在slave上配置好環境，包括ssh，jdk，相關config，lib，bin等的拷貝；
2. 將新的datanode的host加到集群namenode及其他datanode中去；
3. 將新的datanode的ip加到master的conf/slaves中；
4. 重啟cluster,在cluster中看到新的datanode節點；
5. 運行bin/start-balancer.sh，這個會很耗時間
備註：
1. 如果不balance，那麼cluster會把新的數據都存放在新的node上，這樣會降低mr的工作效率；
2. 也可調用bin/start-balancer.sh 命令執行，也可加參數 -threshold 5
threshold 是平衡閾值，默認是10%，值越低各節點越平衡，但消耗時間也更長。
3. balancer也可以在有mr job的cluster上運行，默認dfs.balance.bandwidthPerSec很低，為1M/s。在沒有mr job時，可以提高該設置加快負載均衡時間。

其他備註：
1. 必須確保slave的firewall已關閉;
2. 確保新的slave的ip已經添加到master及其他slaves的/etc/hosts中，反之也要將master及其他slave的ip添加到新的slave的/etc/hosts中

『伍』如何將多個輸入文件合並到hadoop中的一個文

在使用hadoop是，我們有時候需要將本地文件系統上的多個文件合並為hadoop文件系統上的一個文件，但是 hadoop文件系統本身的shell命令並不支持這樣的功能，但是hadoop的put命令支持從標准輸入讀取數據，所以實現標題功能的hadoop命令如下：

[html]view plain

catlocalfile1localfile2|bin/hadoopfs-put/dev/fd/0destfile

『陸』 hadoop，hdfs的namenode中fsimage和edits的合並問題

這個不影響讀取文件，讀取的時候nn會去讀取fsimage和edits兩個文件來查找文件信息

導航:首頁 > 文件教程 > hadoop合並小文件

hadoop合並小文件

與hadoop合並小文件相關的資料

友情鏈接