My First Lucky and Sad Hadoop Results-mysql教程-PHP中文网

My First Lucky and Sad Hadoop Results

php中文网

发布： 2016-06-07 16:30:14

原创

1168人浏览过

Recently I am playing with Hadoop per analyzing the data set I scraped from WEIBO.COM. After a couple of tryings, many are failed due to disk space shortage, after I decreased the input date set volumn, luckily I gained a completed Hadoop

recently i am playing with hadoop per analyzing the data set i scraped from weibo.com. after a couple of tryings, many are failed due to disk space shortage, after i decreased the input date set volumn, luckily i gained a completed hadoop job results, but, sadly, with only 1000 lines of records processed.

Here is the Job Summary:

Counter	Map	Reduce	Total
Bytes Read	7,945,196	0	7,945,196
FILE_BYTES_READ	16,590,565,518	8,021,579,181	24,612,144,699
HDFS_BYTES_READ	*7,945,580*	0	7,945,580
FILE_BYTES_WRITTEN	24,612,303,774	8,021,632,091	32,633,935,865
HDFS_BYTES_WRITTEN	0	2,054,409,494	2,054,409,494
Reduce input groups	0	381,696,888	381,696,888
Map output materialized bytes	8,021,579,181	0	8,021,579,181
Combine output records	826,399,600	0	826,399,600
Map input records	*1,000*	0	1,000
Reduce shuffle bytes	0	8,021,579,181	8,021,579,181
Physical memory (bytes) snapshot	1,215,041,536	72,613,888	1,287,655,424
Reduce output records	0	*381,696,888*	381,696,888
Spilled Records	1,230,714,511	401,113,702	1,631,828,213
Map output bytes	7,667,457,405	0	7,667,457,405
Total committed heap usage (bytes)	1,038,745,600	29,097,984	1,067,843,584
CPU time spent (ms)	2,957,800	2,104,030	5,061,830
Virtual memory (bytes) snapshot	4,112,838,656	1,380,306,944	5,493,145,600
SPLIT_RAW_BYTES	384	0	384
Map output records	*426,010,418*	0	426,010,418
Combine input records	*851,296,316*	0	851,296,316
Reduce input records	0	401,113,702	401,113,702

From which we can see that, specially metrics which highlighted in bold style, I only passed in about 7MB data file with 1000 lines of records, but Reducer outputs 381,696,888 records, which are 2.1GB compressed gz file and some 9GB plain text when decompressed.

But clearly it’s not the problem of my code that leads to so much disk space usages, the above output metrics are all reasonable, although you may be surprised by the comparison between 7MB with only 1000 records input and 9GB with 381,696,888 records output. The truth is that I’m calculating co-appearance combination computation.

From this experimental I learned that my personal computer really cannot play with big elephant, input data records from the first 10 thousand down to 5 thousand to 3 thousand to ONE thousand at last, but data analytic should go on, I need to find a solution to work it out, actually I have 30 times of data need to process, that is 30 thousand records.