Hive函数&压缩-白红宇

Hive函数&压缩

阅读量：7101 次

发布时间：2019-06-28

本文共 2743 字，大约阅读时间需要 9 分钟。

1、排序

Order By:全局排序

1)按照员工表的奖金金额进行正序排序

select * from emptable order by emptable.comm asc;

可以省略asc

2）按照员工表的奖金金额进行倒序排序

select * from emptable order by emptable.comm desc;

3)按照部门和奖金进行升序排序

select * from emptable order by deptno,comm;

Sort By:内部排序（区内有序，全局无序）

设置reduce个数的属性：set mapreduce.job.reduces = 3;

select * from dept_partitions sort by deptno desc;

Distribute By:分区排序

1）先按照部门编号进行排序再按照地域编号进行降序排序。

select * from dept_partitions distribute by deptno sort by loc desc;

Cluster By:分桶排序

1）按照部门编号进行排序

select * from dept_partitions cluster by deptno;

注意：如果Distrbute和Sort by 是相同字段时，可以用cluster by代替

2、分桶

分桶分的是文件

1）创建分桶表

clustered by(id) into 4 buckets

hive> set mapreduce.job.reduces=4;

hive> create table emptable_buck(id int, name string) 

 > clustered by(id) into 4 buckets 

> row format

> delimited fields

> terminated by '\t';

查看表的描述信息

hive> desc formatted emptable_buck;

加载数据

hive> load data local inpath '/root/hsiehchou.txt' into table emptable_buck; 

hive> create table emptable_b(id int, name string)

> row format

> delimited fields

> terminated by '\t';

清空表

hive> truncate table emptable_buck;

加载数据（桶）

hive> load data local inpath '/root/hsiehchou.txt' into table emptable_b; 

设置桶的环境变量(插入数据时分桶，不开启默认在一个桶里面)

hive> set hive.enforce.bucketing=true;

hive> truncate table emptable_buck;

用户需要统计一个具有代表性的结果时，并不是全部结果！抽样！

(bucket 1 out of 2 on id）

1：第一桶数据

2：代表拿两桶

hive> select * from emptable_buck tablesample(bucket 1 out of 2 on id); 

3、UDF自定义函数

查看内置函数

show functions;

查看函数的详细内容

desc function extended upper;

UDF:一进一出

UDAF:聚合函数多进一出 count /max/avg

UDTF:一进多出

java

导入Hive的lib下的所有jar包

编程java代码

package com.hsiehchou;

import org.apache.hadoop.hive.ql.exec.UDF;

public class MyConcat extends UDF { 

//将大写转换成小写

 public String evaluate(String a, String b) { 

 return a + "******" + String.valueOf(b); 

}

export此文件，打包jar，放入hsiehchou121中。

添加临时：

add jar /root/Myconcat.jar;

create temporary function my_cat as “com.hsiehchou.MyConcat”;

<property>

<name>hive.aux.jars.path</name> 

<value>file:///root/hd/hive/lib/hive.jar</value> 

</property>

4、Hive压缩

存储：hdfs

计算：mapreduce

Map输出阶段压缩方式

开启hive中间传输数据压缩功能

set hive.exec.compress.intermediate=true;

开启map输出压缩

set mapreduce.map.output.compress=true;

设置snappy压缩方式

set mapreduce.map.output.compress.codec=org.apache.hadoop.io.com

press.SnappyCodec;

Reduce输出阶段压缩方式

设置hive输出数据压缩功能

set hive.exec.compress.output=true;

设置mr输出数据压缩

set mapreduce.output.fileoutputformat.compress=true;

指定压缩编码

set mapreduce.output.fileoutputformat.compress.codec=org.apache.

hadoop.io.compress.SnappyCodec;

指定压缩类型块压缩

set mapreduce.output.fileoutputformat.compress.type=BLOCK;

测试结果

insert overwrite local directory ‘/root/datas/rs’ select * from emptable order by sal desc;

转载于:https://www.cnblogs.com/hsiehchou/p/10479066.html

你可能感兴趣的文章

20170713L08-00老男孩Linux运维实战培训-DELL R710服务器RAID配置实战演示

vSphere 6.0 -Difference between vSphere 5.0, 5.1, 5.5 and vSphere 6.0

查看>>

Collect VMware support log&Performance Snapshot

查看>>

Enable PowerShell script execution policy

查看>>

aix　设置主机信任

查看>>

编程题：输入一串字符，程序会自动将大写字母转换为小写

查看>>

js赋值时特殊字符完美处理方案

查看>>

Linux基础之文本查看命令(cat,tac,rev,head,tail,more,less)