Hive 本质是将 SQL 翻译成 MapReduce / Tez / Spark 作业执行,底层存储在 HDFS 上。与 MySQL 最大的区别在于:写入一次、读取多次 ,不支持行级更新删除,延迟高但吞吐量大。
与 MySQL 的核心差异
特性
MySQL
Hive
事务/ACID
支持
默认不支持(开启 ACID 后有限支持)
UPDATE / DELETE
支持
默认不支持(ACID 表支持)
自增主键
AUTO_INCREMENT
不支持
索引
B+Tree 索引
无传统索引(依赖分区 / 分桶剪枝)
数据写入
INSERT 实时
LOAD DATA 或 INSERT OVERWRITE
执行引擎
本地存储引擎
MapReduce / Tez / Spark
延迟
毫秒级
秒~分钟级
数据格式
行存
TextFile / ORC / Parquet 等
NULL 比较
IS NULL
IS NULL 或使用 NVL()
数据库操作 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 CREATE DATABASE [IF NOT EXISTS ] db_name[LOCATION '/user/hive/warehouse/db_name.db' ]; SHOW DATABASES;USE db_name; DESCRIBE DATABASE EXTENDED db_name;DROP DATABASE [IF EXISTS ] db_name [CASCADE];
Hive 中每个数据库在 HDFS 上对应一个以 .db 结尾的目录,默认路径为 /user/hive/warehouse/。
表操作 建表 1 2 3 4 5 6 7 8 9 10 11 CREATE TABLE IF NOT EXISTS table_name ( id INT , name STRING, salary DOUBLE , dt STRING ) ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
与 MySQL 的差异:
没有 AUTO_INCREMENT、PRIMARY KEY、FOREIGN KEY
必须声明 STORED AS 存储格式
字符串类型统一用 STRING,不需要指定长度
复制表结构 1 2 3 4 5 6 CREATE TABLE new_table LIKE old_table;CREATE TABLE new_table AS SELECT * FROM old_table;
修改表 1 2 3 4 5 6 7 8 9 10 11 ALTER TABLE old_name RENAME TO new_name;ALTER TABLE table_name ADD COLUMNS (new_col STRING);ALTER TABLE table_name CHANGE old_col new_col STRING;ALTER TABLE table_name REPLACE COLUMNS (id INT , name STRING);
内部表 vs 外部表 这是 Hive 最重要的概念之一,MySQL 中没有对应概念。
1 2 3 4 5 6 7 8 9 10 11 12 CREATE TABLE managed_table ( id INT , name STRING ) STORED AS ORC; CREATE EXTERNAL TABLE external_table ( id INT , name STRING ) LOCATION '/data/external_table/' ;
选择原则:
数据由 Hive 独占管理 → 内部表
数据与其他系统(Spark、Flink、Presto)共享 → 外部表
生产环境建议优先用外部表,防止误删数据
分区表 分区将数据按某个字段值拆分到不同 HDFS 子目录,查询时可跳过无关分区,大幅提升性能。分区字段不在表的实际数据列中,是一个虚拟列。
1 2 3 4 5 6 7 8 CREATE TABLE orders ( order_id INT , user_id INT , amount DOUBLE ) PARTITIONED BY (dt STRING, region STRING) STORED AS ORC;
1 2 3 4 5 6 7 8 ALTER TABLE orders ADD PARTITION (dt= '2024-01-01' , region= 'north' );SHOW PARTITIONS orders;ALTER TABLE orders DROP PARTITION (dt= '2024-01-01' );
1 2 3 4 5 6 7 8 9 10 11 INSERT INTO orders PARTITION (dt= '2024-01-01' , region= 'north' )SELECT order_id, user_id, amount FROM stg_ordersWHERE dt= '2024-01-01' AND region= 'north' ;SET hive.exec.dynamic.partition = true ;SET hive.exec.dynamic.partition.mode = nonstrict; INSERT INTO orders PARTITION (dt, region)SELECT order_id, user_id, amount, dt, region FROM stg_orders;
查询必须带分区过滤,否则触发全表扫描(严格模式下会报错):
1 2 3 4 5 SELECT * FROM orders WHERE dt = '2024-01-01' ;SELECT * FROM orders WHERE amount > 100 ;
分桶表 分桶是对某个字段做 Hash 后分散到固定数量的文件(桶)中,适用于 Join 优化 和 数据抽样 。
1 2 3 4 5 6 7 8 CREATE TABLE users ( user_id INT , name STRING, age INT ) CLUSTERED BY (user_id) INTO 32 BUCKETS STORED AS ORC;
1 2 3 4 SET hive.enforce.bucketing = true ;INSERT INTO users SELECT * FROM stg_users;
1 2 3 4 5 SELECT * FROM users TABLESAMPLE (BUCKET 1 OUT OF 4 ON user_id);SELECT * FROM users TABLESAMPLE (10 PERCENT );
分区 vs 分桶对比:
分区
分桶
目的
减少扫描数据量
优化 Join / 抽样
实现
HDFS 子目录
HDFS 文件内 Hash 分片
数量
动态(数据驱动)
固定(建表时指定)
适合场景
按时间/地区过滤
user_id 等高基数 Join 字段
数据加载与导出 1 2 3 4 5 6 7 8 9 LOAD DATA LOCAL INPATH '/local/path/data.csv' INTO TABLE table_name; LOAD DATA INPATH '/hdfs/path/data.csv' INTO TABLE table_name; LOAD DATA LOCAL INPATH '/local/path/data.csv' OVERWRITE INTO TABLE table_name PARTITION (dt= '2024-01-01' );
1 2 3 4 5 6 7 8 9 INSERT OVERWRITE LOCAL DIRECTORY '/local/output/' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT * FROM table_name;INSERT OVERWRITE DIRECTORY '/hdfs/output/' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT * FROM table_name;
Hive 特有数据类型 复杂类型 1 2 3 4 5 6 CREATE TABLE complex_demo ( id INT , tags ARRAY < STRING> , scores MAP< STRING, INT > , address STRUCT< city:STRING, zip:STRING> );
1 2 3 4 5 6 7 8 SELECT tags[0 ] FROM complex_demo; SELECT scores['math' ] FROM complex_demo; SELECT address.city FROM complex_demo;
LATERAL VIEW + EXPLODE(炸裂函数) 将数组或 Map 展开成多行,是 Hive 中非常常用的操作。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SELECT id, tagFROM complex_demoLATERAL VIEW EXPLODE(tags) tmp AS tag;SELECT id, k, vFROM complex_demoLATERAL VIEW EXPLODE(scores) tmp AS k, v;SELECT id, tag, k, vFROM complex_demoLATERAL VIEW EXPLODE(tags) t1 AS tagLATERAL VIEW EXPLODE(scores) t2 AS k, v;
窗口函数 Hive 对窗口函数支持完整,语法与标准 SQL 相同,但比 MySQL 5.x 早支持。
1 2 3 4 5 6 7 8 9 SELECT user_id, amount, dt, ROW_NUMBER () OVER (PARTITION BY user_id ORDER BY amount DESC ) AS rn, RANK () OVER (PARTITION BY user_id ORDER BY amount DESC ) AS rank, DENSE_RANK () OVER (PARTITION BY user_id ORDER BY amount DESC ) AS dense_rank FROM orders;
1 2 3 4 5 6 7 8 9 SELECT dt, amount, SUM (amount) OVER ( ORDER BY dt ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS rolling_3day_sum FROM daily_sales;
1 2 3 4 5 6 7 SELECT dt, amount, LAG (amount, 1 , 0 ) OVER (ORDER BY dt) AS prev_day, LEAD (amount, 1 , 0 ) OVER (ORDER BY dt) AS next_day FROM daily_sales;
常用内置函数 字符串函数 1 2 3 4 5 6 7 8 9 10 SELECT CONCAT('hello' , ' ' , 'world' ), CONCAT_WS('-' , '2024' , '01' , '01' ), SUBSTR('hello' , 2 , 3 ), TRIM (' hello ' ), LOWER ('HELLO' ), UPPER ('hello' ), LENGTH('hello' ), REGEXP_REPLACE('abc123' , '[0-9]' , '' ), SPLIT('a,b,c' , ',' )[1 ];
日期函数 1 2 3 4 5 6 7 8 9 10 11 12 SELECT CURRENT_DATE , CURRENT_TIMESTAMP , TO_DATE('2024-01-01 12:00:00' ), DATE_ADD('2024-01-01' , 7 ), DATE_SUB('2024-01-08' , 7 ), DATEDIFF('2024-01-08' , '2024-01-01' ), DATE_FORMAT(CURRENT_DATE , 'yyyy-MM' ), YEAR ('2024-01-01' ), MONTH ('2024-01-01' ), FROM_UNIXTIME(1704067200 , 'yyyy-MM-dd' ), UNIX_TIMESTAMP('2024-01-01' , 'yyyy-MM-dd' );
条件函数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 SELECT IF(score >= 60 , '及格' , '不及格' ), CASE WHEN score >= 90 THEN '优秀' WHEN score >= 60 THEN '及格' ELSE '不及格' END , COALESCE (col1, col2, '默认值' ), NVL(col, '默认值' ), NVL2(col, 'not null' , 'is null' ) FROM t;
聚合函数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 SELECT COUNT (* ), COUNT (DISTINCT user_id), SUM (amount), AVG (amount), MAX (amount), MIN (amount), COLLECT_LIST(name), COLLECT_SET(name), CONCAT_WS(',' , COLLECT_SET(name)) FROM ordersGROUP BY dt;
存储格式对比
格式
特点
适用场景
TEXTFILE
纯文本,可读,无压缩
外部数据导入,调试
ORC
列式存储,高压缩,支持 ACID
Hive 生产环境首选
Parquet
列式存储,兼容性好
跨引擎(Spark/Presto)共享
SEQUENCEFILE
二进制 KV 存储
较少使用
AVRO
行式,Schema 演进友好
数据流场景
1 2 3 4 5 6 CREATE TABLE t STORED AS ORCTBLPROPERTIES ('orc.compress' = 'SNAPPY' ); CREATE TABLE t STORED AS PARQUETTBLPROPERTIES ('parquet.compression' = 'SNAPPY' );
常用配置项 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 SET hive.exec.mode.local.auto = true ;SET hive.exec.dynamic.partition = true ;SET hive.exec.dynamic.partition.mode = nonstrict;SET mapreduce.job.reduces = 10 ;SET hive.auto.convert.join = true ;SET hive.mapjoin.smalltable.filesize = 25000000 ; SET hive.exec.compress.output = true ;SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;SET hive.exec.parallel = true ;SET hive.mapred.mode = nonstrict;
Join 类型 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 SELECT a.* , b.nameFROM orders aJOIN users b ON a.user_id = b.user_id;SELECT a.* , b.nameFROM orders aLEFT JOIN users b ON a.user_id = b.user_id;SELECT a.* , b.nameFROM orders aJOIN dim_users b ON a.user_id = b.user_id;SELECT a.* FROM orders aLEFT SEMI JOIN whitelist b ON a.user_id = b.user_id;
查看元数据 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 DESCRIBE table_name;DESC FORMATTED table_name; DESC EXTENDED table_name; SHOW TABLES;SHOW TABLES LIKE 'order*' ;SHOW PARTITIONS table_name;SHOW CREATE TABLE table_name;SHOW FUNCTIONS;DESCRIBE FUNCTION EXTENDED function_name;