Hive | EUB's second brain

Don't sync privileges from ranger to hive

When Apache Ranger is configured for authorization on Secure Hadoop cluster, Hive below 4.x.x synchronizes all the ranger hive policies to rdbms for Hive as default. It is unnecessary and make unnecessary high load on db. See https://issues.apache.org/jira/browse/HIVE-25391. You can disable it with the configurations on HiveServer2 as follows. hive.privilege.synchronizer=false

Monitoring metrics related to "jute.maxbuffer"

There is a configuration named as “jute.maxbuffer” when using zookeeper. This can be set on zookeeper client side or server side. On zookeeper client side, the setting should be lower than that on zookeeper server. If a client gets data bigger than the setting, it will get an error. There are some related issue. https://issues.apache.org/jira/browse/HIVE-21993 https://issues.apache.org/jira/browse/YARN-2962 In order to avoid this errors. Some metrics should be monitored on zookeeper. last_client_response_size or max_client_response_size client_response_size is a response size in bytes from zookeeper server to client. last_proposal_size or max_proposal_size proposal_size is a proposal size in bytes from zookeeper server leader to follower. proposal?: refer to https://zookeeper.apache.org/doc/r3.7.1/zookeeperInternals.html. These values should be lower than jute.maxbuffer. This setting can be set through jvm argument like -Djute.maxbuffer=10485760 (10mb). ...

Checklist for hive metastore when using mysql

MySQL Index There are some expensive operations for hive metastore when accessing or storing metadatas on RDBMS. Here are some official hive patches for indexing. -- HIVE-21063 CREATE UNIQUE INDEX `NOTIFICATION_LOG_EVENT_ID` ON NOTIFICATION_LOG (`EVENT_ID`) USING BTREE; -- HIVE-21487 CREATE INDEX COMPLETED_COMPACTIONS_RES ON COMPLETED_COMPACTIONS (CC_DATABASE,CC_TABLE,CC_PARTITION); -- HIVE-27165 DROP INDEX TAB_COL_STATS_IDX ON TAB_COL_STATS; CREATE INDEX TAB_COL_STATS_IDX ON TAB_COL_STATS (DB_NAME, TABLE_NAME, COLUMN_NAME, CAT_NAME) USING BTREE; DROP INDEX PCS_STATS_IDX ON PART_COL_STATS; CREATE INDEX PCS_STATS_IDX ON PART_COL_STATS (DB_NAME,TABLE_NAME,COLUMN_NAME,PARTITION_NAME,CAT_NAME) USING BTREE; When you upgrade your hive, there are some changes on tables in rdbms. You can find needed SQLs depending on version at https://github.com/apache/hive/tree/master/standalone-metastore/metastore-server/src/main/sql/mysql. ...

About "HADOOP_CLASSPATH" environment variable

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/UnixShellGuide.html#HADOOP_CLASSPATH In Hadoop ecosystem, HADOOP_CLASSPATH environment variable is commonly used in many places. Hive is use this variable, too. I wonder that how the HADOOP_CLASSPATH variable is used in a script like beeline. I cannot find HADOOP_CLASSPATH variable in Hive source codes. I finally figure out that when executing beeline it uses hadoop jar command. (https://github.com/apache/hive/blob/rel/release-3.1.3/bin/ext/beeline.sh#L35) It uses RunJar.java where HADOOP_CLASSPATH is used to set CLASSPATH. (https://github.com/apache/hadoop/blob/rel/release-3.3.4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/RunJar.java#L347-L351) If something in Hadoop ecosystem uses RunJar#main, it probably repect HADOOP_CLASSPATH environment variable. ...