Don't sync privileges from ranger to hive

When Apache Ranger is configured for authorization on Secure Hadoop cluster, Hive below 4.x.x synchronizes all the ranger hive policies to rdbms for Hive as default. It is unnecessary and make unnecessary high load on db. See https://issues.apache.org/jira/browse/HIVE-25391. You can disable it with the configurations on HiveServer2 as follows. hive.privilege.synchronizer=false

October 19, 2024

Monitoring metrics related to "jute.maxbuffer"

There is a configuration named as “jute.maxbuffer” when using zookeeper. This can be set on zookeeper client side or server side. On zookeeper client side, the setting should be lower than that on zookeeper server. If a client gets data bigger than the setting, it will get an error. There are some related issue. https://issues.apache.org/jira/browse/HIVE-21993 https://issues.apache.org/jira/browse/YARN-2962 In order to avoid this errors. Some metrics should be monitored on zookeeper. last_client_response_size or max_client_response_size client_response_size is a response size in bytes from zookeeper server to client. last_proposal_size or max_proposal_size proposal_size is a proposal size in bytes from zookeeper server leader to follower. proposal?: refer to https://zookeeper.apache.org/doc/r3.7.1/zookeeperInternals.html. These values should be lower than jute.maxbuffer. This setting can be set through jvm argument like -Djute.maxbuffer=10485760 (10mb). ...

May 25, 2024

Checklist for hive metastore when using mysql

MySQL Index There are some expensive operations for hive metastore when accessing or storing metadatas on RDBMS. Here are some official hive patches for indexing. -- HIVE-21063 CREATE UNIQUE INDEX `NOTIFICATION_LOG_EVENT_ID` ON NOTIFICATION_LOG (`EVENT_ID`) USING BTREE; -- HIVE-21487 CREATE INDEX COMPLETED_COMPACTIONS_RES ON COMPLETED_COMPACTIONS (CC_DATABASE,CC_TABLE,CC_PARTITION); -- HIVE-27165 DROP INDEX TAB_COL_STATS_IDX ON TAB_COL_STATS; CREATE INDEX TAB_COL_STATS_IDX ON TAB_COL_STATS (DB_NAME, TABLE_NAME, COLUMN_NAME, CAT_NAME) USING BTREE; DROP INDEX PCS_STATS_IDX ON PART_COL_STATS; CREATE INDEX PCS_STATS_IDX ON PART_COL_STATS (DB_NAME,TABLE_NAME,COLUMN_NAME,PARTITION_NAME,CAT_NAME) USING BTREE; When you upgrade your hive, there are some changes on tables in rdbms. You can find needed SQLs depending on version at https://github.com/apache/hive/tree/master/standalone-metastore/metastore-server/src/main/sql/mysql. ...

October 12, 2023

Some cases where "rdns = false" in krb5.conf does not work in Hadoop ecosystem

https://web.mit.edu/kerberos/krb5-1.13/doc/admin/princ_dns.html Operating system bugs may prevent a setting of rdns = false from disabling reverse DNS lookup. Some versions of GNU libc have a bug in getaddrinfo() that cause them to look up PTR records even when not required. MIT Kerberos releases krb5-1.10.2 and newer have a workaround for this problem, as does the krb5-1.9.x series as of release krb5-1.9.4. There are some cases where “rdns = false” in krb5.conf is not respected in Hadoop ecosystem. You can try to modify /etc/hosts or register PTR records to fix this kind of issues. ...

July 2, 2023

Hadoop commands

HDFS Reconfigure without restart https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html configurable keys are limited without restart $ hdfs dfsadmin -reconfig namenode nn1.example.com:8020 properties Node [nn1.example.com:8020] Reconfigurable properties: dfs.block.placement.ec.classname dfs.block.replicator.classname dfs.heartbeat.interval dfs.image.parallel.load dfs.namenode.avoid.read.slow.datanode dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled dfs.namenode.heartbeat.recheck-interval dfs.namenode.max.slowpeer.collect.nodes dfs.namenode.replication.max-streams dfs.namenode.replication.max-streams-hard-limit dfs.namenode.replication.work.multiplier.per.iteration dfs.storage.policy.satisfier.mode fs.protected.directories hadoop.caller.context.enabled ipc.8020.backoff.enable

February 5, 2023

About "HADOOP_CLASSPATH" environment variable

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/UnixShellGuide.html#HADOOP_CLASSPATH In Hadoop ecosystem, HADOOP_CLASSPATH environment variable is commonly used in many places. Hive is use this variable, too. I wonder that how the HADOOP_CLASSPATH variable is used in a script like beeline. I cannot find HADOOP_CLASSPATH variable in Hive source codes. I finally figure out that when executing beeline it uses hadoop jar command. (https://github.com/apache/hive/blob/rel/release-3.1.3/bin/ext/beeline.sh#L35) It uses RunJar.java where HADOOP_CLASSPATH is used to set CLASSPATH. (https://github.com/apache/hadoop/blob/rel/release-3.3.4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/RunJar.java#L347-L351) If something in Hadoop ecosystem uses RunJar#main, it probably repect HADOOP_CLASSPATH environment variable. ...

February 5, 2023

SPNEGO-enabled Hadoop DataNode misjudges kerberos "replay attack".

references https://docs.cloudera.com/cloudera-manager/7.5.5/security-troubleshooting/cm-security-troubleshooting.pdf https://search-guard.com/elasticsearch-kibana-kerberos/ I guess that this is caused by sharing the same kerberos keytab (/etc/security/keytabs/spnego.service.keytab) and principal(HTTP/_HOST@{REALM}) among Hadoop daemons (NameNode, DataNode, JournalNodes, ResourceManager, NodeManager …). I assume that DataNode misjudges it is a replay attack in certain circumstances. Adding the following jvm system properties to Hadoop daemons will fix this issue as a workaround. It means java process will not use replay cache. -Dsun.security.krb5.rcache=none

February 5, 2023