MySQL Index There are some expensive operations for hive metastore when accessing or storing metadatas on RDBMS. Here are some official hive patches for indexing. -- HIVE-21063 CREATE UNIQUE INDEX `NOTIFICATION_LOG_EVENT_ID` ON NOTIFICATION_LOG (`EVENT_ID`) USING BTREE; -- HIVE-21487 CREATE INDEX COMPLETED_COMPACTIONS_RES ON COMPLETED_COMPACTIONS (CC_DATABASE,CC_TABLE,CC_PARTITION); -- HIVE-27165 DROP INDEX TAB_COL_STATS_IDX ON TAB_COL_STATS; CREATE INDEX TAB_COL_STATS_IDX ON TAB_COL_STATS (DB_NAME, TABLE_NAME, COLUMN_NAME, CAT_NAME) USING BTREE; DROP INDEX PCS_STATS_IDX ON PART_COL_STATS; CREATE INDEX PCS_STATS_IDX ON PART_COL_STATS (DB_NAME,TABLE_NAME,COLUMN_NAME,PARTITION_NAME,CAT_NAME) USING BTREE; When you upgrade your hive, there are some changes on tables in rdbms....
Some cases where "rdns = false" in krb5.conf does not work in Hadoop ecosystem
https://web.mit.edu/kerberos/krb5-1.13/doc/admin/princ_dns.html Operating system bugs may prevent a setting of rdns = false from disabling reverse DNS lookup. Some versions of GNU libc have a bug in getaddrinfo() that cause them to look up PTR records even when not required. MIT Kerberos releases krb5-1.10.2 and newer have a workaround for this problem, as does the krb5-1.9.x series as of release krb5-1.9.4. There are some cases where “rdns = false” in krb5.conf is not respected in Hadoop ecosystem....
Hadoop commands
HDFS Reconfigure without restart https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html configurable keys are limited without restart $ hdfs dfsadmin -reconfig namenode nn1.example.com:8020 properties Node [nn1.example.com:8020] Reconfigurable properties: dfs.block.placement.ec.classname dfs.block.replicator.classname dfs.heartbeat.interval dfs.image.parallel.load dfs.namenode.avoid.read.slow.datanode dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled dfs.namenode.heartbeat.recheck-interval dfs.namenode.max.slowpeer.collect.nodes dfs.namenode.replication.max-streams dfs.namenode.replication.max-streams-hard-limit dfs.namenode.replication.work.multiplier.per.iteration dfs.storage.policy.satisfier.mode fs.protected.directories hadoop.caller.context.enabled ipc.8020.backoff.enable
About "HADOOP_CLASSPATH" environment variable
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/UnixShellGuide.html#HADOOP_CLASSPATH In Hadoop ecosystem, HADOOP_CLASSPATH environment variable is commonly used in many places. Hive is use this variable, too. I wonder that how the HADOOP_CLASSPATH variable is used in a script like beeline. I cannot find HADOOP_CLASSPATH variable in Hive source codes. I finally figure out that when executing beeline it uses hadoop jar command. (https://github.com/apache/hive/blob/rel/release-3.1.3/bin/ext/beeline.sh#L35) It uses RunJar.java where HADOOP_CLASSPATH is used to set CLASSPATH. (https://github.com/apache/hadoop/blob/rel/release-3.3.4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/RunJar.java#L347-L351) If something in Hadoop ecosystem uses RunJar#main, it probably repect HADOOP_CLASSPATH environment variable....
SPNEGO-enabled Hadoop DataNode misjudges kerberos "replay attack".
참고 https://docs.cloudera.com/cloudera-manager/7.5.5/security-troubleshooting/cm-security-troubleshooting.pdf https://search-guard.com/elasticsearch-kibana-kerberos/ I guess that this is caused by sharing the same kerberos keytab (/etc/security/keytabs/spnego.service.keytab) and principal(HTTP/_HOST@{REALM}) among Hadoop daemons (NameNode, DataNode, JournalNodes, ResourceManager, NodeManager …). I assume that DataNode misjudges it is a replay attack in certain circumstances. Adding the following jvm system properties to Hadoop daemons will fix this issue as a workaround. It means java process will not use replay cache. -Dsun.security.krb5.rcache=none