Last night we had ADFS configured in my lab on vIDM ( globalenvironment ) and it was working as expected with vRealize Automation
Over a period of time , we did notice that globalenvironment on vRSLCM was not reporting health status
This was weird as vIDM was healthy and authentication was going through as expected.
When i browsed to Identity and Tenant Management pane , there was a message stating
VMware Identity Manager is not available at the moment. There are some requests which are in progress
There was something wrong with the new ADFS configuration as the permissions seemed to me messed up as well
vIDM Inventory Sync never went past the initial stage
*** Request Creation ***
2022-06-15 01:43:32.459 INFO [http-nio-8080-exec-7] c.v.v.l.l.u.RequestSubmissionUtil - -- ++++++++++++++++++ Creating request to Request_Service :::>>> {
"vmid" : "7d637ba2-5294-4e92-836a-2610a509fa03",
"transactionId" : null,
"tenant" : "default",
"requestName" : "productinventorysync",
"requestReason" : "VIDM in Environment globalenvironment - Product Inventory Sync",
"requestType" : "PRODUCT_INVENTORY_SYNC",
"requestSource" : "globalenvironment",
"requestSourceType" : "user",
"inputMap" : {
"environmentId" : "globalenvironment",
"productId" : "vidm",
"tenantId" : ""
},
"outputMap" : { },
"state" : "CREATED",
"executionId" : null,
"executionPath" : null,
"executionStatus" : null,
"errorCause" : null,
"resultSet" : null,
"isCancelEnabled" : null,
"lastUpdatedOn" : 1655257412458,
"createdBy" : null
}
*** Request Response ***
2022-06-15 01:43:32.466 INFO [http-nio-8080-exec-7] c.v.v.l.l.u.RequestSubmissionUtil - -- Generic Request Response : {
"requestId" : "7d637ba2-5294-4e92-836a-2610a509fa03"
}
2022-06-15 01:43:34.001 INFO [scheduling-1] c.v.v.l.r.c.RequestProcessor - -- Number of request to be processed : 1
2022-06-15 01:43:34.022 INFO [scheduling-1] c.v.v.l.r.c.p.ProductInventorySyncPlanner - -- Creating spec for inventory sync for product : vidm in environment : globalenvironment
2022-06-15 01:43:34.025 INFO [scheduling-1] c.v.v.l.r.u.InfrastructurePropertiesHelper - -- VCF properties: {
"vcfEnabled" : false,
"sddcManagerDetails" : [ ]
}
2022-06-15 01:43:34.027 INFO [scheduling-1] c.v.v.l.r.c.p.ProductInventorySyncPlanner - -- Found product with id vidm
2022-06-15 01:43:34.038 INFO [scheduling-1] c.v.v.l.r.c.p.CreateEnvironmentPlanner - -- Not a clustered vIDM, fetching the hostname from primary node.
*** Suit Request is generated and then set to IN_PROGRESS ***
2022-06-15 01:43:34.785 INFO [scheduling-1] c.v.v.l.r.c.RequestProcessor - -- Processing request with ID : 7d637ba2-5294-4e92-836a-2610a509fa03 with request type PRODUCT_INVENTORY_SYNC with request state INPROGRESS.
Browsing to "Authentication Provider " under settings would return an exception stating
Failed to fetch Complete Authentication Provider details
That networksettings API returns 400 , response was "No Settings "
The problem described here could be due to engine is not picking up the request to due to already processing requests (overloaded) or genuine unprocessed but stuck requests.
So logged into database and executed a query which would return the number of requests which are stuck in "IN_PROGRESS" state
Login into vRSLCM database
/opt/vmware/vpostgres/11/bin/psql -U postgres -d vrlcm
Check IN_PROGRESS from vm_engine_event table
select currentState,status from vm_engine_event where status='IN_PROGRESS';
We do see 50 of them.
If this number is greater than or equal to 50 ( > vRSLCM 8.4 ) , then we need to follow below steps to clear this data and bring the system to functional state
Remediation
Before even proceeding further , please take a snapshot of vRSLCM Appliance. This is mandatory
Here's the plan which would be implemented in order to fix this issue
*** select count(*) queries. These queries will help you in identifying the number of records per table in a defined state ***
select count(*) from vm_rs_request where requestname= 'lcmgenricsetting';
select count(*) from vm_engine_execution_request where enginestatus= 'INITIATED';
select count(*) from vm_engine_statemachine_instance where status= 'CREATED';
select count(*) from vm_engine_event where status= 'IN_PROGRESS';
*** execute these delete queries. This would clear up all stuck queries ***
delete from vm_rs_request where requestname= 'lcmgenricsetting';
delete from vm_engine_execution_request where enginestatus= 'INITIATED';
delete from vm_engine_statemachine_instance where status= 'CREATED';
delete from vm_engine_event where status= 'IN_PROGRESS';
*** Perform a full VACUUM of the postgres database ***
VACUUM FULL verbose analyze vm_rs_request;
*** exit database ***
\q
*** Restart vRSLCM Service ***
systemctl restart vrlcm-server
Let's implement this plan at hand in my environment and see if it helps
*** select count(*) queries. These queries will help you in identifying the number of records per table in a defined state ***
vrlcm=# select count(*) from vm_engine_event where status='IN_PROGRESS';
count
-------
50
^
vrlcm=# select count(*) from vm_engine_statemachine_instance where status= 'CREATED';
count
-------
1580
vrlcm=# select count(*) from vm_engine_execution_request where enginestatus = 'INITIATED';
count
-------
1616
vrlcm=# select count(*) from vm_rs_request where requestname = 'lcmgenericsetting';
count
-------
0
*** execute these delete queries. This would clear up all stuck queries ***
vrlcm=# delete from vm_engine_event where status='IN_PROGRESS';
DELETE 50
vrlcm=# delete from vm_engine_statemachine_instance where status= 'CREATED';
DELETE 1580
vrlcm=# delete from vm_engine_execution_request where enginestatus = 'INITIATED';
DELETE 1616
*** Note , Since select count(*) from vm_rs_request where requestname = 'lcmgenericsetting'; returned 0 records , i am not executing delete statement ****
*** Perform a full VACUUM of the postgres database ***
vrlcm=# VACUUM FULL verbose analyze vm_rs_request;
INFO: vacuuming "public.vm_rs_request"
INFO: "vm_rs_request": found 48 removable, 1565 nonremovable row versions in 379 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU: user: 0.06 s, system: 0.02 s, elapsed: 0.46 s.
INFO: analyzing "public.vm_rs_request"
INFO: "vm_rs_request": scanned 339 of 339 pages, containing 1565 live rows and 0 dead rows; 1565 rows in sample, 1565 estimated total rows
VACUUM
*** exit database ***
\q
*** Restart vRSLCM Service ***
systemctl restart vrlcm-server
After this plan implementation , it would take 2 minutes for the server to startup and come back to it's functional state again
Then , if we go ahead and check the Authentication setting information
globalenvironment or the product inventory sync completed too
To conclude , there was something which went wrong in our last nights tests in lab which triggered these queued requests in vRSLCM.
Happy to get the environment back in functional state. Ensure a snapshot is taken before making any changes. Can't stress how important is this if things does not work out.
Comments