Jun 15, 20224 min read

vIDM Inventory Sync was stuck and no progress made even though vIDM was healthy , Why ?

Updated: Aug 8, 2023

Last night we had ADFS configured in my lab on vIDM ( globalenvironment ) and it was working as expected with vRealize Automation

Over a period of time , we did notice that globalenvironment on vRSLCM was not reporting health status

This was weird as vIDM was healthy and authentication was going through as expected.

When i browsed to Identity and Tenant Management pane , there was a message stating

VMware Identity Manager is not available at the moment. There are some requests which are in progress

There was something wrong with the new ADFS configuration as the permissions seemed to me messed up as well

vIDM Inventory Sync never went past the initial stage


*** Request Creation *** 


2022-06-15 01:43:32.459 INFO  [http-nio-8080-exec-7] c.v.v.l.l.u.RequestSubmissionUtil -  -- ++++++++++++++++++ Creating request to Request_Service :::>>> {
  "vmid" : "7d637ba2-5294-4e92-836a-2610a509fa03",
  "transactionId" : null,
  "tenant" : "default",
  "requestName" : "productinventorysync",
  "requestReason" : "VIDM in Environment globalenvironment - Product Inventory Sync",
  "requestType" : "PRODUCT_INVENTORY_SYNC",
  "requestSource" : "globalenvironment",
  "requestSourceType" : "user",
  "inputMap" : {
    "environmentId" : "globalenvironment",
    "productId" : "vidm",
    "tenantId" : ""
  },
  "outputMap" : { },
  "state" : "CREATED",
  "executionId" : null,
  "executionPath" : null,
  "executionStatus" : null,
  "errorCause" : null,
  "resultSet" : null,
  "isCancelEnabled" : null,
  "lastUpdatedOn" : 1655257412458,
  "createdBy" : null
}



*** Request Response *** 


2022-06-15 01:43:32.466 INFO  [http-nio-8080-exec-7] c.v.v.l.l.u.RequestSubmissionUtil -  -- Generic Request Response : {
  "requestId" : "7d637ba2-5294-4e92-836a-2610a509fa03"
}
2022-06-15 01:43:34.001 INFO  [scheduling-1] c.v.v.l.r.c.RequestProcessor -  -- Number of request to be processed : 1
2022-06-15 01:43:34.022 INFO  [scheduling-1] c.v.v.l.r.c.p.ProductInventorySyncPlanner -  -- Creating spec for inventory sync for product : vidm in environment : globalenvironment
2022-06-15 01:43:34.025 INFO  [scheduling-1] c.v.v.l.r.u.InfrastructurePropertiesHelper -  -- VCF properties: {
  "vcfEnabled" : false,
  "sddcManagerDetails" : [ ]
}
2022-06-15 01:43:34.027 INFO  [scheduling-1] c.v.v.l.r.c.p.ProductInventorySyncPlanner -  -- Found product with id vidm
2022-06-15 01:43:34.038 INFO  [scheduling-1] c.v.v.l.r.c.p.CreateEnvironmentPlanner -  -- Not a clustered vIDM, fetching the hostname from primary node.



*** Suit Request is generated and then set to IN_PROGRESS *** 

2022-06-15 01:43:34.785 INFO  [scheduling-1] c.v.v.l.r.c.RequestProcessor -  -- Processing request with ID : 7d637ba2-5294-4e92-836a-2610a509fa03 with request type PRODUCT_INVENTORY_SYNC with request state INPROGRESS.

Browsing to "Authentication Provider " under settings would return an exception stating

Failed to fetch Complete Authentication Provider details

That networksettings API returns 400 , response was "No Settings "

The problem described here could be due to engine is not picking up the request to due to already processing requests (overloaded) or genuine unprocessed but stuck requests.

So logged into database and executed a query which would return the number of requests which are stuck in "IN_PROGRESS" state

Login into vRSLCM database

/opt/vmware/vpostgres/11/bin/psql -U postgres -d vrlcm

Check IN_PROGRESS from vm_engine_event table

select currentState,status from vm_engine_event where status='IN_PROGRESS';

We do see 50 of them.

If this number is greater than or equal to 50 ( > vRSLCM 8.4 ) , then we need to follow below steps to clear this data and bring the system to functional state

Remediation

Before even proceeding further , please take a snapshot of vRSLCM Appliance. This is mandatory

Here's the plan which would be implemented in order to fix this issue


*** select count(*) queries. These queries will help you in identifying the number of records per table in a defined state  ***

select count(*) from vm_rs_request where requestname= 'lcmgenricsetting';

select count(*) from vm_engine_execution_request where enginestatus= 'INITIATED';

select count(*) from vm_engine_statemachine_instance where status= 'CREATED';

select count(*) from vm_engine_event where status= 'IN_PROGRESS';



*** execute these delete queries. This would clear up all stuck queries  ***

delete from vm_rs_request where requestname= 'lcmgenricsetting';

delete from vm_engine_execution_request where enginestatus= 'INITIATED';

delete from vm_engine_statemachine_instance where status= 'CREATED';

delete from vm_engine_event where status= 'IN_PROGRESS';


*** Perform a full VACUUM of the postgres database *** 

VACUUM FULL verbose analyze vm_rs_request;



*** exit database ***
\q



*** Restart vRSLCM Service ***

systemctl restart vrlcm-server

Let's implement this plan at hand in my environment and see if it helps


   
*** select count(*) queries. These queries will help you in identifying the number of records per table in a defined state  ***  

 vrlcm=# select count(*) from vm_engine_event where status='IN_PROGRESS'; 
 count 
------- 
    50



                             ^ 
vrlcm=# select count(*) from vm_engine_statemachine_instance  where status= 'CREATED'; 
 count 
------- 
  1580
  
  
  
vrlcm=# select count(*)  from vm_engine_execution_request where enginestatus = 'INITIATED'; 
 count 
------- 
  1616
  
  

vrlcm=# select count(*) from vm_rs_request where requestname = 'lcmgenericsetting'; 
 count 
------- 
     0

     
*** execute these delete queries. This would clear up all stuck queries  ***  


vrlcm=# delete from vm_engine_event where status='IN_PROGRESS'; 
DELETE 50




vrlcm=# delete from vm_engine_statemachine_instance  where status= 'CREATED'; 
DELETE 1580




vrlcm=# delete from vm_engine_execution_request where enginestatus = 'INITIATED'; 
DELETE 1616


*** Note , Since select count(*) from vm_rs_request where requestname = 'lcmgenericsetting'; returned 0 records , i am not executing delete statement  ****


*** Perform a full VACUUM of the postgres database *** 



vrlcm=# VACUUM FULL verbose analyze vm_rs_request; 
INFO:  vacuuming "public.vm_rs_request" 
INFO:  "vm_rs_request": found 48 removable, 1565 nonremovable row versions in 379 pages 
DETAIL:  0 dead row versions cannot be removed yet. 
CPU: user: 0.06 s, system: 0.02 s, elapsed: 0.46 s. 
INFO:  analyzing "public.vm_rs_request" 
INFO:  "vm_rs_request": scanned 339 of 339 pages, containing 1565 live rows and 0 dead rows; 1565 rows in sample, 1565 estimated total rows 
VACUUM


*** exit database ***
\q

*** Restart vRSLCM Service ***

systemctl restart vrlcm-server

After this plan implementation , it would take 2 minutes for the server to startup and come back to it's functional state again

Then , if we go ahead and check the Authentication setting information

globalenvironment or the product inventory sync completed too

To conclude , there was something which went wrong in our last nights tests in lab which triggered these queued requests in vRSLCM.

Happy to get the environment back in functional state. Ensure a snapshot is taken before making any changes. Can't stress how important is this if things does not work out.

arunnukula

arunnukula

vIDM Inventory Sync was stuck and no progress made even though vIDM was healthy , Why ?

Recent Posts

Comments