Following a production release on Thursday, 10/11/16 in the 10-11 PM time window, a bug was introduced in one of the underlying order data indexing libraries. This caused intermittent failures across several applications within Infoplus that update order data. The bug appears to have only impacted Infoplus clients and lines of business who utilize Infoplus’s tax lookup capabilities.
In addition, this bug caused a subsystem used by some of the Infoplus legacy applications (specifically, “Call Center Application”, “eStore”, and “EDI Manager”) to suffer intermittent service failures, causing intermittent errors in those applications as well.
During this outage, some orders being placed in the Infoplus legacy applications, mentioned above, may have not been received by Infoplus. Such order entry errors may be identified in the logs of these applications, and/or would have been reported to users while placing the Orders.
Once the issue was identified, the Infoplus engineering team deployed a patch to eliminate the bug within approximately 3 hours. After the patch was deployed, the Infoplus engineering team re-synchronized the order data indexes across all client databases.
Tuesday, 10/11/2016
Wednesday 10/12/2016
The root cause of this outage was a programming error, which was released to production in the 10/11 release.
The library used to synchronize order index data uses a temporary database table to store values involved in the calculation of tax amounts on orders. Multiple functions within the program access this temporary table, and before accessing it, those functions first check to confirm that the table exists (creating it if needed).
The change which caused the error was a switch from using a static/global process-level variable to track whether or not the temporary table existed, to instead query the database’s internal metadata tables to see if the table exists. In a single-process environment, this change is semantically consistent with the use of a static/global process-level variable - however, when multiple processes are running at the same time, one process will see that the temporary table exists, even if it has been created in a different process. This would cause that secondary program to attempt to use the temporary table without creating it within its own database session, therefore causing a runtime error.
The issue was not identified in local development, quality assurance (QA) or production testing due to the nature of the error requiring fairly specific timing (having two or more processes attempting to use this temporary table at the same moment of time).
In addition, this error was only triggered for clients who take advantage of Infoplus’s tax lookup functionality, which is a limited subset of clients (and which is not well covered by our primary test data sets).
Resolution of the outage was to revert the code that was changed to check for the temporary table’s existence in the database metadata to instead revert back to tracking the table’s existence through a static/global process-level variable. This correctly causes each individual process to only see the temporary table within its own database session.