Infoplus Feature Issue - Fulfillment Processing
Incident Report for Infoplus
Postmortem

Summary

Following a production release on Thursday, 10/11/16 in the 10-11 PM time window, a bug was introduced in one of the underlying order data indexing libraries. This caused intermittent failures across several applications within Infoplus that update order data. The bug appears to have only impacted Infoplus clients and lines of business who utilize Infoplus’s tax lookup capabilities.

In addition, this bug caused a subsystem used by some of the Infoplus legacy applications (specifically, “Call Center Application”, “eStore”, and “EDI Manager”) to suffer intermittent service failures, causing intermittent errors in those applications as well.

During this outage, some orders being placed in the Infoplus legacy applications, mentioned above, may have not been received by Infoplus. Such order entry errors may be identified in the logs of these applications, and/or would have been reported to users while placing the Orders.

Once the issue was identified, the Infoplus engineering team deployed a patch to eliminate the bug within approximately 3 hours. After the patch was deployed, the Infoplus engineering team re-synchronized the order data indexes across all client databases.

Timeline

Summary
  • Timezone: CDT
  • Outage Duration: 3 hours
  • Feature Outage Initially Reported: 10/12/2016 7:07 AM
  • Feature Outage Resolved: 10/12/2016 10:11 AM
Detailed Events

Tuesday, 10/11/2016

  • 10:00 PM - 10:45 PM Production Release of Infoplus (Release-50)
  • 10:45 PM - 11:00 PM Infoplus Release QA Verification Completed

Wednesday 10/12/2016

  • 7:07 AM - Initial Customer Down ticket was submitted (regarding errors in a legacy shipping user interface)
  • 7:09 AM - Ticket received by Infoplus support team; research into issue began.
  • 7:11 AM - Additional Customer Down ticket was submitted (regarding errors in fulfillment processes)
  • 7:26 AM - Tickets were escalated to Engineering
  • 7:45 AM - Engineering found error messages suggesting root cause of issue in error logs
  • 7:58 AM - Infoplus Posted incident to status page
  • 8:31 AM - Additional Customer Down ticket was submitted (regarding errors in legacy Call Center Application)
  • 8:52 AM - Engineering completed patch to fix bug; began build & deployment process.
  • 9:56 AM - Additional Customer Down ticket was submitted (regarding errors in legacy eStore Application)
  • 10:11 AM - Deployment of patched code was completed across all application servers. Incident was updated on status page and clients were give updates.

Root Cause

The root cause of this outage was a programming error, which was released to production in the 10/11 release.

The library used to synchronize order index data uses a temporary database table to store values involved in the calculation of tax amounts on orders. Multiple functions within the program access this temporary table, and before accessing it, those functions first check to confirm that the table exists (creating it if needed).

The change which caused the error was a switch from using a static/global process-level variable to track whether or not the temporary table existed, to instead query the database’s internal metadata tables to see if the table exists. In a single-process environment, this change is semantically consistent with the use of a static/global process-level variable - however, when multiple processes are running at the same time, one process will see that the temporary table exists, even if it has been created in a different process. This would cause that secondary program to attempt to use the temporary table without creating it within its own database session, therefore causing a runtime error.

Contributing Factors

The issue was not identified in local development, quality assurance (QA) or production testing due to the nature of the error requiring fairly specific timing (having two or more processes attempting to use this temporary table at the same moment of time).

In addition, this error was only triggered for clients who take advantage of Infoplus’s tax lookup functionality, which is a limited subset of clients (and which is not well covered by our primary test data sets).

Resolution and Recovery

Resolution of the outage was to revert the code that was changed to check for the temporary table’s existence in the database metadata to instead revert back to tracking the table’s existence through a static/global process-level variable. This correctly causes each individual process to only see the temporary table within its own database session.

Corrective and Preventative

Corrective Actions Taken
  • The technique which was used to check for the existence of temporary database tables by querying the database internal meta data has been removed from the codebase, so that it will not be used in the future, since it does not provide correct semantics in multi-process environments.
Preventative Measures Taken
  • A ticket has been added to the Engineering Support queue to develop a test of the library which experienced this error, which will launch multiple processes in parallel, to demonstrate the library's resilience to such usage.
Posted Oct 13, 2016 - 16:37 CDT

Resolved
The issue with fulfillment processing and order entry has been resolved.  Root cause has been identified and measures are being taken to mitigate issues like this in the future. We will be conducting a post mortem and will provide an incident report in the near future.
Posted Oct 12, 2016 - 14:15 CDT
Monitoring
We have deployed a patch which we believe resolves the underlying issue that was impacting Fulfillment Processing.  

We have also identified the fact that the root cause of this problem was impacting more areas of Infoplus, besides just Fulfillment Processing.  In particular, Order entry, across various channels has been effected.  Other components of Infoplus may have also been impacted.

We are currently monitoring the patch that we have deployed.  If you experience any errors, please submit them to our support team.
Posted Oct 12, 2016 - 10:11 CDT
Identified
We've identified the issue with fulfillment processing and are currently preparing a patch to resolve the issue.

We will provide an update once the patch has been tested and released.
Posted Oct 12, 2016 - 08:59 CDT
Investigating
An issue has been reported with fulfillment processing within Infoplus.

We are currently investigating and will provide an update as we learn more.
Posted Oct 12, 2016 - 07:58 CDT