Infoplus is currently experiencing delays in some areas of the application. We are aware and working towards a solution as quickly as possible.

Incident Report for InfoPlus

Postmortem

Summary

During a period of increased usage and system load across some centralized components of the Infoplus WMS backend infrastructure on Black Friday, 11/27/2020, performance of the entire system degraded significantly, and in some cases asynchronous processing jobs became significantly backlogged and experienced increased error rates.

Our devops team responded by adjusting system utilization parameters, restoring services which had been temporarily unavailable, prioritizing jobs based on their impact to our customers, allocating additional system resources (increased processing capacity), and noting specific bottlenecks for prioritized optimization to lessen the likelihood of repeated incidents in the future.

Timeline

11/27/2020 9:38 am CST - Initial devops alarms issued re: database server connection pools
11/27/2020 9:55 am CST - Initial system restarts performed by Infoplus devops team
11/27/2020 11:11 am CST - Infoplus devops team began adjusting per-job thread counts
11/27/2020 12:07 pm CST - Infoplus support notified Infoplus devops team of multiple clients reporting performance issues
11/27/2020 12:29 pm CST - Incident posted to status.infopluscommerce.com
11/27/2020 2:15 pm CST - Infoplus devops team attempted to relieve congestion by decreasing number of active job-processing application servers
11/27/2020 2:30 pm CST - Infoplus devops team began adjusting thread pool sizes within individual application servers.
11/27/2020 3:00 pm CST - Number of queued jobs began to decrease for the first time since the incident began
11/27/2020 3:30 pm CST - Infoplus devops team worked on correcting backlog of orders and fulfillment processes which were stuck in pending statuses.
11/27/2020 4:10 pm CST - Alarms issues re: database server utilization above maximum acceptable threshold.
11/28/2020 9:00 pm CST - Infoplus devops team reported jobs were successfully processing, but queues remained larger than normal.
11/29/2020 9:49 am CST - Incident resolved on status.infopluscommerce.com

Root Cause

As a larger number of jobs needed to run concurrently due to increased demand, the database connection pools used in the job-processing application servers reached maximum capacity, meaning that jobs became blocked waiting to obtain connections to complete their work. This led to the job-thread pools getting filled, meaning that new jobs could not be kicked off. Simultaneously, jobs which had successfully started would block mid-way through their work, eventually timing out and failing. Then new jobs would start, but get caught in the same scenario, and eventually fail as well.

Contributing Factors

Various order management components of Infoplus using excessive numbers of database connections (more than other components use for similar tasks).
Recent increases in per-job thread-pool sizes, bringing the total possible number job-threads beyond the capacity of the database connection pools.
Lack of monitoring on connection-pool and thread-pool statistics.
The solution in this case required an approach that is contrary to what is typically done to improve performance issues, that is - decreasing the number of threads available for specific jobs, rather than increasing. This took time to be realized.

Impacts

Increased time to import orders from Shopping Cart Connections.
Decreased accuracy of orderable quantities on items (e.g., which are pushed to shopping carts), due to orders being enqueued between shopping carts and Infoplus.
Increased time to move orders from Pending to On Order status (e.g., running on-insert Trigger actions).
Delays moving some Fulfillment Processes from Pending to Running status.
Some Fulfillment Processes failed (ending up in Error status, with Orders returned to On Order status).
Increased time to update system audits and search indexes.

Corrective and Preventative Measures

Corrective Actions Taken

11/27/2020 11:11 am CST - Reallocated threads per queue (to better align with customer priorities)
11/27/2020 2:30 pm CST - Decreased thread pool sizes for application servers
11/30/2020 10:00 pm CST - Increased capacity of primary database servers
11/30/2020 10:00 pm CST - Released application patch for order components to use fewer database connections per job
12/1/2002 1:00 pm CST - Added additional numbers of job-processing application servers
12/2/2020 (multiple times) - Added additional database indexes to improve performance and decrease load.

Preventative Measures

Adding monitors & alarms for application-server database connection pool usage and active-thread counts.

This will allow us to proactively tune and respond to events before they reach the critical condition they came into during this event.

‌

Adding metrics around application Key performance indicators, which will be used to prioritize engineering focus towards areas of application performance.

This shift in focus will have long-term payoff, as the product engineering team spends more of its time addressing existing application performance issues, as well as giving us actionable data to trigger additional optimization work and/or system resource allocation.

‌

Enabled additional database performance metrics, to provide greater real-time insights into performance bottlenecks.

This will (and has already started to) allow us to identify “hot spots” for database optimization, which in turn will improve overall system performance, further decreasing the ability for an incident of this nature to occur again, as well as generally improving all user experiences in terms of speed/responsiveness.

‌

Reset (lowered) alarm thresholds on critical system resources, to reallocate at lower levels.

This change will increase our normal amount of reserved capacity, to provide additional room for activity bursts, so that an extra busy day cannot recreate this situation.

Posted Dec 04, 2020 - 09:44 CST

Resolved

This incident has been resolved.

Posted Nov 29, 2020 - 09:49 CST

Update

We are continuing to investigate this issue.

Posted Nov 27, 2020 - 16:29 CST

Investigating

We are currently investigating the issue and will provide updates as quickly as possible.

Posted Nov 27, 2020 - 12:29 CST

This incident affected: Infoplus Web Application and Infoplus API.