Summary
During a period of increased usage and system load across some centralized components of the Infoplus WMS backend infrastructure on Black Friday, 11/27/2020, performance of the entire system degraded significantly, and in some cases asynchronous processing jobs became significantly backlogged and experienced increased error rates.
Our devops team responded by adjusting system utilization parameters, restoring services which had been temporarily unavailable, prioritizing jobs based on their impact to our customers, allocating additional system resources (increased processing capacity), and noting specific bottlenecks for prioritized optimization to lessen the likelihood of repeated incidents in the future.
Timeline
- 11/27/2020 9:38 am CST - Initial devops alarms issued re: database server connection pools
- 11/27/2020 9:55 am CST - Initial system restarts performed by Infoplus devops team
- 11/27/2020 11:11 am CST - Infoplus devops team began adjusting per-job thread counts
- 11/27/2020 12:07 pm CST - Infoplus support notified Infoplus devops team of multiple clients reporting performance issues
- 11/27/2020 12:29 pm CST - Incident posted to status.infopluscommerce.com
- 11/27/2020 2:15 pm CST - Infoplus devops team attempted to relieve congestion by decreasing number of active job-processing application servers
- 11/27/2020 2:30 pm CST - Infoplus devops team began adjusting thread pool sizes within individual application servers.
- 11/27/2020 3:00 pm CST - Number of queued jobs began to decrease for the first time since the incident began
- 11/27/2020 3:30 pm CST - Infoplus devops team worked on correcting backlog of orders and fulfillment processes which were stuck in pending statuses.
- 11/27/2020 4:10 pm CST - Alarms issues re: database server utilization above maximum acceptable threshold.
- 11/28/2020 9:00 pm CST - Infoplus devops team reported jobs were successfully processing, but queues remained larger than normal.
- 11/29/2020 9:49 am CST - Incident resolved on status.infopluscommerce.com
Root Cause
As a larger number of jobs needed to run concurrently due to increased demand, the database connection pools used in the job-processing application servers reached maximum capacity, meaning that jobs became blocked waiting to obtain connections to complete their work. This led to the job-thread pools getting filled, meaning that new jobs could not be kicked off. Simultaneously, jobs which had successfully started would block mid-way through their work, eventually timing out and failing. Then new jobs would start, but get caught in the same scenario, and eventually fail as well.
Contributing Factors
- Various order management components of Infoplus using excessive numbers of database connections (more than other components use for similar tasks).
- Recent increases in per-job thread-pool sizes, bringing the total possible number job-threads beyond the capacity of the database connection pools.
- Lack of monitoring on connection-pool and thread-pool statistics.
- The solution in this case required an approach that is contrary to what is typically done to improve performance issues, that is - decreasing the number of threads available for specific jobs, rather than increasing. This took time to be realized.
Impacts
- Increased time to import orders from Shopping Cart Connections.
- Decreased accuracy of orderable quantities on items (e.g., which are pushed to shopping carts), due to orders being enqueued between shopping carts and Infoplus.
- Increased time to move orders from Pending to On Order status (e.g., running on-insert Trigger actions).
- Delays moving some Fulfillment Processes from Pending to Running status.
- Some Fulfillment Processes failed (ending up in Error status, with Orders returned to On Order status).
- Increased time to update system audits and search indexes.
Corrective and Preventative Measures
Corrective Actions Taken
- 11/27/2020 11:11 am CST - Reallocated threads per queue (to better align with customer priorities)
- 11/27/2020 2:30 pm CST - Decreased thread pool sizes for application servers
- 11/30/2020 10:00 pm CST - Increased capacity of primary database servers
- 11/30/2020 10:00 pm CST - Released application patch for order components to use fewer database connections per job
- 12/1/2002 1:00 pm CST - Added additional numbers of job-processing application servers
- 12/2/2020 (multiple times) - Added additional database indexes to improve performance and decrease load.
Preventative Measures
Adding monitors & alarms for application-server database connection pool usage and active-thread counts.
- This will allow us to proactively tune and respond to events before they reach the critical condition they came into during this event.
Adding metrics around application Key performance indicators, which will be used to prioritize engineering focus towards areas of application performance.
- This shift in focus will have long-term payoff, as the product engineering team spends more of its time addressing existing application performance issues, as well as giving us actionable data to trigger additional optimization work and/or system resource allocation.
Enabled additional database performance metrics, to provide greater real-time insights into performance bottlenecks.
- This will (and has already started to) allow us to identify “hot spots” for database optimization, which in turn will improve overall system performance, further decreasing the ability for an incident of this nature to occur again, as well as generally improving all user experiences in terms of speed/responsiveness.
Reset (lowered) alarm thresholds on critical system resources, to reallocate at lower levels.
- This change will increase our normal amount of reserved capacity, to provide additional room for activity bursts, so that an extra busy day cannot recreate this situation.