When O(n+m) Isn't Fast Enough: A Java Optimization Adventure

My very first project at my current company was to migrate a system written in PHP to Java. There is one function that plays an important role in the system. It performs the following tasks:

Given a sorted list A and a set B, both containing Strings. A and B are loaded from Redis and cached in memory. List A remains constant between requests, while set B changes.

Find the intersection of A and B
Create list C containing the intersecting items, ordered as they appear in A
Append all remaining items from B to C
Return the reverse of C

Here's the original Java implementation of the core logic:

private static List<String> sortItems(List<String> sortedItemList, Set<String> items) {
    if (items == null || items.isEmpty()) {
        return Collections.emptyList();
    }
    if (sortedItemList == null || sortedItemList.isEmpty()) {
        return new ArrayList<>(items);
    }
    // Find intersection, preserving order from sortedItemList
    List<String> intersection = new ArrayList<>();
    for (String item : sortedItemList) {
        if (items.contains(item)) {
            intersection.add(item);
        }
    }
    List<String> reversedList = new ArrayList<>(intersection);
    Set<String> sortedSet = new HashSet<>(intersection);
    // Add remaining items
    for (String s : items) {
        if (!sortedSet.contains(s)) {
            reversedList.add(s);
        }
    }
    // Reverse the list to get the final desired order
    Collections.reverse(reversedList);
    return reversedList;
}

Performance Observations

The time complexity of the original implementation is O(n + m), where n is the size of A and m is the size of B. This sounds good since there are no exponentials or multipliers, but theoretical complexity does not always reflect the actual runtime.

Through profiling and observation, I noted that the sizes of A and B could vary significantly between requests. Specifically, two main scenarios emerged:

Case 1: A is very large (2k–10k elements) while B is very small (50–100 elements).

Case 2: The sizes of A and B are more comparable (50–2k elements).

This led to two key observations about the original implementation:

First: The initial for() loop that finds the intersection between A and B scans every item in A. In case 1, when A is much larger than B, many iterations result in no useful work.

Second: Calling Collections.reverse() seems inefficient. Since we already have a sorted list, we can build the reversed list using a more clever approach. The Collections.reverse() method performs swaps on the list elements, and swapping can be relatively slow.

But... why does this matter?

A better implementation means faster runtime, lower CPU utilization, and reduced costs. JProfiler shows a noticeable processing time for this function (around 7%-8% of processing time at normal load), despite A and B being cached in memory. CPU utilization can quickly go up from 50% to over 90% if the size of A or B increases to over 7k.

Developing an Optimized Solution

My goal was to create a solution that performed well across both Case 1 and Case 2 by addressing the observed inefficiencies. This led to developing two distinct implementations and dynamically choosing between them.

Solution Approach:

The core idea is to use different strategies based on the relative sizes of A and B.

For Case 1 (small B, large A), iterating through B to find the intersection is more efficient than iterating through A.
For Case 2 (comparable sizes), iterating through A (like the original O(n + m) approach) is generally fine, but we can optimize by avoiding the separate Collections.reverse() call.

Implementation 1: O(m log m) Approach (Optimized for Small B)

For Case 1, when the size of B is very small, iterating over B should be faster than iterating over A. To efficiently check for intersection membership while looping through B (achieving O(1) average lookup time per item in B), A must first be converted into a Map (e.g., a HashMap<String, Integer>) mapping each item to its original index (order). This allows finding the intersection in O(m) time overall.

After finding the intersection, these items need to be sorted based on their original order derived from A (using the indices stored as values in the Map). This sorting step takes O(k log k) time, where k is the size of the intersection (k <= m). Finally, we populate the result array by adding the sorted intersection items and the remaining items from B, using reverse indexing to build the final reversed list directly without a separate reverse() call.

The overall time complexity for this implementation is dominated by the sorting step, resulting in O(m log m) (since k <= m).

private static List<String> sortItemFromSet_givenMap(Map<String, Integer> item2Order, Set<String> items) {
    if (items == null || items.isEmpty()) {
        // return empty list if no items are provided
        return Collections.emptyList();
    }
    if (item2Order == null || item2Order.isEmpty()) {
        // return items as is if no orders are provided
        return new ArrayList<>(items);
    }     
    List<ItemOrder> itemOrderList = new ArrayList<>();
    Set<String> addedItem = new HashSet<>();
    for (String deal : items) {
        // find intersection
        if (item2Order.containsKey(deal) && addedItem.add(deal)) {
            itemOrderList.add(new ItemOrder(deal, item2Order.get(deal)));
        }
    }
    if (itemOrderList.isEmpty()) {
        return new ArrayList<>(items);
    }
    // sort by order
    itemOrderList.sort(Comparator.comparingInt(ItemOrder::getOrder));
    String[] result = new String[items.size()];
    int currentIndex = items.size() - 1;
    // fill the result array with items in order
    for (ItemOrder item : itemOrderList) {
        result[currentIndex--] = item.getItem();
    }
    // fill the result array with items do not in the intersection
    for (String deal : items) {           
        if (!addedItem.contains(deal)) {
            result[currentIndex--] = deal;
        }
    }
    return Arrays.asList(result);
}
@Getter
@AllArgsConstructor
static class ItemOrder {
    private String item;
    private int order;
}

This works very well for Case 1 where B is very small, but for Case 2 it might be slower compared to the O(n + m) approach.

Implementation 2: O(n + m) Approach (Optimized for Reverse)

This revised implementation targets the O(n + m) complexity while avoiding the explicit Collections.reverse() call. By iterating over A once, we find the intersection items and add them directly to the result array using reverse indexing. Then, we iterate through B again to add the remaining non-intersecting items to the result array, also using reverse indexing. This builds the final list in the correct reversed order from the start.

private static List<String> sortItemFromSet_givenList(List<String> sortedItems, Set<String> items) {
    if (items == null || items.isEmpty()) {
        // return empty list if no items are provided
        return Collections.emptyList();
    }
    if (sortedItems == null || sortedItems.isEmpty()) {
        return new ArrayList<>(items);
    }        
    int size = items.size();
    int currentIndex = size - 1;
    String[] result = new String[size];
    Set<String> addedItems = new HashSet<>(size);

    // Iterate over sortedItems and items in one go
    for (String item : sortedItems) {
        if (items.contains(item) && addedItems.add(item)) {
            result[currentIndex--] = item;
        }
    }
    if (addedItems.isEmpty()) {
        return new ArrayList<>(items);
    }
    // Append the remaining items from items not in result
    for (String item : items) {
        if (!addedItems.contains(item)) {
            result[currentIndex--] = item;
        }
    }
    return Arrays.asList(result);
}

Data Structures

Supporting both implementations efficiently requires preparing two data structures derived from the initial list A. After A is loaded from Redis, I create an ItemStruct object that holds both the original List (sortedItemList) and a Map (item2OrderMap) which maps items to their original index (order) in A. This pre-computation allows the dynamic selection logic to quickly access the required data structure.

@Getter
@AllArgsConstructor
public static class ItemStruct {
    private List<String> sortedItemList;
    private Map<String, Integer> item2OrderMap;
}

Dynamic Implementation Selection

Finally, the selection logic chooses Implementation 1 (sortItemFromSet_givenMap) if the estimated complexity m * log2(m) is less than n + m; otherwise, it chooses Implementation 2 (sortItemFromSet_givenList). This heuristic aims to pick the theoretically faster approach based on the input sizes. (The IntMath.log2 function here is from Google's Guava library).

private static List<String> sortItem(ItemStruct itemStruct, Set<String> items) {
    if (items == null || items.isEmpty()) {
        return Collections.emptyList();
    }
    if (itemStruct != null && !itemStruct.getSortedItemList().isEmpty()) {
        int sizeOfItems = items.size();
        int sizeOfItemSortedList = itemStruct.getSortedItemList().size();
        if (sizeOfItems * IntMath.log2(sizeOfItems, RoundingMode.HALF_UP) < (sizeOfItems + sizeOfItemSortedList)) {
            return sortItemFromSet_givenMap(dealWithImp.getDealImpWeightMap(), items);
        }
        return sortItemFromSet_givenList(dealWithImp.getDealImpList(), items);
    }
    return new ArrayList<>(items);
}

Testing and Results

Test results on my laptop:

Case 1:

Size of A: 7000
Size of B: 100
Intersection: 100
Calls: 10,000

Run Time:

Old Implementation: 256 ms
New Implementation: 98 ms (~2.6x faster)

Case 2:

Size of A: 5000
Size of B: 5000
Intersection: 3000
Calls: 10,000

Run Time:

Old Implementation: 1255 ms
New Implementation: 909 ms (~1.4x faster)

The test results clearly show that the optimized implementation significantly outperforms the old one.

Impact in Production

The true impact of this optimization was validated in our production environment. Previously, large list A sizes under heavy load caused CPU utilization to spike above 90%, impacting system stability. After deploying the optimized code with dynamic algorithm selection, peak CPU usage under similar conditions dropped significantly to around 70%. This confirmed the optimization's real-world effectiveness in improving performance and resource efficiency for this critical function.

Conclusion

DSA is hard, especially when you're tasked with optimizing legacy functions at work. When optimizing, it's essential to use profiling tools like JProfiler to identify the bottleneck. Once you've pinpointed the performance constraints, carefully analyze which factors are impacting the runtime and test your approaches rigorously. This methodical process not only leads to significant improvements in efficiency but also reinforces the importance of combining theoretical knowledge with practical testing in real-world applications.

Non-blocking Engineer

Search This Blog