深入理解 SortedList 中自定义对象的高效搜索-Python教程-PHP中文网

深入理解 sortedlist 中自定义对象的高效搜索

本文探讨了如何在 Python 的 `sortedcontainers.SortedList` 中高效搜索自定义类的对象，特别是当 `SortedList` 内部存储的是复杂对象，而搜索条件是其某个特定属性时。通过重载自定义类的富比较方法（如 `__lt__`），我们能够使对象直接与搜索值（例如字符串）进行比较，从而简化 `bisect_left` 的使用，避免了创建临时对象或编写复杂的自定义二分查找逻辑，显著提升了代码的简洁性和可维护性。

SortedList 中自定义对象搜索的挑战

sortedcontainers 库提供的 SortedList 是一个功能强大的有序列表实现，它在保持元素有序的同时，提供了高效的插入、删除和查找操作。当 SortedList 中存储的是基本数据类型（如整数、字符串）时，其内置的 bisect_left 等方法可以直观地用于查找。然而，当列表中存储的是自定义类的对象时，情况会变得复杂。

考虑一个场景，我们有一个 Supplier 类，包含 Name、Id、SapId 等属性，并且我们希望将这些 Supplier 对象存储在一个按 Name 属性排序的 SortedList 中。

from typing import List
from sortedcontainers import SortedList

class Supplier:
    def __init__(self, name: str, id: int = 0, sap_id: int = 0):
        self.Name = name
        self.Id = id
        self.SapId = sap_id

    def __repr__(self):
        return f"Supplier(Name='{self.Name}', Id={self.Id})"

class Data:
    def __init__(self):
        # 初始化时指定key，按Supplier的Name属性（小写）排序
        self.suppliers = SortedList(key=lambda x: x.Name.lower())

# 示例数据
data_store = Data()
data_store.suppliers.add(Supplier("Apple", 101, 2001))
data_store.suppliers.add(Supplier("Banana", 102, 2002))
data_store.suppliers.add(Supplier("Cherry", 103, 2003))
print(data_store.suppliers)
# 输出: SortedList([Supplier(Name='Apple', Id=101), Supplier(Name='Banana', Id=102), Supplier(Name='Cherry', Id=103)])

登录后复制

现在，我们想要根据供应商的名称来查找 Supplier 对象。直观的尝试是直接使用 bisect_left 方法：

# 假设在Data类中有一个查找方法
# def find_supplier(self, name: str):
#     index = self.suppliers.bisect_left(name.lower()) # 尝试直接传入字符串
#     # ... 后续检查

登录后复制

然而，这种做法会遇到类型不匹配的问题。bisect_left 在内部进行比较时，会尝试将传入的搜索值（一个字符串）与 SortedList 中的元素（Supplier 对象）进行比较。由于 SortedList 是通过 key=lambda x: x.Name.lower() 来排序的，bisect_left 期望一个可以与 Supplier 对象的 Name.lower() 属性进行比较的值，但它本身在查找过程中，实际上是将 name.lower() 与 Supplier 对象本身进行比较，或者更准确地说，是与 Supplier 对象通过 key 函数转换后的结果进行比较。当直接传入一个字符串时，它会尝试将这个字符串与 key 函数的返回值（也是一个字符串）进行比较。

更深层次的问题在于，虽然 SortedList 使用 key 函数来决定元素的排序位置，但 bisect_left 在执行二分查找时，其比较逻辑是基于列表元素自身的。如果列表中存储的是 Supplier 对象，那么 bisect_left 在内部比较时，会尝试比较 Supplier 对象与你传入的搜索值。当传入一个字符串时，Python 默认的比较行为会认为一个 Supplier 对象和一个字符串是不同类型的，通常会导致 TypeError 或不符合预期的比较结果。

一种常见的“变通”方法是创建一个临时的 Supplier 对象，将其 Name 属性设置为搜索名称，然后用这个临时对象进行查找：

纳米搜索

纳米搜索：360推出的新一代AI搜索引擎

查看详情

# 在Data类中
def find_supplier_with_temp_object(self, name: str):
    temporary_supplier = Supplier(name) # 创建一个临时对象
    index = self.suppliers.bisect_left(temporary_supplier)

    if index != len(self.suppliers) and self.suppliers[index].Name.lower() == name.lower():
        return self.suppliers[index]

    return None

# print(data_store.find_supplier_with_temp_object("Apple"))

登录后复制

这种方法虽然能够工作，但它引入了不必要的临时对象创建，增加了代码的复杂性和潜在的性能开销，尤其是在高频查找的场景下，显得不够优雅。

优雅的解决方案：重载富比较方法

为了避免创建临时对象并实现更简洁的搜索逻辑，我们可以通过在自定义类 Supplier 中重载富比较方法（rich comparison methods）来解决这个问题。核心思想是让 Supplier 对象能够直接与字符串进行比较。

Python 中的富比较方法包括：

__lt__(self, other): 小于 (<)
__le__(self, other): 小于等于 (<=)
__eq__(self, other): 等于 (==)
__ne__(self, other): 不等于 (!=)
__gt__(self, other): 大于 (>)
__ge__(self, other): 大于等于 (>=)

通过实现 __lt__ 方法，我们可以定义 Supplier 对象如何与另一个对象（包括字符串）进行“小于”比较。

from typing import List
from sortedcontainers import SortedList

class Supplier:
    def __init__(self, name: str, id: int = 0, sap_id: int = 0):
        self.Name = name
        self.Id = id
        self.SapId = sap_id

    def __repr__(self):
        return f"Supplier(Name='{self.Name}', Id={self.Id})"

    # 重载小于操作符
    def __lt__(self, other):
        if isinstance(other, str):
            # 如果另一个操作数是字符串，则与自己的Name属性进行比较
            return self.Name.lower() < other.lower()
        elif isinstance(other, Supplier):
            # 如果另一个操作数是Supplier对象，则与另一个Supplier的Name属性进行比较
            return self.Name.lower() < other.Name.lower()
        # 处理其他类型或抛出错误，这里简化为默认False
        return NotImplemented # 或者 raise TypeError(f"Cannot compare Supplier with {type(other)}")

    # 重载等于操作符 (推荐实现，确保精确匹配)
    def __eq__(self, other):
        if isinstance(other, str):
            return self.Name.lower() == other.lower()
        elif isinstance(other, Supplier):
            return self.Name.lower() == other.Name.lower()
        return NotImplemented

    # 如果实现了__eq__，通常也建议实现__hash__，除非明确不希望对象可哈希
    # def __hash__(self):
    #     return hash(self.Name.lower())

class Data:
    def __init__(self):
        # 此时SortedList不再需要key函数，因为它存储的对象本身就可比较了
        self.suppliers = SortedList()

    def add_supplier(self, supplier: Supplier):
        self.suppliers.add(supplier)

    def find_supplier(self, name: str):
        # 直接传入字符串进行二分查找
        index = self.suppliers.bisect_left(name)

        # 检查找到的索引是否有效，并且对应元素的名称是否与搜索名称匹配
        if index != len(self.suppliers) and self.suppliers[index].Name.lower() == name.lower():
            return self.suppliers[index]

        return None

# 示例用法
data_store = Data()
data_store.add_supplier(Supplier("Banana", 102, 2002))
data_store.add_supplier(Supplier("Apple", 101, 2001))
data_store.add_supplier(Supplier("Cherry", 103, 2003))

print("排序后的供应商列表:", data_store.suppliers)
# 预期输出: SortedList([Supplier(Name='Apple', Id=101), Supplier(Name='Banana', Id=102), Supplier(Name='Cherry', Id=103)])

found_supplier = data_store.find_supplier("Apple")
print("查找 'Apple':", found_supplier)
# 预期输出: 查找 'Apple': Supplier(Name='Apple', Id=101)

not_found_supplier = data_store.find_supplier("Grape")
print("查找 'Grape':", not_found_supplier)
# 预期输出: 查找 'Grape': None

found_supplier_case_insensitive = data_store.find_supplier("apple")
print("查找 'apple' (不区分大小写):", found_supplier_case_insensitive)
# 预期输出: 查找 'apple' (不区分大小写): Supplier(Name='Apple', Id=101)

登录后复制

在这个优化后的方案中：

Supplier 类重载 __lt__ 方法：
- 当 other 是 str 类型时，它会将 self.Name.lower() 与 other.lower() 进行比较。
- 当 other 是 Supplier 类型时，它会将 self.Name.lower() 与 other.Name.lower() 进行比较。
- 通过这种方式，Supplier 对象自身变得可比较，并且能够理解与字符串的比较。
SortedList 初始化简化：Data 类的 __init__ 方法中，self.suppliers 不再需要 key=lambda x: x.Name.lower() 参数，因为 Supplier 对象自身已经定义了排序逻辑。SortedList 会直接使用 Supplier 对象之间定义的 __lt__ 进行排序。
find_supplier 方法简洁：bisect_left 可以直接传入搜索字符串 name，而不再需要创建临时 Supplier 对象。SortedList 内部会利用 Supplier 对象中重载的 __lt__ 方法来正确执行二分查找。
__eq__ 方法的实现：为了确保在查找后进行精确匹配（self.suppliers[index].Name.lower() == name.lower()）时，比较行为符合预期，我们同样重载了 __eq__ 方法。这对于 SortedList 的其他操作（如 remove）也可能很重要。

注意事项与总结

一致性：重载比较方法时，确保它们之间的一致性至关重要。例如，如果 a < b 为真，那么 b < a 应该为假，并且 a == b 应该为假。同时，比较逻辑应与你期望的排序顺序保持一致。
处理不同类型：在 __lt__ 和 __eq__ 方法中，通过 isinstance 检查 other 的类型是一种健壮的做法，可以避免不必要的 TypeError。对于无法处理的类型，返回 NotImplemented 是 Python 推荐的做法，它允许解释器尝试反向操作或调用其他比较方法。
排序键的单一性：这种方法将排序逻辑（基于 Name 属性）硬编码到 Supplier 类中。如果你的 SortedList 需要根据 Supplier 对象的不同属性进行排序（例如，有时按 Name，有时按 Id），那么这种方法可能不适用。在这种情况下，使用 SortedList(key=...) 并在查找时创建临时对象（或手动实现二分查找）可能是更灵活的选择。
性能：重载比较方法通常比创建临时对象更高效，因为它避免了每次查找时的对象实例化开销。
可读性与维护性：此方案使代码更具可读性，将对象的比较逻辑封装在对象自身内部，符合面向对象的原则，降低了 Data 类中查找方法的复杂性。