Added GenericAliasHandler class [ENG-226]#76
Conversation
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
eywalker
left a comment
There was a problem hiding this comment.
Thanks for adding support to this! Could you add new tests that cover the GenericAliasHandler usage?
There was a problem hiding this comment.
Pull request overview
Adds support for hashing parameterized generic type annotations (e.g. dict[int, str]) by introducing and registering a new semantic hashing handler, and fixes a truthiness check in Python→Arrow conversion to avoid failures on containers with ambiguous boolean evaluation.
Changes:
- Introduce
GenericAliasHandlerand register it fortypes.GenericAliasand (programmatically)typing._GenericAlias. - Update the default data context (
v0.1.json) to register the new handler fortypes.GenericAlias. - Adjust Python→Arrow list/map converters to use
value is not Nonerather thanif value.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/orcapod/semantic_types/universal_converter.py |
Avoids truthiness checks when converting list/map values to Arrow-friendly structures. |
src/orcapod/hashing/semantic_hashing/builtin_handlers.py |
Adds and registers GenericAliasHandler for parameterized generic annotations. |
src/orcapod/contexts/data/v0.1.json |
Registers GenericAliasHandler in the versioned default context’s TypeHandlerRegistry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class GenericAliasHandler: | ||
| """ | ||
| Handler for generic alias type annotations such as ``dict[int, list[int]]`` | ||
| (``types.GenericAlias``) and ``typing`` generics (``typing._GenericAlias``). | ||
|
|
||
| Produces a stable dict containing the origin type and a list of hashed | ||
| argument types so that structurally identical generic annotations always | ||
| yield the same hash, and structurally different ones yield different hashes. | ||
| """ | ||
|
|
||
| def handle(self, obj: Any, hasher: "SemanticHasherProtocol") -> Any: | ||
| origin = getattr(obj, "__origin__", None) | ||
| args = getattr(obj, "__args__", None) or () | ||
| if origin is None: | ||
| return f"generic_alias:{obj!r}" | ||
| return { | ||
| "__type__": "generic_alias", | ||
| "origin": hasher.hash_object(origin).to_string(), | ||
| "args": [hasher.hash_object(arg).to_string() for arg in args], | ||
| } |
There was a problem hiding this comment.
There are extensive semantic hasher tests, but none appear to cover parameterized generics (e.g. dict[int, str], list[int], typing.Dict[int, str]). Adding a regression test that hashes these types in strict mode (and asserts stable equality/inequality across different args) would prevent this handler from silently breaking in future refactors.
| elif origin is list: | ||
| element_converter = self.get_python_to_arrow_converter(args[0]) | ||
| return ( | ||
| lambda value: [element_converter(item) for item in value] | ||
| if value | ||
| if value is not None | ||
| else [] | ||
| ) | ||
|
|
||
| elif origin is dict: | ||
| key_converter = self.get_python_to_arrow_converter(args[0]) | ||
| value_converter = self.get_python_to_arrow_converter(args[1]) | ||
| return ( | ||
| lambda value: [ | ||
| {"key": key_converter(k), "value": value_converter(v)} | ||
| for k, v in value.items() | ||
| ] | ||
| if value | ||
| if value is not None | ||
| else [] |
There was a problem hiding this comment.
This change fixes truthiness checks for list/dict values, but there’s no regression test ensuring python_to_arrow conversion works for iterable containers with an ambiguous/invalid __bool__ (e.g. numpy arrays raise on if value). Consider adding a test using a small custom iterable whose __bool__ raises, and verifying conversion still succeeds when the value is empty/non-empty.
| [{"_type": "types.BuiltinFunctionType"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.FunctionHandler", "_config": {"function_info_extractor": {"_ref": "function_info_extractor"}}}], | ||
| [{"_type": "types.MethodType"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.FunctionHandler", "_config": {"function_info_extractor": {"_ref": "function_info_extractor"}}}], | ||
| [{"_type": "builtins.type"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.TypeObjectHandler", "_config": {}}], | ||
| [{"_type": "types.GenericAlias"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.GenericAliasHandler", "_config": {}}], | ||
| [{"_type": "pyarrow.Table"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.ArrowTableHandler", "_config": {"arrow_hasher": {"_ref": "arrow_hasher"}}}], | ||
| [{"_type": "pyarrow.RecordBatch"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.ArrowTableHandler", "_config": {"arrow_hasher": {"_ref": "arrow_hasher"}}}] |
There was a problem hiding this comment.
type_handler_registry in this context only registers types.GenericAlias, but the PR description (and register_builtin_handlers) also registers typing._GenericAlias. Because the default semantic_hasher here uses this JSON-constructed registry, typing generics like typing.Dict[int, str] / typing.Optional[int] can still raise in strict mode. Consider adding a typing._GenericAlias entry (or switching this context to a registry implementation that conditionally registers it).
There was a problem hiding this comment.
@brian-arnold this is a good point. Shall we register _GenericAlias as well?
| registry.register(type, TypeObjectHandler()) | ||
|
|
||
| # generic alias type annotations: dict[int, str], list[str], etc. | ||
| import types as _types |
There was a problem hiding this comment.
register_builtin_handlers imports types as _types twice (once for functions, again for GenericAlias). This is redundant and makes the section a bit harder to read; you can reuse the existing _types import from the functions block instead of re-importing it.
| import types as _types |
…ping._SpecialForm objects (like typing.Union) that appear as __origin__ on typing generics - Registered typing._SpecialForm alongside typing._GenericAlias in register_builtin_handlers, and removed the duplicate import types as _types - Added typing._GenericAlias and typing._SpecialForm entries to v0.1.json so the JSON-constructed hasher also handles them correctly
|
@eywalker I addressed all comments above — I've added a TestGenericAliasHandler test class with 11 tests. I also needed to add a SpecialFormHandler for typing._SpecialForm (e.g. typing.Union), since Optional[int].origin is typing.Union, which was unregistered and caused strict-mode failures. I also added typing._GenericAlias (to support e.g. Dict[int,str] as well as dict[int,str]) and typing._SpecialForm entries to v0.1.json so the JSON-constructed hasher handles them too. All new and existing tests pass. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| # generic alias type annotations: dict[int, str], list[str], etc. | ||
| generic_alias_handler = GenericAliasHandler() | ||
| registry.register(_types.GenericAlias, generic_alias_handler) | ||
| # typing._GenericAlias covers Optional[X], Union[X, Y], Dict[K, V], etc. | ||
| # typing._SpecialForm covers typing.Union, typing.ClassVar, etc. which | ||
| # appear as __origin__ on those generics (e.g. Optional[int].__origin__ | ||
| # is typing.Union, a _SpecialForm). | ||
| try: | ||
| import typing as _typing | ||
|
|
||
| registry.register(_typing._GenericAlias, generic_alias_handler) # type: ignore[attr-defined] | ||
| registry.register(_typing._SpecialForm, SpecialFormHandler()) # type: ignore[attr-defined] | ||
| except AttributeError: | ||
| pass |
There was a problem hiding this comment.
types.GenericAlias and typing._GenericAlias are now handled, but PEP 604 union annotations (instances of types.UnionType, e.g. int | None / int | str) are still unhandled in strict mode and will raise the same "no TypeHandlerProtocol registered" TypeError when they appear in user annotations or internal schema inference (e.g. default_type | None in semantic_types/pydata_utils.py). Consider adding a handler/registration for types.UnionType (and ensure it hashes equivalently to typing.Union[...] / typing.Optional[...] if you want those forms to be interchangeable).
| def test_optional_hashed(self, hasher): | ||
| """Optional[int] (a typing._GenericAlias) produces a ContentHash.""" | ||
| import typing | ||
|
|
||
| assert isinstance(hasher.hash_object(typing.Optional[int]), ContentHash) | ||
|
|
There was a problem hiding this comment.
The new tests cover typing.Optional[...] / typing.Union[...], but they don’t cover PEP 604 unions (int | None, int | str), which are common in Python 3.11+ and are used internally in semantic_types/pydata_utils.py (e.g. default_type | None). Adding at least one test that hasher.hash_object(int | None) succeeds (and ideally matches typing.Optional[int] if that equivalence is intended) would prevent regressions.
| assert reg.has_handler(_types.GenericAlias) | ||
| assert reg.has_handler(_typing._GenericAlias) # type: ignore[attr-defined] | ||
| assert reg.has_handler(_typing._SpecialForm) # type: ignore[attr-defined] |
There was a problem hiding this comment.
Since this test is asserting that the default registry includes the new typing handlers, it would be good to also assert support for PEP 604 unions (types.UnionType) here if you add that handler. Otherwise the default-context registry can still be missing coverage for int | None annotations even though typing.Optional[...] works.
eywalker
left a comment
There was a problem hiding this comment.
@brian-arnold would you mind creating issues corresponding to my comments? When issues are created in Linear, I'm happy to merge this!
|
|
||
| def handle(self, obj: Any, hasher: "SemanticHasherProtocol") -> Any: | ||
| name = getattr(obj, "_name", None) or repr(obj) | ||
| return f"special_form:typing.{name}" |
There was a problem hiding this comment.
For the case of Optional[int], I'd like this to hash to identical value as int|None as these two are semantically identical. I'm going to merge this PR but let's be sure to address this correctly when implementing for UnionType
| if origin is None: | ||
| return f"generic_alias:{obj!r}" | ||
| return { | ||
| "__type__": "generic_alias", |
There was a problem hiding this comment.
I'm starting to think that it'd actually be better to consider how to handle both type and Generic alias in a uniform/consistent manner. Let's address this as a separate issue.
| return ( | ||
| lambda value: [element_converter(item) for item in value] | ||
| if value | ||
| if value is not None |
There was a problem hiding this comment.
I'm actually not sure about dropping the value if it's None. I think this ought to be correctly expressed?
The semantic hasher had no registered handler for types.GenericAlias — the type Python uses to represent parameterized generics like dict[int, str] or list[int]. Plain types like dict or list worked fine, but as soon as key/value types were specified, the hasher raised a TypeError in strict mode.
This affected both user-written type annotations on @function_pod functions and orcapod's own internal schema inference, which produces parameterized generics like dict[key_type, value_type].
The fix adds a GenericAliasHandler that produces a stable serializable representation by recursively hashing origin and args, and registers it for both types.GenericAlias and typing._GenericAlias.
Original error message: