Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for serializing/deserializing auto-generated key-value pairs in the new IR format. #127

Merged
merged 6 commits into from
Jan 22, 2025
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions clp_ffi_py/ir/native.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ from __future__ import annotations

from datetime import tzinfo
from types import TracebackType
from typing import Any, Dict, IO, List, Optional, Type
from typing import Any, Dict, IO, List, Optional, Tuple, Type

from clp_ffi_py.wildcard_query import WildcardQuery

Expand Down Expand Up @@ -86,8 +86,8 @@ class FourByteDeserializer:
) -> Optional[LogEvent]: ...

class KeyValuePairLogEvent:
Copy link
Member

@junhaoliao junhaoliao Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure how we are going to implement the compatibility layer for CLP Text IR Streams. In the future, will log events also be accessed through this interface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, timestamp will be stored inside the auto-generated dictionary and msg will be stored inside the user-generated dictionary.

def __init__(self, dictionary: Dict[Any, Any]): ...
def to_dict(self) -> Dict[Any, Any]: ...
def __init__(self, auto_gen_kv_pairs: Dict[Any, Any], user_gen_kv_pairs: Dict[Any, Any]): ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we swap the order of user_gen_kv_pairs & auto_gen_kv_pairs and make auto_gen_kv_pairs optional? That way we can achieve backward compatibility.

(That said, I can see any generic developers might not need to call this constructor.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor is mainly for testing purpose; we usually construct KeyValuePairLogEvent directly from the deserializer in CPython level.
But I think your comment might make sense in Serializer's API changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought a little more. I feel like we shouldn't allow optional input even in Serializer. As this is the low level API, both values should be explicitly given. Also, most users won't use this API directly anyways. They should use Python logging library to log their kv pairs, which will be handled as user-generated by the loglib's code.

def to_dict(self) -> Tuple[Dict[Any, Any], Dict[Any, Any]]: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That whether kv-pairs are auto-generated might not matter to all developers. Can we also provide a helper to return a single dict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, my understanding is this should be the low level API, so we should return whatever given on the serialization end.
If you are suggesting merging two dicts into one, I think it would also be problematic because:

  • We can't simply merge them into one since there might be key conflicts in two dicts
  • If we add keys on the top level, like {"auto_gen": auto_gen_dict, "user_gen": user_gen_dict}, it would be a little confusing since the returned dictionaries aren't exactly what user inputs are and we need more documents to clarify things. It might be better just to return two dictionaries separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, i was proposing that we can simply merge the two dicts into one. That there can be potential key conflicts, though rare, is valid concern.

If we are to keep returning a tuple, having the user-kv-pairs as the first element might be more flexible to developers. e.g., they can write

kv_pairs, = log_event.to_dict()

to only get the user inserted pairs, if they don't care about the auto generated ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, the current ordering strictly follows the underlying ffi core APIs.
If really necessary, we can add APIs like get_user_gen_kv_pairs_as_dict and get_auto_gen_kv_pairs_as_dict in future PRs?


class Serializer:
def __init__(self, output_stream: IO[bytes], buffer_size_limit: int = 65536): ...
Expand All @@ -98,7 +98,9 @@ class Serializer:
exc_value: Optional[BaseException],
traceback: Optional[TracebackType],
) -> None: ...
def serialize_log_event_from_msgpack_map(self, msgpack_map: bytes) -> int: ...
def serialize_log_event_from_msgpack_map(
self, auto_gen_msgpack_map: bytes, user_gen_msgpack_map: bytes
) -> int: ...
def get_num_bytes_serialized(self) -> int: ...
def flush(self) -> None: ...
def close(self) -> None: ...
Expand Down
2 changes: 1 addition & 1 deletion src/clp
Submodule clp updated 241 files
5 changes: 2 additions & 3 deletions src/clp_ffi_py/ir/native/PyDeserializer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -192,9 +192,8 @@ auto PyDeserializer::init(

PyDeserializer::IrUnitHandler::SchemaTreeNodeInsertionHandle
trivial_schema_tree_node_insertion_handle
= []([[maybe_unused]] clp::ffi::SchemaTree::NodeLocator) -> IRErrorCode {
return IRErrorCode::IRErrorCode_Success;
};
= []([[maybe_unused]] bool, [[maybe_unused]] clp::ffi::SchemaTree::NodeLocator
) -> IRErrorCode { return IRErrorCode::IRErrorCode_Success; };

PyDeserializer::IrUnitHandler::EndOfStreamHandle end_of_stream_handle
= [this]() -> IRErrorCode { return this->handle_end_of_stream(); };
Expand Down
10 changes: 6 additions & 4 deletions src/clp_ffi_py/ir/native/PyDeserializer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -121,9 +121,10 @@ class PyDeserializer {
= std::function<clp::ffi::ir_stream::IRErrorCode(clp::ffi::KeyValuePairLogEvent&&)>;
using UtcOffsetChangeHandle
= std::function<clp::ffi::ir_stream::IRErrorCode(clp::UtcOffset, clp::UtcOffset)>;
using SchemaTreeNodeInsertionHandle
= std::function<clp::ffi::ir_stream::IRErrorCode(clp::ffi::SchemaTree::NodeLocator
)>;
using SchemaTreeNodeInsertionHandle = std::function<clp::ffi::ir_stream::IRErrorCode(
bool is_auto_generated,
clp::ffi::SchemaTree::NodeLocator
)>;
using EndOfStreamHandle = std::function<clp::ffi::ir_stream::IRErrorCode()>;

// Constructor
Expand Down Expand Up @@ -162,9 +163,10 @@ class PyDeserializer {
}

[[nodiscard]] auto handle_schema_tree_node_insertion(
bool is_auto_generated,
clp::ffi::SchemaTree::NodeLocator schema_tree_node_locator
) -> clp::ffi::ir_stream::IRErrorCode {
return m_schema_tree_node_insertion_handle(schema_tree_node_locator);
return m_schema_tree_node_insertion_handle(is_auto_generated, schema_tree_node_locator);
}

[[nodiscard]] auto handle_end_of_stream() -> clp::ffi::ir_stream::IRErrorCode {
Expand Down
173 changes: 124 additions & 49 deletions src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ class IrUnitHandler {
}

[[nodiscard]] static auto handle_schema_tree_node_insertion(
[[maybe_unused]] bool is_auto_generated,
[[maybe_unused]] clp::ffi::SchemaTree::NodeLocator schema_tree_node_locator
) -> IRErrorCode {
return IRErrorCode::IRErrorCode_Success;
Expand Down Expand Up @@ -239,13 +240,18 @@ PyDoc_STRVAR(
" key-value pairs. This class is designed to be instantiated by the IR deserializer."
" However, direct instantiation using the `__init__` method is also supported for testing"
" purposes, although this may not be as efficient as emission from the IR deserializer.\n\n"
"__init__(self, dictionary)\n\n"
"__init__(self, auto_gen_kv_pairs, user_gen_kv_pairs)\n\n"
"Initializes a :class:`KeyValuePairLogEvent` from the given Python dictionary. Note that"
" each object should only be initialized once. Double initialization will result in a"
" memory leak.\n\n"
":param dictionary: A dictionary representing the key-value pair log event, where all keys"
" must be strings, including keys inside any sub-dictionaries.\n"
":type dictionary: dict[str, Any]\n"
":param auto_gen_kv_pairs: A dictionary representing the auto-generated key-value pairs of"
" the given log event, where all keys must be strings, including keys inside any"
" sub-dictionaries.\n"
":type auto_gen_kv_pairs: dict[str, Any]\n"
":param user_gen_kv_pairs: A dictionary representing the user-generated key-value pairs of"
" the given log event, where all keys must be strings, including keys inside any"
" sub-dictionaries.\n"
":type user_gen_kv_pairs: dict[str, Any]\n"
);
CLP_FFI_PY_METHOD auto
PyKeyValuePairLogEvent_init(PyKeyValuePairLogEvent* self, PyObject* args, PyObject* keywords)
Expand All @@ -259,9 +265,11 @@ PyDoc_STRVAR(
cPyKeyValuePairLogEventToDictDoc,
"to_dict(self)\n"
"--\n\n"
"Converts the log event into a Python dictionary.\n\n"
":return: The log event as a Python dictionary.\n"
":rtype: dict[str, Any]\n"
"Converts the log event into Python dictionaries.\n\n"
":return: A tuple of Python dictionaries:\n\n"
" - A dictionary for auto-generated key-value pairs.\n"
" - A dictionary for user-generated key-value pairs.\n"
":rtype: tuple[dict[str, Any], dict[str, Any]]\n"
);
CLP_FFI_PY_METHOD auto PyKeyValuePairLogEvent_to_dict(PyKeyValuePairLogEvent* self) -> PyObject*;

Expand Down Expand Up @@ -313,12 +321,15 @@ PyType_Spec PyKeyValuePairLogEvent_type_spec{
* instance. This approach is inefficient and intended solely for testing purposes, as it allows
* instance creation without a full IR stream. TODO: Replace this method with a more efficient
* conversion once a direct utility is available.
* @param py_dict
* @param py_auto_gen_kv_pairs_dict
* @param py_user_gen_kv_pairs_dict
* @return The converted key-value log event of the given dictionary on success.
* @return std::nullopt on failure with the relevant Python exception and error set.
*/
[[nodiscard]] auto convert_py_dict_to_key_value_pair_log_event(PyDictObject* py_dict)
-> std::optional<clp::ffi::KeyValuePairLogEvent>;
[[nodiscard]] auto convert_py_dict_to_key_value_pair_log_event(
PyDictObject* py_auto_gen_kv_pairs_dict,
PyDictObject* py_user_gen_kv_pairs_dict
) -> std::optional<clp::ffi::KeyValuePairLogEvent>;

/**
* Serializes the given node id value pairs into a Python dictionary object.
Expand Down Expand Up @@ -361,33 +372,46 @@ PyType_Spec PyKeyValuePairLogEvent_type_spec{
CLP_FFI_PY_METHOD auto
PyKeyValuePairLogEvent_init(PyKeyValuePairLogEvent* self, PyObject* args, PyObject* keywords)
-> int {
static char keyword_dictionary[]{"dictionary"};
static char* keyword_table[]{static_cast<char*>(keyword_dictionary), nullptr};
static char keyword_auto_gen_kv_pairs[]{"auto_gen_kv_pairs"};
static char keyword_user_gen_kv_pairs[]{"user_gen_kv_pairs"};
static char* keyword_table[]{
static_cast<char*>(keyword_auto_gen_kv_pairs),
static_cast<char*>(keyword_user_gen_kv_pairs),
nullptr
};

// If the argument parsing fails, `self` will be deallocated. We must reset all pointers to
// nullptr in advance, otherwise the deallocator might trigger segmentation fault.
self->default_init();

PyObject* dictionary{Py_None};
PyObject* py_auto_gen_kv_pairs{Py_None};
PyObject* py_user_gen_kv_pairs{Py_None};
if (false
== static_cast<bool>(PyArg_ParseTupleAndKeywords(
args,
keywords,
"O",
"OO",
static_cast<char**>(keyword_table),
&dictionary
&py_auto_gen_kv_pairs,
&py_user_gen_kv_pairs
)))
{
return -1;
}

if (false == static_cast<bool>(PyDict_Check(dictionary))) {
PyErr_SetString(PyExc_TypeError, "`dictionary` must be a Python dictionary object");
if (false == static_cast<bool>(PyDict_Check(py_auto_gen_kv_pairs))) {
PyErr_SetString(PyExc_TypeError, "`auto_gen_kv_pairs` must be a Python dictionary object");
return -1;
}
if (false == static_cast<bool>(PyDict_Check(py_user_gen_kv_pairs))) {
PyErr_SetString(PyExc_TypeError, "`user_gen_kv_pairs` must be a Python dictionary object");
return -1;
}
PyDictObject* py_dict{py_reinterpret_cast<PyDictObject>(dictionary)};

auto optional_kv_pair_log_event{convert_py_dict_to_key_value_pair_log_event(py_dict)};
auto optional_kv_pair_log_event{convert_py_dict_to_key_value_pair_log_event(
py_reinterpret_cast<PyDictObject>(py_auto_gen_kv_pairs),
py_reinterpret_cast<PyDictObject>(py_user_gen_kv_pairs)
)};
if (false == optional_kv_pair_log_event.has_value()) {
return -1;
}
Expand All @@ -396,36 +420,44 @@ PyKeyValuePairLogEvent_init(PyKeyValuePairLogEvent* self, PyObject* args, PyObje
}

CLP_FFI_PY_METHOD auto PyKeyValuePairLogEvent_to_dict(PyKeyValuePairLogEvent* self) -> PyObject* {
return py_reinterpret_cast<PyObject>(self->to_dict());
return self->to_dict();
}

CLP_FFI_PY_METHOD auto PyKeyValuePairLogEvent_dealloc(PyKeyValuePairLogEvent* self) -> void {
self->clean();
Py_TYPE(self)->tp_free(py_reinterpret_cast<PyObject>(self));
}

auto convert_py_dict_to_key_value_pair_log_event(PyDictObject* py_dict)
-> std::optional<clp::ffi::KeyValuePairLogEvent> {
PyObjectPtr<PyBytesObject> const serialized_msgpack_byte_sequence{
py_utils_serialize_dict_to_msgpack(py_dict)
auto convert_py_dict_to_key_value_pair_log_event(
PyDictObject* py_auto_gen_kv_pairs_dict,
PyDictObject* py_user_gen_kv_pairs_dict
) -> std::optional<clp::ffi::KeyValuePairLogEvent> {
PyObjectPtr<PyBytesObject> const msgpack_serialized_auto_gen_kv_pairs{
py_utils_serialize_dict_to_msgpack(py_auto_gen_kv_pairs_dict)
};
if (nullptr == serialized_msgpack_byte_sequence) {
if (nullptr == msgpack_serialized_auto_gen_kv_pairs) {
return std::nullopt;
}
PyObjectPtr<PyBytesObject> const msgpack_serialized_user_gen_kv_pairs{
py_utils_serialize_dict_to_msgpack(py_user_gen_kv_pairs_dict)
};
if (nullptr == msgpack_serialized_user_gen_kv_pairs) {
return std::nullopt;
}

// Since the type is already checked, we can use the macro to avoid duplicated type checking.
std::span<char const> const data_view{
PyBytes_AS_STRING(serialized_msgpack_byte_sequence.get()),
static_cast<size_t>(PyBytes_GET_SIZE(serialized_msgpack_byte_sequence.get()))
};
auto const unpack_result{unpack_msgpack(data_view)};
if (unpack_result.has_error()) {
PyErr_SetString(PyExc_RuntimeError, unpack_result.error().c_str());
auto const optional_auto_gen_msgpack_map_handle{unpack_msgpack_map(
{PyBytes_AS_STRING(msgpack_serialized_auto_gen_kv_pairs.get()),
static_cast<size_t>(PyBytes_GET_SIZE(msgpack_serialized_auto_gen_kv_pairs.get()))}
)};
if (false == optional_auto_gen_msgpack_map_handle.has_value()) {
return std::nullopt;
}
auto const& msgpack_obj{unpack_result.value().get()};
if (msgpack::type::MAP != msgpack_obj.type) {
PyErr_SetString(PyExc_TypeError, "Unpacked msgpack is not a map");
auto const optional_user_gen_msgpack_map_handle{unpack_msgpack_map(
{PyBytes_AS_STRING(msgpack_serialized_user_gen_kv_pairs.get()),
static_cast<size_t>(PyBytes_GET_SIZE(msgpack_serialized_user_gen_kv_pairs.get()))}
)};
if (false == optional_user_gen_msgpack_map_handle.has_value()) {
return std::nullopt;
}

Expand All @@ -442,8 +474,14 @@ auto convert_py_dict_to_key_value_pair_log_event(PyDictObject* py_dict)
}

auto& serializer{serializer_result.value()};
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-union-access)
if (false == serializer.serialize_msgpack_map(msgpack_obj.via.map)) {
// NOLINTBEGIN(cppcoreguidelines-pro-type-union-access)
if (false
== serializer.serialize_msgpack_map(
optional_auto_gen_msgpack_map_handle.value().get().via.map,
optional_user_gen_msgpack_map_handle.value().get().via.map
))
// NOLINTEND(cppcoreguidelines-pro-type-union-access)
{
PyErr_SetString(
PyExc_RuntimeError,
get_c_str_from_constexpr_string_view(cSerializerSerializeMsgpackMapError)
Expand Down Expand Up @@ -694,24 +732,61 @@ auto PyKeyValuePairLogEvent::init(clp::ffi::KeyValuePairLogEvent kv_pair_log_eve
return true;
}

[[nodiscard]] auto PyKeyValuePairLogEvent::to_dict() -> PyDictObject* {
[[nodiscard]] auto PyKeyValuePairLogEvent::to_dict() -> PyObject* {
try {
auto const& node_id_value_pairs{m_kv_pair_log_event->get_node_id_value_pairs()};
auto const& schema_tree{m_kv_pair_log_event->get_schema_tree()};
auto const schema_subtree_bitmap_result{m_kv_pair_log_event->get_schema_subtree_bitmap()};
if (schema_subtree_bitmap_result.has_error()) {
auto const& auto_gen_node_id_value_pairs{
m_kv_pair_log_event->get_auto_gen_node_id_value_pairs()
};
auto const& auto_gen_keys_schema_tree{m_kv_pair_log_event->get_auto_gen_keys_schema_tree()};
auto const auto_gen_keys_schema_subtree_bitmap_result{
m_kv_pair_log_event->get_auto_gen_keys_schema_subtree_bitmap()
};
if (auto_gen_keys_schema_subtree_bitmap_result.has_error()) {
PyErr_Format(
PyExc_RuntimeError,
"Failed to get schema subtree bitmap: %s",
schema_subtree_bitmap_result.error().message().c_str()
"Failed to get auto-generated keys schema subtree bitmap: %s",
auto_gen_keys_schema_subtree_bitmap_result.error().message().c_str()
);
return nullptr;
}
return serialize_node_id_value_pair_to_py_dict(
schema_tree,
schema_subtree_bitmap_result.value(),
node_id_value_pairs
);
PyObjectPtr<PyDictObject> const auto_gen_kv_pairs_dict{
serialize_node_id_value_pair_to_py_dict(
auto_gen_keys_schema_tree,
auto_gen_keys_schema_subtree_bitmap_result.value(),
auto_gen_node_id_value_pairs
)
};
if (nullptr == auto_gen_kv_pairs_dict) {
return nullptr;
}

auto const& user_gen_node_id_value_pairs{
m_kv_pair_log_event->get_user_gen_node_id_value_pairs()
};
auto const& user_gen_keys_schema_tree{m_kv_pair_log_event->get_user_gen_keys_schema_tree()};
auto const user_gen_keys_schema_subtree_bitmap_result{
m_kv_pair_log_event->get_user_gen_keys_schema_subtree_bitmap()
};
if (user_gen_keys_schema_subtree_bitmap_result.has_error()) {
PyErr_Format(
PyExc_RuntimeError,
"Failed to get user-generated keys schema subtree bitmap: %s",
user_gen_keys_schema_subtree_bitmap_result.error().message().c_str()
);
return nullptr;
}
PyObjectPtr<PyDictObject> const user_gen_kv_pairs_dict{
serialize_node_id_value_pair_to_py_dict(
user_gen_keys_schema_tree,
user_gen_keys_schema_subtree_bitmap_result.value(),
user_gen_node_id_value_pairs
)
};
if (nullptr == user_gen_kv_pairs_dict) {
return nullptr;
}

return Py_BuildValue("(OO)", auto_gen_kv_pairs_dict.get(), user_gen_kv_pairs_dict.get());
} catch (clp::TraceableException& ex) {
handle_traceable_exception(ex);
return nullptr;
Expand Down
9 changes: 6 additions & 3 deletions src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -85,11 +85,14 @@ class PyKeyValuePairLogEvent {
}

/**
* Converts the underlying key-value pair log event into a Python dictionary.
* @return A new reference to the created dictionary on success.
* Converts the underlying key-value pair log event into Python dictionaries.
* @return A new reference to a Python tuple containing a pair of Python dictionaries on
* success:
* - A Python dictionary for auto-generated key-value pairs.
* - A Python dictionary for user-generated key-value pairs.
* @return nullptr on failure with the relevant Python exception and error set.
*/
[[nodiscard]] auto to_dict() -> PyDictObject*;
[[nodiscard]] auto to_dict() -> PyObject*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, any reason we don't return PyTupleObject* instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Py_BuildValue to create the tuple, which returns PyObject*.
Reference: https://docs.python.org/3/c-api/arg.html#c.Py_BuildValue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuple APIs would also return PyObject* directly, like this one (which is equivalent to Py_BuildValue): https://docs.python.org/3/c-api/tuple.html#c.PyTuple_Pack

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Right, I read from the https://docs.python.org/3/c-api/arg.html#c.Py_BuildValue docs that the API is not guaranteed to return a tuple, but since we have the format string fixed, it should not return anything else i assume.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swith to PyTuple_Pack since it doesn't need argument parsing.


private:
static inline PyObjectStaticPtr<PyTypeObject> m_py_type{nullptr};
Expand Down
Loading
Loading