The GLM-5.2 model now offers an OpenAI-compatible API that supports reasoning effort controls, function calling, and long-context retrieval. This guide demonstrates how to configure the client, manage costs, and test these features in a practical workflow.
In this article
Setting Up the Client and Chat Wrapper
The tutorial begins by defining multiple provider options, including ZAI, OpenRouter, Together, Requesty, and Hugging Face. It securely loads the API key and creates a reusable chat wrapper. This wrapper handles standard chat, thinking mode, streaming, tool calling, and token tracking.
The code sets up the OpenAI client with specific base URLs for each provider. It defines a `load_api_key` function that checks Google Colab userdata, system environment variables, or prompts the user for input if necessary.
Cost tracking is implemented via a `_track` function that accumulates input tokens, output tokens, and call counts. A helper function, `get_reasoning`, extracts hidden reasoning traces from the model’s response, checking for `reasoning_content` attributes in the message, extra fields, or the dictionary representation.
The main `chat` function accepts parameters for reasoning effort, thinking mode, streaming, and tool selection. It constructs an `extra_body` dictionary to pass GLM-specific settings like `thinking` and `reasoning_effort` alongside standard OpenAI arguments.
import sys, subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "openai"], check=False)
import os, re, json, time, getpass
from openai import OpenAI
PROVIDERS = {
"zai": {"base_url": "https://api.z.ai/api/paas/v4/", "model": "glm-5.2", "env": "ZAI_API_KEY"},
"openrouter": {"base_url": "https://openrouter.ai/api/v1", "model": "z-ai/glm-5.2", "env": "OPENROUTER_API_KEY"},
"together": {"base_url": "https://api.together.xyz/v1", "model": "zai-org/GLM-5.2","env": "TOGETHER_API_KEY"},
"requesty": {"base_url": "https://router.requesty.ai/v1", "model": "zai/glm-5.2", "env": "REQUESTY_API_KEY"},
"huggingface": {"base_url": "https://router.huggingface.co/v1","model": "zai-org/GLM-5.2","env": "HF_TOKEN"},
}
PROVIDER = "zai"
CFG = PROVIDERS[PROVIDER]
MODEL = CFG["model"]
def load_api_key(env_name):
try:
from google.colab import userdata
v = userdata.get(env_name)
if v: return v
except Exception:
pass
if os.environ.get(env_name):
return os.environ[env_name]
return getpass.getpass(f"Enter your {env_name}: ")
client = OpenAI(api_key=load_api_key(CFG["env"]), base_url=CFG["base_url"])
PRICE_IN_PER_M, PRICE_OUT_PER_M = 1.40, 4.40
_USAGE = {"in": 0, "out": 0, "calls": 0}
def _track(usage):
if usage:
_USAGE["in"] += getattr(usage, "prompt_tokens", 0) or 0
_USAGE["out"] += getattr(usage, "completion_tokens", 0) or 0
_USAGE["calls"] += 1
def get_reasoning(obj):
"""Pull GLM's hidden reasoning trace from a message/delta (a provider-extra field)."""
val = getattr(obj, "reasoning_content", None)
if val: return val
extra = getattr(obj, "model_extra", None) or {}
if extra.get("reasoning_content"): return extra["reasoning_content"]
try: return obj.to_dict().get("reasoning_content")
except Exception: return None
def chat(messages, effort=None, thinking=True, tools=None, tool_choice="auto",
stream=False, max_tokens=2048, temperature=1.0, tool_stream=False):
"""
effort: None | "high" | "max" (GLM-5.2 thinking-effort level; max is the model default)
thinking: True -> deep thinking on; False -> off (fast, cheap, low-latency)
GLM-specific params go through extra_body so any OpenAI client works.
"""
extra = {"thinking": {"type": "enabled" if thinking else "disabled"}}
if effort and thinking: extra["reasoning_effort"] = effort
if tool_stream: extra["tool_stream"] = True
kwargs = dict(model=MODEL, messages=messages, max_tokens=max_tokens,
temperature=temperature, stream=stream, extra_body=extra)
if tools:
kwargs.update(tools=tools, tool_choice=tool_choice)
if stream:
kwargs["stream_options"] = {"include_usage": True}
return client.chat.completions.create(**kwargs)
Testing Reasoning Effort and Streaming
The guide moves to practical testing, starting with a basic sanity check. It then compares the model’s performance across three modes: thinking off, effort high, and effort max. This comparison measures latency and output token counts for the same train problem.
Streaming is demonstrated by separating the reasoning channel from the final answer. The code iterates through chunks to print the thinking process first, followed by the main content.
def demo_basic():
print("\n=== 1. BASIC CHAT / SANITY CHECK =========================")
resp = chat([{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "In one sentence, what is GLM-5.2 best at?"}],
thinking=False, max_tokens=200)
_track(resp.usage)
print(resp.choices[0].message.content.strip())
def demo_effort():
print("\n=== 2. THINKING-EFFORT CONTROL (off / high / max) ========")
problem = ("Train A leaves city A at 9:00 going 60 km/h toward city B. "
"Train B leaves B (420 km away) at 9:30 going 90 km/h toward A. "
"At what clock time do they meet? Show the key steps briefly.")
for label, kw in [("thinking OFF", dict(thinking=False)),
("effort=high", dict(thinking=True, effort="high")),
("effort=max", dict(thinking=True, effort="max"))]:
t0 = time.time()
resp = chat([{"role": "user", "content": problem}], max_tokens=2000, **kw)
dt = time.time() - t0
_track(resp.usage)
msg, u = resp.choices[0].message, resp.usage
print(f"\n--- {label} | {dt:0.1f}s | out_tokens={getattr(u,'completion_tokens',0)} ---")
r = get_reasoning(msg)
if r:
print(" [reasoning, first 220 chars]: " + " ".join(r.split())[:220] + " ...")
print(" : " + " ".join((msg.content or '').split())[:350])
def demo_streaming():
print("\n=== 3. STREAMING: reasoning channel vs answer channel ====")
stream = chat([{"role": "user", "content":
"Explain why the sky is blue, then give a one-line TL;DR."}],
thinking=True, effort="high", stream=True, max_tokens=1200)
saw_r = saw_a = False
usage = None
for chunk in stream:
if getattr(chunk, "usage", None): usage = chunk.usage
if not chunk.choices: continue
delta = chunk.choices[0].delta
r = get_reasoning(delta)
if r:
if not saw_r: print("\n[thinking] ", end="", flush=True); saw_r = True
print(r, end="", flush=True)
if getattr(delta, "content", None):
if not saw_a: print("\n\n ", end="", flush=True); saw_a = True
print(delta.content, end="", flush=True)
print()
_track(usage)
Costs are estimated at 1.40 per million input tokens and 4.40 per million output tokens. The code tracks usage throughout these demonstrations.
Function Calling and Tool Use
The final section builds a multi-step agent using function calling. It defines two tools: a calculator for arithmetic and a city population lookup. The agent runs in a loop, sending tool calls back to the model and feeding the results into the conversation.




