agentic workflows with langgraph and browser use: building an ai-powered apartment search tool
the rapid advancement of multimodal models with powerful vision capabilities has opened exciting new possibilities for ai applications, particularly in the realm of autonomous web agents. with openai’s operator and deep research showcasing what’s possible, there’s growing industry momentum toward building agentic systems that can perceive, reason about, and interact with web interfaces. in this project, i implemented a simple web agent that navigates craigslist housing listings using browser-use for web navigation, langgraph for workflow/prompt orchestration, and remix for the frontend. this combination of technologies enables an end-to-end solution that accepts natural language queries, autonomously searches apartment listings, and presents structured results—showcasing how these emerging tools can be combined to create powerful applications with minimal development effort.
code: https://github.com/petermiller310/craigslist-agent
webvoyager: a vision-enabled web-browsing agent
webvoyager marked a significant advancement in autonomous web agents by integrating several key techniques:
- visual perception of web elements: using screenshots to understand page layout and content
- set-of-marks annotations: highlighting interactive elements with bounding boxes
- action space definition: providing a set of primitive actions (click, type, scroll)
- function calling: utilizing structured function calls to execute actions reliably
- ReAct loop implementation: continuous observation, reasoning, and action
these techniques enabled ai agents to navigate websites with unprecedented autonomy, but implementing them required specialized knowledge and significant engineering effort. the academic research was powerful, but the prompt orchestration needed to make it work was a technical challenge for most developers.
browser-use: a clean interface for complex web navigation
browser-use is an open-source library that abstracts away the complexities of connecting LLMs to web browsers. it provides a clean, straightforward interface for developers to build web automation tools without dealing with the intricate details of browser control, visual processing, and prompt orchestration.
with just a few lines of code, developers can create agents capable of complex web navigation tasks that previously required specialized expertise. browser-use handles:
- browser initialization and management: launching and controlling browser instances
- visual perception: capturing screenshots and processing visual information
- element identification: detecting interactive elements on the page
- action execution: clicking, typing, scrolling, and other interactions
the ReAct pattern: reasoning + acting
what makes browser-use particularly powerful is its implementation of the ReAct pattern (reasoning + acting), which enables agents to alternate between reasoning about the current state and taking actions in the environment. this approach allows agents to:
- reason about observations: analyze what they see on the page
- plan next steps: determine the most appropriate action
- execute actions: interact with the web interface
- observe results: process feedback from the environment
- adjust strategy: modify plans based on new information
this tight loop of reasoning and acting creates more robust agents capable of handling complex, multi-step tasks that require adaptation to changing environments.
visual affordances and bounding boxes
the set-of-marks technique from webvoyager is implemented through browser-use’s built-in capabilities:
browser-use automatically:
- takes screenshots of the current page
- identifies interactive elements
- annotates them with bounding boxes
- provides this visual context to the llm
this approach allows the agent to “see” the page as a human would, identifying buttons, forms, and other interactive elements through their visual appearance and position.
action space and reasoning
browser-use implements a “computer-use” action space for web navigation, which includes several core primitives:
- click: interacting with buttons, links, and other clickable elements
- type: entering text into form fields
- scroll: navigating through content that doesn’t fit on screen
these represent just a few of the core actions available; the full action space is more comprehensive. beyond these built-in actions, browser-use allows developers to extend the action space with custom functions that the model can choose from when navigating the web, enabling specialized behaviors tailored to specific use cases.
the library sequences these primitive actions to accomplish complex tasks, using the ReAct loop to continuously observe, reason, and act.
function calling and structured outputs
function calling is the critical capability that enables LLMs to reliably interact with external tools and apis. it allows models to:
- recognize when a tool is needed: identify when a task requires external functionality
- structure the appropriate call: format arguments correctly for the target function
- interpret and use results: process the returned data to continue the task
after the agent navigates through the website and achieves its task, extracting structured data is surprisingly simple. all you need to do is provide the LLM with a pydantic model, and it will automatically extract and organize the information. for example, this pydantic model captures all the essential details from a craigslist apartment listing:
1class ListingDetails(BaseModel):
2 title: str = Field(description="title of the listing")
3 price: str= Field(description="a dollar string for the price of the apartment for example $3,000 or $3000")
4 location: str = Field(description="location of the apartment")
5 address: Optional[str] = Field(
6 None, description="approximate address of the apartment"
7 )
8 url: str = Field(description="url of the listing")
9 bedrooms: int = Field(description="number of bedrooms")
10 bathrooms: Optional[float] = Field(None, description="number of bathrooms")
11 description: str = Field(description="description of the apartment")
12 images: List[str] = Field(
13 default_factory=list, description="urls of listing images"
14 )this approach forces the model to return data in a consistent format, making downstream processing more reliable and reducing the need for error handling. the field descriptions also help guide the model to extract the right information from the webpage.
building agentic workflows with langgraph
langgraph’s state management capabilities complement browser-use’s react pattern perfectly. while browser-use handles the reasoning-action loop for web navigation, langgraph orchestrates the overall workflow, managing transitions between different states and ensuring that the agent follows a coherent process from start to finish.
this graph-based approach provides several advantages:
- clear separation of concerns: each node handles a specific task
- explicit state management: all data is passed through a well-defined state object
- error handling and recovery: the system can retry specific nodes or take alternative paths
- maintainability: components can be tested and updated independently
the agency-orchestration spectrum
my journey with this project revealed a critical insight about balancing agency and structure:
initially, i gave gpt-4o a single broad instruction: “search craigslist for apartments matching these criteria.” it could handle the entire workflow—navigating menus, applying filters, and extracting listings—but with significant drawbacks:
- the cost was prohibitively high (those tokens add up quickly!)
- reliability varied, especially when dealing with many types of filters
- debugging was cumbersome when things went wrong
by shifting to a more structured approach with langgraph, i broke the process into smaller, discrete steps. this allowed me to:
- use
gpt-4o-minifor most tasks, dramatically reducing costs - create more predictable behavior with explicit transitions
- test and debug each component independently
the key insight wasn’t choosing one approach over another, but understanding their tradeoffs. full agency works for novel, complex tasks where flexibility matters most. structured orchestration excels when reliability, cost, and maintainability are priorities.
key learnings: model specialization, reflection loops, and future directions
building this tool reinforced a practical insight about ai system design—the power of specialized model roles. using gpt-4o as a “planner” while delegating execution to gpt-4o-mini created a sweet spot for cost/performance. this approach slashed costs dramatically without meaningful quality degradation, proving that we don’t always need our most powerful models handling every workflow step.
as openai moves toward custom-tuned models specifically designed for “computer use” tasks, this specialization strategy will become even more effective. purpose-built models optimized for web navigation could further reduce costs while improving reliability in these domain-specific tasks.
future iterations should leverage a reflection-style multi agent architecture. my current linear graph works but lacks robustness against common failures—hallucinated urls, incomplete filter application, and geocoding errors. implementing critique agents to evaluate and improve the main agent’s output would create self-correcting loops that increase reliability without constant human babysitting.
a few practical enhancements that would transform this from proof-of-concept to production-ready:
- semantic reranking of results using cohere’s reranker
- reflection-based verification for geocoding queries
- failure recovery mechanisms for retry loops
what i’ve learned from this project is that current models can already handle most practical tasks—success often comes down to how you structure the problem rather than waiting for more powerful ai. thoughtful orchestration, strategic model selection, and well-designed workflows can create systems that are both capable and cost-effective today.